A64 -- SME Instructions (alphabetic order)

ADD (to array, array and multiple vectors): Multi-vector accumulate to ZA array vectors.

ADD (to array, multiple and single vector): Multi-vector add by vector to ZA array vectors.

ADD (to array, multiple vectors): Multi-vector add to ZA array vectors.

ADD (to vector, multiple vectors): Multi-vector add by vector to multi-vector.

ADDHA: Add horizontally vector elements to ZA tile.

ADDSPL: Add multiple of Streaming SVE predicate register size to scalar register.

ADDSVL: Add multiple of Streaming SVE vector register size to scalar register.

ADDVA: Add vertically vector elements to ZA tile.

BF1CVT, BF2CVT: Multi-vector 8-bit floating-point convert to BFloat16.

BF1CVTL, BF2CVTL: Multi-vector 8-bit floating-point convert to deinterleaved BFloat16.

BFADD: Multi-vector BFloat16 accumulate to ZA array vectors.

BFCLAMP: Multi-vector BFloat16 clamp to minimum/maximum number.

BFCVT (BFloat16 to 8-bit floating-point): Multi-vector BFloat16 convert to 8-bit floating-point.

BFCVT (single-precision to BFloat16): Multi-vector single-precision convert to BFloat16.

BFCVTN: Multi-vector single-precision convert to interleaved BFloat16.

BFDOT (multiple and indexed vector): Multi-vector BFloat16 dot product by indexed element to single-precision.

BFDOT (multiple and single vector): Multi-vector BFloat16 dot product by vector to single-precision.

BFDOT (multiple vectors): Multi-vector BFloat16 dot product to single-precision.

BFMAX (multiple and single vector): Multi-vector BFloat16 maximum by vector.

BFMAX (multiple vectors): Multi-vector BFloat16 maximum.

BFMAXNM (multiple and single vector): Multi-vector BFloat16 maximum number by vector.

BFMAXNM (multiple vectors): Multi-vector BFloat16 maximum number.

BFMIN (multiple and single vector): Multi-vector BFloat16 minimum by vector.

BFMIN (multiple vectors): Multi-vector BFloat16 minimum.

BFMINNM (multiple and single vector): Multi-vector BFloat16 minimum number by vector.

BFMINNM (multiple vectors): Multi-vector BFloat16 minimum number.

BFMLA (multiple and indexed vector): Multi-vector BFloat16 fused multiply-add by indexed element.

BFMLA (multiple and single vector): Multi-vector BFloat16 fused multiply-add by vector.

BFMLA (multiple vectors): Multi-vector BFloat16 fused multiply-add.

BFMLAL (multiple and indexed vector): Multi-vector BFloat16 multiply-add by indexed element to single-precision.

BFMLAL (multiple and single vector): Multi-vector BFloat16 multiply-add by vector to single-precision.

BFMLAL (multiple vectors): Multi-vector BFloat16 multiply-add to single-precision.

BFMLS (multiple and indexed vector): Multi-vector BFloat16 fused multiply-subtract by indexed element.

BFMLS (multiple and single vector): Multi-vector BFloat16 fused multiply-subtract by vector.

BFMLS (multiple vectors): Multi-vector BFloat16 fused multiply-subtract.

BFMLSL (multiple and indexed vector): Multi-vector BFloat16 multiply-subtract by indexed element from single-precision.

BFMLSL (multiple and single vector): Multi-vector BFloat16 multiply-subtract by vector from single-precision.

BFMLSL (multiple vectors): Multi-vector BFloat16 multiply-subtract from single-precision.

BFMOP4A (non-widening): BFloat16 quarter-tile outer product, accumulating.

BFMOP4A (widening): BFloat16 quarter-tile sum of outer products to single-precision, accumulating.

BFMOP4S (non-widening): BFloat16 quarter-tile outer product, subtracting.

BFMOP4S (widening): BFloat16 quarter-tile sum of outer products to single-precision, subtracting.

BFMOPA (non-widening): BFloat16 outer product, accumulating.

BFMOPA (widening): BFloat16 sum of outer products to single-precision, accumulating.

BFMOPS (non-widening): BFloat16 outer product, subtracting.

BFMOPS (widening): BFloat16 sum of outer products to single-precision, subtracting.

BFMUL (multiple and single vector): Multi-vector BFloat16 multiply by vector.

BFMUL (multiple vectors): Multi-vector BFloat16 multiply.

BFSCALE (multiple and single vector): Multi-vector BFloat16 adjust exponent by vector.

BFSCALE (multiple vectors): Multi-vector BFloat16 adjust exponent.

BFSUB: Multi-vector BFloat16 subtract from ZA array vectors.

BFTMOPA (non-widening): BFloat16 sparse outer product, accumulating.

BFTMOPA (widening): BFloat16 sparse sum of outer products to single-precision, accumulating.

BFVDOT: Multi-vector BFloat16 vertical dot product by indexed element to single-precision.

BMOPA: Bitwise exclusive NOR population count outer product, accumulating.

BMOPS: Bitwise exclusive NOR population count outer product, subtracting.

F1CVT, F2CVT: Multi-vector 8-bit floating-point convert to half-precision.

F1CVTL, F2CVTL: Multi-vector 8-bit floating-point convert to deinterleaved half-precision.

FADD: Multi-vector floating-point accumulate to ZA array vectors.

FAMAX: Multi-vector floating-point absolute maximum.

FAMIN: Multi-vector floating-point absolute minimum.

FCLAMP: Multi-vector floating-point clamp to minimum/maximum number.

FCVT (narrowing, FP16 to FP8): Multi-vector half-precision convert to 8-bit floating-point.

FCVT (narrowing, FP32 to FP16): Multi-vector single-precision convert to half-precision.

FCVT (narrowing, FP32 to FP8): Multi-vector single-precision convert to 8-bit floating-point.

FCVT (widening): Multi-vector half-precision convert to single-precision.

FCVTL: Multi-vector half-precision convert to deinterleaved single-precision.

FCVTN (FP32 to FP16): Multi-vector single-precision convert to interleaved half-precision.

FCVTN (FP32 to FP8): Multi-vector single-precision convert to interleaved 8-bit floating-point.

FCVTZS: Multi-vector single-precision convert to signed 32-bit integer, rounding toward zero.

FCVTZU: Multi-vector single-precision convert to unsigned 32-bit integer, rounding toward zero.

FDOT (2-way, multiple and indexed vector, FP16 to FP32): Multi-vector half-precision dot product by indexed element to single-precision.

FDOT (2-way, multiple and indexed vector, FP8 to FP16): Multi-vector 8-bit floating-point dot product by indexed element to half-precision.

FDOT (2-way, multiple and single vector, FP16 to FP32): Multi-vector half-precision dot product by vector to single-precision.

FDOT (2-way, multiple and single vector, FP8 to FP16): Multi-vector 8-bit floating-point dot product by vector to half-precision.

FDOT (2-way, multiple vectors, FP16 to FP32): Multi-vector half-precision dot product to single-precision.

FDOT (2-way, multiple vectors, FP8 to FP16): Multi-vector 8-bit floating-point dot product to half-precision.

FDOT (4-way, multiple and indexed vector): Multi-vector 8-bit floating-point dot product by indexed element to single-precision.

FDOT (4-way, multiple and single vector): Multi-vector 8-bit floating-point dot product by vector to single-precision.

FDOT (4-way, multiple vectors): Multi-vector 8-bit floating-point dot product to single-precision.

FMAX (multiple and single vector): Multi-vector floating-point maximum by vector.

FMAX (multiple vectors): Multi-vector floating-point maximum.

FMAXNM (multiple and single vector): Multi-vector floating-point maximum number by vector.

FMAXNM (multiple vectors): Multi-vector floating-point maximum number.

FMIN (multiple and single vector): Multi-vector floating-point minimum by vector.

FMIN (multiple vectors): Multi-vector floating-point minimum.

FMINNM (multiple and single vector): Multi-vector floating-point minimum number by vector.

FMINNM (multiple vectors): Multi-vector floating-point minimum number.

FMLA (multiple and indexed vector): Multi-vector floating-point fused multiply-add by indexed element.

FMLA (multiple and single vector): Multi-vector floating-point fused multiply-add by vector.

FMLA (multiple vectors): Multi-vector floating-point fused multiply-add.

FMLAL (multiple and indexed vector, FP16 to FP32): Multi-vector half-precision multiply-add by indexed element to single-precision.

FMLAL (multiple and indexed vector, FP8 to FP16): Multi-vector 8-bit floating-point multiply-add by indexed element to half-precision.

FMLAL (multiple and single vector, FP16 to FP32): Multi-vector half-precision multiply-add by vector to single-precision.

FMLAL (multiple and single vector, FP8 to FP16): Multi-vector 8-bit floating-point multiply-add by vector to half-precision.

FMLAL (multiple vectors, FP16 to FP32): Multi-vector half-precision multiply-add to single-precision.

FMLAL (multiple vectors, FP8 to FP16): Multi-vector 8-bit floating-point multiply-add to half-precision.

FMLALL (multiple and indexed vector): Multi-vector 8-bit floating-point multiply-add by indexed element to single-precision.

FMLALL (multiple and single vector): Multi-vector 8-bit floating-point multiply-add by vector to single-precision.

FMLALL (multiple vectors): Multi-vector 8-bit floating-point multiply-add to single-precision.

FMLS (multiple and indexed vector): Multi-vector floating-point fused multiply-subtract by indexed element.

FMLS (multiple and single vector): Multi-vector floating-point fused multiply-subtract by vector.

FMLS (multiple vectors): Multi-vector floating-point fused multiply-subtract.

FMLSL (multiple and indexed vector): Multi-vector half-precision multiply-subtract by indexed element from single-precision.

FMLSL (multiple and single vector): Multi-vector half-precision multiply-subtract by vector from single-precision.

FMLSL (multiple vectors): Multi-vector half-precision multiply-subtract from single-precision.

FMOP4A (non-widening): Floating-point quarter-tile outer product, accumulating.

FMOP4A (widening, 2-way, FP16 to FP32): Half-precision quarter-tile sum of outer products to single-precision, accumulating.

FMOP4A (widening, 2-way, FP8 to FP16): 8-bit floating-point quarter-tile sum of outer products to half-precision, accumulating.

FMOP4A (widening, 4-way): 8-bit floating-point quarter-tile sum of outer products to single-precision, accumulating.

FMOP4S (non-widening): Floating-point quarter-tile outer product, subtracting.

FMOP4S (widening): Half-precision quarter-tile sum of outer products to single-precision, subtracting.

FMOPA (non-widening): Floating-point outer product, accumulating.

FMOPA (widening, 2-way, FP16 to FP32): Half-precision sum of outer products to single-precision, accumulating.

FMOPA (widening, 2-way, FP8 to FP16): 8-bit floating-point sum of outer products to half-precision, accumulating.

FMOPA (widening, 4-way): 8-bit floating-point sum of outer products to single-precision, accumulating.

FMOPS (non-widening): Floating-point outer product, subtracting.

FMOPS (widening): Half-precision sum of outer products to single-precision, subtracting.

FMUL (multiple and single vector): Multi-vector floating-point multiply by vector.

FMUL (multiple vectors): Multi-vector floating-point multiply.

FRINTA: Multi-vector single-precision round to integral value, to nearest with ties away from zero.

FRINTM: Multi-vector single-precision round to integral value, toward minus Infinity.

FRINTN: Multi-vector single-precision round to integral value, to nearest with ties to even.

FRINTP: Multi-vector single-precision round to integral value, toward plus Infinity.

FSCALE (multiple and single vector): Multi-vector floating-point adjust exponent by vector.

FSCALE (multiple vectors): Multi-vector floating-point adjust exponent.

FSUB: Multi-vector floating-point subtract from ZA array vectors.

FTMOPA (non-widening): Floating-point sparse outer product, accumulating.

FTMOPA (widening, 2-way, FP16 to FP32): Half-precision sparse sum of outer products to single-precision, accumulating.

FTMOPA (widening, 2-way, FP8 to FP16): 8-bit floating-point sparse sum of outer products to half-precision, accumulating.

FTMOPA (widening, 4-way): 8-bit floating-point sparse sum of outer products to single-precision, accumulating.

FVDOT (FP16 to FP32): Multi-vector half-precision vertical dot product by indexed element to single-precision.

FVDOT (FP8 to FP16): Multi-vector 8-bit floating-point vertical dot product by indexed element to half-precision.

FVDOTB: Multi-vector 8-bit floating-point vertical dot product by indexed element to single-precision (bottom).

FVDOTT: Multi-vector 8-bit floating-point vertical dot product by indexed element to single-precision (top).

LD1B (scalar plus immediate, strided registers): Contiguous load of bytes to multiple strided vectors (immediate index).

LD1B (scalar plus scalar, strided registers): Contiguous load of bytes to multiple strided vectors (scalar index).

LD1B (scalar plus scalar, tile slice): Contiguous load of bytes to 8-bit element ZA tile slice.

LD1D (scalar plus immediate, strided registers): Contiguous load of doublewords to multiple strided vectors (immediate index).

LD1D (scalar plus scalar, strided registers): Contiguous load of doublewords to multiple strided vectors (scalar index).

LD1D (scalar plus scalar, tile slice): Contiguous load of doublewords to 64-bit element ZA tile slice.

LD1H (scalar plus immediate, strided registers): Contiguous load of halfwords to multiple strided vectors (immediate index).

LD1H (scalar plus scalar, strided registers): Contiguous load of halfwords to multiple strided vectors (scalar index).

LD1H (scalar plus scalar, tile slice): Contiguous load of halfwords to 16-bit element ZA tile slice.

LD1Q: Contiguous load of quadwords to 128-bit element ZA tile slice.

LD1W (scalar plus immediate, strided registers): Contiguous load of words to multiple strided vectors (immediate index).

LD1W (scalar plus scalar, strided registers): Contiguous load of words to multiple strided vectors (scalar index).

LD1W (scalar plus scalar, tile slice): Contiguous load of words to 32-bit element ZA tile slice.

LDNT1B (scalar plus immediate, strided registers): Contiguous load non-temporal of bytes to multiple strided vectors (immediate index).

LDNT1B (scalar plus scalar, strided registers): Contiguous load non-temporal of bytes to multiple strided vectors (scalar index).

LDNT1D (scalar plus immediate, strided registers): Contiguous load non-temporal of doublewords to multiple strided vectors (immediate index).

LDNT1D (scalar plus scalar, strided registers): Contiguous load non-temporal of doublewords to multiple strided vectors (scalar index).

LDNT1H (scalar plus immediate, strided registers): Contiguous load non-temporal of halfwords to multiple strided vectors (immediate index).

LDNT1H (scalar plus scalar, strided registers): Contiguous load non-temporal of halfwords to multiple strided vectors (scalar index).

LDNT1W (scalar plus immediate, strided registers): Contiguous load non-temporal of words to multiple strided vectors (immediate index).

LDNT1W (scalar plus scalar, strided registers): Contiguous load non-temporal of words to multiple strided vectors (scalar index).

LDR (array vector): Load ZA array vector.

LDR (table): Load ZT0 register.

LUTI2 (four registers): Lookup table read with 2-bit indexes (four registers).

LUTI2 (single): Lookup table read with 2-bit indexes (single).

LUTI2 (two registers): Lookup table read with 2-bit indexes (two registers).

LUTI4 (four registers, 16-bit and 32-bit): Lookup table read with 4-bit indexes (four registers).

LUTI4 (four registers, 8-bit): Lookup table read with 4-bit indexes and 8-bit elements (four registers).

LUTI4 (single): Lookup table read with 4-bit indexes (single).

LUTI4 (two registers): Lookup table read with 4-bit indexes (two registers).

MOV (array to vector, four registers): Move four ZA single-vector groups to Z four-vector operand: an alias of MOVA (array to vector, four registers).

MOV (array to vector, two registers): Move two ZA single-vector groups to Z two-vector operand: an alias of MOVA (array to vector, two registers).

MOV (tile to vector, four registers): Move ZA four-slice operand to Z four-vector operand: an alias of MOVA (tile to vector, four registers).

MOV (tile to vector, single): Move ZA tile slice to Z vector: an alias of MOVA (tile to vector, single).

MOV (tile to vector, two registers): Move ZA two-slice operand to Z two-vector operand: an alias of MOVA (tile to vector, two registers).

MOV (vector to array, four registers): Move Z four-vector operand to four ZA single-vector groups: an alias of MOVA (vector to array, four registers).

MOV (vector to array, two registers): Move Z two-vector operand to two ZA single-vector groups: an alias of MOVA (vector to array, two registers).

MOV (vector to tile, four registers): Move Z four-vector operand to ZA four-slice operand: an alias of MOVA (vector to tile, four registers).

MOV (vector to tile, single): Move Z vector to ZA tile slice: an alias of MOVA (vector to tile, single).

MOV (vector to tile, two registers): Move Z two-vector operand to ZA two-slice operand: an alias of MOVA (vector to tile, two registers).

MOVA (array to vector, four registers): Move four ZA single-vector groups to Z four-vector operand.

MOVA (array to vector, two registers): Move two ZA single-vector groups to Z two-vector operand.

MOVA (tile to vector, four registers): Move ZA four-slice operand to Z four-vector operand.

MOVA (tile to vector, single): Move ZA tile slice to Z vector.

MOVA (tile to vector, two registers): Move ZA two-slice operand to Z two-vector operand.

MOVA (vector to array, four registers): Move Z four-vector operand to four ZA single-vector groups.

MOVA (vector to array, two registers): Move Z two-vector operand to two ZA single-vector groups.

MOVA (vector to tile, four registers): Move Z four-vector operand to ZA four-slice operand.

MOVA (vector to tile, single): Move Z vector to ZA tile slice.

MOVA (vector to tile, two registers): Move Z two-vector operand to ZA two-slice operand.

MOVAZ (array to vector, four registers): Move and zero four ZA single-vector groups to Z four-vector operand.

MOVAZ (array to vector, two registers): Move and zero two ZA single-vector groups to Z two-vector operand.

MOVAZ (tile to vector, four registers): Move and zero ZA four-slice operand to Z four-vector operand.

MOVAZ (tile to vector, single): Move and zero ZA tile slice to Z vector.

MOVAZ (tile to vector, two registers): Move and zero ZA two-slice operand to Z two-vector operand.

MOVT (scalar to table): Move 8 bytes from general-purpose register to ZT0.

MOVT (table to scalar): Move 8 bytes from ZT0 to general-purpose register.

MOVT (vector to table): Move vector register to ZT0.

RDSVL: Read multiple of Streaming SVE vector register size to scalar register.

SCLAMP: Multi-vector signed clamp to minimum/maximum.

SCVTF: Multi-vector signed 32-bit integer convert to single-precision.

SDOT (2-way, multiple and indexed vector): Multi-vector signed 16-bit integer dot product by indexed element to 32-bit integer.

SDOT (2-way, multiple and single vector): Multi-vector signed 16-bit integer dot product by vector to 32-bit integer.

SDOT (2-way, multiple vectors): Multi-vector signed 16-bit integer dot product to 32-bit integer.

SDOT (4-way, multiple and indexed vector): Multi-vector signed integer dot product by indexed element.

SDOT (4-way, multiple and single vector): Multi-vector signed integer dot product by vector.

SDOT (4-way, multiple vectors): Multi-vector signed integer dot product.

SEL: Multi-vector conditional select.

SMAX (multiple and single vector): Multi-vector signed maximum by vector.

SMAX (multiple vectors): Multi-vector signed maximum.

SMIN (multiple and single vector): Multi-vector signed minimum by vector.

SMIN (multiple vectors): Multi-vector signed minimum.

SMLAL (multiple and indexed vector): Multi-vector signed 16-bit integer multiply-add by indexed element to 32-bit integer.

SMLAL (multiple and single vector): Multi-vector signed 16-bit integer multiply-add by vector to 32-bit integer.

SMLAL (multiple vectors): Multi-vector signed 16-bit integer multiply-add to 32-bit integer.

SMLALL (multiple and indexed vector): Multi-vector signed integer multiply-add long long by indexed element.

SMLALL (multiple and single vector): Multi-vector signed integer multiply-add long long by vector.

SMLALL (multiple vectors): Multi-vector signed integer multiply-add long long.

SMLSL (multiple and indexed vector): Multi-vector signed 16-bit integer multiply-subtract by indexed element from 32-bit integer.

SMLSL (multiple and single vector): Multi-vector signed 16-bit integer multiply-subtract by vector from 32-bit integer.

SMLSL (multiple vectors): Multi-vector signed 16-bit integer multiply-subtract from 32-bit integer.

SMLSLL (multiple and indexed vector): Multi-vector signed integer multiply-subtract long long by indexed element.

SMLSLL (multiple and single vector): Multi-vector signed integer multiply-subtract long long by vector.

SMLSLL (multiple vectors): Multi-vector signed integer multiply-subtract long long.

SMOP4A (2-way): Signed 16-bit integer quarter-tile sum of outer products to 32-bit integer, accumulating.

SMOP4A (4-way): Signed integer quarter-tile sum of outer products, accumulating.

SMOP4S (2-way): Signed 16-bit integer quarter-tile sum of outer products to 32-bit integer, subtracting.

SMOP4S (4-way): Signed integer quarter-tile sum of outer products, subtracting.

SMOPA (2-way): Signed 16-bit integer sum of outer products to 32-bit integer, accumulating.

SMOPA (4-way): Signed integer sum of outer products, accumulating.

SMOPS (2-way): Signed 16-bit integer sum of outer products to 32-bit integer, subtracting.

SMOPS (4-way): Signed integer sum of outer products, subtracting.

SQCVT (four registers): Multi-vector signed saturating extract narrow.

SQCVT (two registers): Multi-vector signed 32-bit integer saturating extract narrow to 16-bit integer.

SQCVTN: Multi-vector signed saturating extract narrow to interleaved integer.

SQCVTU (four registers): Multi-vector signed saturating extract narrow to unsigned integer.

SQCVTU (two registers): Multi-vector signed 32-bit integer saturating extract narrow to unsigned 16-bit integer.

SQCVTUN: Multi-vector signed saturating extract narrow to interleaved unsigned integer.

SQDMULH (multiple and single vector): Multi-vector signed saturating doubling multiply high by vector.

SQDMULH (multiple vectors): Multi-vector signed saturating doubling multiply high.

SQRSHR (four registers): Multi-vector signed saturating rounding shift right narrow by immediate.

SQRSHR (two registers): Multi-vector signed 32-bit integer saturating rounding shift right narrow by immediate to 16-bit integer.

SQRSHRN: Multi-vector signed saturating rounding shift right narrow by immediate to interleaved integer.

SQRSHRU (four registers): Multi-vector signed saturating rounding shift right narrow by immediate to unsigned integer.

SQRSHRU (two registers): Multi-vector signed 32-bit integer saturating rounding shift right narrow by immediate to unsigned 16-bit integer.

SQRSHRUN: Multi-vector signed saturating rounding shift right narrow by immediate to interleaved unsigned integer.

SRSHL (multiple and single vector): Multi-vector signed rounding shift left by vector.

SRSHL (multiple vectors): Multi-vector signed rounding shift left.

ST1B (scalar plus immediate, strided registers): Contiguous store of bytes from multiple strided vectors (immediate index).

ST1B (scalar plus scalar, strided registers): Contiguous store of bytes from multiple strided vectors (scalar index).

ST1B (scalar plus scalar, tile slice): Contiguous store of bytes from 8-bit element ZA tile slice.

ST1D (scalar plus immediate, strided registers): Contiguous store of doublewords from multiple strided vectors (immediate index).

ST1D (scalar plus scalar, strided registers): Contiguous store of doublewords from multiple strided vectors (scalar index).

ST1D (scalar plus scalar, tile slice): Contiguous store of doublewords from 64-bit element ZA tile slice.

ST1H (scalar plus immediate, strided registers): Contiguous store of halfwords from multiple strided vectors (immediate index).

ST1H (scalar plus scalar, strided registers): Contiguous store of halfwords from multiple strided vectors (scalar index).

ST1H (scalar plus scalar, tile slice): Contiguous store of halfwords from 16-bit element ZA tile slice.

ST1Q: Contiguous store of quadwords from 128-bit element ZA tile slice.

ST1W (scalar plus immediate, strided registers): Contiguous store of words from multiple strided vectors (immediate index).

ST1W (scalar plus scalar, strided registers): Contiguous store of words from multiple strided vectors (scalar index).

ST1W (scalar plus scalar, tile slice): Contiguous store of words from 32-bit element ZA tile slice.

STMOPA (2-way): Signed 16-bit integer sparse sum of outer products to 32-bit integer, accumulating.

STMOPA (4-way): Signed 8-bit integer sparse sum of outer products to 32-bit integer, accumulating.

STNT1B (scalar plus immediate, strided registers): Contiguous store non-temporal of bytes from multiple strided vectors (immediate index).

STNT1B (scalar plus scalar, strided registers): Contiguous store non-temporal of bytes from multiple strided vectors (scalar index).

STNT1D (scalar plus immediate, strided registers): Contiguous store non-temporal of doublewords from multiple strided vectors (immediate index).

STNT1D (scalar plus scalar, strided registers): Contiguous store non-temporal of doublewords from multiple strided vectors (scalar index).

STNT1H (scalar plus immediate, strided registers): Contiguous store non-temporal of halfwords from multiple strided vectors (immediate index).

STNT1H (scalar plus scalar, strided registers): Contiguous store non-temporal of halfwords from multiple strided vectors (scalar index).

STNT1W (scalar plus immediate, strided registers): Contiguous store non-temporal of words from multiple strided vectors (immediate index).

STNT1W (scalar plus scalar, strided registers): Contiguous store non-temporal of words from multiple strided vectors (scalar index).

STR (array vector): Store ZA array vector.

STR (table): Store ZT0 register.

SUB (to array, array and multiple vectors): Multi-vector subtract from ZA array vectors.

SUB (to array, multiple and single vector): Multi-vector subtract by vector to ZA array vectors.

SUB (to array, multiple vectors): Multi-vector subtract to ZA array vectors.

SUDOT (4-way, multiple and indexed vector): Multi-vector signed by unsigned 8-bit integer dot product by indexed elements to 32-bit integer.

SUDOT (4-way, multiple and single vector): Multi-vector signed by unsigned 8-bit integer dot product by vector to 32-bit integer.

SUMLALL (multiple and indexed vector): Multi-vector signed by unsigned 8-bit integer multiply-add by indexed element to 32-bit integer.

SUMLALL (multiple and single vector): Multi-vector signed by unsigned 8-bit integer multiply-add by vector to 32-bit integer.

SUMOP4A: Signed by unsigned integer quarter-tile sum of outer products, accumulating.

SUMOP4S: Signed by unsigned integer quarter-tile sum of outer products, subtracting.

SUMOPA (4-way): Signed by unsigned integer sum of outer products, accumulating.

SUMOPS: Signed by unsigned integer sum of outer products, subtracting.

SUNPK: Unpack and sign-extend multi-vector elements.

SUTMOPA: Signed by unsigned 8-bit integer sparse sum of outer products to 32-bit integer, accumulating.

SUVDOT: Multi-vector signed by unsigned 8-bit integer vertical dot product by indexed element to 32-bit integer.

SVDOT (2-way): Multi-vector signed 16-bit integer vertical dot product by indexed element to 32-bit integer.

SVDOT (4-way): Multi-vector signed integer vertical dot product by indexed element.

UCLAMP: Multi-vector unsigned clamp to minimum/maximum.

UCVTF: Multi-vector unsigned 32-bit integer convert to single-precision.

UDOT (2-way, multiple and indexed vector): Multi-vector unsigned 16-bit integer dot product by indexed element to 32-bit integer.

UDOT (2-way, multiple and single vector): Multi-vector unsigned 16-bit integer dot product by vector to 32-bit integer.

UDOT (2-way, multiple vectors): Multi-vector unsigned 16-bit integer dot product to 32-bit integer.

UDOT (4-way, multiple and indexed vector): Multi-vector unsigned integer dot product by indexed element.

UDOT (4-way, multiple and single vector): Multi-vector unsigned integer dot product by vector.

UDOT (4-way, multiple vectors): Multi-vector unsigned integer dot product.

UMAX (multiple and single vector): Multi-vector unsigned maximum by vector.

UMAX (multiple vectors): Multi-vector unsigned maximum.

UMIN (multiple and single vector): Multi-vector unsigned minimum by vector.

UMIN (multiple vectors): Multi-vector unsigned minimum.

UMLAL (multiple and indexed vector): Multi-vector unsigned 16-bit integer multiply-add by indexed element to 32-bit integer.

UMLAL (multiple and single vector): Multi-vector unsigned 16-bit integer multiply-add by vector to 32-bit integer.

UMLAL (multiple vectors): Multi-vector unsigned 16-bit integer multiply-add to 32-bit integer.

UMLALL (multiple and indexed vector): Multi-vector unsigned integer multiply-add long long by indexed element.

UMLALL (multiple and single vector): Multi-vector unsigned integer multiply-add long long by vector.

UMLALL (multiple vectors): Multi-vector unsigned integer multiply-add long long.

UMLSL (multiple and indexed vector): Multi-vector unsigned 16-bit integer multiply-subtract by indexed element from 32-bit integer.

UMLSL (multiple and single vector): Multi-vector unsigned 16-bit integer multiply-subtract by vector from 32-bit integer.

UMLSL (multiple vectors): Multi-vector unsigned 16-bit integer multiply-subtract from 32-bit integer.

UMLSLL (multiple and indexed vector): Multi-vector unsigned integer multiply-subtract long long by indexed element.

UMLSLL (multiple and single vector): Multi-vector unsigned integer multiply-subtract long long by vector.

UMLSLL (multiple vectors): Multi-vector unsigned integer multiply-subtract long long.

UMOP4A (2-way): Unsigned 16-bit integer quarter-tile sum of outer products to 32-bit integer, accumulating.

UMOP4A (4-way): Unsigned integer quarter-tile sum of outer products, accumulating.

UMOP4S (2-way): Unsigned 16-bit integer quarter-tile sum of outer products to 32-bit integer, subtracting.

UMOP4S (4-way): Unsigned integer quarter-tile sum of outer products, subtracting.

UMOPA (2-way): Unsigned 16-bit integer sum of outer products to 32-bit integer, accumulating.

UMOPA (4-way): Unsigned integer sum of outer products, accumulating.

UMOPS (2-way): Unsigned 16-bit integer sum of outer products to 32-bit integer, subtracting.

UMOPS (4-way): Unsigned integer sum of outer products, subtracting.

UQCVT (four registers): Multi-vector unsigned saturating extract narrow.

UQCVT (two registers): Multi-vector unsigned 32-bit integer saturating extract narrow to 16-bit integer.

UQCVTN: Multi-vector unsigned saturating extract narrow to interleaved integer.

UQRSHR (four registers): Multi-vector unsigned saturating rounding shift right narrow by immediate.

UQRSHR (two registers): Multi-vector unsigned 32-bit integer saturating rounding shift right narrow by immediate to 16-bit integer.

UQRSHRN: Multi-vector unsigned saturating rounding shift right narrow by immediate to interleaved integer.

URSHL (multiple and single vector): Multi-vector unsigned rounding shift left by vector.

URSHL (multiple vectors): Multi-vector unsigned rounding shift left.

USDOT (4-way, multiple and indexed vector): Multi-vector unsigned by signed 8-bit integer dot product by indexed element to 32-bit integer.

USDOT (4-way, multiple and single vector): Multi-vector unsigned by signed 8-bit integer dot product by vector to 32-bit integer.

USDOT (4-way, multiple vectors): Multi-vector unsigned by signed 8-bit integer dot product to 32-bit integer.

USMLALL (multiple and indexed vector): Multi-vector unsigned by signed 8-bit integer multiply-add by indexed element to 32-bit integer.

USMLALL (multiple and single vector): Multi-vector unsigned by signed 8-bit integer multiply-add by vector to 32-bit integer.

USMLALL (multiple vectors): Multi-vector unsigned by signed 8-bit integer multiply-add to 32-bit integer.

USMOP4A: Unsigned by signed integer quarter-tile sum of outer products, accumulating.

USMOP4S: Unsigned by signed integer quarter-tile sum of outer products, subtracting.

USMOPA (4-way): Unsigned by signed integer sum of outer products, accumulating.

USMOPS: Unsigned by signed integer sum of outer products, subtracting.

USTMOPA: Unsigned by signed 8-bit integer sparse sum of outer products to 32-bit integer, accumulating.

USVDOT: Multi-vector unsigned by signed 8-bit integer vertical dot product by indexed element to 32-bit integer.

UTMOPA (2-way): Unsigned 16-bit integer sparse sum of outer products to 32-bit integer, accumulating.

UTMOPA (4-way): Unsigned 8-bit integer sparse sum of outer products to 32-bit integer, accumulating.

UUNPK: Unpack and zero-extend multi-vector elements.

UVDOT (2-way): Multi-vector unsigned 16-bit integer vertical dot product by indexed element to 32-bit integer.

UVDOT (4-way): Multi-vector unsigned integer vertical dot product by indexed element.

UZP (four registers): Concatenate elements from four vectors.

UZP (two registers): Concatenate elements from two vectors.

ZERO (double-vector): Zero ZA double-vector groups.

ZERO (quad-vector): Zero ZA quad-vector groups.

ZERO (single-vector): Zero ZA single-vector groups.

ZERO (table): Zero ZT0.

ZERO (tiles): Zero a list of 64-bit element ZA tiles.

ZIP (four registers): Interleave elements from four vectors.

ZIP (two registers): Interleave elements from two vectors.

2026-03_rel 2026-03-26 20:48:11