User:Ovinus/sandbox3

SSE background

Introduction and history[edit]

Intel introduced MMX technology in 1997 with the Pentium P5 line of processors, adding eight 64-bit registers, called MM0 through MM7. Each such register may hold a single 64-bit integer, or multiple smaller integers to be operated on in parallel--an example of single instruction, multiple data (SIMD). For example, the instruction PADDD MM0, MM1^[a] adds 32-bit integers from MM0 and MM1 in parallel, writing to MM0. Architecturally, these registers were mapped onto the existing x87 FPU registers,^[b] and thus code cannot simultaneously use x87 and MMX instructions without conflict.^[c]

AMD aimed to enhance the MMX instruction set with 3DNow!, introduced 1998, by adding floating-point instructions that use the MMX registers. 3DNow! never garnered much popularity and was officially deprecated in 2010.^[d] In 1999, Intel added Streaming SIMD Extensions (SSE) with the Pentium III, adding eight new 128-bit registers named XMM0 through XMM7, along with a control register for them, called MXCSR. With AMD64, the 64-bit version of the x86 architecture, the count of SSE registers was doubled to 16 (XMM0 through XMM15). The original SSE instructions mostly operated on single-precision floating-point numbers, but included both SIMD and scalar (single data) instructions; for example, ADDSS XMM0, XMM1 adds the single-precision floats in the lowermost quarter of XMM0 and of XMM1, writing to XMM0, while ADDPS XMM0, XMM1^[e] adds four pairs of single-precision floats in parallel. SSE2, released 2000, added double-precision and many new integer SSE instructions, essentially extending and supplanting the MMX technology. Integer SSE instructions are architecturally no different from floating-point instructions; floats may be treated as integers and vice versa. Successive extensions added new instructions (SSE3, SSSE3, and SSE4) performing complex operations such as shuffling data within a register.

The next major architectural change came with Advanced Vector Extensions, released in 2011 by Intel, expanding the SSE registers to 256-bit AVX registers, named YMM0 through YMM15. The SSE registers map onto the lower halves of the corresponding AVX registers. AVX contains extensions of SSE's floating-point operations, but no integer operations. In terms of functionality, the AVX registers were akin to two SSE registers joined together: Most instructions cannot move data between the 128-bit "lanes". This restriction changed somewhat with the 2013 introduction of AVX2, which extended SSE integer instructions to AVX registers, added new instructions, and added a few lane-crossing instructions. Introduced in 2013, the FMA3 and FMA4 instruction sets (of which only the former is still supported) added fused multiply–accumulate floating-point instructions to SSE and AVX registers. Other, miscellaneous instruction sets have been added to the SSE and AVX registers.

AVX-512 added hundreds of new instructions, expanded the 256-bit AVX registers into 512-bit AVX-512 registers and doubled the register count to 32 (ZMM0 through ZMM31). Like the transition from SSE to AVX, the AVX registers map to the lower halves of corresponding AVX-512 registers, and most of the AVX-512 instructions may also be applied to the smaller registers (but always require an EVEX prefix). It also adds 8 "opmask" registers, named K0 through K7, which can be used to arbitrarily select what portions of a SIMD register are affected by an operation. AVX-512 is subdivided into sets of instructions, of which only AVX-512F ("Foundation") is required; AVX-512-capable CPUs vary in which instruction sets they support.

Instruction encoding and table notation[edit]

An example of a CISC (complex instruction) computer architecture, x86 instructions are variable in length and coded with a combination of prefixes attached to an opcode, which determines the instruction, followed by several arguments---registers, memory addresses, and immediates (data for immediate use). The following table uses Intel assembly syntax, in which the write destination of an operation (if applicable) is the first argument.

A particular instruction has some opcode (PADDD, for example, is OF FE) and may be followed by its encoded operands, in this case denoted /r ... (don't know what that syntax means tbh). Immediate byte values are indicated with ib. There is no prefix peculiar to MMX instructions, but SSE, AVX, and AVX-512 instructions are prefixed with a one-byte 66 prefix, a VEX prefix, and an EVEX prefix, respectively, determining (among other things) the width of the instruction. A REX prefix is used for certain operations on 64-bit registers and determining their behavior.

In the table, mm indicates a SIMD register, which should be extended to the corresponding vector length for the corresponding widened instructions. Typically, the mnemonic for SSE instructions is the same as the corresponding MMX instruction, while AVX and AVX-512 instructions begin with the letter V. Thus, the SSE instruction PADDD xmm, xmm adds 32-bit integers in parallel, writing to the first argument, while the analogous AVX2 instruction is denoted VPADDD ymm, ymm. Instructions that do not follow this exact pattern are indicated in the columns.^[f] r/m32 indicates a 32-bit argument that may either be a 32-bit general purpose register or a memory location. imm8 indicates an 8-bit integer immediate, an arbitrary byte encoded directly into the instruction. For example, VPSRLQ ymm, imm8 will bitwise shift right the contents of the AVX register by however many bits specified in imm8.

^ PADDD stands for packed add double word. In x86, a word is 16 bits in length, so a double word generally refers to a 32-bit integer or chunk of memory.
^ In particular, each 64-bit MMX register maps to the lower 64 bits of the associated 80-bit x87 register. Unlike x87 registers, which operate as a stack, MMX registers may be accessed in an arbitrary order.
^ Switching back to the x87 FPU state may be done by calling EMMS (empty MMX state).
^ A few 3DNow! instructions are still maintained, however, namely the prefetch instructions.
^ ADDPS stands for add packed single-precision (number).
^ For example, MOVQ mm, mm (MMX) becomes MOVDQA xmm, xmm (SSE2).

[1] PADDD stands for packed add double word. In x86, a word is 16 bits in length, so a double word generally refers to a 32-bit integer or chunk of memory.

[2] In particular, each 64-bit MMX register maps to the lower 64 bits of the associated 80-bit x87 register. Unlike x87 registers, which operate as a stack, MMX registers may be accessed in an arbitrary order.

[3] Switching back to the x87 FPU state may be done by calling EMMS (empty MMX state).

[4] A few 3DNow! instructions are still maintained, however, namely the prefetch instructions.

[5] ADDPS stands for add packed single-precision (number).

[6] For example, MOVQ mm, mm (MMX) becomes MOVDQA xmm, xmm (SSE2).

[a]

[b]

[c]

[d]

[e]

[f]