Intrinsics – Arm Developer

SIMD ISA Return Type Name Arguments Instruction Group

Neon float16x8x4_t vld4q_dup_f16 (float16_t const * ptr) Load / Stride

SIMD ISA	Return Type	Name	Arguments	Instruction Group
Neon	`float16x8x4_t`	`vld4q_dup_f16`	`(float16_t const * ptr)`	Load / Stride
Description Load single 4-element structure and Replicate to all lanes of four registers. This instruction loads a 4-element structure from memory and replicates the structure to all the lanes of the four SIMD&FP registers. Results Vt4.8H result.val[3]Vt3.8H result.val[2]Vt2.8H result.val[1]Vt.8H result.val[0] This intrinsic compiles to the following instructions: LD4R `{Vt.8H - Vt4.8H},[Xn]` Argument Preparation ptr register: Xn Architectures v7, A32, A64 Operation if HaveMTE2Ext() then SetTagCheckedInstruction(tag_checked); CheckFPAdvSIMDEnabled64(); bits(64) address; bits(64) offs; bits(128) rval; bits(esize) element; constant integer ebytes = esize DIV 8; if n == 31 then CheckSPAlignment(); address = SP[]; else address = X[n]; offs = Zeros(); if replicate then // load and replicate to all elements for s = 0 to selem-1 element = Mem[address + offs, ebytes, AccType_VEC]; // replicate to fill 128- or 64-bit register V[t] = Replicate(element, datasize DIV esize); offs = offs + ebytes; t = (t + 1) MOD 32; else // load/store one element per register for s = 0 to selem-1 rval = V[t]; if memop == MemOp_LOAD then // insert into one lane of 128-bit register Elem[rval, index, esize] = Mem[address + offs, ebytes, AccType_VEC]; V[t] = rval; else // memop == MemOp_STORE // extract from one lane of 128-bit register Mem[address + offs, ebytes, AccType_VEC] = Elem[rval, index, esize]; offs = offs + ebytes; t = (t + 1) MOD 32; if wback then if m != 31 then offs = X[m]; if n == 31 then SP[] = address + offs; else X[n] = address + offs;

Description

Load single 4-element structure and Replicate to all lanes of four registers. This instruction loads a 4-element structure from memory and replicates the structure to all the lanes of the four SIMD&FP registers.

Results

Vt4.8H result.val[3]Vt3.8H result.val[2]Vt2.8H result.val[1]Vt.8H result.val[0]

This intrinsic compiles to the following instructions:

LD4R {Vt.8H - Vt4.8H},[Xn]

Argument Preparation

ptr register: Xn

Architectures

v7, A32, A64

Operation

if HaveMTE2Ext() then
    SetTagCheckedInstruction(tag_checked);

CheckFPAdvSIMDEnabled64();

bits(64) address;
bits(64) offs;
bits(128) rval;
bits(esize) element;
constant integer ebytes = esize DIV 8;

if n == 31 then
    CheckSPAlignment();
    address = SP[];
else
    address = X[n];

offs = Zeros();
if replicate then
    // load and replicate to all elements
    for s = 0 to selem-1
        element = Mem[address + offs, ebytes, AccType_VEC];
        // replicate to fill 128- or 64-bit register
        V[t] = Replicate(element, datasize DIV esize);
        offs = offs + ebytes;
        t = (t + 1) MOD 32;
else
    // load/store one element per register
    for s = 0 to selem-1
        rval = V[t];
        if memop == MemOp_LOAD then
            // insert into one lane of 128-bit register
            Elem[rval, index, esize] = Mem[address + offs, ebytes, AccType_VEC];
            V[t] = rval;
        else // memop == MemOp_STORE
            // extract from one lane of 128-bit register
            Mem[address + offs, ebytes, AccType_VEC] = Elem[rval, index, esize];
        offs = offs + ebytes;
        t = (t + 1) MOD 32;

if wback then
    if m != 31 then
        offs = X[m];
    if n == 31 then
        SP[] = address + offs;
    else
        X[n] = address + offs;