Intrinsics – Arm Developer

SIMD ISA	Return Type	Name	Arguments	Instruction Group
Neon	`float32x4_t`	`vbfdotq_f32`	`(float32x4_t r, bfloat16x8_t a, bfloat16x8_t b)`	Vector arithmetic / Dot product
Description BFloat16 floating-point dot product (vector). This instruction delimits the source vectors into pairs of 16-bit BF16 elements. Within each pair, the elements in the first source vector are multiplied by the corresponding elements in the second source vector. The resulting single-precision products are then summed and added destructively to the single-precision element of the destination vector that aligns with the pair of BF16 values in the first source vector. The instruction ignores the FPCR and does not update the FPSR exception status. Results Vd.4S result This intrinsic compiles to the following instructions: BFDOT `Vd.4S,Vn.8H,Vm.8H` Argument Preparation r register: Vd.4Sa register: Vn.8Hb register: Vm.8H Architectures A32, A64 Operation CheckFPAdvSIMDEnabled64(); bits(datasize) operand1 = V[n]; bits(datasize) operand2 = V[m]; bits(datasize) operand3 = V[d]; bits(datasize) result; for e = 0 to elements-1 bits(16) elt1_a = Elem[operand1, 2 * e + 0, 16]; bits(16) elt1_b = Elem[operand1, 2 * e + 1, 16]; bits(16) elt2_a = Elem[operand2, 2 * e + 0, 16]; bits(16) elt2_b = Elem[operand2, 2 * e + 1, 16]; bits(32) sum = Elem[operand3, e, 32]; sum = BFDotAdd(sum, elt1_a, elt1_b, elt2_a, elt2_b, FPCR[]); Elem[result, e, 32] = sum; V[d] = result;

vbfdotq_f32