SIMD ISAReturn TypeNameArgumentsInstruction Group
Neonfloat32x4_tvbfmmlaq_f32(float32x4_t r, bfloat16x8_t a, bfloat16x8_t b)Vector arithmetic / Matrix multiply
Description
BFloat16 floating-point matrix multiply-accumulate into 2x2 matrix. This instruction multiplies the 2x4 matrix of BF16 values held in the first 128-bit source vector by the 4x2 BF16 matrix in the second 128-bit source vector. The resulting 2x2 single-precision matrix product is then added destructively to the 2x2 single-precision matrix in the 128-bit destination vector. This is equivalent to performing a 4-way dot product per destination element. The instruction ignores the FPCR and does not update the FPSR exception status.
Results
Vd.4S result
This intrinsic compiles to the following instructions:

BFMMLA Vd.4S,Vn.8H,Vm.8H

Argument Preparation
r register: Vd.4Sa register: Vn.8Hb register: Vm.8H
Architectures
A32, A64

Operation

CheckFPAdvSIMDEnabled64();
bits(128) op1 = V[n];
bits(128) op2 = V[m];
bits(128) acc = V[d];

V[d] = BFMatMulAdd(acc, op1, op2);