GitHub - maderix/ANE: Training neural networks on Apple Neural Engine via reverse-engineered private APIs
Blog

GitHub - maderix/ANE: Training neural networks on Apple Neural Engine via reverse-engineered private APIs

maderix
2026.03.03
·GitHub·by 이호민
#ANE#Apple Neural Engine#Machine Learning#Performance Benchmarking#Reverse Engineering

Key Points

  • 1This research project reverse-engineers Apple Neural Engine (ANE) private APIs to enable direct neural network training on Apple Silicon, demonstrating that the ANE is capable of computation beyond inference.
  • 2By utilizing undocumented `_ANEClient` and `_ANECompiler` APIs and generating custom MIL programs, the project executes full backpropagation for a transformer layer, with some operations handled by the ANE and others by the CPU.
  • 3The project serves as a proof-of-concept, achieving 11.2% ANE utilization (1.78 TFLOPS sustained) on an M4 chip for a single transformer layer, despite acknowledging current limitations like low peak utilization and partial CPU fallback.

This research project, titled "ANE Training β€” Backpropagation on Apple Neural Engine," demonstrates the feasibility of training neural networks directly on Apple's Neural Engine (ANE) via reverse-engineered private APIs. It explicitly bypasses Apple's inference-only restriction for the ANE, operating without CoreML, Metal, or GPU, relying solely on ANE compute. The primary goal is to prove that the ANE, a 15.8 TFLOPS (M4) inference accelerator, is hardware-capable of training, with software support being the historical barrier.

The core methodology involves reverse-engineering Apple's private and undocumented APIs, specifically _ANEClient and _ANECompiler, along with the Model Intermediate Language (MIL) format. This allows for the execution of custom compute graphs, including backpropagation, directly on the ANE hardware.

The training loop implements a single transformer layer (dimension = 768, sequence length = 512). It achieves 9.3 ms/step with 11.2% ANE utilization (1.78 TFLOPS sustained) on an M4 chip. The process involves 6 ANE kernel dispatches per training step for both forward and backward passes.

The architecture decomposes the transformer training into the following ANE kernels:

  1. kFwdAttn: Performs RMSNorm, QKV projection, Scaled Dot-Product Attention (SDPA), and output projection.
  2. kFwdFFN: Executes RMSNorm and the SwiGLU Feed-Forward Network (FFN).
  3. kFFNBwd: Computes the FFN backward pass, including W2T+SiLUbwd+W1T+W3T\text{W}_2^\text{T} + \text{SiLU}_\text{bwd} + \text{W}_1^\text{T} + \text{W}_3^\text{T}.
  4. kSdpaBwd1: Handles the WoT\text{W}_\text{o}^\text{T} and the first part of the SDPA backward pass (computing βˆ‚V\partial \text{V}, βˆ‚probs\partial \text{probs}, βˆ‚dp\partial \text{dp}).
  5. kSdpaBwd2: Completes the SDPA backward pass, including softmax gradient, βˆ‚Q\partial \text{Q}, and βˆ‚K\partial \text{K}.
  6. kQKVb: Performs the QKV backward pass, calculating WqT+WkT+WvTβ†’βˆ‚x\text{W}_\text{q}^\text{T} + \text{W}_\text{k}^\text{T} + \text{W}_\text{v}^\text{T} \to \partial x.

The CPU manages RMSNorm backward computations, residual connections, loss computation, weight gradient accumulation (via cblas_sgemm), and Adam optimizer updates.

Key optimizations implemented to improve performance include:

  • Channel-first CPU layout: Matches the ANE IOSurface [1,C,1,S] format, eliminating transpose overheads.
  • vDSP vectorized RMSNorm: Accelerates RMSNorm computation significantly.
  • GCD async cblas overlap: dW gradient sgemm operations run in parallel with ANE evaluations on a serial dispatch queue.
  • Deferred cblas wait: Waiting for cblas completion is pushed into the next step's forward pass to maximize overlap.
  • ANE RMSNorm fusion: RMSNorm operations are folded into forward kernels as MIL operations (reduce_sum+pow+mul\text{reduce\_sum} + \text{pow} + \text{mul}).
  • WoT\text{W}_\text{o}^\text{T} fusion: The output projection backward operation is merged into the SDPA backward kernel, reducing the number of kernels from 7 to 6.
  • Forward taps: Intermediate activations (Q, K, V, attention scores, hidden states) are exposed as concatenated outputs to avoid CPU recomputation.
  • exec() restart: A workaround to bypass an approximate 119 ANE compile limit per process, using checkpointing.

The system's operation involves:

  1. MIL Generation: Objective-C code dynamically constructs the MIL program text at runtime. This specifies low-level operations like convolutions (for linear layers), matrix multiplications (matmul\text{matmul}) for attention, softmax, and element-wise operations.
  2. In-Memory Compilation: The _ANEInMemoryModelDescriptor API is used to compile the generated MIL text and associated weight blobs directly into executable ANE programs in memory, circumventing the need for on-disk .mlmodelc files.
  3. IOSurface I/O: Input and output tensors are exchanged with the ANE via IOSurface shared memory, adhering to a [1, channels, 1, spatial] format using FP16 precision.
  4. Weight Embedding: Weights are baked into the ANE programs as BLOBFILE constants. When weights are updated (e.g., after an optimizer step), the ANE programs are recompiled for the next batch.
  5. Gradient Flow: Intermediate results from the forward pass ("forward taps") are exposed as additional outputs of the ANE kernels. These are then used by subsequent backward kernels. Input gradients (βˆ‚x\partial x) are computed on the ANE, while weight gradients (βˆ‚W\partial W) are offloaded to the CPU and computed using Apple's Accelerate framework (cblas).

Current limitations include ANE hardware ignoring the attn_mask in SDPA operations, requiring causal attention to be decomposed into separate ANE operations for Q@K^T, then CPU masking and ANE softmax, followed by ANE scores@V. The ~119 compile limit is worked around by exec() restarts with checkpointing. The project currently trains only a single transformer layer, and uses synthetic data for benchmarking, with real tokenized data support under development.

This project is research-oriented, using private and undocumented Apple APIs under fair use for educational and interoperability purposes. It is explicitly not a production framework or a replacement for existing ML libraries.