GitHub - microsoft/BitNet: Official inference framework for 1-bit LLMs
Service

GitHub - microsoft/BitNet: Official inference framework for 1-bit LLMs

microsoft
2025.04.20
ยทGitHubยทby Anonymous
#LLM#1-bit LLM#Inference Framework#BitNet#cpp

Key Points

  • 1BitNet.cpp is Microsoft's official inference framework designed to enable fast and lossless inference of 1-bit Large Language Models, such as BitNet b1.58, on CPU and upcoming NPU platforms.
  • 2This framework achieves substantial performance improvements, including speedups of up to 6.17x and energy reductions of up to 82.2% on x86 CPUs, making it possible to run large 100B models on a single CPU at human reading speeds.
  • 3Built upon elements of llama.cpp and T-MAC, bitnet.cpp aims to significantly enhance the potential for efficient local deployment of LLMs and inspires further development of large-scale 1-bit models.

bitnet.cpp is an official inference framework developed by Microsoft for highly quantized Large Language Models (LLMs), specifically focusing on 1-bit LLMs such as BitNet b1.58 models. The primary objective of bitnet.cpp is to enable fast, lossless, and energy-efficient inference of these ultra-low-precision models on edge devices, starting with CPUs (ARM and x86) and with future support planned for NPUs.

The core methodology of bitnet.cpp revolves around a suite of highly optimized kernels tailored for bit-wise operations inherent in 1-bit and 1.58-bit quantized LLMs. These optimizations are designed to leverage the specific properties of low-bit quantization, enabling significant performance gains and energy reductions compared to traditional inference methods. A crucial technical foundation for bitnet.cpp's kernels is the utilization of Lookup Table (LUT) methodologies, as pioneered in projects like T-MAC. This approach transforms computationally intensive matrix multiplications involving extremely low-bit (e.g., binary or ternary) weights and activations into efficient table lookups or specialized bitwise operations. For instance, in a 1-bit neural network, a multiplication of a binary weight wโˆˆ{โˆ’1,+1}w \in \{-1, +1\} and a binary activation aโˆˆ{โˆ’1,+1}a \in \{-1, +1\} can be replaced by XOR operations and popcounts (counting set bits) or directly mapped to a pre-computed sum based on the input bit patterns, which is considerably faster than floating-point arithmetic. This design principle allows bitnet.cpp to achieve "lossless inference," meaning that the computational speedup does not come at the cost of accuracy degradation for the already quantized models.

The framework reports substantial performance improvements and energy efficiency:

  • On ARM CPUs, bitnet.cpp achieves speedups ranging from 1.37x to 5.07x, with larger models demonstrating greater performance gains. Concurrently, it reduces energy consumption by 55.4% to 70.0%.
  • On x86 CPUs, speedups range from 2.37x to 6.17x, accompanied by energy reductions between 71.9% to 82.2%.
  • Notably, bitnet.cpp can run a 100B BitNet b1.58 model on a single CPU, achieving inference speeds comparable to human reading (5-7 tokens per second), which significantly broadens the scope for deploying LLMs on local devices.

bitnet.cpp supports various 1-bit LLM configurations available on Hugging Face, including BitNet-b1.58-2B-4T, bitnet_b1_58-large (0.7B), bitnet_b1_58-3B (3.3B), Llama3-8B-1.58-100B-tokens (8.0B), and Falcon3/Falcon-E families (1B-10B models), across both x86 and ARM architectures. It provides specific quantization types such as i2_s (likely integer 2-bit signed) and tl1 (likely ternary lookup 1-bit), indicating flexibility in handling different low-bit precision schemes. The project acknowledges its foundational reliance on the llama.cpp framework and the T-MAC methodology for its optimized kernels.