GitHub - Leonheart0910/airllm: AirLLM 70B inference with single 4GB GPU : 걍 존나 효율적인 서빙 추론 엔진
Key Points
- 1AirLLM is an inference engine designed to enable large language models (e.g., 70B) to run on single low-VRAM GPUs (e.g., 4GB) by optimizing memory usage through layer-wise decomposition, avoiding the need for initial quantization or distillation.
- 2It offers optional block-wise quantization (4-bit/8-bit) for up to 3x inference speed-up, supports CPU inference, includes prefetching for efficiency, and works on MacOS.
- 3The tool supports a broad range of models, including Llama2, Llama3 (up to 405B on 8GB VRAM), Qwen, ChatGLM, Baichuan, Mistral, and InternLM, providing a solution for resource-constrained environments.
AirLLM is an inference engine designed to enable large language models (LLMs), such as 70B parameter models, to run on commodity hardware with limited GPU memory (e.g., a single 4GB GPU), and 405B Llama3.1 on 8GB VRAM. Its core methodology focuses on optimizing memory usage by dynamically managing model parameters rather than loading the entire model into GPU VRAM at once.
The primary technical approach involves layer-wise decomposition and on-demand loading of the model. Upon initialization, the original LLM is first decomposed (or "sharded") into its individual layers, which are then saved separately to disk. During inference, instead of residing entirely in GPU memory, only the specific layers required for the current computational step are loaded from disk (or system RAM) into the GPU. Once computation on a layer is complete, its memory can be freed or overwritten for subsequent layers. This technique, effectively a form of offloading or CPU/disk-to-GPU parameter streaming, drastically reduces the peak VRAM footprint required for inference.
To mitigate the performance overhead associated with frequent data transfers between disk/CPU and GPU, AirLLM incorporates prefetching. This mechanism overlaps the loading of upcoming model layers with the computation of the current layer, achieving performance improvements (e.g., 10% speedup).
While the initial design aimed to run models without traditional quantization for memory reduction, AirLLM introduces an optional block-wise quantization feature (4-bit or 8-bit) primarily to accelerate inference speed by up to 3x. This compression is applied specifically to the weights during disk storage. The rationale is that the primary bottleneck, in this memory-optimized setup, is the disk loading speed. By reducing the size of the model weights on disk, less data needs to be transferred, thereby speeding up the disk I/O. Crucially, this quantization targets only the weights and is distinct from full quantization techniques that also quantize activations for on-GPU memory reduction, which often have a higher risk of accuracy degradation.
AirLLM provides a unified AutoModel interface similar to Hugging Face Transformers, allowing users to load and infer various models including Llama2, Llama3, Qwen, ChatGLM, Baichuan, Mistral, InternLM, Mixtral, and Qwen2.5. It supports CPU inference and MacOS (Apple Silicon) using mlx and PyTorch backends. The system requires sufficient disk space for the initial layer-wise decomposition of the model.