Service

GitHub - deepseek-ai/FlashMLA: FlashMLA: Efficient Multi-head Latent Attention Kernels

deepseek-ai

2025.03.08

·GitHub·by Anonymous

#Attention Kernels#LLM#Sparse Attention#Deep Learning#GPU Computing

Key Points

1FlashMLA is DeepSeek's optimized library of attention kernels, powering DeepSeek-V3 and DeepSeek-V3.2-Exp models with highly efficient dense and sparse attention implementations.
2It delivers substantial performance gains for prefill and decoding stages, achieving up to 660 TFlops on NVIDIA H800 GPUs and featuring token-level sparse attention with FP8 KV cache support.
3The library provides kernels for SM90/SM100 architectures, including MQA and MHA modes, and is supported across various other GPU platforms through community collaborations.

k

Service

deepseek-ai

2025.03.08

·GitHub·by Anonymous

#Attention Kernels#LLM#Sparse Attention#Deep Learning#GPU Computing

1FlashMLA is DeepSeek's optimized library of attention kernels, powering DeepSeek-V3 and DeepSeek-V3.2-Exp models with highly efficient dense and sparse attention implementations.
2It delivers substantial performance gains for prefill and decoding stages, achieving up to 660 TFlops on NVIDIA H800 GPUs and featuring token-level sparse attention with FP8 KV cache support.
3The library provides kernels for SM90/SM100 architectures, including MQA and MHA modes, and is supported across various other GPU platforms through community collaborations.

k