Service

GitHub - deepseek-ai/FlashMLA: FlashMLA: Efficient Multi-head Latent Attention Kernels

deepseek-ai

2025.03.08

·GitHub·by Anonymous

#Attention Kernels#LLM#Sparse Attention#Deep Learning#GPU Computing

핵심 포인트

1FlashMLA는 DeepSeek-V3 및 DeepSeek-V3.2-Exp 모델을 구동하는 DeepSeek의 최적화된 Attention Kernels 라이브러리로, Sparse Attention과 Dense Attention Kernels을 포함합니다.
2이 라이브러리는 FP8 KV cache를 사용하는 sparse decoding에서 최대 410 TFlops, dense decoding에서 최대 660 TFlops 등 높은 성능을 달성하며 NVIDIA SM90 및 SM100 GPU 아키텍처를 지원합니다.
3FlashMLA는 MLA Decoding과 Sparse/Dense MLA Prefill을 위한 Kernels 사용법을 제공하며, FlashAttention 및 cutlass 프로젝트에서 영감을 받아 다양한 하드웨어 플랫폼에 적용 가능합니다.

i

Service

deepseek-ai

2025.03.08

·GitHub·by Anonymous

#Attention Kernels#LLM#Sparse Attention#Deep Learning#GPU Computing

1FlashMLA는 DeepSeek-V3 및 DeepSeek-V3.2-Exp 모델을 구동하는 DeepSeek의 최적화된 Attention Kernels 라이브러리로, Sparse Attention과 Dense Attention Kernels을 포함합니다.
2이 라이브러리는 FP8 KV cache를 사용하는 sparse decoding에서 최대 410 TFlops, dense decoding에서 최대 660 TFlops 등 높은 성능을 달성하며 NVIDIA SM90 및 SM100 GPU 아키텍처를 지원합니다.
3FlashMLA는 MLA Decoding과 Sparse/Dense MLA Prefill을 위한 Kernels 사용법을 제공하며, FlashAttention 및 cutlass 프로젝트에서 영감을 받아 다양한 하드웨어 플랫폼에 적용 가능합니다.

i