Service

GitHub - deepseek-ai/DeepGEMM: DeepGEMM: clean and efficient FP8 GEMM kernels with fine-grained scaling

deepseek-ai

2025.03.08

·GitHub·by Anonymous

#GEMM#FP8#CUDA#GPU#Optimization

Key Points

1DeepGEMM is a CUDA-based library designed for clean and efficient General Matrix Multiplications (GEMMs), supporting FP8 and BF16 data types across dense and MoE grouped scenarios.
2It features a lightweight JIT module for runtime kernel compilation, simplifying its design by focusing on a limited set of core functions without heavy reliance on complex templates from other libraries.
3Despite its simplicity, DeepGEMM achieves competitive performance, matching or exceeding expert-tuned libraries, with reported peak performance of up to 1550 TFLOPS on H800, and includes specialized kernels for tasks like MoE weight gradients and MQA logits.

\mathbf{D} = \mathbf{C} + \mathbf{A} @ \mathbf{B}

Service

deepseek-ai

2025.03.08

·GitHub·by Anonymous

#GEMM#FP8#CUDA#GPU#Optimization

1DeepGEMM is a CUDA-based library designed for clean and efficient General Matrix Multiplications (GEMMs), supporting FP8 and BF16 data types across dense and MoE grouped scenarios.
2It features a lightweight JIT module for runtime kernel compilation, simplifying its design by focusing on a limited set of core functions without heavy reliance on complex templates from other libraries.
3Despite its simplicity, DeepGEMM achieves competitive performance, matching or exceeding expert-tuned libraries, with reported peak performance of up to 1550 TFLOPS on H800, and includes specialized kernels for tasks like MoE weight gradients and MQA logits.

\mathbf{D} = \mathbf{C} + \mathbf{A} @ \mathbf{B}