GitHub - ziwon/ai-data-center-network: AI Data Center Network 참여형 스터디 자료 모음
Key Points
- 1This GitHub repository serves as a curated collection of resources focused on AI data center network design, engineering, and performance optimization.
- 2It compiles diverse materials, including books, code, articles, talks, and academic papers, covering critical aspects like network architecture, InfiniBand vs. RoCEv2 comparisons, and distributed AI training.
- 3The compilation further addresses GPU architectures, efficient LLM inference systems, and broader considerations for building robust AI/ML infrastructure within data centers.
This document is a GitHub repository README file titled "AI Data Center Network," serving as a curated collection of study materials and resources pertinent to the design, operation, and optimization of network infrastructure for Artificial Intelligence (AI) and Machine Learning (ML) workloads. It aggregates diverse content types to provide a comprehensive overview of the field.
The repository organizes resources into several key categories:
- Books: Recommended texts covering AI data center network design, deep learning applications for network engineers, and performance engineering for AI systems, including GPU, CUDA, and PyTorch optimization.
- Code: References to practical implementations and concepts such as efficient LLM (Large Language Model) inference systems, building LLMs from scratch, and InfiniBand network architecture.
- Articles: A broad range of contemporary topics including in-depth comparisons of InfiniBand and RoCEv2 for large-scale AI clusters, practical guides to configuring lossless RoCEv2 networks for GPU clusters, discussions on the evolving landscape of fabric technologies, trends in hyperscale AI data centers (e.g., megawatts to gigawatts), and vendor-specific solutions from Juniper, AMD, Broadcom, and Cisco for AI/ML infrastructure. It also covers general data center design requirements and network best practices for AI.
- Talks: Links to presentations on the engineering challenges of training multi-trillion parameter LLMs, detailed analyses of AI network architectures comparing InfiniBand and Ultra Ethernet, and discussions on Remote Direct Memory Access (RDMA).
- Papers: A selection of academic publications primarily focused on advanced LLM inference techniques (e.g., phase splitting, scaling transformer inference), quantization methods like 8-bit matrix multiplication for Transformers (), and foundational scaling laws for neural language models.
- GPU: Specific technical documentation on NVIDIA's H100 Tensor Core GPU and Blackwell Architectures, detailing precision formats like NVFP4, FP8, and FP4 for training and inference efficiency.
- NCCL and Communication Collectives: Resources dedicated to NVIDIA Collective Communications Library (NCCL) and its underlying algorithms, critical for distributed GPU training.
- LLM Arch: A gallery and comparison of various LLM architectures, alongside practical guides for fine-tuning LLMs and specific model examples (e.g., Unsloth, Nemotron-3, Qwen).
- Cable & Data Center: External links related to global network infrastructure like submarine cable maps and grid operations.
The overarching methodology implied by this collection emphasizes the critical role of high-performance, low-latency, and high-bandwidth networking in enabling distributed AI computation. Key technical considerations highlighted across the resources include the selection between InfiniBand and Ethernet-based RoCEv2 as network fabrics, the importance of lossless network configurations to prevent packet drops in GPU clusters, the optimization of collective communication operations () across GPUs, and hardware-specific optimizations like various floating-point precisions () and specialized hardware accelerators. The collection collectively provides insights into the architectural patterns, software optimizations, and hardware advancements necessary to build and scale modern AI data centers for demanding workloads like large language model training and inference. The document itself does not present specific formulas but references resources where such technical details would be found.