엔비디아가 이 LPU를 선택한 이유 [Nvidia, Groq, LPU, execuhire]
Video

엔비디아가 이 LPU를 선택한 이유 [Nvidia, Groq, LPU, execuhire]

sudoremove
2026.01.11
·YouTube·by 이호민
#LPU#Groq#Nvidia#Inference#AI

Key Points

  • 1Grok is a company specializing in Language Processing Units (LPUs), new hardware specifically designed for high-speed AI inference due to its on-chip memory architecture.
  • 2Nvidia acquired Grok through what was described as an "executive hire" or technology licensing agreement, with Grok's founder reportedly joining Nvidia.
  • 3Grok's LPU technology achieves superior inference speed and deterministic performance by utilizing on-chip SRAM for high memory bandwidth and a design optimized for efficient multi-chip linear algebraic computations.

Nvidia has entered into an "Inference Technology Licensing Agreement" with GRQ (Grok), a company specializing in hardware for AI inference. While framed as a licensing deal, it's effectively an "executive hire" or "e-queue hire," with Grok's founder and president, Jonathan Russ, transitioning to Nvidia. The transaction, reportedly involving approximately 20billionincash,isasignificantmoveforNvidiaintheinferencemarket,comparabletoGooglesacquisitionofCharacterAI(20 billion in cash, is a significant move for Nvidia in the inference market, comparable to Google's acquisition of Character AI (2.5 billion) and Meta's acquisition of Scale AI ($15 billion).

Grok's core product is the Language Processing Unit (LPU), a proprietary hardware architecture specifically designed for accelerating large language model (LLM) inference. The primary advantage of the LPU is its exceptional speed. This is demonstrated by its ability to infer a 1-trillion parameter model (e.g., Kimi K2) at a rate of 200 tokens per second with a latency as low as 195 milliseconds. Grok also offers an open platform where users can run various open-source models, highlighting its competitive pricing for inference services (e.g., ~$2 per 1 million tokens for LLM inference, and very low cost for services like Whisper).

The technical foundation of Grok's LPU lies in four key design principles: software-first, programmable, deterministic, and on-memory. The most critical architectural differentiator is its on-chip memory-centric design, which heavily leverages SRAM (Static Random-Access Memory) instead of traditional HBM (High-Bandwidth Memory) used in GPUs.

In conventional GPU architectures, LLM weights and data are typically stored in HBM, and then transferred to the GPU's on-chip SRAM and registers for computation. This data movement between HBM and SRAM often becomes a performance bottleneck due to HBM's comparatively lower bandwidth. Grok's LPU, however, aims to store the entire model (or significant portions) directly within the fast, on-chip SRAM. This design choice dramatically increases memory bandwidth, with the text stating an on-chip SRAM bandwidth of approximately 80 TB/s, roughly 10 times faster than typical HBM bandwidths of around 8 TB/s. This allows for near-instantaneous access to model parameters, virtually eliminating memory-bound latency during inference.

The trade-offs for this SRAM-heavy design include:

  • Cost: SRAM is significantly more expensive per unit capacity than HBM.
  • Die Area: Integrating large amounts of SRAM directly on the chip leads to a considerably larger die size, which can negatively impact manufacturing yield and overall chip cost.
  • Model Size Limitations: While ideal for speed, the finite capacity of on-chip SRAM implies a practical limit to the size of models that can entirely reside on the chip. Larger models might necessitate complex partitioning or reliance on external memory, potentially diminishing the primary benefit.

Despite these challenges, Grok's LPU is optimized for inference from the ground up. It functions as a highly efficient linear algebraic calculator, capable of handling all operations involving linear algebra. This specialization for inference, combined with its unique memory architecture, makes it incredibly fast for its intended purpose.

Furthermore, Grok's hardware boasts exceptional multi-chip communication capabilities. The architecture is designed for seamless, linear connectivity between multiple chips, enabling scaling for larger models without performance degradation. This is achieved through direct, software-controlled communication between chips, bypassing the need for complex routers or controllers typically found in multi-GPU setups. This deterministic communication, along with the processor's deterministic execution time, ensures predictable latency, a crucial advantage in applications requiring guaranteed response times.

Nvidia's acquisition of Grok and its LPU technology is interpreted as a strategic move to aggressively strengthen its position in the rapidly growing AI inference market. While Nvidia already has a dominant presence in AI hardware, Grok's specialized inference architecture, particularly its SRAM-centric design and multi-chip communication efficiency, represents a valuable addition that could complement or even challenge existing GPU-based inference solutions. The fact that Grok's founder was previously part of Google's TPU team suggests potential intellectual property advantages in the domain of specialized AI accelerators.