RNGD
Service

RNGD

@furiosaai
2025.04.20
·Web·by Anonymous
#AI Inference#LLM#Accelerator#Hardware#Data Center

Key Points

  • 1The Furiosa RNGD is a Gen 2 data center accelerator featuring a Tensor Contraction Processor (TCP) architecture, specifically designed to accelerate AI inference through efficient tensor contraction operations.
  • 2Each RNGD chip offers 512 TFLOPS (FP8) at an efficient 180W TDP, with an 8-chip server configuration achieving 4 petaFLOPS (FP8) and demonstrating competitive performance per watt for LLM inference against leading GPUs.
  • 3Optimized for LLM and multimodal deployment in air-cooled data centers, RNGD is supported by a comprehensive Furiosa Software Stack facilitating seamless integration and maximizing data center utilization.

Furiosa RNGD (renegade) is a Gen 2 data center accelerator designed for high-performance and energy-efficient AI inference, particularly for large language models (LLMs) and multimodal applications. The core architectural innovation distinguishing RNGD is its Tensor Contraction Processor (TCP).

Performance and Efficiency:
The RNGD accelerator prioritizes power efficiency, operating at a Thermal Design Power (TDP) of 180W, which is significantly lower than competing high-performance GPUs like the NVIDIA H100 SXM (700W) or L40S (350W). Benchmarks demonstrate its inference capabilities:

  • Llama 3.1 70B (2,048 input, 128 output tokens, 8 cards): RNGD achieves 957.05 token/s using FuriosaSDK/FP8, compared to 2,064.53 token/s for H100 SXM and 163.53 token/s for L40S (both using TensorRT-LLM 0.15.0/FP8).
  • Llama 3.1 8B (128 input, 4,096 output tokens, 1 card): RNGD delivers 3,935.25 token/s using FuriosaSDK/FP8, while H100 SXM reaches 13,222.06 token/s and L40S achieves 2,989.17 token/s (both using TensorRT-LLM 0.15.0/FP8).

A key metric, token/s/W, highlights the energy efficiency. While specific token/s/W values are not explicitly stated for all benchmarks, the low TDP of RNGD suggests a strong performance-per-watt profile, especially for enterprise and cloud deployments where operational costs are critical.

Hardware Specifications:
The RNGD accelerator is fabricated on TSMC's 5nm process technology. Key specifications include:

  • Compute: 256 TFLOPS (BF16) / 512 TFLOPS (FP8) and 512 TOPS (INT8) / 1024 TOPS (INT4). This is achieved through 8 processing elements, each contributing 64 TFLOPS (FP8).
  • Memory: 48 GB of HBM3 memory (2 x HBM3 CoWoS-S) with a bandwidth of 1.5 TB/s. It also features 256 MB of SRAM with 384 TB/s on-chip bandwidth.
  • Interface: PCIe Gen5 x16 host interface.
  • Cooling: Designed for air-cooled data centers.
  • Features: Supports PCIe P2P, BF16, FP8, INT8, INT4 data types, Multiple-Instance and Virtualization, and secure boot & model encryption.

Core Methodology: Tensor Contraction Processor (TCP):
At the heart of the RNGD's architecture is the Tensor Contraction Processor (TCP), a compute architecture specifically designed for efficient tensor contraction operations (ISCA 2024). Tensor contraction is the fundamental computational primitive in modern deep learning, representing a higher-dimensional generalization of matrix multiplication. Unlike most commercial deep learning accelerators that incorporate fixed-sized matrix multiplication (matmul) instructions as their primitives, TCP treats tensor operations as first-class citizens.

This approach offers several advantages:

  1. Optimized for Tensor Operations: By directly supporting tensor contraction, TCP unlocks unparalleled energy efficiency.
  2. Programming Interface: The programming interface between hardware and software is elevated to treat tensor contraction as a single, unified operation. This streamlines programming and maximizes parallelism and data reuse.
  3. Flexibility and Reconfigurability: The architecture provides flexibility and reconfigurability of compute and memory resources based on tensor shapes, rather than being constrained by fixed-size matrix operations.
  4. Compiler Optimization: The Furiosa Compiler leverages this hardware flexibility to select the most optimized tactics for various deep learning workloads, ensuring powerful and efficient acceleration across different scales of deployment.

Advanced Packaging Technology:
RNGD incorporates advanced packaging technology, specifically CoWoS-S for HBM3 integration, to achieve optimal single-chip compute density, memory bandwidth, and energy efficiency.

Furiosa NXT RNGD Server:
Furiosa offers a turnkey AI inference server, the NXT RNGD, designed for cost-efficient scalability in air-cooled data centers. The server features:

  • 8 x RNGD Tensor Contraction Processors.
  • 384 GB HBM3 capacity (48 GB per card).
  • 12 TB/s aggregate memory bandwidth.
  • 4 petaFLOPS peak compute (FP8, calculated as 512 TFLOPS/RNGD x 8 RNGDs).
  • 3 kW total power consumption.
  • Dual AMD EPYC 9354 CPUs.

Software Ecosystem:
The Furiosa SW Stack provides a comprehensive toolkit for optimizing and deploying LLMs on RNGD, facilitating a seamless transition from model development to production. It includes:

  • Components: A model compressor, serving framework, runtime, compiler, profiler, debugger, and a suite of APIs.
  • Optimization: Built for advanced inference deployment, it supports comprehensive optimization of large language models.
  • Usability: Offers user-friendly APIs for seamless state-of-the-art LLM deployment.
  • Data Center Utilization: Maximizes data center utilization and flexibility through support for containerization, SR-IOV, and Kubernetes, integrating with cloud-native components.
  • Ecosystem Support: Ensures robust ecosystem support with PyTorch 2.x integration, enabling users to leverage open-source AI advancements and transition models into production efficiently.

In summary, the Furiosa RNGD system presents a highly efficient, purpose-built solution for AI inference, differentiating itself through its unique Tensor Contraction Processor architecture and a robust software stack designed for ease of deployment and maximum data center utilization within a low-power envelope.