Reduce AI Service Costs by 67% with Personal PC/Smartphone GPUs
Key Points
- 1KAIST has developed 'SpecEdge,' a new technology that significantly reduces the cost of large language model (LLM) AI services by leveraging readily available GPUs in personal PCs and smartphones.
- 2This innovation allows for an approximate 67% reduction in AI service costs per token compared to traditional methods that solely rely on expensive data center GPUs.
- 3SpecEdge utilizes a "Speculative Decoding" approach where small models on edge devices rapidly generate preliminary token sequences, which are then efficiently verified and corrected by larger models in the data center, improving both speed and resource efficiency.
KAIST's Professor Han Dong-soo's team has developed 'SpecEdge', a novel technology aimed at significantly reducing the operational costs of Large Language Model (LLM) based AI services by leveraging consumer-grade GPUs found in personal computers and smartphones. The current paradigm for LLM services necessitates expensive, high-performance GPUs located in data centers, which drives up service costs. SpecEdge proposes a hybrid inference infrastructure that integrates these readily available, lower-cost edge GPUs with data center GPUs.
The core methodology of SpecEdge is based on an advanced application of Speculative Decoding. In traditional speculative decoding, a smaller, faster "drafting" model generates an initial sequence of tokens, which a larger, more accurate "verifier" model then corrects. SpecEdge extends this concept by distributing the computational load across heterogeneous hardware:
- Edge GPU as Draft Model: A smaller language model is deployed on the user's local edge device (e.g., PC or smartphone) equipped with an edge GPU. This edge GPU rapidly generates a draft sequence of highly probable tokens () without waiting for immediate server responses.
- Data Center GPU as Verifier Model: The generated draft sequence from the edge device is then sent to the data center. The large language model (LLM) residing on the powerful data center GPU performs a batch verification and correction of these candidate tokens. This process can be conceptualized as:
- The LLM, given the prefix and the proposed draft , computes the probability of each token for .
- It then finds the first token where the drafted token does not match the LLM's most probable token , or where is below a certain threshold.
- If all tokens are validated, they are accepted. Otherwise, the sequence up to is accepted, and the LLM generates the next token from .
By offloading the initial, computationally lighter drafting phase to the abundant and low-cost edge GPUs, SpecEdge achieves substantial cost savings and efficiency gains. The edge GPU continuously generates words without being bottlenecked by server latency, concurrently improving LLM inference speed and infrastructure utilization.
The results demonstrate significant improvements:
- Cost Reduction: SpecEdge reduces the cost per token by approximately 67.6% compared to existing data center-only inference methods.
- Cost Efficiency: It enhances cost efficiency by 1.91 times when compared to speculative decoding performed solely on data center GPUs.
- Server Throughput: It boosts server throughput by 2.22 times relative to data center-only speculative decoding, allowing the data center GPUs to process more requests concurrently without idle time.
The technology is designed to operate seamlessly over general internet connections, eliminating the need for specialized network environments. The server architecture is optimized to efficiently handle simultaneous verification requests from multiple edge GPUs, leading to a more efficient utilization of data center resources.
SpecEdge aims to democratize access to high-quality AI by significantly lowering service provision costs, extending the LLM infrastructure beyond centralized data centers to leverage distributed user-side edge resources. This research was recognized as a 'Spotlight' paper (top 3.2% of submissions) at the prestigious Neural Information Processing Systems (NeurIPS) conference in December.