Transformers.js v4 Preview: Now Available on NPM!
Blog

Transformers.js v4 Preview: Now Available on NPM!

2026.02.10
·Hugging Face·by 이호민
#AI#JavaScript#NPM#Transformers.js#WebGPU

Key Points

  • 1Hugging Face has released a preview of Transformers.js v4 on NPM, marking a significant update after a year of development that brings advanced AI models to various JavaScript environments.
  • 2This version introduces a new C++ WebGPU Runtime for enhanced performance and cross-environment compatibility, leverages specialized ONNX Runtime operators for substantial speedups, and features a restructured monorepo with modular class architecture.
  • 3Further improvements include a standalone, lightweight Tokenizers.js library, support for new complex model architectures, and a migration to esbuild which drastically reduces build times and bundle sizes.

Transformers.js v4 (preview), now available on NPM, marks a significant architectural overhaul, initiated in March 2025, to enhance performance, maintainability, and expand model support.

The core methodology for performance improvement revolves around a new WebGPU Runtime, completely re-engineered in C++. This runtime, developed in close collaboration with the ONNX Runtime team, facilitates hardware-accelerated model execution across diverse JavaScript environments, including web browsers, Node.js, Bun, and Deno, enabling WebGPU-accelerated models server-side. A key aspect of this performance focus is a re-engineered export strategy, particularly for large language models. This strategy involves the re-implementation of models operation-by-operation, leveraging specialized ONNX Runtime Contrib Operators. Examples include com.microsoft.GroupQueryAttention for optimized attention mechanisms, com.microsoft.MatMulNBits for efficient quantized matrix multiplication, and com.microsoft.QMoE for quantized Mixture-of-Experts operations. The adoption of com.microsoft.MultiHeadAttention, for instance, yielded a notable 4×\sim4 \times speedup for BERT-based embedding models. Furthermore, full offline support is achieved by caching WebAssembly (WASM) files locally within the browser, allowing continued application functionality without an internet connection post-initial download.

Repository restructuring was a major undertaking. The project transitioned from a single-package repository to a monorepo utilizing pnpm workspaces. This enables the development and distribution of multiple sub-packages that depend on the @huggingface/transformers core, streamlining maintenance and distribution for specific use cases. A significant refactoring effort addressed the models.js file, which previously spanned over 8,000 lines. In v4, this monolithic structure was broken down into smaller, modular files, clearly delineating utility functions, core logic, and model-specific implementations. This modular class structure substantially improves readability and simplifies the integration of new models. Additionally, example projects have been externalized from the main repository into a dedicated examples repository to maintain a cleaner core codebase. Code consistency is enforced through the adoption and reformatting of all files with Prettier.

The updated export strategy and expanded ONNX Runtime custom operator support have enabled the integration of numerous new models and architectures, including GPT-OSS, Chatterbox, GraniteMoeHybrid, LFM2-MoE, HunYuanDenseV1, Apertus, Olmo3, FalconH1, and Youtu-LLM. Support for advanced architectural patterns such as Mamba (state-space models), Multi-head Latent Attention (MLA), and Mixture of Experts (MoE) has been implemented, all compatible with the WebGPU runtime for hardware-accelerated execution in various JavaScript environments.

The build system underwent a migration from Webpack to esbuild, resulting in substantial improvements. Build times were reduced by a factor of 10, from 2 seconds to 200 milliseconds. Bundle sizes saw an average decrease of 10%, with the default transformers.web.js export experiencing a significant 53% reduction, leading to faster downloads and quicker application startup times.

A frequently requested feature, the extraction of tokenization logic, has been realized with the introduction of the standalone @huggingface/tokenizers library. This library provides a complete refactor of the tokenization process, designed for seamless operation across browsers and server-side runtimes. It is lightweight, at just 8.8KB (gzipped), with zero dependencies, and fully type-safe, offering a versatile tool for any WebML project independent of the core Transformers.js library.

Further quality-of-life improvements include an enhanced type system with dynamic pipeline types for improved developer experience and type safety, refined logging mechanisms for clearer user feedback, and expanded support for larger models exceeding 8 billion parameters, demonstrated by successful execution of GPT-OSS 20B (q4f16) at approximately 60 tokens per second on an M4 Pro Max.