GitHub - brody-0125/dart_sentencepiece_tokenizer: A lightweight, pure Dart implementation of SentencePiece tokenizer. Supports BPE (Gemma) and Unigram (Llama) algorithms.
Key Points
- 1This paper presents `dart_sentencepiece_tokenizer`, a lightweight and memory-efficient pure Dart library for SentencePiece tokenization, supporting BPE (Gemma) and Unigram (Llama) algorithms without external dependencies.
- 2The library provides a comprehensive API for encoding, decoding, batch processing, padding, truncation, and offset mapping, ensuring HuggingFace compatibility through `tokenizer.json` serialization, dynamic token addition, and a streaming interface.
- 3Optimized for performance with typed arrays and O(1) BPE merges, it achieves high throughput (500K+ tokens/sec) and is versatile for use across Flutter, server, and web environments, including ONNX Runtime integration.
dart_sentencepiece_tokenizer is a lightweight, pure Dart implementation of the SentencePiece tokenizer, designed for cross-platform use (Flutter, Server, CLI, Web) with zero external dependencies. It supports the two primary SentencePiece algorithms: Byte Pair Encoding (BPE), as used by Gemma models, and Unigram, utilized by Llama models.
The core methodology revolves around efficiently implementing these subword tokenization algorithms. For BPE, the implementation features an optimization that achieves merge operations through the use of a linked list data structure and merge caching. This significantly enhances performance during tokenization. The Unigram algorithm, based on a probabilistic model, is also fully supported, allowing for variable-length subword units based on their likelihood in a given corpus.
Key features and technical details include:
- Memory Efficiency: The library leverages Dart's typed arrays (
Int32Listfor token IDs,Uint8Listfor attention masks, type IDs, and special token masks) to achieve a 50-70% reduction in memory footprint compared to standard Dart lists. Specifically, token IDs consume 4 bytes/token, whiletypeIds,attentionMask, andspecialTokensMaskeach consume 1 byte/token. - Comprehensive API:
- Encoding: Supports single text encoding (
encode), sentence pair encoding (encodePair), and batch encoding (encodeBatch,encodeBatchParallel,encodePairBatch). Encoding outputs anEncodingobject containingtokens(List<String>),ids(Int32List),attentionMask(Uint8List),typeIds(Uint8List),specialTokensMask(Uint8List),offsets(List<(int, int)> for character spans),wordIds(List<int?>), andsequenceIds(List<int?>). The input text length is capped at 500,000 characters to prevent Out-Of-Memory (OOM) errors. - Decoding: Provides
decodefor single token ID sequences anddecodeBatchfor multiple sequences, with an option to skip special tokens. - Padding and Truncation: Offers both fluent API configurations (
enablePadding,enableTruncation) and manual methods (withPadding,withTruncation,withPaddingToMultipleOf). Truncation strategies for pair encoding includelongestFirst,onlyFirst,onlySecond, anddoNotTruncate. - Offset Mapping: Enables precise character-to-token, token-to-character, and word-to-token index mapping (
charToToken,tokenToChars,wordToTokens,tokenToWord,tokenToSequence).
- Encoding: Supports single text encoding (
- Vocabulary Management: Provides access to vocabulary size (
vocabSize) and special token IDs (unkId,bosId,eosId,padId). It supports conversion between tokens and IDs (convertTokensToIds,convertIdsToTokens), vocabulary lookup (vocab.contains), and retrieval of the full vocabulary as a (getVocab). - HuggingFace Compatibility: Adheres to HuggingFace tokenization standards, including a
tokenizemethod returning and support for saving/loading tokenizer configurations in thetokenizer.jsonformat. - Dynamic Token Addition: Allows for the runtime addition of new regular tokens (
addTokens) and special tokens (addSpecialTokens) to the vocabulary, which are then integrated into the tokenization process. - Streaming API: Features a
TextStreamercompatible with HuggingFace'sTextStreamerfor real-time decoding of LLM outputs. It includes heuristics for word boundaries, CJK character detection for immediate output, and options to skip special tokens or a specified number of prompt tokens. - ONNX Runtime Integration: Facilitates direct integration with ONNX Runtime by providing encoded
input_idsandattention_maskasInt64List, which is a common requirement for ONNX models. - Configuration: Offers predefined configurations for Gemma (
SentencePieceConfig.gemma, adding BOS and EOS tokens) and Llama (SentencePieceConfig.llama, adding BOS token only), alongside a custom configuration option.
Performance metrics highlight its efficiency: throughput exceeds 500,000 tokens/sec, model loading typically completes in ~50ms for a 32K vocabulary, and vocabulary lookup operates at per token, where is the token length. BPE merge operations are optimized to per merge. The tokenizer loads binary protobuf .model files generated by the SentencePiece C++ library.