GitHub - brody-0125/dart_sentencepiece_tokenizer: A lightweight, pure Dart implementation of SentencePiece tokenizer. Supports BPE (Gemma) and Unigram (Llama) algorithms.
Service

GitHub - brody-0125/dart_sentencepiece_tokenizer: A lightweight, pure Dart implementation of SentencePiece tokenizer. Supports BPE (Gemma) and Unigram (Llama) algorithms.

brody-0125
2026.02.08
·GitHub·by 권준호
#BPE#Dart#SentencePiece#Tokenizer#Unigram

Key Points

  • 1This paper presents `dart_sentencepiece_tokenizer`, a lightweight and memory-efficient pure Dart library for SentencePiece tokenization, supporting BPE (Gemma) and Unigram (Llama) algorithms without external dependencies.
  • 2The library provides a comprehensive API for encoding, decoding, batch processing, padding, truncation, and offset mapping, ensuring HuggingFace compatibility through `tokenizer.json` serialization, dynamic token addition, and a streaming interface.
  • 3Optimized for performance with typed arrays and O(1) BPE merges, it achieves high throughput (500K+ tokens/sec) and is versatile for use across Flutter, server, and web environments, including ONNX Runtime integration.

dart_sentencepiece_tokenizer is a lightweight, pure Dart implementation of the SentencePiece tokenizer, designed for cross-platform use (Flutter, Server, CLI, Web) with zero external dependencies. It supports the two primary SentencePiece algorithms: Byte Pair Encoding (BPE), as used by Gemma models, and Unigram, utilized by Llama models.

The core methodology revolves around efficiently implementing these subword tokenization algorithms. For BPE, the implementation features an optimization that achieves O(1)O(1) merge operations through the use of a linked list data structure and merge caching. This significantly enhances performance during tokenization. The Unigram algorithm, based on a probabilistic model, is also fully supported, allowing for variable-length subword units based on their likelihood in a given corpus.

Key features and technical details include:

  • Memory Efficiency: The library leverages Dart's typed arrays (Int32List for token IDs, Uint8List for attention masks, type IDs, and special token masks) to achieve a 50-70% reduction in memory footprint compared to standard Dart lists. Specifically, token IDs consume 4 bytes/token, while typeIds, attentionMask, and specialTokensMask each consume 1 byte/token.
  • Comprehensive API:
    • Encoding: Supports single text encoding (encode), sentence pair encoding (encodePair), and batch encoding (encodeBatch, encodeBatchParallel, encodePairBatch). Encoding outputs an Encoding object containing tokens (List<String>), ids (Int32List), attentionMask (Uint8List), typeIds (Uint8List), specialTokensMask (Uint8List), offsets (List<(int, int)> for character spans), wordIds (List<int?>), and sequenceIds (List<int?>). The input text length is capped at 500,000 characters to prevent Out-Of-Memory (OOM) errors.
    • Decoding: Provides decode for single token ID sequences and decodeBatch for multiple sequences, with an option to skip special tokens.
    • Padding and Truncation: Offers both fluent API configurations (enablePadding, enableTruncation) and manual methods (withPadding, withTruncation, withPaddingToMultipleOf). Truncation strategies for pair encoding include longestFirst, onlyFirst, onlySecond, and doNotTruncate.
    • Offset Mapping: Enables precise character-to-token, token-to-character, and word-to-token index mapping (charToToken, tokenToChars, wordToTokens, tokenToWord, tokenToSequence).
  • Vocabulary Management: Provides access to vocabulary size (vocabSize) and special token IDs (unkId, bosId, eosId, padId). It supports conversion between tokens and IDs (convertTokensToIds, convertIdsToTokens), vocabulary lookup (vocab.contains), and retrieval of the full vocabulary as a Map<String,int>Map<String, int> (getVocab).
  • HuggingFace Compatibility: Adheres to HuggingFace tokenization standards, including a tokenize method returning List<String>List<String> and support for saving/loading tokenizer configurations in the tokenizer.json format.
  • Dynamic Token Addition: Allows for the runtime addition of new regular tokens (addTokens) and special tokens (addSpecialTokens) to the vocabulary, which are then integrated into the tokenization process.
  • Streaming API: Features a TextStreamer compatible with HuggingFace's TextStreamer for real-time decoding of LLM outputs. It includes heuristics for word boundaries, CJK character detection for immediate output, and options to skip special tokens or a specified number of prompt tokens.
  • ONNX Runtime Integration: Facilitates direct integration with ONNX Runtime by providing encoded input_ids and attention_mask as Int64List, which is a common requirement for ONNX models.
  • Configuration: Offers predefined configurations for Gemma (SentencePieceConfig.gemma, adding BOS and EOS tokens) and Llama (SentencePieceConfig.llama, adding BOS token only), alongside a custom configuration option.

Performance metrics highlight its efficiency: throughput exceeds 500,000 tokens/sec, model loading typically completes in ~50ms for a 32K vocabulary, and vocabulary lookup operates at O(k)O(k) per token, where kk is the token length. BPE merge operations are optimized to O(1)O(1) per merge. The tokenizer loads binary protobuf .model files generated by the SentencePiece C++ library.