Service

GitHub - brody-0125/dart_sentencepiece_tokenizer: A lightweight, pure Dart implementation of SentencePiece tokenizer. Supports BPE (Gemma) and Unigram (Llama) algorithms.

brody-0125

2026.02.08

·GitHub·by 권준호

#BPE#Dart#SentencePiece#Tokenizer#Unigram

Key Points

1This paper presents `dart_sentencepiece_tokenizer`, a lightweight and memory-efficient pure Dart library for SentencePiece tokenization, supporting BPE (Gemma) and Unigram (Llama) algorithms without external dependencies.
2The library provides a comprehensive API for encoding, decoding, batch processing, padding, truncation, and offset mapping, ensuring HuggingFace compatibility through `tokenizer.json` serialization, dynamic token addition, and a streaming interface.
3Optimized for performance with typed arrays and O(1) BPE merges, it achieves high throughput (500K+ tokens/sec) and is versatile for use across Flutter, server, and web environments, including ONNX Runtime integration.

O(1)

Service

brody-0125

2026.02.08

·GitHub·by 권준호

#BPE#Dart#SentencePiece#Tokenizer#Unigram

1This paper presents `dart_sentencepiece_tokenizer`, a lightweight and memory-efficient pure Dart library for SentencePiece tokenization, supporting BPE (Gemma) and Unigram (Llama) algorithms without external dependencies.
2The library provides a comprehensive API for encoding, decoding, batch processing, padding, truncation, and offset mapping, ensuring HuggingFace compatibility through `tokenizer.json` serialization, dynamic token addition, and a streaming interface.
3Optimized for performance with typed arrays and O(1) BPE merges, it achieves high throughput (500K+ tokens/sec) and is versatile for use across Flutter, server, and web environments, including ONNX Runtime integration.

O(1)