yanolja/YanoljaNEXT-Rosetta-4B-2511 · Hugging Face
Service

yanolja/YanoljaNEXT-Rosetta-4B-2511 · Hugging Face

2025.11.09
·Hugging Face·by Anonymous
#LLM#translation#transformers#Gemma

Key Points

  • 1YanoljaNEXT-Rosetta-4B-2511 is a 4-billion parameter language model, fine-tuned from Gemma3, specifically engineered for translating structured data formats like JSON, YAML, and XML while preserving their original structure.
  • 2Trained on extensive synthetic multilingual datasets including FineWeb Edu and FineWeb2, the model supports 25 languages equally, excelling in cross-language structured content conversion.
  • 3It demonstrates competitive translation quality, evidenced by strong CHrF++ scores against state-of-the-art models, but its primary utility and optimal performance are restricted to structured data input.

The YanoljaNEXT-Rosetta-4B-2511 model is a 4-billion parameter, decoder-only language model developed by Yanolja NEXT Co., Ltd., fine-tuned from google/gemma-3-4b-pt. It is specifically designed for the translation of structured data formats such as JSON, YAML, and XML, aiming to preserve the original data structure during translation. Unlike previous models in the EEVE series, this model does not feature an expanded tokenizer; instead, it exclusively utilizes the Gemma3ForCausalLM component from the base architecture for text generation.

The model's core methodology revolves around fine-tuning on synthetically generated multilingual translation datasets. These datasets were synthesized using the fineweb corpora, specifically leveraging FineWeb Edu and FineWeb2. The training data equally covers a broad spectrum of 30 languages: Arabic, Bulgarian, Chinese, Czech, Danish, Dutch, English, Finnish, French, German, Greek, Gujarati, Hebrew, Hindi, Hungarian, Indonesian, Italian, Japanese, Korean, Persian, Polish, Portuguese, Romanian, Russian, Slovak, Spanish, Swedish, Tagalog, Thai, Turkish, Ukrainian, and Vietnamese. This comprehensive multilingual training aims to optimize its performance across various language pairs.

For performance evaluation, the model's translation quality is benchmarked using CHrF++ scores on the WMT24++ dataset. Specifically, for English to Korean translation, YanoljaNEXT-Rosetta-4B-2511 achieved a CHrF++ score of 35.64, demonstrating competitive performance against larger and more established models. For instance, it ranks closely to openai/gpt-4o (36.08) and surpasses google/gemini-2.5-flash (35.25), tencent/Hunyuan-MT-7B (34.76), and several other Gemma variants and larger models, highlighting its efficiency as a 4-billion parameter model.

The model is intended for applications requiring structured data translation, such as localizing product catalogs, translating hotel reviews, or processing any other structured content. Its primary optimization lies in handling these specific data formats, with performance on unstructured text or other data types potentially varying. Known limitations include the possibility of generating invalid JSON, repetitive output, or inaccurate translations in certain scenarios.

The model is released under the Gemma license, inherited from its base model. This work was supported by the Korea Creative Content Agency (KOCCA) grant from the Ministry of Culture, Sports and Tourism (MCST) in 2025.

In terms of technical usage with the transformers library, the model can be loaded using AutoModelForCausalLM.from_pretrained with dtype=torch.bfloat16dtype=torch.bfloat16 and devicemap="auto"device_map="auto" for efficient memory management (e.g., maxmemory=0:"23GB"max_memory={0: "23GB"}). The AutoTokenizer.from_pretrained is used to load the corresponding tokenizer. Input prompts are constructed following a specific chat template (tokenizer.apply_chat_template) with system and user roles. The system prompt can define the target language, context, tone, and a glossary, emphasizing an "Output format: JSON" and instructing for immediate translation. The user prompt contains the source structured data, typically json.dumps for proper formatting. Generation is performed using model.generate with torch.inference_mode() and parameters like max_new_tokens. The generated tokens are then decoded and parsed back into the target structured format (e.g., JSON).