NC-AI-consortium-VAETKI/VAETKI · Hugging Face
Key Points
- 1VAETKI is a 112.2 billion-parameter Mixture-of-Experts (MoE) large language model developed by the NC-AI consortium, prioritizing efficiency and scalability for diverse applications.
- 2Designed for both research and real-world use, it supports advanced reasoning, domain-specific tasks, and instruction following in Korean, English, Chinese, and Japanese.
- 3Trained on a 9.8 trillion token dataset, the model exhibits strong multilingual and reasoning capabilities but carries limitations concerning factual accuracy, complex reasoning, and potential biases.
VAETKI is a large language model collaboratively developed by the NC-AI consortium, involving 13 organizations and led by NC-AI. Its primary design goals are efficiency and scalability, achieved through the adoption of a Mixture-of-Experts (MoE) architecture. The model, VAETKI-100B-A10B, is built for both research and real-world applications, aiming to support advanced reasoning, domain-specific tasks, and agent-oriented systems.
The model's architecture is Causal (Auto-regressive) and based on Transformers with an MoE setup. Key architectural specifications include:
- Total Parameters: 112.2 billion
- Activated Parameters: 10.1 billion
- Non-Embedding Parameters: 111.3 billion
- Number of Layers: 48
- Number of Attention Heads: 24
- Number of Experts: 128
- Number of Activated Experts: 8
- Context Length: 32k tokens
- Vocabulary Size: 126k
VAETKI operates in a "thinking mode" for most tasks, except for Tool Agent tasks which use a "non-thinking mode." It emphasizes strong human preference alignment for instruction following and natural conversation. The model supports instruction following and translation in Korean, English, Chinese, and Japanese.
Training involved both pre-training and post-training stages. The training data comprises a diverse set of datasets totaling 9.8 trillion tokens, including:
- FineWeb-2(kor_Hang): 54.5B tokens
- FineWeb2-HQ: 338.9B tokens
- The Stack v2: 1.571T tokens
- StackExchange_Mar2023: 2.6B tokens
- Multiple finemath datasets (finemath-3plus, infiwebmath-3plus, finemath-4plus): 37.4B, 23.7B, 10.4B tokens respectively.
- proof-pile-2: 28.2B tokens
- Nemotron series (CC-v2, CC-Math-v1, Pretraining-Code-v1, Pretraining-SFT-v1, PrisMath): 3.360T, 214.3B, 191.4B, 367.2B, 6.2B tokens respectively.
- DCLM-baseline-1.0: 3.190T tokens
- WanJuan-Korean: 68.9B tokens
- MegaMath: 208.0B tokens
- Stack-Edu: 86.7B tokens
- AceReason-1.1-SFT: 31.4B tokens
- OpenScience-OS-Q2: 18.1B tokens
- OpenScience-OS-Q3: 0.7B tokens
- OpenCodeGeneticInstruct (Qwen2.5-32b-instruct, mixtral-8x22b-instruct): 6.8B, 9.0B tokens respectively.
To enhance multilingual and reasoning capabilities, large-scale NIA-supported datasets were constructed. During pre-training, 7.6 billion tokens were integrated from Chinese and Japanese corpora, along with data for long-context comprehension and Chain-of-Thought (CoT) reasoning. In post-training, an additional 10-billion-token dataset focusing on specialized Korean studies and mathematical reasoning was developed.
The training procedure was executed on the Naver Cloud MLX Platform, utilizing 1,016 NVIDIA H100 80GB HBM3 GPUs interconnected by InfiniBand 400 Gb/s (6 lanes, with 4 used for RDMA-based inter-node communication). The model architecture, training loop, checkpointing, and distributed optimization logic were implemented based on Megatron-Core v0.14, with internal modifications for research and optimization.
Key hyperparameters evolved during training:
- Learning Rate: Started at 2e-4, then 1e-4, and finalized at 8e-5.
- Batch Size: Increased from 8.1M Tokens to 33M Tokens, and finally to 46M Tokens.
- Context Length: Started at 4096, then increased to 32768.
Evaluation results compare VAETKI-100B-A10B against gpt-oss-120b (which has 117B total parameters and 5.1B activated parameters) on various benchmarks.
- Korean General Tasks:
- KMMLU-Pro: VAETKI scored 58.4,
gpt-oss-120bscored 61.9. - CLIcK: VAETKI scored 75.5,
gpt-oss-120bscored 73.0. - KoBALT: VAETKI scored 47.5,
gpt-oss-120bscored 46.0.
- KMMLU-Pro: VAETKI scored 58.4,
- Korean Reasoning Task:
- HRM8K: VAETKI scored 70.6,
gpt-oss-120bscored 83.3.
- HRM8K: VAETKI scored 70.6,
- English General Task:
- MMLU-Pro: VAETKI scored 71.0,
gpt-oss-120bscored 79.1.
- MMLU-Pro: VAETKI scored 71.0,
- English Reasoning Tasks:
- GPQA-Diamond: VAETKI scored 53.2,
gpt-oss-120bscored 73.1. - HLE (text only): VAETKI scored 5.9,
gpt-oss-120bscored 8.6. - IFBench: VAETKI scored 52.3,
gpt-oss-120bscored 63.1. - IFEval: VAETKI scored 86.0,
gpt-oss-120bscored 83.6.
- GPQA-Diamond: VAETKI scored 53.2,
Limitations of the model include potential for inaccurate or incomplete outputs, hallucinated content, and difficulties with complex multi-step reasoning, precise mathematical computation, and strict code generation. It lacks independent information verification. Training data may contain social or cultural biases, which could be reflected in the model's outputs. The model is not designed for safety-critical or regulated domains. VAETKI is licensed under the MIT License.