skt/A.X-K1 · Hugging Face
Key Points
- 1A.X K1 is a large-scale Mixture-of-Experts (MoE) language model featuring 519 billion total and 33 billion active parameters, designed for efficient high-capacity reasoning and instruction following.
- 2Its key innovations include a hybrid reasoning control ("Think" and "Non-Think" modes for adaptable response depth), a multilingual and code-optimized tokenizer, and architectural enhancements like post-MLP RMSNorm and Multi-Token Prediction for training stability.
- 3Benchmarked against other large models, A.X K1 demonstrates competitive performance across knowledge, instruction following, math, and code domains in both English and Korean, with integration support for vLLM and SGLang for efficient inference.
A.X K1 is a large-scale Mixture-of-Experts (MoE) language model developed by SK Telecom, trained from scratch to enable efficient high-capacity reasoning and instruction following. The model boasts 519 billion total parameters but activates only 33 billion parameters per token, allowing for strong performance while maintaining practical inference efficiency. This design facilitates a hybrid approach, offering users control over reasoning depth versus response latency.
Key Features:
- Large-Scale Sparse MoE: Utilizes a Mixture-of-Experts architecture that activates a small subset of experts per token (8 out of 192 experts, plus 1 shared expert), significantly increasing model capacity with computational costs comparable to smaller dense models. This design supports scalability through expert parallelism.
- Hybrid Reasoning Control (Think / Non-Think): Provides user-controllable reasoning depth. In "Think" mode, the model generates explicit reasoning steps for complex problem-solving and multi-step inferences. In "Non-Think" mode, it delivers concise, direct responses optimized for low-latency applications.
- Optimized Tokenizer: Employs a large-vocabulary BBPE-based tokenizer optimized for token efficiency across five languages (English, Korean, Chinese, Japanese, Spanish), with a focus on source code, structured text, and programming patterns.
- Stability-Oriented Architecture: Incorporates RMSNorm normalization both before and after MLP (MoE) blocks within each Transformer layer, enhancing training stability and robustness in sparse, long-context settings.
Model Details:
- Architecture: Decoder-only Transformer with Mixture-of-Experts.
- Total Parameters: 519 Billion.
- Active Parameters: 33 Billion per token.
- Experts: 192 experts + 1 shared expert.
- Active Experts: 8 experts + 1 shared expert per token.
- Number of Layers: 61 (1 dense + 60 MoE).
- Number of Attention Heads: 64.
- Intermediate Size: 7168.
- Expert Intermediate Size: 2048.
- Normalization: RMSNorm applied before and after the MLP block.
- Attention Mechanism: Multi-Latent Attention (MLA).
- Vocab Size: 163,840.
- Context Length: 131,072 tokens.
Core Methodology:
- Mixture-of-Experts Design: The core of A.X K1's architecture is its sparse MoE setup. Instead of activating all parameters for every token, only a select few "experts" (MLP layers) are chosen by a router network. This drastically increases the model's total capacity without a proportional increase in computational cost during inference, as only the activated experts contribute to the forward pass. This enables the model to specialize different parts of its network for various types of input or tasks. The model's capacity grows primarily by adding experts, making it highly scalable, and expert parallelism allows for distributed training and serving.
- Hybrid Reasoning Fusion (Think / Non-Think): This feature is a unique aspect allowing dynamic control over the model's output strategy. In "Think" mode, the model is prompted to explicitly generate internal "thought" processes or reasoning steps before producing the final answer. This is beneficial for complex tasks requiring multi-step logical deductions, similar to chain-of-thought prompting, but integrated directly into the model's generation capabilities. In "Non-Think" mode, the model bypasses these explicit reasoning steps, producing direct and concise answers suitable for applications prioritizing low latency and directness. This is achieved within a single unified model, offering a trade-off between reasoning depth and response speed based on user requirements.
- Post-MLP RMSNorm: Unlike standard Transformer architectures that typically apply normalization only before the MLP block, A.X K1 introduces an additional RMSNorm layer *after* the MLP (MoE) block in each Transformer layer. This design choice is critical for improving training stability, especially in large-scale sparse MoE models, and enhances robustness during reasoning-intensive and long-context generations. RMSNorm is defined as:
where is the dimension of the input vector , and is a small constant for numerical stability.
- Multi-Token Prediction (MTP): During training, A.X K1 utilizes a multi-token prediction objective. In addition to the standard next-token prediction, the model is also trained to predict one future token beyond the immediate next token from a single forward pass. This serves as an auxiliary signal, contributing to the stabilization of training for large-scale models. While beneficial for training, MTP does not alter the standard autoregressive decoding process at inference time. However, it provides advantages for speculative decoding, which can lead to higher inference throughput when used with compatible serving frameworks.
Evaluation Results:
A.X K1's performance was evaluated in both "Thinking Mode" and "Non-Thinking Mode" across diverse domains and languages (English and Korean) against DeepSeek-V3.1 and GLM-4.6. In Thinking Mode, A.X K1 showed strong results in Knowledge (e.g., 80.2 KMMLU), Instruction Following (64.7 IFBench prompt-loose), Math (89.8 AIME25), and Code (75.8 LiveCodeBench v6), often being competitive with or outperforming its counterparts in specific benchmarks like LiveCodeBench v6. In Non-Thinking Mode, its performance was generally lower than in Thinking Mode, reflecting the trade-off for conciseness, yet it still performed comparably in certain benchmarks.
Usage:
The model can be integrated with Hugging Face Transformers for direct inference. Initial integrations are also provided for vLLM and SGLang, supporting multi-node, tensor-parallel configurations with long-context support for high inference throughput.
Limitations:
A.X K1, being a stochastic model, may produce incorrect or misleading information. Its "Think" mode reasoning outputs should not be taken as literal representations of the model's internal decision process. Performance can vary across domains and languages based on data coverage.