Jackrong/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled · Hugging Face
Key Points
- 1Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled is a fine-tuned model based on Qwen3.5, which distills Chain-of-Thought (CoT) reasoning primarily from Claude-4.6 Opus interactions to enhance structured problem-solving.
- 2Through Supervised Fine-Tuning, the model learns to meticulously plan solutions within `<think>` tags, adopting an efficient "analyze-break down-formulate-execute" reasoning scaffold to reduce redundant cognitive loops.
- 3This model offers significantly improved autonomy and stability in coding agent environments, natively supporting the "developer" role and maintaining a full chain-of-thought, making it highly effective for complex analytical and logic-dependent tasks.
The Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled model is a specialized large language model (LLM) designed to enhance reasoning capabilities by distilling Chain-of-Thought (CoT) logic from Claude-4.6 Opus onto the Qwen3.5-27B base architecture. Developed with a focus on improving autonomy and stability, particularly within coding agent environments, this model addresses noted limitations of the official Qwen3.5 model, such as Jinja template issues with the "developer" role and inconsistent thinking mode preservation.
The core methodology employed is Supervised Fine-Tuning (SFT) combined with Low-Rank Adaptation (LoRA), utilizing the Unsloth framework for efficient memory and computational optimization. The fine-tuning process leverages a meticulously curated dataset composed of high-quality reasoning distillation data. This includes nohurry/Opus-4.6-Reasoning-3000x-filtered, which provides comprehensive Claude 4.6 Opus reasoning trajectories, and Jackrong/Qwen3.5-reasoning-700x, contributing additional structured problem-solving samples.
A critical technical aspect of the SFT stage is the train_on_responses_only strategy. This involves masking the initial instructions from the loss calculation, ensuring that the model's loss is computed *purely over the generation of the internal reasoning sequences and the subsequent final solutions*. Formally, if a training sample consists of an input , an internal reasoning sequence , and a final answer , the objective function (e.g., cross-entropy loss) is designed such that:
where belongs only to the sequence and is the preceding tokens within the same sequence, with initial instructions being ignored for loss calculation. This forces the model to independently generate its internal thought process and the resulting solution.
All training samples were systematically normalized to enforce a strict output structure: . This structural imitation is central to the model's adoption of an efficient, streamlined reasoning paradigm, exemplified by a learned scaffold like "Let me analyze this request carefully: 1..2..3...", which aims to reduce redundant cognitive loops observed in the base model.
Key improvements and capabilities of the distilled model include:
- Enhanced Reasoning: It excels at breaking down complex user problems, formulating step-by-step methodologies within explicit blocks, and delivering precise, nuanced solutions, making it well-suited for analytical tasks, coding, and mathematical problems.
- Robust Agent Performance: The model provides native support for the "developer" role, maintaining a fully preserved and active thinking mode ( in logs), and demonstrates significantly improved autonomy and stability in coding agent environments, capable of extended, self-correcting operation.
- Structured Thinking: By directly distilling Claude Opus's approach, the model exhibits a more confident and outlined planning phase in its internal reasoning, minimizing "trial-and-error" self-doubt.
- Resource Efficiency: Leveraging Unsloth, the model maintains efficiency, operating with approximately 16.5 GB VRAM using Q4\_K\_M quantization and achieving 29–35 tokens/second generation speed, while preserving the full 262K context window.
While demonstrating strong reasoning, the model, as an autoregressive LLM, may still exhibit hallucination risk concerning external facts within its thinking sequences if real-world verification is required. It is primarily intended for logic-dependent prompting where transparency of the AI's internal thought process is beneficial. The current release is considered a preview due to the evolving ecosystem around its specific architecture and tooling.