GLM-5V-Turbo - Overview - Z.AI DEVELOPER DOCUMENT
Key Points
- 1This paper introduces the Very Big Video Reasoning (VBVR) suite, a comprehensive benchmark designed to push video reasoning models beyond visual generation towards physical-world commonsense and logical reasoning.
- 2VBVR includes a large-scale, systematically constructed dataset (VBVR-Dataset) with over 2 million samples and 200 tasks categorized by human cognitive faculties, alongside a verifiable evaluation framework (VBVR-Bench).
- 3Evaluations reveal that current state-of-the-art video models significantly lag human performance in reasoning tasks, yet increasing data scale is shown to improve model capabilities and induce emergent behaviors.
The paper introduces the Very Big Video Reasoning (VBVR) suite, a novel framework designed to address the critical limitations of current video generation models in logical reasoning and physical commonsense. While models like Sora and Veo excel in visual quality, they exhibit significant deficiencies in higher-level cognitive abilities. To bridge this gap, VBVR provides a large-scale, systematically designed dataset (VBVR-Dataset) and a verifiable evaluation framework (VBVR-Bench).
The core innovation of VBVR-Dataset lies in its task taxonomy, which is meticulously constructed based on established theories from human cognitive science (e.g., Kant, Anderson). This taxonomy decomposes video reasoning into five fundamental cognitive faculties:
- Perception: The ability to extract structured representations from raw sensory input.
- Transformation: The capacity to operate on and compose mental representations.
- Spatiality: An intuitive understanding of position, navigation, and spatial relationships.
- Abstraction: The skill of distilling general patterns and rules from concrete experiences.
- Knowledge: The application of prior knowledge and logical rules to new situations.
The dataset comprises 200 distinct tasks and approximately samples in total, split equally into training and test samples. This represents an order of magnitude increase in scale compared to existing benchmarks such as Video-Zero-Shot and Ruler-Bench, providing a substantial foundation for training robust reasoning models. Task examples range in complexity from simple geometric shape recognition to intricate physical simulations and logical planning problems, including polygon recognition, pipe connection, grid navigation, maze solving, and sliding puzzles, each requiring a combination of perception, spatial reasoning, and logical operations. To ensure both data quality and massive scale, VBVR employs a distributed parametric generation pipeline. Tasks are rigorously designed and reviewed, implemented via standardized generator templates, and then generated in parallel using cloud services like AWS Lambda, with results stored in S3.
VBVR-Bench serves as a rule-based, reproducible, and interpretable evaluation framework. Comprehensive evaluations conducted on state-of-the-art video generation models, including open-source (e.g., CogVideoX) and closed-source models (e.g., Sora 2, Veo 3.1), reveal that even the top-performing models achieve an overall score of only , significantly lagging behind human performance of . This stark discrepancy underscores the challenges models face in tasks demanding strict logical reasoning and physical consistency. The validity of VBVR-Bench is further reinforced by a large-scale human alignment analysis, which demonstrates a strong correlation between the benchmark's automatic evaluation scores and human preference scores (Pearson correlation coefficient ), confirming its accuracy in reflecting true reasoning capabilities.
The study also investigates scaling laws and emergent capabilities. By fine-tuning the Wan2.2 model on VBVR-Dataset, it was observed that increasing the training data from 0 to samples led to a consistent improvement in model performance across all metrics, highlighting the critical role of high-quality, large-scale reasoning data. Qualitatively, the VBVR-fine-tuned Wan2.2 model exhibited superior controllable execution compared to Sora 2 on tasks requiring precise manipulation (e.g., "delete a specific symbol," "precisely rotate an object"), where Sora 2 was more prone to errors or deformation. Furthermore, the training process revealed emergent behaviors in the model, such as spontaneously adopting a "Self-chosen completion policy" and providing "Rationalizing" interpretations of scenes, suggesting that extensive reasoning training can unlock deeper cognitive capacities in models.
In conclusion, the VBVR suite provides the largest and most systematic video reasoning dataset and evaluation benchmark to date. It clearly identifies the limitations of current video generation models in complex logical reasoning scenarios and empirically validates that large-scale, high-quality reasoning data is crucial for advancing these models. This foundational work paves the way for the development of future general-purpose video agents endowed with robust physical-world commonsense and logical reasoning abilities.