
PaddleOCR-VL: Boosting Multilingual Document Parsing via a 0.9B Ultra-Compact Vision-Language Model
Key Points
- 1PaddleOCR-VL presents a state-of-the-art and resource-efficient solution for multilingual document parsing, centered around its compact 0.9B Vision-Language Model (VLM), PaddleOCR-VL-0.9B.
- 2This system employs a two-stage architecture, utilizing PP-DocLayoutV2 for robust layout analysis and integrating a NaViT-style dynamic resolution visual encoder with the ERNIE-4.5-0.3B language model for accurate element-level recognition of text, tables, formulas, and charts.
- 3Evaluated on public benchmarks, PaddleOCR-VL achieves SOTA performance in both page-level parsing and element recognition, supports 109 languages, and offers significant speed advantages, making it highly suitable for practical deployment.
PaddleOCR-VL is a state-of-the-art (SOTA) and resource-efficient solution for multilingual document parsing, designed to handle the growing complexity and volume of modern documents. It addresses the limitations of traditional pipeline-based methods (integration complexity, error propagation) and end-to-end multimodal models (computational overhead, instability, hallucinations for long sequences). The core of PaddleOCR-VL is a compact yet powerful 0.9B vision-language model (VLM) named PaddleOCR-VL-0.9B, enabling accurate recognition of diverse elements (text, tables, formulas, charts) across 109 languages with minimal resource consumption.
The system operates as a two-stage pipeline:
- Layout Analysis (PP-DocLayoutV2): This initial stage is dedicated to localizing semantic regions within a document page and predicting their reading order. This decoupling from the VLM addresses the instability and high computational overhead issues often found in end-to-end VLM approaches for layout analysis, especially in multi-column or mixed text-graphic layouts.
- Element-level Recognition (PaddleOCR-VL-0.9B): Following layout analysis, specific elements (text blocks, tables, formulas, charts) are segmented based on their predicted positions and fed into the PaddleOCR-VL-0.9B for fine-grained content recognition. A lightweight post-processing module then aggregates these outputs into structured Markdown and JSON formats.
Core Methodology and Architecture:
1. PP-DocLayoutV2 (Layout Analysis Model):
PP-DocLayoutV2 consists of two sequentially connected networks:
- Object Detection and Classification: An RT-DETR [17]-based model performs layout element detection and classification, outputting bounding boxes and class labels.
- Reading Order Prediction: The detected bounding boxes and class labels are passed to a lightweight pointer network [18] with six transformer layers.
- Input Embedding: Selected proposals (foreground elements based on per-class thresholds) are embedded using absolute 2D positional encodings and class label embeddings.
- Geometric Bias: The encoder attention incorporates a geometric bias mechanism from Relation-DETR [18] to explicitly model pairwise geometric relationships between elements.
- Pairwise Relation Head: This head linearly projects element representations into query and key vectors, computing bilinear similarities to produce pairwise logits, resulting in an matrix representing the relative order between each pair of elements.
- Decoding: A deterministic win-accumulation decoding algorithm recovers a topologically consistent reading order.
2. PaddleOCR-VL-0.9B (Element-level Recognition Model):
PaddleOCR-VL-0.9B adopts an architectural style inspired by LLaVA [20], balancing the scale of vision and language models for multi-element recognition.
- Vision Encoder: A NaViT-style [15] encoder, initialized from Keye-VLβs [22] vision model, supports native-resolution inputs and dynamic high-resolution preprocessing. This design allows handling images of arbitrary resolution without distortion, reducing hallucinations and improving performance on text-intensive tasks.
- Projector: A randomly initialized 2-layer MLP with GELU [23] activation and a merge size of 2, efficiently bridges visual features from the encoder to the language modelβs embedding space.
- Language Model: ERNIE-4.5-0.3B [5], a compact yet powerful open-source language model, is used for its inference efficiency. It further enhances positional representation by incorporating 3D-RoPE [24]. The small size of the language model contributes to faster decoding speeds.
Training Recipe:
PP-DocLayoutV2 Training:
A two-stage strategy is employed:
- RT-DETR Training: The RT-DETR model is initialized with PP-DocLayout\_Plus-L [25] pretrained weights and trained for 100 epochs on a self-constructed dataset of over 20,000 high-quality samples for layout detection and classification.
- Pointer Network Training: After freezing the RT-DETR parameters, the pointer network is independently trained for 200 epochs. It uses Generalized Cross Entropy Loss [26] for robustness against mixed pre-annotated data, with a constant learning rate of and the AdamW optimizer.
PaddleOCR-VL-0.9B Training:
A post-adaptation strategy is used with pre-trained vision (Keye-VL) and language (ERNIE-4.5-0.3B) models, divided into two stages based on the ERNIEKit [27] repository:
- Stage 1 (Pre-training Alignment):
- Objective: Learn to associate visual information with textual representations and align feature spaces.
- Data: 29 million high-quality image-text pairs.
- Settings: 1 epoch, batch size 128, sequence length 16384, max resolution , data augmentation. Learning rate scheduled between (max) and (min).
- Stage 2 (Instruction Fine-tuning):
- Objective: Adapt the general multimodal understanding to specific downstream element recognition tasks.
- Data: 2.7 million meticulously curated, diverse samples.
- Settings: 2 epochs, batch size 128, sequence length 16384, max resolution . Finer learning rate scheduled between (max) and (min).
- Tasks: The model is fine-tuned for four types of tasks through explicit instructions:
- OCR: Identify and extract text (characters, words, lines, blocks, simple page-level structures).
- Table Recognition: Parse tabular structures, extract cell contents, identify rows/columns, recognize logical relationships, generate structured representations in OTSL [28] format.
- Formula Recognition: Recognize and interpret mathematical/scientific formulas, convert visual representations to structured LATEX format, distinguishing inline (\verb|\(...)\) and display (\verb|\[...\]|) equations.
- Chart Recognition: Extract information from various chart types (bar, line, pie), convert to Markdown format tables.
Dataset Construction:
A systematic methodology ensures a high-quality and diverse training dataset:
- Data Curation: Data is collected from four sources:
- Open Source Dataset: Aggregates established public datasets (e.g., CASIA-HWDB [29] for text, UniMER-1M [30], MathWriting [31] for math, ChartQA [32], PlotQA [33] for charts).
- Data Synthesizing Dataset: Generates large volumes of missing data types to address natural imbalances.
- Network Accessible Dataset: Amasses public data from the internet (academic papers, newspapers, scanned documents, etc.) for generalization and robustness.
- In-house Dataset: Incorporates extensive proprietary datasets from years of OCR research.
- Automatic Data Annotation: Utilizes expert model PP-StructureV3 for preliminary pseudo-labeling. These pseudo-labels, along with original images, are fed via prompt engineering to advanced multimodal large language models (ERNIE-4.5-VL [5], Qwen2.5VL [24]) for refinement. A hallucination filtering step ensures label quality.
- Hard Cases Mining: An evaluation engine categorizes elements (23 text, 20 table, 4 formula, 11 chart categories) and uses professional metrics (e.g., EditDist for Text, TEDS [41] for Tables, RMS-F1 [42] for Charts, BLEU [43] for Formulas) to identify poor model performance. For identified weaknesses, synthetic challenging examples are generated using resources like Font Library, CSS Library, Corpus, and rendering tools (XeLaTeX, web browsers). Manual annotation is applied for a small number of corner cases.
Evaluation:
PaddleOCR-VL is comprehensively evaluated on public benchmarks (OmniDocBench v1.0, v1.5 [16], olmOCR-Bench [12]) and in-house benchmarks. It achieves SOTA performance in both page-level document parsing and element-level recognition, significantly outperforming existing pipeline solutions and demonstrating strong competitiveness against top-tier VLMs. On OmniDocBench v1.5, PaddleOCR-VL achieves an overall score of 92.86, outperforming MinerU2.5-1.2B (90.67). It also sets new SOTA results in sub-tasks with a Text-Edit distance of 0.035, Formula-CDM of 91.22, Table-TEDS of 90.89, Table-TEDS-S of 94.76, and Reading Order-Edit of 0.043, highlighting its superior accuracy and efficient inference speeds.