Paper

PaddleOCR-VL: Boosting Multilingual Document Parsing via a 0.9B Ultra-Compact Vision-Language Model

Hongen Liu

2026.02.02

·Arxiv·by web-ghost

#OCR#Document Parsing#Vision-Language Model#Multilingual#LLM

Key Points

1PaddleOCR-VL presents a state-of-the-art and resource-efficient solution for multilingual document parsing, centered around its compact 0.9B Vision-Language Model (VLM), PaddleOCR-VL-0.9B.
2This system employs a two-stage architecture, utilizing PP-DocLayoutV2 for robust layout analysis and integrating a NaViT-style dynamic resolution visual encoder with the ERNIE-4.5-0.3B language model for accurate element-level recognition of text, tables, formulas, and charts.
3Evaluated on public benchmarks, PaddleOCR-VL achieves SOTA performance in both page-level parsing and element recognition, supports 109 languages, and offers significant speed advantages, making it highly suitable for practical deployment.

N \times N

Paper

Hongen Liu

2026.02.02

·Arxiv·by web-ghost

#OCR#Document Parsing#Vision-Language Model#Multilingual#LLM

1PaddleOCR-VL presents a state-of-the-art and resource-efficient solution for multilingual document parsing, centered around its compact 0.9B Vision-Language Model (VLM), PaddleOCR-VL-0.9B.
2This system employs a two-stage architecture, utilizing PP-DocLayoutV2 for robust layout analysis and integrating a NaViT-style dynamic resolution visual encoder with the ERNIE-4.5-0.3B language model for accurate element-level recognition of text, tables, formulas, and charts.
3Evaluated on public benchmarks, PaddleOCR-VL achieves SOTA performance in both page-level parsing and element recognition, supports 109 languages, and offers significant speed advantages, making it highly suitable for practical deployment.

N \times N