Digitizing paperwork has lengthy been a multi-step downside. It first detects the structure, then extracts the textual content, and at last makes an attempt to reconstruct the construction. For Massive Imaginative and prescient-Language Fashions (LVLM), this usually results in “structural hallucinations” similar to jumbled strains, concocted formulation, or unclosed syntax.
FireRedTeam has launched FireRed-OCR-2B, a flagship mannequin designed to deal with doc evaluation as a structural engineering process quite than “impressionist” textual content technology. Constructed on the Qwen3-VL-2B-Instruct structure, this mannequin established a brand new state-of-the-art in end-to-end options (SOTA) and achieved an total rating of 92.94% on the OmniDocBench v1.5 benchmark.
Paradigm Shift: Structural Engineering vs. Textual content Era
Builders usually discover that even probably the most highly effective widespread VLMs wrestle with the dense spatial logic of technical PDFs. When a mannequin “sees” complicated tables or multi-line LaTeX equations, it usually can not preserve hierarchical relationships between components.
FireRed-OCR-2B addresses this by means of a specialised progressive coaching pipeline consisting of three distinct levels:
Multitask preconditioning: This stage establishes the spatial basis by coaching the mannequin on duties starting from detection, area recognition, and structure to markdown. Specialised SFT (Supervised Fantastic-Tuning): Fashions are fine-tuned based mostly on high-quality standardized Markdown datasets to make sure logical consistency and hierarchical illustration. Kind-constrained GRPO: The ultimate stage makes use of reinforcement studying to implement syntactic validity.
Core innovation: format-constrained GRPO
FireRed-OCR’s most essential technical differentiator is its use of group-relative coverage optimization (GRPO) with format constraints. Whereas conventional fine-tuning focuses on character accuracy, GRPO introduces a reinforcement studying loop that offers the mannequin particular structural properties.
Mathematical syntax: Make sure that LaTeX equations are mathematically legitimate. Desk integrity: Keep constant row/column counts and correct HTML/Markdown tagging. Hierarchical closure: Ensures that every one open structural tags (similar to lists and headers) are correctly closed. Textual content accuracy: Cut back character-level errors in dense blocks of textual content.
By eliminating the necessity for separate “important” fashions, a key good thing about the GRPO algorithm, FireRedTeam has optimized the coaching course of to particularly give attention to high-friction areas of doc evaluation.
Fixing long-tail structure points
The “lengthy tail” of doc structure (similar to non-standard authorized codecs, educational papers with overlapping figures, or handwritten annotations) is the place most OCR pipelines break. FireRed-OCR makes use of a “geometry + semantics” knowledge manufacturing facility.
This new strategy makes use of geometric characteristic clustering and multidimensional tagging to synthesize a balanced dataset. By combining geometric consciousness and semantic understanding, the mannequin maintains “in-the-wild robustness” and outperforms conventional pipeline methods similar to PaddleOCR on complicated and non-standard layouts (as benchmarked on the FireRedBench dataset).
Efficiency benchmark
In a direct comparability in OmniDocBench v1.5, FireRed-OCR-2B (92.94%) considerably outperforms different end-to-end fashions, together with:
DeepSeek-OCR 2: 91.09% Gemini-3.0 Professional: 90.33% Qwen3-VL-235B: 89.15%
Though some “pipeline” options (utilizing separate fashions for detection and recognition) obtain barely increased scores, FireRed-OCR-2B has the most effective efficiency as a single-model end-to-end strategy. That is particularly essential for builders trying to scale back system complexity and inference latency in manufacturing Retrieval-Augmented Era (RAG) environments.
Necessary factors
We have summarized the technical significance and efficiency metrics of the FireRed-OCR-2B launch into 5 key takeaways for AI engineers and knowledge scientists.
5 essential factors: FireRed-OCR-2B
New end-to-end SOTA efficiency: FireRed-OCR-2B achieved a state-of-the-art (SOTA) rating of 92.94% on the OmniDocBench v1.5 benchmark. This makes it the main single-model answer for doc evaluation, surpassing considerably bigger fashions similar to Qwen2-VL-72B and Gemini-1.5-Professional in structural accuracy. Architectural Basis: Fashions constructed on Qwen2-VL-2B-Instruct (or up to date 2026 iterations) make the most of the Imaginative and prescient-Language-Mannequin (VLM) strategy. Exchange conventional multi-stage pipelines (separate detection, cropping, and OCR steps) with a unified end-to-end transformer structure that straight outputs structured markdown. Structural integrity with GRPO: The important thing technical differentiator is using GRPO (Group Relative Coverage Optimization) with format constraints. This reinforcement studying method rewards fashions that preserve syntactic validity. Particularly, it ensures that LaTeX formulation, desk tags, and Markdown hierarchies are logically closed and mathematically constant. “Geometry + Semantics” Information Manufacturing unit: To resolve complicated “real-world” structure issues, FireRedTeam has developed a specialised knowledge engine. This “manufacturing facility” synthesizes datasets by balancing geometric structure options with semantic content material, permitting the mannequin to deal with duplicate figures, multi-column educational papers, and non-standard types extra reliably than earlier iterations.
Verify your mannequin weights and repositories. Additionally, be at liberty to observe us on Twitter. Additionally, do not forget to affix the 120,000+ ML SubReddit and subscribe to our e-newsletter. dangle on! Are you on telegram? Now you can additionally take part by telegram.


