DeepSeek AI has announced the release of DeepSeek-OCR 2, an innovative open-source system designed for advanced optical character recognition (OCR) and document comprehension. This latest iteration introduces a re-engineered vision encoder that processes pages in a sequential manner, mirroring how humans typically scan intricate documents.
At the core of this advancement is DeepEncoder V2, a transformer architecture inspired by large language models. This component transforms a two-dimensional page representation into a one-dimensional sequence of visual elements. Crucially, this sequence adheres to a learned reading flow even before text decoding commences, setting it apart from conventional approaches.
Rethinking Document Scan Order
Traditional multimodal models often process images by flattening them into a fixed, top-left to bottom-right raster sequence. This method, combined with static positional encodings, struggles with the complexities of multi-column layouts, nested tables, and mixed language regions common in many documents. Human readers, in contrast, navigate documents following a semantic order, intelligently jumping between distinct content areas.
DeepSeek-OCR 2 retains the successful encoder-decoder architecture of its predecessor but replaces the original CLIP ViT-based visual encoder with the new DeepEncoder V2. The system continues to utilize DeepSeek-3B-A500M as its decoder, a Mixture-of-Experts (MoE) language model featuring approximately 3 billion total parameters, with around 500 million active parameters per token. The primary objective is to empower the encoder to perform causal reasoning over visual information, presenting the decoder with a sequence already optimized for a probable reading order.
Advanced Vision Tokenization and Budget Management
The vision tokenizer, inherited from the initial DeepSeek-OCR, employs an 80-million parameter SAM base backbone followed by two convolutional layers. This stage efficiently downsamples the image, reducing the visual token count by a factor of 16 and compressing features into an 896-dimensional embedding.
To effectively manage dense pages without an overwhelming token count, DeepSeek-OCR 2 implements a multi-crop strategy encompassing both global and local views. A global view at 1024 × 1024 resolution yields 256 tokens. Up to six local crops, each at 768 × 768 resolution, contribute an additional 144 tokens. Consequently, the total visual token count per page ranges from 256 to 1120. This upper limit is slightly more efficient than the 1156 token budget of the original DeepSeek-OCR's 'Gundam' mode and remains comparable to the token allocation utilized by Gemini-3 Pro on OmniDocBench.
DeepEncoder-V2: A Language Model as Vision Encoder
DeepEncoder-V2 is constructed using a Qwen2-0.5B style transformer, repurposed as the vision encoder. The input sequence is formed by combining all visual tokens from the tokenizer as a prefix, followed by a set of learnable 'causal flow tokens' as a suffix. The number of causal flow tokens mirrors the number of visual tokens.
The attention mechanism within DeepEncoder-V2 is asymmetric. Visual tokens interact bidirectionally, attending to all other visual tokens. Causal flow tokens, however, utilize causal attention, allowing them to observe all visual tokens and only preceding causal flow tokens. Only the outputs generated at the causal flow positions are then transmitted to the decoder. This architecture enables the encoder to learn a transformation from a 2D grid of visual tokens into a 1D causal sequence of flow tokens, encapsulating a proposed reading order and contextual information. This design effectively separates the problem into two distinct stages: DeepEncoder-V2 handles causal reasoning regarding visual structure and reading order, while DeepSeek-3B-A500M then performs causal decoding of text, conditioned on this pre-ordered visual input.
Rigorous Training Methodology
The training data pipeline for DeepSeek-OCR 2 mirrors that of its predecessor, emphasizing OCR-intensive content, with 80% of the data mixture comprising OCR-specific examples. The research team meticulously rebalanced sampling across text, formulas, and tables using a 3:1:1 ratio to ensure ample exposure to structurally complex examples.
Training proceeds in three structured stages:
- Stage 1: Encoder Pretraining – DeepEncoder-V2 is coupled with a compact decoder and trained using a standard language modeling objective. This stage incorporates multi-scale sampling at 768x768 and 1024x1024 resolutions. The vision tokenizer is initialized from the original DeepEncoder, while the LLM-style encoder starts from Qwen2-0.5B base. Approximately 160 A100 GPUs are utilized for this stage, handling an 8k sequence length with packing and a diverse collection of document image text samples.
- Stage 2: Query Enhancement – DeepEncoder-V2 is integrated with DeepSeek-3B-A500M, and multi-crop views are introduced. The tokenizer remains fixed, while both the encoder and decoder undergo joint training. This stage leverages 4-stage pipeline parallelism and 40 data parallel replicas, with a global batch size of 1280.
- Stage 3: Decoder Fine-tuning – All encoder parameters are frozen, and only the DeepSeek decoder is trained to enhance its adaptation to the newly ordered visual tokens. This stage employs the same batch size but a shorter schedule and a lower learning rate. Freezing the encoder significantly boosts training throughput during this phase.
Benchmark Performance on OmniDocBench
Evaluation of DeepSeek-OCR 2 primarily took place on OmniDocBench-v1.5, a comprehensive benchmark featuring 1355 pages across nine document categories in both Chinese and English. These categories include books, academic papers, forms, presentations, and newspapers, each meticulously annotated with layout elements like text spans, equations, tables, and figures.
DeepSeek-OCR 2 achieved an impressive overall OmniDocBench score of 91.09, operating with a maximum visual token count of 1120. This marks a substantial improvement over the original DeepSeek-OCR baseline, which scored 87.36 with a slightly larger token maximum of 1156, representing a gain of 3.73 points.
Key metrics further highlight the advancements: Reading order (R-order) Edit Distance, which quantifies the discrepancy between predicted and ground truth reading sequences, decreased from 0.085 to 0.057. Text edit distance also saw a reduction from 0.073 to 0.048. Furthermore, improvements were observed in formula and table edit distances, indicating enhanced parsing of mathematical expressions and structured data regions.
As a document parser, DeepSeek-OCR-2 registered an overall element-level edit distance of 0.100. This compares favorably to the original DeepSeek-OCR's 0.129 and Gemini-3 Pro's 0.115, all under similar visual token constraints. These results strongly suggest that the causal visual flow encoder significantly boosts structural fidelity without increasing the token budget.
Category-wise analysis revealed that DeepSeek-OCR-2 improved text edit distance across most document types, including academic papers and books. Performance was comparatively lower for extremely dense newspapers, where text edit distance remained above 0.13. Researchers attribute this to limited training data specific to newspapers and the heavy compression required for such high text density. Nevertheless, reading order metrics demonstrated improvement across all evaluated categories.
This article is a rewritten summary based on publicly available reporting. For the original story, visit the source.
Source: MarkTechPost