DeepSeek-OCR: Contexts Optical Compression

Haoran Wei, Yaofeng Sun, Yukun Li
DeepSeek-AI

Executive Summary

DeepSeek-OCR represents a paradigm shift in how AI systems handle long text contexts by treating vision as a compression medium. Rather than processing text token-by-token, the system converts documents into images and compresses them 7-20× while maintaining high accuracy. At 10× compression, the model achieves 97% OCR precision; even at 20× compression, accuracy remains around 60%. This approach directly addresses the quadratic computational scaling problem in large language models when processing long contexts. Beyond OCR performance, DeepSeek-OCR demonstrates a novel path toward implementing memory forgetting mechanisms in AI systems—older conversation rounds can be stored at progressively lower resolutions, mirroring human memory decay. With production throughput of 200,000+ pages per day on a single A100 GPU, the system has immediate practical value while opening research directions for vision-text compression, context management, and multimodal architecture design.

Video Overview

Watch the video overview on YouTube

🧒 ELI5: The Core Idea

Imagine your brain trying to remember a long conversation...

Right now, AI chatbots are like someone with a perfect memory who has to read through every single word of a conversation from the beginning every time you say something new. If you've been chatting for an hour, that's thousands of words to re-read!

DeepSeek-OCR has a clever trick: it takes a picture of old messages.

Think about it—if I show you a photo of a page from a book, you can see all the words at once. The photo file is much smaller than storing each letter separately. DeepSeek-OCR does the same thing: it converts old text into images, and those images take up way less "brain space" (tokens) for the AI to remember.

Even better: it mimics how human memory works!

This lets the AI "remember" 10 times more conversation history while using the same amount of brain power! It's like being able to fit 10 books into your backpack by taking pictures of the pages instead of carrying the actual books.

Research Context & Motivation

The Long Context Problem

Large language models face a fundamental computational bottleneck: processing cost scales quadratically with context length. When a document contains 10,000 tokens, processing requires managing 10,000² = 100 million interactions in self-attention mechanisms. This becomes prohibitively expensive for:

The Core Insight: Vision as Compression

Key Observation: A single image containing document text can represent rich information using substantially fewer tokens than the equivalent digital text.

A 1024×1024 image rendered from 1,000 text tokens can be encoded into just 100 vision tokens—a 10× compression ratio—while maintaining 97%+ decoding accuracy.

This insight flips the traditional VLM (Vision-Language Model) paradigm. Instead of asking "how can vision encoders help LLMs understand images?", DeepSeek-OCR asks: "how can vision encoders help LLMs process text more efficiently?"

Key Contributions

  1. Quantitative Vision-Text Compression Analysis: First comprehensive study demonstrating 7-20× text compression via optical mapping with measured accuracy bounds
  2. DeepEncoder Architecture: Novel vision encoder maintaining low activation memory and minimal vision tokens under high-resolution inputs through serial connection of window attention and global attention components
  3. Production-Ready OCR System: State-of-the-art performance on OmniDocBench using fewer vision tokens than existing models, with 200k+ pages/day throughput on single GPU
  4. Memory Forgetting Mechanism: Conceptual framework for implementing progressive context compression mimicking human memory decay

Architecture Deep Dive

System Overview

TWO-COMPONENT DESIGN

DeepEncoder: The Core Innovation

🎯 Simple Explanation

DeepEncoder is like a smart camera with two lenses:

  1. First lens (SAM - 80M params): Takes quick, detailed snapshots of small areas. It looks at little windows of the image, one piece at a time. This is fast and doesn't use much memory because it only looks at small chunks.
  2. Squisher in the middle (16× compressor): Takes all those little snapshots and squishes them down. Like when you zip a file on your computer—same information, much smaller size.
  3. Second lens (CLIP - 300M params): Looks at the whole squished-down picture at once to understand the big picture and add "knowledge" about what things mean.

The genius: By doing the hard work (looking at details) when things are split into small pieces, and only looking at everything together AFTER it's been squished, it stays fast and doesn't run out of memory!

DeepEncoder Technical Architecture

Design Requirements

  1. Process high resolutions (up to 1280×1280)
  2. Maintain low activation memory
  3. Generate few vision tokens
  4. Support multiple resolution inputs
  5. Moderate parameter count (fit on single GPU)

Three-Stage Pipeline

STAGE 1: LOCAL PERCEPTION

STAGE 2: TOKEN COMPRESSION

STAGE 3: GLOBAL CONTEXT

Why This Architecture Wins

Typical VLM Approach Problem DeepEncoder Solution
Tile-based (InternVL) Low native resolution → excessive fragmentation → too many tokens High native resolution (1024+) → minimal tiling needed
Adaptive resolution (Qwen2-VL) Massive activation memory → GPU OOM on large images Window attention + compression before global attention
Dual-tower (Vary) Complex preprocessing, hard to parallelize Single serial pipeline, simple and efficient

Multi-Resolution Support

📸 Camera Modes Analogy

DeepEncoder is like a camera with different quality settings:

The genius: ONE model can switch between all these modes. You tell it "use 100 tokens" or "use 400 tokens" and it adjusts on the fly!

Mode Resolution Vision Tokens Use Case
Tiny 512×512 64 Simple documents, maximum compression
Small 640×640 100 Standard documents, good balance
Base 1024×1024 256 (182 valid) Detailed documents, preserves aspect ratio
Large 1280×1280 400 (285 valid) High-quality OCR, complex layouts
Gundam n×640 + 1024 global n×100 + 256 Newspapers, multi-page documents
Gundam-Master n×1024 + 1280 global n×256 + 400 Maximum quality, ultra-high resolution

The MoE Decoder

DeepSeek3B-MoE Architecture:

f_decode: R^(n×d_latent) → R^(N×d_text)
where n ≤ N (compression ratio)

Compressed vision tokens → Reconstructed text

Training Methodology

Data Engine: Comprehensive and Diverse

OCR 1.0 Data (Traditional OCR)

OCR 2.0 Data (Deep Parsing)

General Vision Data (20% of total)

Text-Only Data (10% of total)

Two-Stage Training Pipeline

Stage 1: Training DeepEncoder Independently

Stage 2: Training DeepSeek-OCR End-to-End

Experimental Results: Compression Study

Vision-Text Compression Bounds

Tested on Fox Benchmark (100 English pages, 600-1300 tokens)

Key Finding: Near-lossless compression achievable up to 10× ratio. Beyond this, performance degrades but remains useful.

Text Tokens (Ground Truth) 64 Vision Tokens Compression 100 Vision Tokens Compression Pages
600-700 96.5% 10.5× 98.5% 6.7× 7
700-800 93.8% 11.8× 97.3% 7.5× 28
800-900 83.8% 13.2× 96.8% 8.5× 28
900-1000 85.9% 15.1× 96.8% 9.7× 14
1000-1100 79.3% 16.5× 91.5% 10.6× 11
1100-1200 76.4% 17.7× 89.8% 11.3× 8
1200-1300 59.1% 19.7× 87.1% 12.6× 4

Why Performance Degrades Beyond 10×

  1. Layout Complexity: Longer documents tend to have more complex layouts with multiple columns, tables, mixed fonts
  2. Visual Resolution Limits: At 512×512 or 640×640, 1200+ character documents become visually blurry—even humans struggle to read them
  3. Token Capacity Bounds: 64 vision tokens simply cannot encode all the nuanced variations in 1200 text tokens without information loss

Practical Performance: OmniDocBench

Overall Results

Model Avg Tokens/Page Edit Distance Performance
Pipeline Models
MinerU-2.1.1 ~6000 0.162 (EN) High quality, very expensive
PPstructure-v3 ~6000 0.152 (EN) Best pipeline, but token-heavy
End-to-End Models (High Token Count)
InternVL3-78B 6790 0.218 (EN) Excellent but expensive
Qwen2.5-VL-72B 3949 0.214 (EN) High quality, moderate cost
MinerU2.0 6790 0.133 (EN) Top performer but token-heavy
End-to-End Models (Low Token Count)
GOT-OCR2.0 256 0.287 (EN) Efficient but lower quality
DeepSeek-OCR (Small) 100 0.221 (EN) Beats GOT with 2.5× fewer tokens!
DeepSeek-OCR (Large) 400 (285 valid) 0.138 (EN) Matches top models, 15× fewer tokens
DeepSeek-OCR (Gundam) 795 0.127 (EN) Beats MinerU2.0 with 8.5× fewer tokens!

Per-Document-Type Analysis

Document Type Tiny (64) Small (100) Base (256) Gundam
Slides 0.116 0.111 0.080 0.085
Books 0.147 0.085 0.037 0.035
Financial Reports 0.207 0.079 0.027 0.289
Academic Papers 0.395 0.131 0.052 0.039
Newspapers 0.940 0.744 0.645 0.122

Insights from Per-Type Performance

Practical Implication: Different document types have different optimal compression ratios. Adaptive token allocation based on document type can optimize cost-performance tradeoff.

Advanced Capabilities: "Deep Parsing"

OCR 2.0: Beyond Text Recognition

🎨 What is Deep Parsing?

Imagine you're scanning a science textbook. Regular OCR just reads the words. But what about:

DeepSeek-OCR does all of this automatically! One unified prompt, and it figures out what type of content it's looking at and provides the appropriate structured output.

Deep Parsing Capabilities

Content Type Input Output Format Applications
Charts Line, bar, pie, composite charts HTML table with structured data Financial analysis, data extraction from reports
Chemical Formulas Molecular structure images SMILES notation Chemistry research, drug discovery databases
Geometry Plane geometry figures Structured dictionary: segments, coordinates, types Math education, geometric reasoning
Natural Images Photos in documents Dense caption describing scene Document understanding, accessibility

Unified Interface: Single Prompt

<image>\nParse the figure.

With this single prompt, DeepSeek-OCR automatically:

  1. Identifies the content type
  2. Selects appropriate parsing strategy
  3. Returns structured output in the correct format

Multilingual Support

The Memory Forgetting Mechanism

Mimicking Human Memory Decay

🧠 How Human Memory Works

Think about remembering a conversation:

DeepSeek-OCR proposes the same idea for AI: Convert old conversation text to images, then progressively make those images smaller and blurrier as time passes!

Implementation Concept

Multi-Level Context Compression

TEMPORAL DECAY STRATEGY

  1. Current turn (just happened): Keep as text tokens—full fidelity
  2. Recent history (1-5 turns ago): Convert to Gundam mode (high resolution) → ~10× compression
  3. Medium history (6-20 turns ago): Downsample to Large mode (1280×1280) → 15× compression
  4. Older history (21-50 turns ago): Downsample to Base mode (1024×1024) → 20× compression
  5. Ancient history (50+ turns ago): Downsample to Small/Tiny mode (640×640 or 512×512) → 30-40× compression

Progressive Resolution Degradation

Text → Image_High_Res → Image_Med_Res → Image_Low_Res → Discard

As conversations age, progressively downsample the rendered images. This mirrors both:

Practical Benefits

Without Forgetting With Optical Forgetting Benefit
100 turn conversation
100,000 tokens total
Recent: 10,000 tokens (text)
Medium: 2,000 tokens (vision)
Old: 1,000 tokens (vision)
87% token reduction!
Quadratic attention cost
100,000² operations
Linear in compressed form
~13,000² operations
60× fewer operations
Context limit: 100k tokens
= ~100 turns max
Context limit: 100k tokens
= ~1000+ turns possible
10× longer conversations

Research Implications

This approach suggests a new paradigm for ultra-long context LLMs:

Open Question: Can LLMs be pretrained with digital-optical text interleaving to natively support this compression mechanism?

Comparison with Existing Systems

End-to-End OCR Models

Model Tokens/Page Strengths Limitations
Nougat 2352 First academic paper OCR, pioneering work Huge token count, limited to academic papers
GOT-OCR2.0 256 Efficient, supports OCR 2.0 tasks Lower accuracy than pipeline methods
Qwen2.5-VL-72B 3949 High quality, general VLM Token-heavy, expensive inference
InternVL3-78B 6790 Excellent quality, handles extreme resolutions Excessive fragmentation, very expensive
MinerU2.0 6790 Top accuracy on OmniDocBench Most expensive, 6000+ tokens/page
DeepSeek-OCR 100-800 Best accuracy per token, adaptive modes, fast inference Not a chatbot (no SFT), requires completion prompts

Vision Encoders in VLMs

Architecture Example Problem DeepEncoder Advantage
Dual-Tower Vary, DeepSeek-VL Complex preprocessing, hard to parallelize Single serial pipeline, simple deployment
Tile-Based InternVL2/3 Low native resolution → excessive fragmentation → too many tokens High native resolution (1024+) → minimal tiling, fewer tokens
Adaptive Resolution Qwen2-VL, NaViT Massive activation memory → GPU OOM on large images Window attention + compression → manageable memory
Serial Compression DeepEncoder Combines benefits: high resolution + low activation + few tokens

Unique Positioning

DeepSeek-OCR fills a critical gap:

Production Deployment

Performance Characteristics

Metric Value Notes
Throughput 200,000+ pages/day Single A100-40G GPU
Cluster Scale 33M pages/day 20 nodes × 8 A100s (160 GPUs)
Inference Speed ~9 pages/second Base mode (256 tokens)
Memory per Image ~2-4GB peak Depends on resolution mode
Model Size 3.4GB 380M encoder + 570M active decoder

Cost-Performance Analysis

Comparison: DeepSeek-OCR vs Traditional Pipelines

Traditional Pipeline (MinerU2.0):

DeepSeek-OCR (Base mode):

Use Cases

  1. LLM/VLM Training Data Generation: Convert large PDF corpora to structured text at scale
  2. Document Search Indexing: Extract searchable text from scanned documents
  3. Archival Digitization: Process historical documents, newspapers, books
  4. Scientific Literature Processing: Extract text, formulas, charts from research papers
  5. Financial Document Analysis: Parse reports, extract structured data from charts

Technical Innovations Explained

Why Window Attention + Global Attention Works

🔍 The Two-Stage Processing Trick

The Problem: Looking at a high-resolution image with global attention (where every pixel attends to every other pixel) requires MASSIVE memory.

For a 1024×1024 image = 1 million pixels:
Global attention needs: 1M × 1M = 1 trillion operations! 🤯

The Solution: Split the work into two stages:

  1. Stage 1 (Window Attention): Look at small 16×16 windows independently. Each window only attends to itself.
    • Operations: 256 × 256 × (1M/256) = 256 million (1000× less!)
    • This stage captures LOCAL details (edges, characters, small patterns)
  2. Compression: Squish all those windows down 16×—now you have only 4096 tokens
  3. Stage 2 (Global Attention): Now use global attention on the compressed tokens.
    • Operations: 4096 × 4096 = 16 million (still manageable!)
    • This stage captures GLOBAL context (layout, relationships, meaning)

Result: You get both local details AND global understanding, without exploding memory!

Position Encoding for Variable Resolution

The Challenge: The model is trained on specific resolutions (512, 640, 1024, 1280). How does it handle arbitrary sizes?

The Solution: Dynamic Positional Encoding Interpolation

PE(pos) = sin(pos / 10000^(2i/d))
For new resolution: Interpolate between learned positions

Why MoE for the Decoder?

Mixture-of-Experts Benefits

Traditional Decoder: Every token activates entire model → expensive

MoE Decoder: Each token routes to 6 of 64 experts → only activates small portion

Aspect Dense 3B Model DeepSeek3B-MoE
Total Parameters 3B 3B (64 experts × 45M each)
Active per Token 3B 570M (6 routed + 2 shared)
FLOPs High 5× lower
Expressivity Baseline Higher (64 specialized sub-models)

Why This Matters for OCR:

The "Model Flywheel" for Minority Languages

Bootstrapping OCR for 100 Languages

Problem: No labeled OCR data for minority languages (Sinhala, Khmer, etc.)

Solution: Iterative Self-Improvement

  1. Step 1: Use PP-DocLayout (layout model) → finds text regions (works across languages)
  2. Step 2: Use fitz to extract raw text → creates initial training data
  3. Step 3: Train GOT-OCR2.0 on this noisy data → gets basic OCR ability
  4. Step 4: Use trained model to label more documents → creates better training data
  5. Step 5: Retrain on improved data → model gets better
  6. Repeat steps 4-5: Model quality improves with each iteration

Result: 600K labeled samples for minority languages, good enough for production use

Critique & Limitations

Current Limitations

  1. No Supervised Fine-Tuning (SFT):
  2. Unproven for True Context Compression:
  3. Geometry Parsing Still Challenging:
  4. Compression-Accuracy Tradeoff:
  5. Limited to Static Documents:

Architectural Concerns

Future Research Directions

Short-Term Improvements (6-12 months)

  1. Add Supervised Fine-Tuning Stage:
  2. Adaptive Compression:
  3. Improved Geometry Parsing:
  4. Streaming Inference:

Medium-Term Exploration (1-2 years)

  1. Digital-Optical Interleaved Pretraining:
  2. Learned Forgetting Mechanisms:
  3. Hierarchical Vision Tokenization:
  4. Cross-Modal Compression:

Long-Term Vision (2+ years)

  1. Lossless Compression Bounds:
  2. Multi-Agent Collaborative Compression:
  3. Real-Time Adaptive Systems:
  4. Unified Multimodal Memory Architecture:

Novel Research Questions Opened

Fundamental Questions

  1. Compression-Understanding Paradox: If text is compressed 10×, has the model truly "understood" it, or just memorized a lookup table? How do we distinguish compression from understanding?
  2. Optimal Rendering Strategies: Is standard text rendering optimal, or should we design special "compression-friendly" fonts/layouts that maximize information density?
  3. Cross-Lingual Compression: Do different languages compress differently when rendered optically? Should compression strategies vary by language?
  4. Semantic Preservation: At what compression ratio do semantics degrade? Can we prioritize semantic information over syntactic details?
  5. Temporal vs Spatial Compression: The paper suggests time-based forgetting mimics spatial distance. Are these truly analogous, or do they require different mechanisms?

Implementation Insights for Practitioners

When to Use DeepSeek-OCR

✅ Ideal Use Cases

❌ When to Use Alternatives

Deployment Considerations

Aspect Recommendation Rationale
GPU Requirements A100-40G or better Base model fits, Gundam mode needs memory headroom
Batch Size 16-32 images Balance throughput vs memory
Mode Selection Start with Small (100 tokens) Best accuracy-cost tradeoff for most documents
Prompt Engineering Use completion-style prompts Model not fine-tuned for instruction following
Error Handling Retry with higher resolution on failures Adaptive quality based on content complexity
Preprocessing Convert PDFs at 200 DPI Optimal quality-size tradeoff per paper

Cost-Benefit Analysis

Example: Processing 1M Pages

Approach GPU Hours Token Cost Quality Total Cost
GPT-4V API $20,000 Excellent $20,000
InternVL3-78B 5000 Excellent $15,000
MinerU2.0 3000 Best $9,000
DeepSeek-OCR (Small) 200 Very Good $600

Assuming: A100 at $3/hour, GPT-4V at $0.02/page for OCR-length content

Result: DeepSeek-OCR is 15-30× cheaper while maintaining high quality

Connections to Broader AI Research

Relation to Other Compression Techniques

Technique Mechanism Compression Lossiness Relation to DeepSeek-OCR
Token Pruning Remove redundant tokens 2-5× Low Complementary—could prune before optical compression
KV Cache Compression Compress attention cache 2-4× Medium Orthogonal—applies during inference, not context encoding
Summarization LLM rewrites shorter 3-10× High Similar goal, but optical preserves visual structure
Retrieval Store externally, fetch None Could store old contexts as images for retrieval
Optical Compression Render as image 7-20× Low-High Novel modality-crossing approach

Implications for Multimodal Foundation Models

DeepSeek-OCR suggests a paradigm shift in multimodal model design:

  1. Vision as Infrastructure, Not Feature: Vision encoders should be optimized for text processing efficiency, not just image understanding
  2. Modality-Crossing Compression: Transform data into most efficient modality for representation (text→image→compressed text)
  3. Heterogeneous Token Budgets: Different parts of context can use different modalities based on age/importance
  4. Unified Attention Across Modalities: LLMs attend to both text tokens and vision tokens representing compressed text

Memory Systems in AI

DeepSeek-OCR connects to broader work on memory architectures:

Additional Comments

Why This Paper Matters

  1. Paradigm Shift: First work to systematically treat vision encoding as compression mechanism for text processing
  2. Quantitative Bounds: Establishes empirical compression-accuracy tradeoffs with clear experimental validation
  3. Production Viability: Not just a research prototype—deployed system processing millions of pages
  4. Biological Inspiration: Memory forgetting mechanism mirrors neuroscience findings on memory consolidation
  5. Open Source: Code and weights available, enabling reproducible research and practical deployment

Surprising Findings

Underexplored Aspects in Paper

  1. Semantic vs Syntactic Preservation: Does compression preserve meaning better than exact text? No analysis of semantic similarity metrics
  2. Multimodal Pretraining Analysis: What happens if LLMs see compressed text during pretraining? Would they naturally develop decompression abilities?
  3. Compression Artifacts: What types of errors occur at different compression ratios? Character-level, word-level, sentence-level?
  4. Cross-Language Transfer: Does OCR ability on English transfer to unseen languages? Few-shot adaptation?
  5. Adversarial Robustness: Can carefully designed documents "fool" the compression, forcing token allocation to irrelevant content?

Conclusions

DeepSeek-OCR represents a significant conceptual advance in addressing long-context challenges in large language models through a novel paradigm: treating vision as a compression medium for text. Key takeaways:

Core Contributions

Paradigm Implications

For Researchers

  1. Immediate Exploration: Digital-optical interleaved pretraining to validate native compression abilities in LLMs
  2. Architecture Research: Adaptive compression ratios, content-aware token allocation, hierarchical representations
  3. Theoretical Analysis: Information-theoretic limits of optical compression, semantic preservation bounds
  4. Applications Beyond OCR: Audio compression, video summarization, code compression

For Practitioners

  1. Immediate Deployment: Use for large-scale document processing, training data generation (15-30× cost savings vs alternatives)
  2. Mode Selection Strategy: Start with Small mode (100 tokens), scale up only when accuracy insufficient
  3. Integration Patterns: Combine with retrieval systems (store contexts as compressed images), implement adaptive quality based on downstream task needs
  4. Future-Proofing: Design systems with compression-friendly memory architectures anticipating next-gen LLMs with native optical compression

Open Questions

Broader Impact

DeepSeek-OCR opens pathways toward scalable ultra-long context processing without proportional computational cost increases. By establishing vision-text compression as a viable paradigm and demonstrating production viability, it challenges assumptions about modality separation in foundation models. The work suggests that optimal AI systems may not cleanly separate vision and language processing, but instead fluidly transform between modalities to optimize computational efficiency.

Most Importantly: This is early-stage work with substantial room for improvement. The 10× near-lossless compression achieved is just the beginning. With proper pretraining integration, learned compression policies, and adaptive mechanisms, future systems might achieve 20-50× compression while maintaining high fidelity—fundamentally changing how we think about context windows in AI.

The question is no longer "can we compress contexts optically?" but rather "how much further can this paradigm be pushed?" The answer will shape the next generation of multimodal foundation models.

References