DeepSeek发布开源OCR模型，高效压缩图像文本

钛媒体：引领未来商业与生活新知 10月22日 16:13

../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

中国人工智能公司DeepSeek发布了开源模型DeepSeek-OCR，旨在从图像和PDF中提取并压缩文本，为训练大型语言模型（LLMs）和视觉语言模型（VLMs）提供大规模、高质量的数据集，同时显著降低计算需求。该模型已在GitHub上公开，并附带一篇名为《DeepSeek-OCR：上下文光学压缩》的研究论文。DeepSeek-OCR通过光学压缩将文本信息编码为视觉表示，存储在光学格式中。这种方法解决了LLMs在处理长篇内容（如研究论文、法律合同和财务报告）时面临的主要计算瓶颈，通过将文本转换为图像，系统能更高效地处理大量文档，类似于人类记忆的逐渐遗忘机制。性能指标显示，DeepSeek-OCR在数据压缩十倍的情况下仍能达到96%以上的准确率，在压缩10-12倍时准确率约为90%，压缩20倍时准确率约为60%。该模型具有高可扩展性，单张A100-40G GPU每天可生成超过20万页的训练数据。DeepSeek-OCR压缩长文本内容的能力为LLM训练开辟了新的可能性，尤其是在需要处理海量数据的场景中，通过将对话、研究材料和多页文档转换为图像，减少了token数量和计算开销。虽然该模型是一项重要的技术进步，但DeepSeek在推出新模型方面相对较慢，引发了关于公司发展策略的讨论。

🖼️ DeepSeek发布开源OCR模型DeepSeek-OCR，专注于从图像和PDF中提取并压缩文本，旨在为训练LLMs和VLMs提供大规模高质量数据集，并显著降低计算成本。

💾 DeepSeek-OCR采用光学压缩技术，将文本信息编码为视觉表示，解决了LLMs在处理长篇文本内容时面临的计算瓶颈，通过将文本转化为图像，实现了更高效的文档处理，并模拟了人类记忆的遗忘机制。

📊 研究数据显示，DeepSeek-OCR在数据压缩十倍的情况下准确率超过96%，在压缩10-12倍时准确率约为90%，即使压缩20倍也能保持约60%的准确率，展示了紧凑型语言模型有效解码压缩视觉文本的潜力。

🚀 DeepSeek-OCR具有高度可扩展性，据称单张A100-40G GPU每天可以生成超过20万页的训练数据，这为处理海量数据提供了强大的支持。

🤝 尽管DeepSeek-OCR是一项重要的技术进步，但DeepSeek在发布如R2等新模型方面进展相对缓慢，这引发了关于公司在快速发展的AI领域中策略的讨论和分析。

Chinese artificial intelligence firm DeepSeek has released DeepSeek-OCR, an open-source model designed to extract and compress text from images and PDFs, aiming to provide large-scale, high-quality datasets for training large language models (LLMs) and vision-language models (VLMs) while dramatically reducing computational requirements.

The model was made publicly available on GitHub yesterday, accompanied by a research paper titled DeepSeek-OCR: Contexts Optical Compression.

The technology behind DeepSeek-OCR leverages optical compression to encode textual information into visual representations, which are stored in an optical format.

According to the company, this approach addresses the major computational bottlenecks LLMs face when processing long-form content such as research papers, legal contracts, financial reports, and dialogue histories. By converting text into images, the system allows models to process extensive documents more efficiently, simulating a gradual forgetting mechanism similar to human memory.

Performance metrics shared in the research indicate that DeepSeek-OCR can achieve over 96% accuracy with a tenfold reduction in data, 90% accuracy at compression rates of 10–12 times, and around 60% accuracy with a 20-fold reduction.

This demonstrates that compact language models can effectively decode compressed visual text, potentially enabling larger models to adopt similar capabilities with fewer resources. The model is also highly scalable: a single A100-40G GPU can reportedly generate more than 200,000 pages of training data per day.

DeepSeek-OCR’s ability to compress long-form textual content opens new possibilities for LLM training, particularly for scenarios requiring the processing of massive amounts of data. By converting dialogues, research materials, and multi-page documents into images, the approach reduces token counts and computational overhead, potentially allowing models to handle larger datasets without a corresponding spike in GPU demand.

The open-source release has already attracted attention within the AI community, with DeepSeek-OCR garnering over 1,400 stars on GitHub shortly after its debut.

Analysts note that while the model represents a significant technical advancement, DeepSeek has been relatively slow in rolling out new models like R2. Some experts speculate that this may suggest the company is temporarily falling behind in the rapidly evolving AI field.

Others, however, interpret the cautious pace as a deliberate strategy to strengthen internal capabilities and lay the groundwork for a next-generation AI model.

更多精彩内容，关注钛媒体微信号（ID：taimeiti），或者下载钛媒体App

Fish AI Reader

FishAI

联系邮箱 441953276@qq.com

相关标签