https://www.seangoedecke.com/rss.xml 7小时前
图像压缩技术:AI模型处理文本的新视角
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

近期DeepSeek发布的一篇OCR论文揭示了一种名为“光学压缩”的新技术。该技术利用AI模型将图像中的文本信息以比纯文本更高效的方式进行编码,在图像的一个token中可提取近乎100%准确的10个文本token。这种方法能够显著提高数据处理效率,甚至可能改变AI模型处理和存储信息的方式。研究人员正在探索将此技术应用于现有模型,并设想未来AI模型可能通过类似人类记忆的方式处理长文本,即较早的信息以更模糊的方式存储。这一创新为AI领域带来了新的研究方向,特别是在处理海量未数字化文本数据方面。

💡 **光学压缩的效率提升**:DeepSeek的研究表明,AI模型能够从单个图像token中提取出比原始文本更多的信息,实现了约10:1的压缩比。这意味着将文本转换为图像后,模型可以更高效地处理数据,为AI模型解锁了更多文本数据用于训练语言模型,并可能在推理时输入更多信息。

🧠 **类比人类记忆与信息处理**:该技术借鉴了人类记忆的特点,提出在处理长文本时,可以像人类记忆一样,随着时间的推移降低早期图像的清晰度,使其更模糊,从而降低存储成本。这种方式可能更贴近大脑处理信息的方式,为AI模型理解文本提供新的思路。

🚀 **潜在应用与未来挑战**:光学压缩技术在现有多模态LLM上已有初步尝试,并催生了相关服务和项目。尽管如此,将其作为AI实验室的重点研究方向仍有待发展。训练完全基于图像文本的新模型可能面临挑战,例如如何高效地进行训练和验证,以及如何处理文本token的内部表示。

📚 **解锁未数字化文本数据**:目前仍有大量书籍未被数字化,仅约30%的印刷书籍已完成数字化。光学压缩技术为AI处理这些海量未数字化文本提供了新的可能性,有望极大地丰富AI模型的训练数据来源。

Several days ago, DeepSeek released a new OCR paper. OCR, or “optical character recognition”, is the process of converting an image of text - say, a scanned page of a book - into actual text content. Better OCR is obviously relevant to AI because it unlocks more text data to train language models on1. But there’s a more subtle reason why really good OCR might have deep implications for AI models.

Optical compression

According to the DeepSeek paper, you can pull out 10 text tokens from a single image token with near-100% accuracy. In other words, a model’s internal representation of an image is ten times as efficient as its internal representation of text. Does this mean that models shouldn’t consume text at all? When I paste a few paragraphs into ChatGPT, would it be more efficient to convert that into an image of text before sending it to the model? Can we supply 10x or 20x more data to a model at inference time by supplying it as an image of text instead of text itself?

This is called “optical compression”. It reminds me of a funny idea from June of this year to save money on OpenAI transcriptions: before uploading the audio, run it through ffmpeg to speed it up by 2x. The model is smart enough to still pull out the text, and with one simple trick you’ve cut your inference costs and time by half. Optical compression is the same kind of idea: before uploading a big block of text, take a screenshot of it (and optionally downscale the quality) and upload the screenshot instead.

Some people are already sort-of doing this with existing multimodal LLMs. There’s a company selling this as a service, an open-source project, and even a benchmark. It seems to work okay! Bear in mind that this is not an intended use case for existing models, so it’s plausible that it could get a lot better if AI labs start actually focusing on it.

The DeepSeek paper suggests an interesting way2 to use tighter optical compression for long-form text contexts. As the context grows, you could decrease the resolution of the oldest images so they’re cheaper to store, but are also literally blurrier. The paper suggests an analogy between this and human memory, where fresh memories are quite vivid but older ones are vaguer and have less detail.

Why would this work?

Optical compression is pretty unintuitive to many software engineers. Why on earth would an image of text be expressible in fewer tokens than the text itself?

In terms of raw information density, an image obviously contains more information than its equivalent text. You can test this for yourself by creating a text file, screenshotting the page, and comparing the size of the image with the size of the text file: the image is about 200x larger. Intuitively, the word “dog” only contains a single word’s worth of information, while an image of the word “dog” contains information about the font, the background and text color, kerning, margins, and so on. How, then, could it be possible that a single image token can contain ten tokens worth of text?

The first explanation is that text tokens are discrete while image tokens are continuous. Each model has a finite number of text tokens - say, around 50,000. Each of those tokens corresponds to an embedding of, say, 1000 floating-point numbers. Text tokens thus only occupy a scattering of single points in the space of all possible embeddings. By contrast, the embedding of an image token can be sequence of those 1000 numbers. So an image token can be far more expressive than a series of text tokens.

Another way of looking at the same intuition is that text tokens are a really inefficient way of expressing information. This is often obscured by the fact that text tokens are a reasonably efficient way of sharing information, so long as the sender and receiver both know the list of all possible tokens. When you send a LLM a stream of tokens and it outputs the next one, you’re not passing around slices of a thousand numbers for each token - you’re passing a single integer that represents the token ID. But inside the model this is expanded into a much more inefficient representation (inefficient because it encodes some amount of information about the meaning and use of the token)3. So it’s not that surprising that you could do better than text tokens.

Zooming out a bit, it’s plausible to me that processing text as images is closer to how the human brain works. To state the obvious, humans don’t consume text as textual content; we consume it as image content (or sometimes as audio). Maybe treating text as a sub-category of image content could unlock ways of processing text that are unavailable when you’re just consuming text content. As a toy example, emoji like :) are easily-understandable as image content but require you to “already know the trick” as text content4.

Final thoughts

Of course, AI research is full of ideas that sounds promising but just don’t work that well. It sounds like you should be able to do this trick on current multimodal LLMs - particularly since many people just use them for OCR purposes anyway - but it hasn’t worked well enough to become common practice.

Could you train a new large language model on text represented as image content? It might be tricky. Training on text tokens is easy - you can simply take a string of text and ask the model to predict the next token. How do you train on an image of text?

You could break up the image into word chunks and ask the model to generate an image of the next word. But that seems to me like it’d be really slow, and tricky to check if the model was correct or not (e.g. how do you quickly break a file into per-word chunks, how do you match the next word in the image, etc). Alternatively, you could ask the model to output the next word as a token. But then you probably have to train the model on enough tokens so it knows how to manipulate text tokens. At some point you’re just training a normal LLM with no special “text as image” superpowers.


  1. AI labs are desperate for high-quality text, but only around 30% of written books have been digitized. It’s really hard to find recent data on this, but as a very rough estimate Google Books had ~40M books in 2023, but Google estimates there to have been ~130M books in 2010. That comes out to 30%.

  2. See Figure 13.

  3. Not to skip too far ahead, but this is one reason to think that representing a block of text tokens in a single image might not be such a great idea.

  4. Of course current LLMs can interpret these emojis. Less-toy examples: image-based LLMs might have a better feel for paragraph breaks and headings, might be better able to take a big picture view of a single page of text, and might find it easier to “skip through” large documents by skimming the start of each paragraph. Or they might not! We won’t know until somebody tries.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

光学压缩 OCR AI模型 文本处理 DeepSeek 多模态LLM 数据效率 Optical Compression AI Models Text Processing Multimodal LLM Data Efficiency
相关文章