Zeroth Principles of AI 09月25日
Morris Worm与ChatGPT的关联
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

文章探讨了Morris蠕虫与ChatGPT之间的意外联系,通过Model Free Methods Workshop的案例,对比了还原论和整体论两种方法在拼写错误检测中的差异。文章指出,现代大型语言模型(LLMs)如GPT能够多语言检测拼写错误,是因为它们通过学习多种语言实现自然语言理解(NLU),而非传统自然语言处理(NLP)依赖的语言特定字典。文章还介绍了Robert H. Morris提出的基于三字母组的拼写检测算法,该算法语言无关且具有历史意义,其核心思想是通过学习语料库中的三字母组统计来标记未出现过的高频组合,并提及UNIX系统中的typo.c程序实现了这一思路。

🔍 文章通过拼写错误检测的案例,对比还原论(Reductionist Stance)和整体论(Holistic Stance)两种方法,揭示现代大型语言模型(LLMs)如GPT能够多语言检测拼写错误,是因为它们通过学习多种语言实现自然语言理解(NLU),而非传统自然语言处理(NLP)依赖的语言特定字典。

📚 文章介绍了Robert H. Morris提出的基于三字母组的拼写检测算法,该算法通过学习语料库中的三字母组统计来标记未出现过的高频组合,具有语言无关性且具有历史意义,其核心思想是通过学习实现语言理解。

💻 文章提及UNIX系统中的typo.c程序实现了这一思路,该程序能够学习人类语言模型并成功完成语言理解任务,是历史上第一个通过学习语言模型来完成语言理解任务的程序。

🧠 文章指出,现代LLMs如GPT能够理解多种语言,是因为它们通过学习(Holistic)实现自然语言理解,而非传统NLP依赖的语言特定字典,这一观点强调了整体论在语言学习中的重要性。

🔬 文章通过拼写错误检测的案例,展示了整体论在语言理解任务中的有效性,并指出这一方法在21世纪自然语言理解(NLU)中的重要性,与还原论形成鲜明对比。

There is a surprising connection between the Morris Worm and ChatGPT.

Model Free Methods Workshop

In my video Model Free Methods Workshop I let the students analyze four problems at a high level, but to do it two different ways for each. First they solve it using a Reductionist Stance, and then using a Holistic Stance.

I knew when I held the workshop that Holism was a dirty word in academia and initially used Model Free Methods as a euphemism.

The last problem is about mind reading. But the first one is in many ways more interesting.

The first task is to write a program to detect spelling errors in text. Not correct them, just detect them.

In the workshop video I ask “What is the first question you have to ask if you are writing a spelling error catching program Reductionistically?”

The surprising (or not) answer is “You have to pick a language”.

If we are creating a spelling error detector using 20th Century NLP, then we would have word lists or dictionaries, and possibly grammars or “WordNet”. And such dictionaries are obviously language specific since they list words in the target language.

Can you do this without specifying the language before you start the project? Well, modern LLMs like GPT already know dozens of languages and can effectively detect all errors in your documents. And some can rewritet them. They are not using NLP, they are using 21st Century NLU – Natural Language Understanding.

How come LLMs can understand multiple languages? It learned them. And systems that learn are Holistic. So there is the clue. Our Holistic Stance solution to detecting spelling errors is a system that is able to learn all the languages we want to use it on.

Now we can ask “What was the first and simplest Learned Language Model?”

The First Language Model

Here is an algorithm that was published many years ago.

The idea is to gather(“learn”) trigram statistics for the target language and then to flag all trigrams that have not been previously encountered (with at least some frequency) in the training corpus. This is not 100% reliable, but it can pragmatically outperform many of the old NLP approaches, and it is language agnostic, which NLP never was.

We note that this algorithm belongs to the category “Statistical Methods for Natural Language Understanding”. Modern LLMs are more “context-exploiting” than they are “statistical”, but that’s another show.

An example shows how this trigram algorithm works in practice: You are about to submit your fourth paper to a yearly conference. How can you flag typos in this paper?

You can feed the three older papers you published through the program in a trigram statistics gathering mode. Then feed the fourth (new) paper through in detect mode and it will point out all words that contain trigrams it has not seen before. This is amazingly effective. No matter how much jargon there is on your papers, if it recurs year to year, it’s not going to be flagged.

The four papers could have been written in French and it would have worked anyway. The system learns some amount of the language from its corpus, which is the first three papers.

Very clever idea. Sounds straightforward. Was it ever made available as a commercial product? Yes, some of the earliest UNIX release tapes contained a program named typo.c that did exactly this. You can find the program on GitHub.

And it is the first program that I know of that learns a Model for a Human Language and then uses it to successfully accomplish a language understanding task. As such, I think it is historically significant.

The author of the program? Robert H. Morris of Multics and UNIX fame, and father of Robert Tappan Morris, professor at MIT, who is famous for creating the Morris Worm.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

Morris Worm ChatGPT Model Free Methods Reductionist Stance Holistic Stance Natural Language Understanding Spelling Error Detection Robert H. Morris UNIX typo.c
相关文章