语言模型文本困惑度新性质及典型集分析

cs.AI updates on arXiv.org 09月15日

语言模型文本困惑度新性质及典型集分析

本文证明了语言模型生成长文本的困惑度具有新的渐近非等分性质，并提供了开源模型的支持性实验证据。研究表明，语言模型生成的任何大型文本的对数困惑度必须渐近收敛到其标记分布的平均熵，定义了一个典型集。本文对典型集进行了细化，仅包含语法正确的文本，并证明该细化典型集是所有可能语法正确文本的极小子集，表明语言模型在可能行为和输出范围内受到强烈约束。

arXiv:2405.13798v4 Announce Type: replace-cross Abstract: We prove a new asymptotic un-equipartition property for the perplexity of long texts generated by a language model and present supporting experimental evidence from open-source models. Specifically we show that the logarithmic perplexity of any large text generated by a language model must asymptotically converge to the average entropy of its token distributions. This defines a ``typical set'' that all long synthetic texts generated by a language model must belong to. We refine the concept of ''typical set'' to include only grammatically correct texts. We then show that this refined typical set is a vanishingly small subset of all possible grammatically correct texts for a very general definition of grammar. This means that language models are strongly constrained in the range of their possible behaviors and outputs. We make no simplifying assumptions (such as stationarity) about the statistics of language model outputs, and therefore our results are directly applicable to practical real-world models without any approximations. We discuss possible applications of the typical set concept to problems such as detecting synthetic texts and membership inference in training datasets.

Fish AI Reader

AI辅助创作，多种专业模板，深度分析，高质量内容生成。从观点提取到深度思考，FishAI为您提供全方位的创作支持。新版本引入自定义参数，让您的创作更加个性化和精准。

FishAI

鱼阅，AI 时代的下一个智能信息助手，助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

语言模型困惑度典型集语法正确性文本生成

相关文章

Coalition of news publishers sue Microsoft and OpenAI

Top 40+ Generative AI Tools in 2024

Gemini models are coming to Performance Max

This AI Paper by Microsoft and Tsinghua University Introduces YOCO: A Decoder-Decoder Architectures for Language Models

Top 50 AI Writing Tools To Try in 2024

OLMo: Everything You Need to Train an Open Source LLM with Akshita Bhagia - #674

Multilingual LLMs and the Values Divide in AI with Sara Hooker - #651

BloombergGPT - an LLM for Finance with David Rosenberg - #639

AI Trends 2023: Reinforcement Learning - RLHF, Robotic Pre-Training, and Offline RL with Sergey Levine - #612

Scaling BERT and GPT for Financial Services with Jennifer Glore - #561