Communications of the ACM - Artificial Intelligence 09月25日 18:00
大型语言模型扩展的局限性
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

大型语言模型(LLM)在众多任务中展现出卓越能力,但扩展并非万能。尽管规模扩大加速了某些应用(如机器人技术),却在其他领域(如识别虚假信息)未见显著成效。盲目依赖扩展不可取,需关注数据扩展的适用性。数据本身的形状和稳定性是关键,结构化、稳定的领域(如机器翻译、机器人技术)受益于扩展,而动态、碎片化的领域(如虚假信息识别)则不然。此外,数据获取的可行性、数据质量定义及评估框架也至关重要。有意的数据扩展,结合数据形状分析,能提高模型训练效率,推动AI发展。

📈 大型语言模型(LLM)的扩展在特定领域(如机器翻译、机器人技术)取得成功,得益于数据的结构化和稳定性,这些领域通过数据扩展显著提升性能。

📉 然而,在动态变化、碎片化数据(如虚假信息识别)的领域,扩展效果有限,因为这些领域缺乏稳定的拓扑特征,难以通过单纯增加数据量提升模型性能。

🔍 数据本身的形状和稳定性是决定扩展成败的关键因素,Topological Data Analysis(拓扑数据分析)有助于识别数据内在维度和模式,指导数据扩展的适用性评估。

📊 数据获取的可行性、数据质量定义(准确性、可靠性、完整性、信息密度、时效性、多样性)及评估框架(需反映真实世界复杂性、用户满意度)也制约着扩展的效果。

🎯 有意的数据扩展强调针对特定应用收集“量身定制”的数据,结合数据形状分析,提高训练效率,实现更高效、可持续的AI发展。

Large language models (LLMs) have revolutionized the AI landscape, demonstrating remarkable capabilities across a wide range of tasks. Each new model seemingly reinforces the notion that modern transformer-based AI can conquer any challenge if armed with sufficient compute and data. However, while scaling has accelerated certain applications, such as robotics, it has yet to show significant impact in others, such as identifying misinformation. We should not naively ‘throw’ scaling at every problem. Instead, we should consider what types of problems are more likely to benefit from scaling, and in particular data scaling, and shift towards more intentional data acquisition.

We argue that the shape of data1 itself may hold valuable clues that could inform the success of data-driven scaling. For example, the presence of structural patterns and stability of data across multiple scales can help determine when data-driven scaling will be advantageous, and pinpoint where data scaling hits bounds.

Moreover, the practicalities of data acquisition impose additional constraints that we must factor into the scaling equation upfront. Factors such as availability and verifiability of quality data, complexity and resource intensity of data collection, and availability of rigorous evaluation benchmarks determine not just the effectiveness but also the viability of data-driven scaling.

Isn’t Scaling All You Need?

Scaling laws for LLMs have been driving our thinking on how to build optimal models almost since the inception of transformers.12 In 2020, OpenAI researchers demonstrated the relationships between compute budget, model size, and dataset size for an optimal model.6 Subsequently, researchers from Google DeepMind5 and other frontier labs investigated the compute-optimal growth of model and dataset size under a fixed compute budget, demonstrating the model size and training data should be scaled equally. The scaling narrative suggests that this path could continue to yield improvements indefinitely, allowing LLMs to get better and better, and addressing more and more use cases over time. However, most scaling laws rely on the inherent assumption that we have an infinite supply of quality data. In reality, quality data, and in particular human-generated quality data, is of course finite. As models grow into hundreds of billions of parameters, we tend to run out of quality data.3

Low-quality data contains both irrelevant information like duplicates and diluted content, as well as unreliable or meaningless data. Some amount of low-quality data—assuming we can reliably identify it—is not only useful but also necessary: being able to distinguish between high-quality and low-quality data helps to improve the model’s ability to detect and fix mistakes. However, training on large quantities of low-quality data—abundant on the Web—without identifying it as such, is likely to lead to harmful effects on model performance and reliability. Larger models are particularly sensitive to even small amounts of unreliable data. Their exceptional ability to memorize subtle patterns backfires: they memorize even one-off outliers, resulting in undesirable outputs such as the infamous ‘glue on pizza’ response stemming from an ironic Reddit thread.a

Synthetic data can help address some of the shortage of quality data in domains where we have the means to verify the quality of the generated data automatically. Indeed, we can attribute some recent improvements in math and coding to synthetic data.10 However, math and coding are two fields where automatic verification is feasible (for example, by formalizing a math problem into a formal problem statement, and then tackling the problem until the verifier is satisfied). However, synthetic data is very unlikely to take the place of human-generated and sensor-generated data across most domains, as many real-world nuances are difficult to simulate. Moreover, lack of diversity and quality control in the synthetic data generation pipeline may harm model robustness and introduce bias, and overdependence on it introduces risks of model collapse.9

Together, these practical constraints of scaling laws challenge the notion that making models larger by simply adding more training data will lead to continuous and meaningful improvements across domains and tasks.

Where Data-Driven Scaling Thrives and Stumbles

Let’s consider the advancements that large models are catalyzing in machine translation, robotics, and drug discovery. What are the common threads among these success stories? And what is missing for other use cases and AI capabilities?

Consider machine translation. Early deterministic AI translation models struggled with contextual nuances of language, such as capturing culturally specific elements. Now transformer-based language models excel in this area due to several factors. First, the relatively static nature of language, with its abstractable rules and gradual vocabulary evolution, offered a stable foundation for model training. Second, high quality translation data, often sourced from reputable publishers and professional translations, further enabled effective training. Expansive datasets, including non-parallel corpora, enhanced the naturalness and contextual understanding of these models.

Finally, LLM’s vastly superior capacity to understand contexts (for example, an entire document) led to significant improvements in translation quality. As a result, LLM-based techniques have brought such significant improvements to machine translation that they have become the new standard.2

Robotics—from autonomous driving to factory automation—also demonstrates how data-driven scaling can outperform purely rule-based approaches. Vast sensor logs capture many of the scenarios and environments where robots operate. The “long tail” of rare events, such as extreme conditions or erratic human behavior, demands richer inputs. Higher-resolution imagery and synthetic data increasingly address these demands. As sensor and camera costs drop and real-world data becomes more abundant, these conditions create a virtuous cycle of continuous improvement and propel robotics toward greater reliability than what was possible with prior methods.

Similarly, drug discovery benefits from a vast trove of past experiments, while fundamental biological processes remain relatively stable over time. However, coverage gaps, such as underexplored compound families, pose significant challenges. Pinpointing these gaps ensures that newly gathered data remains relevant and interpretable for model training. Meanwhile, stable features that emerge across experiments highlight critical research directions, enabling more targeted and intentional data acquisition.

At the other end of the spectrum is one of AI’s most formidable challenges: robust and reliable reasoning, whereAI systems often fail surprisingly at tasks that are easy for humans. At the time of writing, Google’s Gemini 2.0b and OpenAI’s o1c have just raised the bar on reasoning benchmarks. And yet, OpenAI’s o1 still fails on simple alterations of existing benchmarks7 or more complex math problems.4,d In our discussions with various field experts, they almost unanimously agreed that this challenge is primarily rooted in model architecture and learning algorithms, rather than data-driven scaling.

Predictive Power of Data Shape

Successful use cases offer clues on where data scaling will be helpful under the current learning paradigm. We find the framework of Topological Data Analysis,1 which aims to identify the intrinsic dimensions and patterns within datasets, particularly useful. While scientists discussed the concept of ‘shape of data’ as early as 2009, it still remains relatively underexplored.11

Topological features derived from data, such as data’s compositional and structural patterns, and their evolution over time, provide insights into whether certain applications are suitable for data-driven scaling. In applications that require an understanding of connectivity and higher-order relationships within a dataset, data shape can reveal stable structures across multiple scales (that is, across different levels of granularity or abstraction).13 For example, translation between languages exhibits regular and persistent patterns at different scales (across sentences, paragraphs, documents). In general, language patterns are stable over time. We know what type of data we need to expand to new languages. And while it may be challenging to acquire the data for rare or only spoken languages, it is easy to judge whether newly acquired data is what we need.

In contrast, use cases where data lacks strong, persistent topological features or where the structure is highly fragmented or unstable over time, may not be as well suited for data scaling approaches. In particular, tasks that involve noisy, unstructured, or random data, where no clear topological or geometric patterns emerge, can be more challenging for models to handle effectively.

Journalistic fact checking and exposing misinformation is one such example. LLMs had success on some fact checking tasks, primarily because they understand language better and can match the text against sources in a more flexible way, particularly when grounded with real-time search data. However, new misinformation techniques evolve rapidly, rendering earlier patterns less helpful. The almost infinite number of types of misinformation makes the scope of the problem formidable, even for a human fact checker. It is not clear what data will effectively predict future misinformation. Uncertainty management—critical for navigating complex ‘grey area’ fact checking tasks—also remains a largely unsolved problem from a technical standpoint, with limited benefit to gain from more data. Finally, understanding what information is reliable is both intuitive and contextual, and sometimes even culturally specific or subjective. Researchers are actively working on many approaches to identify misinformation, but they are less focused on collecting more data as a sole solution.

Data Acquisition as Another Predictor

Beyond data shape, the feasibility of data-driven scaling is largely determined by the nature of the data-acquisition process. Intuitively, if quality data is available and accessible, the potential for scaling increases significantly. For example, as we continue to collect sensor data from autonomous cars, their reliability and performance will continue to improve.

The specific type of data to collect also plays a crucial role. For example, training on step-by-step–style content that embodies “procedural knowledge” leads to significant performance improvement in the current learning paradigm.8 The abundance of such data in domains such as math and coding, coupled with the emergence of strong evaluation metrics, has fueled rapid progress in these areas. However, acquiring or creating procedural knowledge for other tasks remains a significant challenge.

As we discuss acquiring quality data, it is crucial to note that the very definition of data quality is nuanced. For starters, quality data needs to be accurate, reliable, and complete. Additionally, training data must be dense with useful information and offer novel insights compared to the data already available for training. High-quality data needs to be at the right level of granularity given the use case, timely and diverse enough to provide meaningful insights. In other words, the definition of high quality data is not universal; the quality is connected to a use case and the value that the trained model delivers to the user.

Finally, we cannot fully assess the impact of data without critically looking at today’s evaluation frameworks. Our evaluation approaches need to better reflect the nuances and needs of real-world users, and focus less on relatively simplistic single-turn benchmarks. Current benchmarks test limited aspects of performance which do not necessarily translate to user value. The next generation of evaluation approaches needs to consider how AI models handle real-world complexities, measure stochastic performance, and reflect user satisfaction and economic value.

Understanding the data-acquisition challenges is a critical component of making an informed decision about a potential scaling initiative, including an ultimate assessment of whether the benefits of scaling outweigh the costs.

The Promise of Intentional Data Scaling

We should be intentional in data-driven scaling. By focusing on use cases with a strong hypothesis about efficacy of scaling, and by collecting fit-for-purpose data based on the needs of these use cases, we can make model training more efficient and reduce the volume of data that we need. In turn, we will make model training more efficient and sustainable. Even with existing data, intentional filtering and selection are crucial to ensure a larger fraction of training mixture is high quality.

Figure.  Operationalizing data-shape based assessment of use cases: The table lists the types of questions that will help understand the shape of data for specific use cases. We demonstrate the application of this framework to two use cases, machine translation and understanding misinformation.

As we continue to learn how to define the shape of data, and how these dimensions impact model performance, an evolution of this approach could play a role in active learning, where models prioritize the right type of data during training via human-in-the-loop and model-in-the-loop, potentially accelerating progress even further. Moreover, the relation between topological dimensions of data and model performance is likely to provide us crucial pointers on where current learning paradigms fail, and hence inform the next generations of learning paradigms as well as relative value of various datasets.

By adopting a more intentional approach, we can build a more focused and efficient AI-powered future, using resources efficiently and paving the way for tackling complex AI challenges that require more than just data and scale.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

大型语言模型 数据扩展 数据形状 机器翻译 机器人技术 虚假信息识别 数据获取 评估框架
相关文章