All Content from Business Insider 10月02日
AI面临数据短缺,合成数据与企业专有数据成关键
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

人工智能的飞速发展正面临训练数据短缺的挑战。Goldman Sachs数据主管Neema Raphael指出,网络上的可用数据已接近枯竭,这可能影响新AI系统的构建方式。目前,合成数据被用来填补空白,但存在输出质量参差不齐的风险。Raphael认为,企业拥有的专有数据集,如交易记录和客户互动信息,可能才是解决数据困境的关键。然而,有效利用这些数据需要深入理解其业务背景并进行规范化处理。过度依赖合成数据也引发了关于AI未来发展方向的哲学思考。

📊 AI数据短缺现状:Goldman Sachs数据主管Neema Raphael表示,AI训练数据已近枯竭,这可能迫使AI系统采用新的构建方式,例如使用现有模型输出来训练新模型,而非完全依赖全新数据。

💡 合成数据的作用与风险:为应对数据短缺,开发者转向合成数据(机器生成文本、图像、代码等)。尽管合成数据供应无限,但存在向模型灌输低质量输出或“AI垃圾”的风险,可能影响AI模型的性能和创造力。

🔑 企业专有数据的重要性:Raphael强调,企业内部积累的专有数据集(如交易记录、客户互动等)是解决AI数据困境的关键“宝藏”。这些信息若能被正确利用,将极大提升AI工具的价值。

🚀 数据利用的挑战与未来:有效利用企业数据不仅在于数量,更在于理解其业务背景并进行规范化处理,使其适合业务消费。过度依赖合成数据可能导致AI的“创造性平台期”,引发关于AI发展方向的深入思考。

The meteoric rise of artificial intelligence may appear unstoppable — but it's facing a shortage of training data.

"We've already run out of data," Neema Raphael, Goldman Sachs' chief data officer and head of data engineering, said on the bank's "Exchanges" podcast published on Tuesday.

Raphael said that this shortage may already be influencing how new AI systems are built.

He pointed to China's DeepSeek as an example, saying one hypothesis for its purported development costs came from training on the outputs of existing models rather than entirely new data.

"I think the real interesting thing is going to be how previous models then shape what the next iteration of the world is going to look like in this way," Raphael said.

With the web tapped out, developers are turning to synthetic data — machine-generated text, images, and code. That approach offers limitless supply, but also risks overwhelming models with low-quality output or AI slop.

However, Raphael said he doesn't think the lack of fresh data will be a massive constraint, in part because companies are sitting on untapped reserves of information.

"I think from a consumer world model, I think it's interesting we've definitely in the synthetic sort of explosion of data. But from an enterprise perspective, I think there's still a lot of juice I'd say to be squeezed in that," he said.

That means the real frontier may not be the open internet, but the proprietary datasets held by corporations. From trading flows to client interactions, firms like Goldman sit on information that could make AI tools far more valuable if harnessed correctly.

Raphael's comments come as the industry grapples with "peak data" since the breakout of ChatGPT three years ago.

In January, OpenAI cofounder Ilya Sutskever said at a conference that all the useful data online had already been used to train models, warning that AI's era of rapid development "will unquestionably end."

The next frontier: proprietary data

For businesses, Raphael stressed, the obstacle isn't just finding more data — it's ensuring that the data is usable.

"The challenge is understanding the data, understanding the business context of the data, and then being able to normalize it in a way that makes sense for the business to consume it," he said.

Still, Raphael suggested that heavy reliance on synthetic data raises a deeper question about AI's trajectory. "I think what might be interesting is people might think there might be a creative plateau," he said.

He wondered what would happen if models keep training only on machine-generated content.

"If all of the data is synthetically generated, then how much human data could then be incorporated?" he said.

"I think that'll be an interesting thing to watch from a philosophical perspective," he added.

Read the original article on Business Insider

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

AI 数据短缺 合成数据 专有数据 Goldman Sachs 人工智能 Data Shortage Synthetic Data Proprietary Data
相关文章