AI训练数据来源争议：Common Crawl与出版商的博弈

Mashable 前天 14:57

../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

一篇深入调查揭示了大型AI公司可能通过Common Crawl基金会获取付费出版物的数据以训练其模型。Common Crawl作为一个公开的网络数据存档项目，被指控为AI公司提供了访问《纽约时报》等付费内容的“后门”。尽管Common Crawl否认了这些指控，声称其数据仅来自公开网页，并表示其数据存档格式“不可变”，但多方证据表明，其存档可能已包含付费内容，且内容移除请求处理缓慢。此事件加剧了AI行业与新闻出版业在版权和数据使用上的冲突。

🔍 **AI公司通过Common Crawl获取付费内容训练数据**：调查显示，Google、Anthropic、OpenAI等AI公司可能利用Common Crawl基金会提供的海量网络数据存档来训练其AI模型。Common Crawl收集并公开了包含海量数据的数据库，据称这为AI公司提供了访问《纽约时报》、《连线》等付费出版物内容的途径，引发了关于数据来源和版权的争议。

🌐 **Common Crawl的辩解与实际操作的矛盾**：Common Crawl基金会坚称其数据仅来自公开可访问的网页，并且其执行董事Richard Skrenta认为AI模型应能访问互联网上的所有内容。然而，报道指出，Common Crawl曾接受OpenAI等公司的捐款，并将其列为“合作者”，同时其存档数据可能包含付费内容，且移除这些内容的过程缓慢且效果存疑，其公开搜索工具也可能提供误导性信息。

⚖️ **出版业的困境与法律挑战**：AI聊天机器人通过抓取出版商信息并直接提供给读者，对新闻业造成了“流量末日”的冲击。许多出版商已意识到Common Crawl的活动，并尝试通过技术手段阻止其抓取，但仅能保护未来内容。部分出版商已要求Common Crawl移除其内容，但进展缓慢，且Common Crawl声称其存档格式“不可变”，这使得内容删除成为难题，进一步加剧了与AI公司的法律纠纷。

🚫 **数据移除的困难与存档的不可变性**：尽管Common Crawl声称正在处理出版商的内容移除请求，但有证据表明这些请求并未得到有效满足，且自2016年以来其存档内容未被修改。执行董事Skrenta解释称，数据存储格式“旨在不可变”，意味着一旦添加便无法删除。此外，Common Crawl的公开搜索工具可能通过返回误导性结果来掩盖其存档的真实范围。

🤝 **AI与出版业的持续博弈**：AI行业对版权材料的使用问题远未结束。OpenAI等公司仍面临来自《纽约时报》和Mashable母公司Ziff Davis等主要出版商的多起诉讼。Common Crawl基金会与AI公司的合作关系以及其数据收集和分发方式，使得这场关于数据所有权和使用权的争议愈发复杂和激烈。

If you've ever wondered how AI companies like Google, Anthropic, OpenAI, and Meta get their training data from paywalled publishers such as the New York Times, Wired, or the Washington Post, we may finally have an answer.

In a detailed investigation for The Atlantic, reporter Alex Reisner reveals that several major AI companies have quietly partnered with the Common Crawl Foundation — a nonprofit that scrapes the web to build a massive public archive of the internet for research purposes. According to the report, Common Crawl, whose database spans multiple petabytes, has effectively opened a backdoor that allows AI companies to train their models on paywalled content from major news outlets. In a blog post published today, Common Crawl strongly denies the accusations.

The foundation’s website claims its data is collected from freely available webpages. But its executive director, Richard Skrenta, told The Atlantic he believes AI models should be able to access everything on the internet. "The robots are people too," Skrenta told The Atlantic.

AI chatbots like ChatGPT and Google Gemini have sparked a crisis for the journalism industry. AI chatbots scrape information from publishers and share this information directly with readers, taking clicks and visitors away from those publishers. This phenomenon has been called the traffic apocalypse and the AI armageddon. (Disclosure: Ziff Davis, Mashable’s parent company, in April filed a lawsuit against OpenAI, alleging it infringed Ziff Davis copyrights in training and operating its AI systems.)

As stated in the Atlantic report, some news publishers have become aware of Common Crawl’s activities, and some have blocked the foundation’s scraper by adding an instruction to their website’s code. However, that only protects future content, not anything that's already been scraped.

Multiple publishers have requested that Common Crawl remove their content from its archives. The foundation has stated that it’s complying, albeit slowly, due to the sheer volume of data, with one organization sharing multiple emails from Common Crawl with The Atlantic that the removal process was "50 percent, 70 percent, and then 80 percent complete." Yet Reisner found that none of those takedown requests seem to have been fulfilled — and that Common Crawl’s archives haven’t been modified since 2016.

Skrenta told The Atlantic that the file format used for storing the archives is "meant to be immutable," meaning content can’t be deleted once it’s added. However, Reisner reports that the site’s public search tool, the only non-technical way to browse Common Crawl’s archives, returns misleading results for certain domains — masking the scope of what has been scraped and stored.

Mashable reached out to Common Crawl, and a team member pointed us to a public blog post from Skrenta. In it, Skrenta denied claims that the organization misled publishers, stating that its web crawler does not bypass paywalls. He also emphasized that Common Crawl is financially independent and “not doing AI’s dirty work.”

"The Atlantic makes several false and misleading claims about the Common Crawl Foundation, including the accusation that our organization has 'lied to publishers' about our activities," the blog post says. It further states, "Our web crawler, known as CCBot, collects data from publicly accessible web pages. We do not go 'behind paywalls,' do not log in to any websites, and do not employ any method designed to evade access restrictions."

However, as Reisner reports, Common Crawl has previously received donations from OpenAI, Anthropic, and other AI-focused companies. It also lists NVIDIA as a "collaborator" on its website. Beyond collecting raw text, Reisner writes, the foundation also helps assemble and distribute AI training datasets — even hosting them for broader use.

Whatever the case, the fight over how the AI industry uses copyrighted material is far from over. OpenAI, for example, remains at the center of several lawsuits from major publishers, including the New York Times and Mashable’s parent company, Ziff Davis.

Fish AI Reader

FishAI

联系邮箱 441953276@qq.com

相关标签