Mashable 10月23日 16:39
互联网档案库页面抓取量骤降,新闻网站受影响尤甚
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

互联网档案库(Internet Archive)旗下的Wayback Machine是保存互联网历史的重要工具,但近期数据显示其抓取网页数量大幅下降,尤其是在新闻类网站方面。报告指出,2025年初至5月中旬,Wayback Machine从100家主要新闻网站首页共抓取了120万个快照,而5月中旬至10月初,这一数字骤降至14.8万个,降幅高达87%。例如,CNN首页的抓取次数也从34,524次锐减至1,903次。Wayback Machine负责人表示,这源于5月份一些特定项目的技术故障,导致部分网站的索引结构未能及时建立。然而,长达五个月的延迟并不寻常,且档案库方面仅以“资源分配”等运营原因含糊带过,未提供详细解释。鉴于新闻媒体网站已成为重要的历史记录载体,Wayback Machine在保存这些信息方面面临资源挑战,其2023年收入与支出存在较大缺口,且去年还经历了数据泄露事件。

📉 抓取量锐减:Wayback Machine在2025年5月中旬后,对100家主要新闻网站首页的抓取量骤降了87%,从年初的120万个快照减少到后来的14.8万个,显示出其存档能力出现显著问题。

📰 新闻网站受影响:此次抓取量下降尤其影响到新闻类网站,这些网站在数字时代已成为重要的历史记录载体,Wayback Machine的存档能力下降可能导致重要的时事信息流失。

🛠️ 技术与运营挑战:Wayback Machine负责人解释称,部分抓取中断是由于特定项目的技术故障和索引结构未能及时建立,同时提及了“资源分配”等运营原因,但具体细节和长期解决方案尚不明朗。

💰 资源压力:互联网档案库作为非营利组织,面临着巨大的运营成本压力。2023年支出高达3270万美元,但收入仅为2300万美元,这可能直接影响了其维护和扩展存档能力。

🔒 安全事件影响:去年10月发生的大规模数据泄露事件,一度导致Wayback Machine及其服务下线数周,这可能进一步加剧了其运营的复杂性和资源消耗。

The Internet Archive's Wayback Machine is an invaluable resource that does exactly what it says in the nonprofit organization's name: It archives the internet. The Internet Archive is responsible for archiving around 500 million webpages per day.

However, there has been a concerning change to the platform in recent months. According to a new report by Nieman Lab, the Internet Archive's Wayback Machine has been archiving certain websites much less lately. Even more concerning: Many of those websites are news-related.

According to the report by Neiman Lab, the Wayback Machine archived 1.2 million snapshots from 100 major news websites' homepages between Jan. 1 and May 15, 2025. Suddenly, though, in mid-May, this changed.

The Wayback Machine only took 148,628 snapshots from those same 100 news websites' homepages between May 17 and Oct. 1, 2025. That's a whopping 87 percent drop in the number of archived pages between the first four months of the year and the preceding five months.

CNN's homepage, for example, was archived by the Wayback Machine 34,524 times between Jan. 1 and May 15. Only 1,903 snapshots of the homepage since then are in the Wayback Machine.

Mashable reported in July that, thanks to a new designation by California Senator Alex Padilla, the Internet Archive will join a network of more than 1,000 libraries around the country tasked with archiving government documents for public view.

Mark Graham, the director of the Wayback Machine, told Nieman Lab that "a breakdown in some specific archiving projects in May ... caused less archives to be created for some sites." According to Graham, some of the missing snapshots have just not had their index structure built yet and would be added to the Wayback Machine archive soon. 

As Nieman Lab pointed out, a five-month delay due to index issues is uncommon. According to Graham, the Internet Archive has been experiencing delays due to "various operational reasons" such as "resource allocation." The Internet Archive did not specify or provide any more information to Nieman Lab about the issue.

Newspapers have long been archived for the historical record. However, in the age of the internet, most newspapers, aside from the legacy media giants, have largely gone unarchived recently. News media websites have taken their place as the historical record. And, since 1996, the Internet Archive has taken up the responsibility of storing those webpage archives.

However, the nonprofit has seen difficulties in recent years. As Nieman Lab reports, the Internet Archive's 2023 expenses were $32.7 million. It takes a lot of resources to not only crawl the internet but store the data too. The nonprofit only brought in $23 million in revenue that same year.

In addition, the Internet Archive fell victim last October to a huge data breach which took the site, along with the Wayback Machine, offline. It took weeks for the site to be fully restored.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

Wayback Machine Internet Archive 网页存档 新闻网站 数字历史 技术故障 资源分配 数据泄露 Archiving Digital Preservation News Media Technical Issues Resource Allocation Data Breach
相关文章