TechCrunch News 08月06日
Some people are defending Perplexity after Cloudflare ‘named and shamed’ it
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

近期,Cloudflare指控AI搜索引擎Perplexity绕过网站阻止机制进行数据抓取,引发了关于AI网络爬虫行为的广泛讨论。支持者认为,AI代理代表用户访问公开网站与人类浏览无异,应享有同等待遇;而反对者则强调网站所有者有权决定内容访问权限。此事件暴露了AI时代网络信息访问的灰色地带,尤其是在AI流量已超人类活动的情况下,如何平衡AI模型训练数据需求与网站内容保护,以及如何有效区分AI代理和恶意爬虫,成为亟待解决的问题。新的Web Bot Auth标准和AI代理的普及,都预示着未来网络交互将面临更复杂的挑战。

🔍 AI爬虫行为引发争议:Cloudflare指责Perplexity无视robots.txt规则,使用伪装的浏览器绕过封锁,抓取网站内容。这一行为被部分人视为“黑客”行为,但也有观点认为AI代理替用户访问公开网站与人类浏览无异,不应被区别对待。

🌐 用户意愿与网站所有权冲突:支持Perplexity的用户认为,当用户请求特定网站信息时,AI代理应被允许访问,就像人类使用浏览器一样。然而,网站所有者希望通过直接流量和广告获得收益,因此可能不希望AI代理绕过其访问限制,这使得AI代理的访问权与网站所有权产生矛盾。

🤖 AI时代的网络安全新挑战:随着AI流量激增,甚至已超过人类活动,如何有效区分AI代理和恶意爬虫成为关键。Cloudflare推广的Web Bot Auth标准旨在通过加密方式识别AI请求,但其有效性仍待检验,这标志着网络安全领域正面临AI带来的新挑战。

📉 AI对传统网络流量的影响:AI模型训练所需的大量数据抓取,可能导致传统搜索引擎流量下降(预测至2026年下降25%)。AI代理的普及可能改变用户获取信息的方式,若网站所有者过度封锁AI代理,可能损害自身商业利益,形成新的博弈局面。

When Cloudflare accused AI search engine Perplexity of stealthily scraping websites on Monday, while ignoring a site’s specific methods to block it, this wasn’t a clear-cut case of an AI web crawler gone wild.

Many people came to Perplexity’s defense. They argued that Perplexity accessing sites in defiance of the website owner’s wishes, while controversial, is acceptable. And this is a controversy that will certainly grow as AI agents flood the internet: Should an agent accessing a website on behalf of its user be treated like a bot? Or like a human making the same request?

Cloudflare is known for providing anti-bot crawling and other web security services to millions of websites. Essentially, Cloudflare’s test case involved setting up a new website with a new domain that had never been crawled by any bot, setting up a robots.txt file that specifically blocked Perplexity’s known AI crawling bots, and then asking Perplexity about the website’s content. And Perplexity answered the question.

Cloudflare researchers found the AI search engine used “a generic browser intended to impersonate Google Chrome on macOS” when its web crawler itself was blocked. Cloudflare CEO Matthew Prince posted the research on X, writing, “Some supposedly ‘reputable’ AI companies act more like North Korean hackers. Time to name, shame, and hard block them.”

But many people disagreed with Prince’s assessment that this was actual bad behavior. Those defending Perplexity on sites like X and Hacker News pointed out that what Cloudflare seemed to document was the AI accessing a specific public website when its user asked about that specific website. 

“If I as a human request a website, then I should be shown the content,” one person on Hacker News wrote, adding, “why would the LLM accessing the website on my behalf be in a different legal category as my Firefox web browser?”

A Perplexity spokesperson previously denied to TechCrunch that the bots were the company’s and called Cloudflare’s blog post a sales pitch for Cloudflare. Then on Tuesday, Perplexity published a blog in its defense (and generally attacking Cloudflare), claiming the behavior was from a third-party service it uses occasionally.

Techcrunch event

San Francisco | October 27-29, 2025

But the crux of Perplexity’s post made a similar appeal as its online defenders did.

“The difference between automated crawling and user-driven fetching isn’t just technical — it’s about who gets to access information on the open web,” the post said. “This controversy reveals that Cloudflare’s systems are fundamentally inadequate for distinguishing between legitimate AI assistants and actual threats.”

Peplexity’s accusations aren’t exactly fair, either. One argument that Prince and Cloudflare used for calling out Perplexity’s methods was that OpenAI doesn’t behave in the same way.

“OpenAI is an example of a leading AI company that follows these best practices. They respect robots.txt and do not try to evade either a robots.txt directive or a network level block. And ChatGPT Agent is signing http requests using the newly proposed open standard Web Bot Auth,” Prince wrote in his post

Web Bot Auth is a Cloudflare-supported standard being developed by the Internet Engineering Task Force that hopes to create a cryptographic method for identifying AI agent web requests.

The debate comes as bot activity reshapes the internet. As TechCrunch has previously reported, bots seeking to scrape massive amounts of content to train AI models have become a menace, especially to smaller sites. 

For the first time in the internet’s history, bot activity is currently outstripping human activity online, with AI traffic accounting for over 50%, according to Imperva’s Bad Bot report released last month. Most of that activity is coming from LLMs. But the report also found that malicious bots now make up 37% of all internet traffic. That’s activity that includes everything from persistent scraping to unauthorized login attempts.

Until LLMs, the internet generally accepted that websites could and should block most bot activity given how often it was malicious by using CAPTCHAs and other services (such as Cloudflare). Websites also had a clear incentive to work with specific good actors, such as Googlebot, guiding it on what not to index through robots.txt. Google indexed the internet, which sent traffic to sites.

Now, LLMs are eating an increasing amount of that traffic. Gartner predicts that search engine volume will drop by 25% by 2026.  Right now humans tend to click website links from LLMs at the point they are most valuable to the website, which is when they are ready to conduct a transaction.

But if humans adopt agents as the tech industry predicts they will — to arrange our travel, book our dinner reservations, and shop for us — would websites hurt their business interests by blocking them? The debate on X captured the dilemma perfectly:

“I WANT perplexity to visit any public content on my behalf when I give it a request/task!” wrote one person in response to Cloudflare calling Perplexity out. “What if the site owners don’t want it? they just want you [to] directly visit the home, see their stuff” argued another, pointing out that the site owner who created the content wants the traffic and potential ad revenue, not to let Perplexity take it.

“This is why I can’t see ‘agentic browsing’ really working — much harder problem than people think. Most website owners will just block,” a third predicted.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

AI爬虫 Perplexity Cloudflare robots.txt Web Bot Auth
相关文章