Perplexity, one of the more prominent AI-driven alternatives, is under fire. It supposedly scraped articles from hundreds of other local and national news platforms—Wired included—without permission. The allegations arrive in the midst of a national conversation about the ethical implications of AI technologies. According to the complaint, Perplexity went further and blocked site no-crawl directives. In addition, it employed various stealth techniques to mask that its activities constituted scraping.
During an interview at the Disrupt 2024 conference, Perplexity’s CEO, Aravind Srinivas, struggled to articulate the company’s stance on plagiarism when questioned by TechCrunch’s Devin Coldewey. This lack of clarity has raised further concerns about the company’s practices and commitment to ethical standards in content usage.
Cloudflare, a web infrastructure and security company, has reported extensive activity from Perplexity across tens of thousands of domains and millions of requests daily. According to Cloudflare researchers, such crawlers from Perplexity regularly ignored blocks placed by websites. This implored behavior that turned into scraping activities that stomped on clearly stated web etiquette.
“In an attempt to circumvent the website’s preferences,” – Cloudflare’s researchers
Cloudflare’s subsequent analysis found that all of these searches were coming from Perplexity—using its stated user-agent. When its special-purpose crawler hit walls, it too turned to standard browsers that mimic Google Chrome on macOS. This approach should alarm anyone concerned with the company’s stewardship of ethical web scraping techniques.
“We observed that Perplexity uses not only their declared user-agent, but also a generic browser intended to impersonate Google Chrome on macOS when their declared crawler was blocked” – Cloudflare
The allegations don’t stop at unauthorized scraping. Additionally, as news outlets have taken the opportunity to accuse Perplexity of plagiarism, the platform’s public image has only become further muddied. Perplexity’s spokesperson, Jesse Dwyer, dismissed Cloudflare’s blog post as a “sales pitch,” arguing that the evidence presented did not substantiate claims of wrongdoing.
Perhaps most troublingly, Dwyer again insisted that the screenshots Cloudflare had posted in their damage report “clearly show that no content was ever reached.” According to Cloudflare’s own research, that was not the case — Perplexity’s actions were widespread and rampant.
“This activity was observed across tens of thousands of domains and millions of requests per day. We were able to fingerprint this crawler using a combination of machine learning and network signals.” – Cloudflare
In light of these changes, Cloudflare has begun implementing measures to fight against predatory scraping. The company released a free tool last year focused on helping webmasters find and fight AI bots. On top of that, Cloudflare announced a new publisher marketplace where publishers can literally sell access to their published content to AI scrapers.
Even with all the accumulating evidence and backlash, Dwyer has continued to insist that the bot described in Cloudflare’s blog post “isn’t even ours.” This denial hasn’t calmed fears about Perplexity’s practices. It poses frightening questions about content ownership in our ever-more automated digital ecosystem.
This very debate underscores the struggle between developing AI technology and protecting creators’ rights by preventing unethical use of their content. As businesses like Perplexity are exploring ways to mitigate these risks, they are coming under more and more pressure to embrace transparent and responsible practices.