Reddit Accuses ‘Data Scraper’ Companies of Stealing Its Information

Eight years ago, SerpApi, a start-up in Austin, Texas, dived headlong into the byzantine world of using robots to “scrape” Google’s search algorithms, so it could collect information to help customers appear higher in search results.

Then OpenAI’s ChatGPT came along, kicking off an artificial intelligence revolution. As more tech companies began building A.I. chatbots to keep up, they needed large amounts of data to train their A.I. models — data that SerpApi had already gathered.

Practically overnight, a class of companies like SerpApi — known as “data scrapers” — found a new business selling data scraped from Google to companies looking to train their A.I. chatbots.

On Wednesday, the internet message board Reddit decided to fight the data scrapers. It filed a lawsuit in the U.S. District Court for the Southern District of New York claiming that four companies had illegally stolen its data by scraping Google search results in which Reddit content appeared.

Three of those companies — SerpApi; a Lithuanian start-up, Oxylabs; and a Russian company, AWMProxy — sold data to A.I. companies like OpenAI and Meta, according to the lawsuit. The fourth company, Perplexity, is a San Francisco start-up that makes an A.I. search engine.

“Recognizing they lack permission to access the data directly from Reddit, defendants have devised a scheme to scrape the data from Google’s search results,” Reddit’s lawsuit said. “They do so by masking their identities, hiding their locations and disguising web scrapers as regular people to circumvent or bypass the technical restrictions meant to stop them. And they do it at an industrial scale.”

Reddit said it was seeking a permanent injunction against the companies, as well as financial damages, and wanted to prohibit the use or sale of any previously scraped Reddit data.

Representatives from SerpApi, Perplexity, Oxylabs and AWMProxy did not immediately respond to requests for comment.

Scraping the internet has been a longtime — albeit thorny — practice. In the internet’s earlier days, Google created an empire by using robots to scrape web pages and categorizing them, then offering a search engine that combed through those categories to help people find the information they needed. Along the way, companies began scraping Google and sold their findings to businesses seeking to appear higher in Google search results.

The relationship between the scrapers and the scraped was seen as symbiotic. Google’s scraping could help direct web traffic to publishers’ sites. Those that scraped Google could sell that information to help web publishers build their sites in ways that made them easier for Google to surface.

“It was all the original ecosystem of the web,” said Doug Leeds, a co-founder of Really Simple Licensing, a nonprofit that works to help publishers and creators obtain compensation when A.I. uses their work. “It wasn’t necessarily a problem back then, because there was a monetization method for all the companies involved.”

Now, some feel the relationship has turned from symbiotic to parasitic. A.I. companies have used their own bots to hoover up as much information as possible without paying for the data. In response, companies like Reddit began locking down their websites to prevent A.I. companies from freely profiting off the data.

Book publishers like Simon & Schuster and news organizations like The New York Times — which has sued OpenAI and Microsoft, claiming copyright infringement — have struck deals to sell licenses to their data for millions of dollars.

Reddit, which is used by more than 416 million people a week, said it believed it had particularly valuable data. Its users chat about a wide variety of topics, from makeup brands and Swiss dog breeds to role-playing video games and international travel tips. Such discussions can aid A.I. companies that are aiming to improve the “natural language” abilities of their chatbots.

In 2023, Reddit asked outsiders to begin paying for access to its data. It forged licensing deals with Google, which uses Reddit data to train its Gemini chatbot, and OpenAI, which needs data to train ChatGPT.

But not all companies wanted to sign deals. Instead, some found a way to use Reddit’s information through data scrapers, according to the lawsuit.

SerpApi, Oxylabs and AWMProxy began scraping billions of Google search queries a month and used those searches to surface Reddit data, Reddit’s lawsuit said. The companies then packaged that data and resold it to others, which used it to train their A.I. systems.

Perplexity was one of those buyers, according to Reddit’s lawsuit. Perplexity had scraped Reddit data in the past without payment but agreed to stop after Reddit sent it a cease-and-desist order. Even so, citations to Reddit data in Perplexity search results jumped “fortyfold,” the lawsuit said. In the suit, Reddit said it had spent “tens of millions of dollars” on anti-scraping systems.

“Perplexity’s business model is effectively to take Reddit’s content from Google search results,” then feed it into an A.I. model and “call it a new product,” the lawsuit said.

Reddit said it had set a trap for Perplexity by creating a “test post” on its site that could “only be crawled by Google’s search engine and was not otherwise accessible anywhere on the internet.” Within hours, Perplexity search results had surfaced the content of that test post, the lawsuit said.

Google, which is not a plaintiff in Reddit’s lawsuit, has tried and failed to stop SerpApi and other data scrapers, according to the lawsuit and previous reporting from The Information.

“Google has always actively respected the choices websites make through robots.txt, but sadly there’s a bunch of stealthy scrapers that do not,” José Castaneda, a Google spokesman, said in a statement. He was referring to how web publishers can opt out of being scraped by Google’s “robots.txt” bot.

Reddit may be fighting an uphill battle. While its lawsuit was filed in New York, some of the data-scraping start-ups like those targeted in the suit are based in Europe and Asia. And many of those companies have found workarounds against scraping bans.

Still, Reddit plans to persist. In June, it sued Anthropic, accusing the A.I. company of unlawfully using its data. On Wednesday, the social network said in its lawsuit that it would continue taking steps to protect its data from unauthorized use.

Mike Isaac is The Times’s Silicon Valley correspondent, based in San Francisco. He covers the world’s most consequential tech companies, and how they shape culture both online and offline.

The post Reddit Accuses ‘Data Scraper’ Companies of Stealing Its Information appeared first on New York Times.