Stay informed with free updates
Simply sign up to the Artificial intelligence myFT Digest — delivered directly to your inbox.
Social media platform Reddit has filed a copyright lawsuit against Perplexity, accusing the AI company of illegally scraping its data in order to train the model powering its search engine.
The complaint filed in New York federal court on Wednesday marks the latest legal tussle between AI groups over alleged copyrighted material.
Reddit also sued three smaller groups: Lithuanian data scraper Oxylabs UAB, former Russian botnet AWMProxy, and Texas start-up SerpApi.
Reddit claims the three groups provided data-scraping services for hoovering up copyrighted Reddit content “by masking their identities, hiding their locations, and disguising their web scrapers as regular people”.
“AI companies are locked in an arms race for quality human content — and that pressure has fuelled an industrial-scale “data laundering” economy”, Ben Lee, chief legal officer at Reddit said in a statement.
Perplexity was “a willing customer of at least one of its co-defendants”, the social media company wrote in the filing, alleging that the San Francisco-based AI group “desperately” needed “to fuel its “answer engine” by scraping data through Google search results.
“We strongly disagree with Reddit’s allegations and intend to vigorously defend ourselves in court,” SerpApi said.
Two people familiar with the matter told the Financial Times that Reddit had confronted Perplexity about its alleged theft and suggested they enter discussions about a paid partnership, but that its founder Aravind Srinivas was not interested.
Reddit had also contacted Google with its concerns, asking the tech giant to investigate if Perplexity was scraping Reddit’s proprietary data through its search engine and if so, to work out how to prevent this, the people added.
A spokesman for Google declined to comment.
The suit adds to dozens of copyright lawsuits that have been filed against AI companies since the advent of generative AI systems, which are trained using vast amounts of text data, including content from the internet. Copyright holders have claimed their content has been used without consent or fair compensation.
Reddit, which went public in March 2024 and is known for hosting devoted online communities, has struck multimillion-dollar partnerships with Google and OpenAI allowing them to train their large language models on its content.
By contrast, Reddit alleged in the complaint that the defendants had circumvented their data protection measures to obtain its copyrighted material without permission.
Lee said Reddit was “a prime target because it’s one of the largest and most dynamic collections of human conversation ever created”.
In June, Reddit filed a similar lawsuit against Anthropic, alleging the AI start-up had scraped its platform more than 100,000 times since July 2024. Anthropic responded at the time that it “disagreed” with Reddit’s claims and would “defend ourselves vigorously”.
Perplexity and Oxylabs did not immediately respond to a request for comment. AWMProxy could not be reached for comment.
Read the full article here


