The Race to Block OpenAI’s Scraping Bots Is Down

Salma October 7, 2024

0 0 2 minutes read

The Race to Block OpenAI’s Scraping Bots Is Down

It’s too soon to tell how deals between AI companies and publishers will shake out. OpenAI already has one clear win, though: Its web crawlers aren’t blocked by top news outlets at the rate they once were.

The productivity AI boom caused a data gold rush—and a subsequent data protection rush (for many news websites, anyway) where publishers sought to block AI searchers and prevent their work from becoming training data without permission. When Apple released a new AI agent this summer, for example, a number of top stores quickly left Apple’s Web site using the Robots Exclusion Protocol, or robots.txt, a file that allows webmasters to control bots. There are so many new AI bots on the scene that it can feel like playing whack-a-mole to keep up.

OpenAI’s GPTBot has a more recognized name and is blocked more often than competitors such as Google AI. The number of high-profile media websites using robots.txt to “disallow” OpenAI’s GPTBot increased dramatically from its launch in August 2023 until that fall, and then slowly (but steadily) increased from November 2023 to April 2024 , according to an analysis of 1,000. Famous stores from Ontario-based AI for Originity AI. At its height, the height was just over a third of websites; it has decreased now it is almost a quarter. Within a small pool of the most prominent news outlets, the blocking rate is still above 50 percent, but is down from peaks earlier this year of nearly 90 percent.

But last May, after Dotdash Meredith announced a licensing deal with OpenAI, that number dropped significantly. Then it dipped again at the end of May when Vox announced its plan—and again this August when WIRED’s parent company, Condé Nast, struck a deal. The upward trend in sanctions appears to have ended, at least for now.

These dips make obvious sense. When companies enter into partnerships and give permission for their data to be used, they are no longer motivated to block it, so it follows that they will update their robots.txt files to allow clarity; make enough deals and the overall percentage of sites blocking searches will likely drop. Other outlets opened OpenAI crawlers on the same day they announced the deal, such as The Atlantic. Others took a few days to a few weeks, like Vox, which announced its partnership at the end of May but opened GPTBot on its sites in late June.

Robots.txt is not legally binding, but it has long served as the standard governing web browser behavior. For most of the Internet’s existence, people using web pages expected each other to be file compatible. When a WIRED investigation earlier this summer found that AI startup Perplexity may have chosen to ignore robots.txt instructions, Amazon’s cloud division launched an investigation into whether Perplexity violated its rules. It’s not a good look to ignore robots.txt, which probably explains why many prominent AI companies—including OpenAI—clearly state that they use it to decide what to crawl. AI CEO Jon Gillham believes this adds even more urgency to OpenAI’s push for consensus. “It’s clear that OpenAI sees the blockade as a threat to their future ambitions,” Gillham said.

Source link

Salma October 7, 2024

0 0 2 minutes read