Wikipedia servers are struggling under pressure from AI scraping bots

@TechSpot shared a link

2025-04-03 20:37:39 ·

www.techspot.com

Editor's take: AI bots have recently become the scourge of websites dealing with written content or other media types. From Wikipedia to the humble personal blog, no one is safe from the network sledgehammer wielded by OpenAI and other tech giants in search of fresh content to feed their AI models. The Wikimedia Foundation, the nonprofit organization hosting Wikipedia and other widely popular websites, is raising concerns about AI scraper bots and their impact on the foundation's internet bandwidth. Demand for content hosted on Wikimedia servers has grown significantly since the beginning of 2024, with AI companies actively consuming an overwhelming amount of traffic to train their products.Wikimedia projects, which include some of the largest collections of knowledge and freely accessible media on the internet, are used by billions of people worldwide. Wikimedia Commons alone hosts 144 million images, videos, and other files shared under a public domain license, and it is especially suffering from the unregulated crawling activity of AI bots.The Wikimedia Foundation has experienced a 50 percent increase in bandwidth used for multimedia downloads since January 2024, with traffic predominantly coming from bots. Automated programs are scraping the Wikimedia Commons image catalog to feed the content to AI models, the foundation states, and the infrastructure isn't built to endure this type of parasitic internet traffic.Wikimedia's team had clear evidence of the effects of AI scraping in December 2024, when former US President Jimmy Carter passed away, and millions of viewers accessed his page on the English edition of Wikipedia. The 2.8 million people reading the president's bio and accomplishments were 'manageable,' the team said, but many users were also streaming the 1.5-hour-long video of Carter's 1980 debate with Ronald Reagan.As a result of the doubling of normal network traffic, a small number of Wikipedia's connection routes to the internet were congested for around an hour. Wikimedia's Site Reliability team was able to reroute traffic and restore access, but the network hiccup shouldn't have happened in the first place. // Related StoriesBy examining the bandwidth issue during a system migration, Wikimedia found that at least 65 percent of the most resource-intensive traffic came from bots, passing through the cache infrastructure and directly impacting Wikimedia's 'core' data center.The organization is working to address this new kind of network challenge, which is now affecting the entire internet, as AI and tech companies are actively scraping every ounce of human-made content they can find. "Delivering trustworthy content also means supporting a 'knowledge as a service' model, where we acknowledge that the whole internet draws on Wikimedia content," the organization said.Wikimedia is promoting a more responsible approach to infrastructure access through better coordination with AI developers. Dedicated APIs could ease the bandwidth burden, making identification and the fight against "bad actors" in the AI industry easier.

0 Comments ·0 Shares ·15 Views

Upgrade to Pro