• Home
  • Blog
  • AI News
  • Wikipedia Partners with Kaggle to Offer AI Data and Reduce Bot Scraping

Wikipedia Partners with Kaggle to Offer AI Data and Reduce Bot Scraping

Published:April 17, 2025

Reading Time: 2 minutes

Wikipedia is one of the most visited websites in the world because it is a major source of information for both people and machines. 

However, AI bots have been scraping their content at high volumes, causing an increasing strain on Wikipedia’s servers.

To address the issue, the Wikimedia Foundation has taken a new approach. Rather than blocking bots, it is offering developers a better option. 

On April 17, 2025, the Foundation announced a partnership with Kaggle, a data science platform owned by Google.

The Partnership Offer

The partnership introduces a new, openly licensed dataset that includes structured content from Wikipedia in English and French. 

Developers can now access this dataset directly on Kaggle to remove the need for scraping raw web pages.

Wikipedia AI data on Kaggle

The dataset is designed with machine learning in mind. It includes short summaries, article descriptions, image links, infobox data, and organized article sections. 

It does not, however, include references, audio files, or other non-text content. The format is clean, consistent, and easy for machines to read, specifically in JSON.

Also read: How AI Handles Large Data Sets

What Developers Want

Many AI models rely on Wikipedia for training data, but scraping the site is inefficient and risky.

The new dataset solves this issue by offering clean, ready-to-use information. It also saves time,  reduces errors, and makes legal reuse simpler.

Developers aren’t the only ones with something to gain; smaller teams and individual researchers also. They now have access to the same kind of structured data that was once only available to large tech firms.

Benefits for Wikipedia

Wikipedia gains just as much. The new strategy reduces server load and discourages aggressive scraping. It also allows the Foundation to guide how its content is used.

Kaggle’s Role in the Collaboration

Kaggle is widely used by data scientists to host datasets, public notebooks, and competitions. With a Kaggle partnership, Wikimedia ensures broad access and easy adoption.

Brenda Flynn, Kaggle’s partnerships lead, stated:

“Kaggle is excited to play a role in keeping this data accessible, available, and useful.”

Now, anyone on Kaggle can explore, analyze, and even build projects using Wikipedia’s structured content.

Why Not Just Block the Bots?

Blocking AI bots may seem like a simpler solution, but Wikipedia was built on the idea of openness. Cutting off bots entirely would conflict with its original mission.

Instead, the Foundation chose to meet developers halfway. It offers a cleaner, more reliable option, and in return, it hopes developers will stop scraping and use the official dataset.

Lolade

Contributor & AI Expert