Wikipedia is one of the most visited websites in the world because it is a major source of information for both people and machines.
However, AI bots have been scraping their content at high volumes, causing an increasing strain on Wikipedia’s servers.
To address the issue, the Wikimedia Foundation has taken a new approach. Rather than blocking bots, it is offering developers a better option.
On April 17, 2025, the Foundation announced a partnership with Kaggle, a data science platform owned by Google.
The Partnership Offer
The partnership introduces a new, openly licensed dataset that includes structured content from Wikipedia in English and French.
Developers can now access this dataset directly on Kaggle to remove the need for scraping raw web pages.
The dataset is designed with machine learning in mind. It includes short summaries, article descriptions, image links, infobox data, and organized article sections.
It does not, however, include references, audio files, or other non-text content. The format is clean, consistent, and easy for machines to read, specifically in JSON.
Also read: How AI Handles Large Data Sets
What Developers Want
Many AI models rely on Wikipedia for training data, but scraping the site is inefficient and risky.
The new dataset solves this issue by offering clean, ready-to-use information. It also saves time, reduces errors, and makes legal reuse simpler.
Developers aren’t the only ones with something to gain; smaller teams and individual researchers also. They now have access to the same kind of structured data that was once only available to large tech firms.
Benefits for Wikipedia
Wikipedia gains just as much. The new strategy reduces server load and discourages aggressive scraping. It also allows the Foundation to guide how its content is used.
Kaggle’s Role in the Collaboration
Kaggle is widely used by data scientists to host datasets, public notebooks, and competitions. With a Kaggle partnership, Wikimedia ensures broad access and easy adoption.
Brenda Flynn, Kaggle’s partnerships lead, stated:
“Kaggle is excited to play a role in keeping this data accessible, available, and useful.”
Now, anyone on Kaggle can explore, analyze, and even build projects using Wikipedia’s structured content.
Why Not Just Block the Bots?
Blocking AI bots may seem like a simpler solution, but Wikipedia was built on the idea of openness. Cutting off bots entirely would conflict with its original mission.
Instead, the Foundation chose to meet developers halfway. It offers a cleaner, more reliable option, and in return, it hopes developers will stop scraping and use the official dataset.