136
38
toomuchtodo 5 days ago

This sounds good to take the ML/AI consumption load off Wikimedia infra?

immibis 5 days ago

The consumption load isn't the problem. You can download a complete dump of Wikipedia and even if every AI company downloaded the newest dump every time it came out, it would be a manageable server load - you know, probably double-digit terabytes per month, but that's manageable these days. And if that would be a problem, they could charge a reasonable amount to get it on a stack of BD-R discs, or heck, these companies can easily afford a leased line to Wikimedia HQ.

The problem is the non-consumptive load where they just flat-out DDoS the site for no actual reason. They should be criminally charged for that.

Late edit: Individual page loads to answer specific questions aren't a problem either. DDoS is the problem.

parpfish 5 days ago

I'd assume that AI companies would use the wiki dumps for training, but there are probably tons of bots that query wiki from the web when doing some sort of websearch/function call.

philipkglass 5 days ago

The raw wiki dumps contain "wikitext" markup that is significantly different from the nice readable pages you see while browsing Wikipedia.

Compare:

https://en.wikipedia.org/wiki/Transistor

with the raw markup seen in

https://en.wikipedia.org/w/index.php?title=Transistor&action...

That markup format is very hard to parse/render because it evolved organically to mean "whatever Wikipedia software does." I haven't found an independent renderer that handles all of its edge cases correctly. The new Kaggle/Wikimedia collaboration seems to solve that problem for many use cases, since the announcement says

This release is powered by our Snapshot API’s Structured Contents beta, which outputs Wikimedia project data in a developer-friendly, machine-readable format. Instead of scraping or parsing raw article text, Kaggle users can work directly with well-structured JSON representations of Wikipedia content—making this ideal for training models, building features, and testing NLP pipelines.The dataset upload, as of 15 April 2025, includes high-utility elements such as abstracts, short descriptions, infobox-style key-value data, image links, and clearly segmented article sections (excluding references and other non-prose elements).

freeone3000 5 days ago

Just run your own copy of the wikipedia code. It’ll be cheaper than whatever inference you’re doing.

paulryanrogers 5 days ago

IDK why this was downvoted. Wikimedia wiki text can be transformed with some REs. Not exactly fast but likely far easier than playing cat and mouse with bot blockers.

jsheard 5 days ago

The bots which query in response to user prompts aren't really the issue. The really obnoxious ones just crawl the entire web aimlessly looking for training data, and wikis or git repos with huge edit histories and on-demand generated diffs are a worst case scenario for that because even if a crawler only visits each page once, there's a near-infinite number of "unique" pages to visit.

noosphr 5 days ago

You'd assume wrong.

I was at an interview for a tier one AI lab and the pm I was taking to refused to believe that the torrent dumps from Wikipedia were fresh and usable for training.

When you spend all your time fighting bot detection measures it's hard to imagine someone willingly putting up their data out there for free.

kmeisthax 5 days ago

As someone who has actually tried scraping Wikimedia Commons for AI training[0], they're correct only in the most literal sense. Wikitext is effectively unparseable, so just using the data dump directly is a bad idea.

The correct way to do this is to stand up a copy of MediaWiki on your own infra and then scrape that. That will give you shittons of HTML to parse and tokenize. If you can't work with that, then you're not qualified to do this kind of thing, sorry.

[0] If you're wondering, I was scraping Wikimedia Commons directly from their public API, from my residential IP with my e-mail address in the UA. This was primarily out of laziness, but I believe this is the way you're "supposed" to use the API.

Yes, I did try to work with Wikitext directly, and yes that is a terrible idea.

noosphr 5 days ago

This is starting to get into the philosophical question of what training data should look like.

From the same set of interviews I made the point that the only way to meaningfully extract the semantics of a page meant for human consumption is to use a vision model that uses typesetting as a guide for structure.

The perfect example was the contract they sent, which looked completely fine, but was a word document with only wysiwyg formatting, e.g. headings were just extra large bold text rather than marked up as heading. If you used the programmatically extracted text as training data you'd be in trouble.

immibis 4 days ago

Sounds like they're breaking the CFAA and should be criminally prosecuted.

mrbungie 5 days ago

Wikimedia or someone else could offer some kind of MCP service/proxy/whatever for real-time data consumption (i.e. for use cases where the dump data is not useful enough), billed by usage.

ipaddr 5 days ago

Does any repo exist with an updated bot list to block these bot website killers

squigz 5 days ago

I'm confused. Are you suggesting that the AI companies actively participate in malicious DDOS campaigns against Wikimedia, for no constructive reason?

Is there a source on this?

kbelder 4 days ago

Not maliciousness. Incompetence.

Bot traffic is notoriously stupid, reloading the same pages over and over, surging one hour and then gone the next, getting stuck in loops, not understand html response codes... It's only gotten worse with all the AI scrapers. Somehow, they seem even more poorly written than the search engine bots.

immibis 4 days ago

Mine disappeared after about a week of serving them all the same dummy page on every request. They were fetching the images on the dummy page once for each time they fetched the page...

ashvardanian 5 days ago

It's a good start, but I was hoping for more data. Currently, it's only around 114 GB across 2 languages (<https://www.kaggle.com/datasets/wikimedia-foundation/wikiped...>):

  - English: "Size of uncompressed dataset: 79.57 GB chunked by max 2.15GB."
  - French: "Size of the uncompressed dataset: 34.01 GB chunked by max 2.15GB."
In 2025, the standards for ML datasets are quite high.

yorwba 5 days ago

I guess it's limited to only two languages because each version of Wikipedia has its own set of templates and they want to make sure they can render them correctly to JSON before releasing the data.

As for the size, it's small compared to the training data of most LLMs, but large relative to their context size. Probably best used for retrieval-augmented generation or similar information retrieval applications.

0cf8612b2e1e 5 days ago

I wish the Kaggle site were better. Unnecessary amounts of JS to browse a forum.

smcin 5 days ago

If enough of us report it

('Issues: Other' https://www.kaggle.com/contact#/other/issue )

they might do something about it.

astrange 5 days ago

There's not really much connection between asking someone to redo their whole website and them actually doing it. That seems like work.

smcin 2 hours ago

Also, the discussion forums aren't "their whole website". Kaggle is about running data-science notebooks and kernels, on GPU/TPU(/CPU), and noone's suggesting they change that. The discussion forum (or competition leaderboards, or datasets, or submission API) are adjacent pieces to that.

smcin 4 days ago

There absolutely is, in this case. Kaggle was acquired by Google in 2017, and is a showcase for compute on Google cloud, Google Colab, Kaggle Kernels. Fixing the JS on their forums would be a rounding error in their budget.

(Also, FYI, I've previously posted feedback pieces in Kaggle forums that got a very warm direct response from the executives, although that was before the acquisition.)

So, for the average website, you'd be right, but not for Google Cloud/Colab's showcase property.

https://news.ycombinator.com/item?id=13822675

bk496 5 days ago

It would be cool if all the HTML tables on Wikipedia were put under individual datasets

bilsbie 5 days ago

Wasn’t this data always available?

riyanapatel 5 days ago

I like the concept of Kaggle and appreciate it - I also do agree that UI aspects hinder me from taking the time to explore its capabilities. Hoping this new partnership helps structure data for me.

BigParm 5 days ago

I was paying experts in a wide variety of industries for interviews in which I meticulously documented and organized the comprehensive role of the human in that line of work. I thought I was building a treasure chest, but it turns out nobody wants that shit.

Anyways, just a story on small-time closed data for reference.