philipkglass 5 days ago

The raw wiki dumps contain "wikitext" markup that is significantly different from the nice readable pages you see while browsing Wikipedia.

Compare:

https://en.wikipedia.org/wiki/Transistor

with the raw markup seen in

https://en.wikipedia.org/w/index.php?title=Transistor&action...

That markup format is very hard to parse/render because it evolved organically to mean "whatever Wikipedia software does." I haven't found an independent renderer that handles all of its edge cases correctly. The new Kaggle/Wikimedia collaboration seems to solve that problem for many use cases, since the announcement says

This release is powered by our Snapshot API’s Structured Contents beta, which outputs Wikimedia project data in a developer-friendly, machine-readable format. Instead of scraping or parsing raw article text, Kaggle users can work directly with well-structured JSON representations of Wikipedia content—making this ideal for training models, building features, and testing NLP pipelines.The dataset upload, as of 15 April 2025, includes high-utility elements such as abstracts, short descriptions, infobox-style key-value data, image links, and clearly segmented article sections (excluding references and other non-prose elements).

1
freeone3000 5 days ago

Just run your own copy of the wikipedia code. It’ll be cheaper than whatever inference you’re doing.

paulryanrogers 5 days ago

IDK why this was downvoted. Wikimedia wiki text can be transformed with some REs. Not exactly fast but likely far easier than playing cat and mouse with bot blockers.