Item 43708677

ashvardanian • 5 days ago

It's a good start, but I was hoping for more data. Currently, it's only around 114 GB across 2 languages (<https://www.kaggle.com/datasets/wikimedia-foundation/wikiped...>):

  - English: "Size of uncompressed dataset: 79.57 GB chunked by max 2.15GB."
  - French: "Size of the uncompressed dataset: 34.01 GB chunked by max 2.15GB."

In 2025, the standards for ML datasets are quite high.

yorwba • 5 days ago

I guess it's limited to only two languages because each version of Wikipedia has its own set of templates and they want to make sure they can render them correctly to JSON before releasing the data.

As for the size, it's small compared to the training data of most LLMs, but large relative to their context size. Probably best used for retrieval-augmented generation or similar information retrieval applications.