Item 43626106

chubot • 10 days ago

That’s hilarious, but at least Llama was trained on libgen, an archive of most books and publications by humanity, no? Except for the ones which were not digitized I guess

So there is probably a big pile of Reddit comments, twitter messages, and libgen and arxiv PDFs I imagine

So there is some shit, but also painstakingly encoded knowledge (ie writing), and yeah it is miraculous that LLMs are right as often as they are

cratermoon • 10 days ago

libgen is far from an archive of "most" books and publications, not even close.

The most recent numbers from libgen itself are 2.4 million non-fiction books and 80 million science journal articles. The Atlantic's database published in 2025 has 7.5 million books.[0] The publishing industry estimates that many books are published each year. As of 2010, Google counted over 129 million books[1]. At best an LLM like Llama will have have 20% of all books in its training set.

0. https://www.theatlantic.com/technology/archive/2025/03/libge...

1. https://booksearch.blogspot.com/2010/08/books-of-world-stand...

1 reply

UltraSane • 10 days ago

On libgen.mx they claim to have 33,569,200 books and 84,844,242 articles

1 reply

cratermoon • 9 days ago

Still an order of magnitude short of "all", and falling farther behind every year.

ChadNauseam • 10 days ago

It's a miracle, but it's all thanks to the post-training. When you think of it, for so-called "next token predictors", LLMs talk in a way that almost no one actually talks, with perfect spelling and use of punctuation. The post-training somehow is able to get them to predict something along the lines of what a reasonably intelligent assistant with perfect grammar would say. LLMs are probably smarter than is exposed through their chat interface, since it's unlikely the post-training process is able to get them to impersonate the smartest character they'd be capable of impersonating.

1 reply

chubot • 10 days ago

I dunno I actually think say Claude AI SOUNDS smarter than it is, right now

It has a phenomenal recall. I just asked it about "SmartOS", something I knew about, vaguely, in ~2012, and it gave me a pretty darn good answer. On that particular subject, I think it probably gave a better answer than anyone I could e-mail, call, or text right now

It was significantly more informative than wikipedia - https://en.wikipedia.org/wiki/SmartOS

But I still find it easy to stump it and get it to hallucinate, which makes it seem dumb

It is like a person with good manners, and a lot of memory, and which is quite good at comparisons (although you have to verify, which is usually fine)

But I would not say it is "smart" at coming up with new ideas or anything

I do think a key point is that a "text calculator" is doing a lot of work ... i.e. summarization and comparison are extremely useful things. They can accelerate thinking