chubot 10 days ago

That’s hilarious, but at least Llama was trained on libgen, an archive of most books and publications by humanity, no? Except for the ones which were not digitized I guess

So there is probably a big pile of Reddit comments, twitter messages, and libgen and arxiv PDFs I imagine

So there is some shit, but also painstakingly encoded knowledge (ie writing), and yeah it is miraculous that LLMs are right as often as they are

2
cratermoon 10 days ago

libgen is far from an archive of "most" books and publications, not even close.

The most recent numbers from libgen itself are 2.4 million non-fiction books and 80 million science journal articles. The Atlantic's database published in 2025 has 7.5 million books.[0] The publishing industry estimates that many books are published each year. As of 2010, Google counted over 129 million books[1]. At best an LLM like Llama will have have 20% of all books in its training set.

0. https://www.theatlantic.com/technology/archive/2025/03/libge...

1. https://booksearch.blogspot.com/2010/08/books-of-world-stand...

UltraSane 10 days ago

On libgen.mx they claim to have 33,569,200 books and 84,844,242 articles

cratermoon 9 days ago

Still an order of magnitude short of "all", and falling farther behind every year.

ChadNauseam 10 days ago

It's a miracle, but it's all thanks to the post-training. When you think of it, for so-called "next token predictors", LLMs talk in a way that almost no one actually talks, with perfect spelling and use of punctuation. The post-training somehow is able to get them to predict something along the lines of what a reasonably intelligent assistant with perfect grammar would say. LLMs are probably smarter than is exposed through their chat interface, since it's unlikely the post-training process is able to get them to impersonate the smartest character they'd be capable of impersonating.

chubot 10 days ago

I dunno I actually think say Claude AI SOUNDS smarter than it is, right now

It has a phenomenal recall. I just asked it about "SmartOS", something I knew about, vaguely, in ~2012, and it gave me a pretty darn good answer. On that particular subject, I think it probably gave a better answer than anyone I could e-mail, call, or text right now

It was significantly more informative than wikipedia - https://en.wikipedia.org/wiki/SmartOS

But I still find it easy to stump it and get it to hallucinate, which makes it seem dumb

It is like a person with good manners, and a lot of memory, and which is quite good at comparisons (although you have to verify, which is usually fine)

But I would not say it is "smart" at coming up with new ideas or anything

I do think a key point is that a "text calculator" is doing a lot of work ... i.e. summarization and comparison are extremely useful things. They can accelerate thinking