Item 43626318

mcphage • 10 days ago

> Cyc grew to contain approximately 30 million assertions at a cost of $200 million and 2,000 person-years. Yet despite Lenat’s repeated predictions of imminent breakthrough, it never came.

That seems like pretty small potatoes compared to how much has been spent on LLMs these days.

Or to put it another way: if global funding for LLM development had been capped at $200m, how many of them would even exist?

gwern • 10 days ago

Language models repeatedly delivered practical, real-world economic value at every step of the way from at least n-grams on. (Remember the original 'unreasonable effectiveness of data'?) The applications were humble and weren't like "write all my code for me and then immanentize the eschaton", but they were real things like spelling error detection & correction, text compressors, voice transcription boosters, embeddings for information retrieval, recommenders, knowledge graph creation (ironically enough), machine translation services, etc. In contrast, Yuxi goes through the handful of described Cyc use-cases from their entire history, and it's not impressive.

1 reply

mcphage • 9 days ago

> Remember the original 'unreasonable effectiveness of data'?

That came out in 2009, correct? I wonder how much was spent on LLMs up to that point.

> In contrast, Yuxi goes through the handful of described Cyc use-cases from their entire history, and it's not impressive.

They're also not humble. Maintain a semantic database of terrorist cells? Create a tutoring AI? These seem closer to the things that LLMs are currently being used for, with middling success, after vastly more money has been pumped into the field.

Whereas most of the uses you describe for early LLMs are far more humble (spelling error detection & correction, text compressors), and also a lot more successful.

Which makes me think that CYC went first for the big targets, and fell on its face, rather than spending a few decades building up more modest accomplishments. In hindsight that would have obviously been a much better strategy, but honestly—it feels like that would have been an obviously better strategy in non-hindsight as well. I don't know why CYC went that way.

1 reply

gwern • 9 days ago

> That came out in 2009, correct? I wonder how much was spent on LLMs up to that point.

Quite a lot. Look back at the size of the teams working on language models at IBM, Microsoft, Google, etc, and think about all the decades of language model research going back to Shannon and quantifying the entropy of English. Or the costs to produce the datasets like the Brown Corpus which were so critical. And keep in mind that a lot of the research and work is not public for language models; stuff like NSA interest is obvious, but do you know what Bob Mercer did before he vanished into the black hole of Renaissance Capital? I recently learned from a great talk (spotted by OP, as it happens) https://gwern.net/doc/psychology/linguistics/bilingual/2013-... that it was language modeling!

I can't give you an exact number, of course, but when you consider the fully-loaded costs of researchers at somewhere like IBM/MS/G is usually at least several hundred thousand dollars a year and how many decades and how many authors there are on papers and how many man-years must've been spent on now-forgotten projects in the 80s and 90s to scale to billions of word corpuses to train the n-gram language models (sometimes requiring clusters), I would have to guess it's at least hundreds of millions cumulative.

> They're also not humble.

Funnily enough, the more grandiose use-cases of LMs actually were envisioned all the way back at the beginning! In fact, there's an incredible science fiction story you've never heard of which takes language models, quite literally, as the route to a Singularity, from 1943. You really have to read it to believe it: "Fifty Million Monkeys", Jones 1943 https://gwern.net/doc/fiction/science-fiction/1943-jones.pdf

> I don't know why CYC went that way.

If you read the whole OP, which I acknowledge is quite a time investment, I think Yuxi makes a good case for why Lenat culturally aimed for the 'boil the ocean' approach and how they refused to do more incremental small easily-benchmarked applications as distractions and encouraging deeply flawed paradigms and how they could maintain it for so long. (Which shouldn't be too much of a surprise. Look how much traction DL critics on HN get, even now.)

1 reply

mcphage • 9 days ago

> Quite a lot. Look back at the size of the teams working on language models at IBM, Microsoft, Google, etc, and think about all the decades of language model research going back to Shannon and quantifying the entropy of English.

I wonder at what point the money spent on LLMs matched the $200 million that was ultimately spent on CYC.

> Funnily enough, the more grandiose use-cases of LMs actually were envisioned all the way back at the beginning!

Oh, I know—but those grandiose use cases still have yet to materialize, despite the time and money spent. But the smaller scale use cases have borne fruit.

> there's an incredible science fiction story you've never heard of which takes language models, quite literally, as the route to a Singularity, from 1943. You really have to read it to believe it: "Fifty Million Monkeys", Jones 1943

Thanks, I'll read that.

> If you read the whole OP, which I acknowledge is quite a time investment, I think Yuxi makes a good case for why Lenat culturally aimed for the 'boil the ocean' approach and how they refused to do more incremental small easily-benchmarked applications as distractions and encouraging deeply flawed paradigms and how they could maintain it for so long.

I read it for a chunk, but yeah, not the whole way.

zozbot234 • 10 days ago

> That seems like pretty small potatoes compared to how much has been spent on LLMs these days.

It seems to be a pretty high cost, at more than $6 per assertion. Wikidata - the closest thing we have to a "backbone for the Semantic Web" right now - contains around 1.6G bare assertions describing 115M real-world entities, and that's a purely volunteer project.

og_kalu • 9 days ago

Global funding would never have been capped at $200M for LMs because they were obviously useful from the get go and only got more useful with more investment.

Forget CYC, Forget LLMs. We abandoned Symbolic-AI for Neural Networks in NLP long before the advent of the science-fiction esque transformer LLMs. That's how terrible they were.

It wasn't for a lack of trying either. NNs were the underdogs. Some of the greatest minds desparately wanted the symbolic approach to be a valid one and tried for literally decades, and while I wouldn't call it a 'failure', it just couldn't handle anything fuzzy without a rigidly defined problem space, which is kind of unfortunate seeing as that is the exact kind of intelligence that actually exists in the real world.

masfuerte • 10 days ago

It's funny, because AI companies are currently spending fortunes on mathematicians, physicists, chemists, software engineers, etc. to create good training data.

Maybe this money would be better spent on creating a Lenat-style ontology, but I guess we'll never know.

1 reply

throwanem • 10 days ago

We may. LLMs are capable, even arguably at times inventive, but lack the ability to test against ground truth; ontological reasoners can never exceed the implications of the ground truth they're given, but within that scope reason perfectly. These seem like obviously complementary strengths.