I think it also misses the way you can automate non-trivial tasks. For example, I am working on a project where there is tens of thousands of different data sets each with their own meta data and structure but the underlying data is mostly the same. But because the meta data and structure are all different, it’s really impossible to combine all this data into one big data set without a team of engineers going through each data set and meticulously restructuring and conforming said metadata to a new monolithic schema. However I don’t have any money to hire that team of engineers. But I can massage LLMs to do that work for me. These are ideal tasks for AI type algorithms to solve. It makes me quite excited for the future as many of these kind of tasks could be given to ai agents that would otherwise be impossible to do yourself.
I agree, but only for situations where the probabilistic nature is acceptable. It would be the same if you had a large team of humans doing the same work. Inevitably misclassifications would occur on an ongoing basis.
Compare this to the situation where you have a team develop schemas for your datasets which can be tested and verified, and fixed in the event of errors. You can't really "fix" an LLM or human agent in that way.
So I feel like traditionally computing excelled at many tasks that humans couldn't do - computers are crazy fast and don't make mistakes, as a rule. LLMs remove this speed and accuracy, becoming something more like scalable humans (their "intelligence" is debateable, but possibly a moving target - I've yet to see an LLM that I would trust more than a very junior developer). LLMs (and ML generally) will always have higher error margins, it's how they can do what they do.
Yes but i see it as multiple steps. Like perhaps the llm solution has some probabilistic issues that only get you 80% of the way there. But that probably already has given you some ideas how to better solve the problem. And this case the problem is somewhat intractable because of the size and complexity of the way the data is stored. So like in my example the first step is LLMs but the second step would be to use what they do as structure for building a deterministic pipeline. This is because the problem isn’t that there are ten thousand different meta data, but that the structure of those metadata are diffuse. The llm solution will first help identify the main points of what needs to be conformed to the monolithic schema. Then I will build more production ready and deterministic pipelines. At least that is the plan. I’ll write a substack about it eventually if this plan works haha.
I'm reminded of the game Factorio: Essentially the entire game loop is "Do a thing manually, then automate it, then do the higher-level thing the automation enables you to do manually, then automate that, etc etc"
So if you want to translate that, there is value in doing a processing step manually to learn how it works - but when you understood that, automation can actually benefit you, because only then are you even able to do larger, higher-level processing steps "manually", that would take an infeasible amount of time and energy otherwise.
Where I'd agree though is that you should never lose the basic understanding and transparency of the lower-level steps if you can avoid that in any way.