Is there a benchmark or eval for why this might be a better approach than actually modeling the problem? If you're selling this a non-ML person, I get the draw. But you'd still have to show why using these LLMs would be better than training it with something simpler / more lightweight.
That said, it's likely that you'll get good zero-shot performance, so the model building phase could benefit from fine-tuning the prompt given the dataset - instead of training the underlying model itself.
Just to clarify, we're not directly using the LLMs as the "predictor" models for the task. We're making the LLMs do the modeling work for you.
For example, take the classic "house price prediction" problem. We don't use an LLM to make the predictions, we use LLMs to model the problem and write code that trains an ML models to predict house prices. This would most likely end up being an xgboost regressor or something like that.
As to your point about evals, great question! We've done some testing but haven't yet carried out a systematic eval. We intend to run this on OpenAI's MLE-Bench to quantify how well it actually does as creating models.
Hope I didn't misunderstand your comment!