PaulDavisThe1st 1 day ago

Training LLMs to generate some internal command structure for a tool is conceptually similar to what we've done with them already, but the training data for it is essentially non-existent, and would be hard to generate.

1
Karrot_Kream 1 day ago

My experience has been that generating structured output with zero, one, and few-shot prompts works quite well. We've used it at $WORK for zero-shot stuff and it's been good enough. I've done few-shot prompting for some personal projects and it's been solid. JSON Schema based enforcement of responses with temperature 0 settings works quite well. Sometimes LLMs hallucinate their responses but if you keep output formats fairly constrained (e.g. structured dicts of booleans) it decreases hallucinations and even when they do hallucinate, at temperature 0 it seems to stay within < 0.1% of responses even with zero-shot prompting. (At least with datasets and prompts I've considered.)

(Though yes, keep in mind that 0.1% hallucination = 99.9% correctness which is really not that high when we're talking about high reliability things. With zero-shot that far exceeded my expectations though.)