Lucasoato 6 hours ago

> Find the least odd prime factor of 2019^8+1

God that's absurd. The mathematical skills involved on that reasoning are very advanced; the whole process is a bit long but that's impressive for a model that can potentially be self-hosted.

2
pitpatagain 5 hours ago

Also probably in the training data: https://www.quora.com/What-is-the-least-odd-prime-factor-of-...

It's a public AIME problem from 2019.

dartos 4 hours ago

People have to realize that many problems that are hard for humans are in a dataset somewhere.

zamadatix 3 hours ago

In a twofold way: 1) Don't bother testing it with reasoning problems with an example you pulled from a public data set 2) Search the problem you think is novel and see if you already get an answered match in seconds instead of waiting up to minutes for an LLM to attempt to reproduce it.

There is an in-between measure of usefulness which is to take a problem you know is in the dataset and modify it to values not in the dataset on measure how often it is able to accurately adapt to the right values in its response directly. This is less a test of reasoning strength and more a test of whether or not a given model is more useful than searching its data set.

gowld 3 hours ago

The process is only long because it babbled several useless ideas (direct factoring, direct exponentiating, Sophie Germain) before (and in the middle of) the short correct process.

Vetch 3 hours ago

I think it's exploring in-context. Bringing up related ideas and not getting confused by them is pivotal to these models eventually being able to contribute as productive reasoners. These traces will be immediately helpful in a real world iterative loop where you don't already know the answers or how to correctly phrase the questions.

int_19h 1 hour ago

This model seems to be really good at this. It's decently smart for an LM this size, but more importantly, it can reliably catch its own bullshit and course-correct. And it keeps hammering at the problem until it actually has a working solution even if it takes many tries. It's like a not particularly bright but very persistent intern. Which, honestly, is probably what we want these models to be.