> “Our algorithm right now is provably faster, in theory,” Ahmadi said. He’s hopeful, he added, that in 10 to 20 years, it will also be so in practice.
I don't see what a decade or two will add to this technique.
My understanding is that the proposed method is faster in the sense of sampling efficiency (of the cost function to construct the Taylor series), but not in the sense of FLOPS. The higher derivatives do not come for free.
It was never the interesting to see the mathematical maneuver to massage the Taylor series into a more digestible form and prove the error bounds.
This is highly problem dependent. Sometimes, you get second derivatives almost for free. It all depends on your cost function.
It also depends on your stopping tolerance. Computing a higher order derivative incurs a constant cost, whereas a higher order method converges faster, but not just by a constant. Hence, as you ask for more and more precise results, the ratio of cost of your high order to your low order method will go to zero.
The cost of Newton's method is not so much computing the second derivative anyways, but solving the resulting linear system... So, for small dimensional problems, Newton's method is still the go-to optimization method, it is stupidly efficient. (with a good line search, this partly answers your "I don't see what a decade or two will add")
Weirdly, the original algo is showcased visually, very easy to understand even by 9th graders. While the new one...can't be or quanta woulnd't show it to us just like that?
What I'm trying to state is that if the new algo is so much better, that it can't be explained visually, then the simplicity perhaps was sacrificed, which was what kept the original algo going for 300 years...
The paper has some examples, and a very clear one for the univariate case[1].
Yes, weird of Quanta to not include such an example.
The illustrated example isn’t multivariate and can be drawn in 2 dimensions. The harder examples are also harder to visualise.
> My understanding is that the proposed method is faster in the sense of sampling efficiency (of the cost function to construct the Taylor series), but not in the sense of FLOPS. The higher derivatives do not come for free.
Sure, but as long as this remains cheaper than the process of computing the next convergence, this would still be a net win. For example the article talks about how AI training uses gradient descent and I’m pretty sure that the gradient descent part is a tiny fraction of the time spent training vs evaluating the math kernels in all the layers; therefore taking fewer steps should be a substantial win.
> Like the original version of Newton’s method, each iteration of this new algorithm is still computationally more expensive than methods such as gradient descent. As a result, for the moment, the new work won’t change the way self-driving cars, machine learning algorithms or air traffic control systems work. The best bet in these cases is still gradient descent.
Unfortunately not.
Each iteration is (much) more expensive, but you should be able to get to the final answer in fewer iterations.