Very, very cool -- thanks for posting! Quanta is yet again proving itself to be the very best casual science publication in the English-speaking world. This is a short one, but as always, clean, original graphics + big pictures of the participants is a winning combo IMO.
Other iterative methods, like gradient descent, converge toward the true minimum at a linear rate. Newton’s method converges toward it... at an [exponential] rate...
The rate of convergence would scale with the number of derivatives used: Just as using two derivatives allowed Newton to approach the true minimum at a quadratic rate, using three derivatives enabled the researchers to approach it at a cubic rate, and so on.
Was not aware of this reality myself from my ~undergrad ML math education. Obviously we're getting very, very good at optimizing these cheap linear algorithms, but as someone who thinks DL represents the intuitive faculties of human cognition--and thus is missing/badly-recreating our unique deliberative faculties--this gives me hope for future qualitative breakthroughs. I'm sure it's not this simple, but working with ~1500-dimension spaces (or, sometimes, in the millions!) makes this description sound pretty damn promising.I'm still confident that medium-term progress on deliberative cognition lies in the very uncool Neat/symbolic direction, but this is the first time I'm really thinking hard about what it would take to crack it the hard, more durable way (i.e. biological verisimilitude).
In terms of the paper itself (https://arxiv.org/html/2311.06374v2), this is by far the most interesting part for this novice, in 1.2:
We prove that our algorithm is well-defined in the sense that the semidefinite programs it executes are always feasible and that the next iterate is always uniquely defined...[, and that the program] has local convergence of order d. Compared to the classical Newton method, this leads to fewer calls to the Taylor expansion oracle (a common oracle in this literature...) at the price of requiring higher-order derivatives.
I've long said math is just a collection of intellectual tools, but it's very satisfying to read what feels like an unusually flexible application of this premise, relying on "programs" that invoke "oracles". More fundamentally, the specifics of SDP are immediately promising for any among us inclined to extend connectionism to some of the workings of the cosmos itself, IMO: https://en.wikipedia.org/wiki/Semidefinite_programming Why did you replace "quadratic" with "exponential"? Quadratic isn't exponential.
The number of correct digits in the answer grows exponentially with iteration count- "quadratic convergence" means we roughly square the error each time, so after many iterations we raise the error to an exponentially high power.
People have tried applying Newton's method and other quasi-Newton methods to ML problems, and my understanding is that it isn't worth it for large problems.
In ML, you care about getting a few digits for a very large, very nonlinear, high-dimensional optimization problem. The cost function is gnarly, it's expensive to evaluate, computing derivatives is even more expensive, and there are many different optima. You don't even really care too much which optimum you get.
The theory of Newton's method (Kantorovich's theorem) says that you obtain quadratic convergence once you're in a small ball surrounding your optimum. How big this ball is depends on the smoothness of the cost function. E.g., if you try to minimize a positive definite quadratic function with Newton's method, you will converge to the minimum in one step (the basin is all of R^n), with the first order necessary equations being equivalent to solving a symmetric positive definite linear system. But if the function is more wild and crazy, the ball may be quite small; if you try to run an unadulterated Newton's method outside this ball, the iteration can easily diverge.
With this in mind, Newton's method is usually used when you want 1) lots of digits (or ALL the digits), 2) you have a good reason to believe you're within the basin of convergence, 3) you know your function has a single global optimum and you're willing to use a damped Newton's method and wait out some linear convergence until you get to the basin of contraction. At least if you're training a neural network, none of this is true. ($)
That said, even if you're minimizing a nice, smooth, univariate function with exactly one optimum, the secant method will get the same result with fewer FLOPs. So, although the order of convergence of secant is lower (in fact, the order is ~1.618), the fact that each iteration is more economic than Newton means that secant will deliver a target error with fewer FLOPs even though it will take more iterations. That said, in my experience, the distinction between "Newton" and "secant" (i.e. quasi-Newton) in low dimensions is less clear cut. Sometimes your Hessian is easy to evaluate and it's simpler to program and run Newton. The difference between these approaches is also pretty minimal in low dimensions... secant might be faster but not that much faster.
This last point also relates to the original paper. It seems likely that this algorithm will have a worse "iteration vs FLOPs" tradeoff even than Newton.
Now, I think the one place where Newton really shines is in interval arithmetic. There are interval Newton methods which can be used to not only compute but prove the existence and uniqueness of a zero for f(x) in R^n. For this reason, they get used for verified numerics and things like rigorously finding all zeros of a function in a compact domain (e.g. by combining with subdivision).
($): Although, actually, I've seen people train neural networks like this. I believe sometimes, if you have a very SMALL neural network (like, tens to thousands of parameters), you once again enter the regime of "modern" optimization and can train a neural network potentially very accurately using Gauss-Newton.