vlovich123 5 days ago

I have a fundamental problem with the Hutter prize stating that intelligence is related to compression & then sponsoring a prize for lossless compression. Intelligence is related to lossy compression. Lossless is mainly a mechanistic act.

5
_hark 5 days ago

Say we have some dataset composed of D bytes.

Next, say I find some predictive model of the data M, where M is composed of N bytes. Furthermore, let us say that the entropy of the dataset under the model is H bytes.

Then, if N + H < D, my model compresses the data.

It doesn't matter if the model is deterministic or probabilistic: a probabilistic model can be used to (losslessly) compress a dataset with entropy coding.

One more argument for compression being equivalent to intelligence: Across many fields of statistical machine learning, there are generalization bounds which have the form:

test error <= train error + model complexity

That is, we don't expect to do any worse on new (test) data, than the sum of the train error and the model complexity (smallest compressed size). Notice that if you interpret the train error as the entropy of the data (i.e. error under a cross entropy loss), then the models which satisfy the statistical generalization bound correspond to those which best compress the data. In other words: the model which produces the shortest description of the data is the one which is expected to generalize best.

vlovich123 4 days ago

> It doesn't matter if the model is deterministic or probabilistic: a probabilistic model can be used to (losslessly) compress a dataset with entropy coding.

But if you can choose to lose information you can obviously achieve a higher compression score. That's literally what optical & auditory compression exploits. Indeed, we know people generally don't memorize the entire Wikipedia article. Rather they convert what they learn into some internally consistent story that they can then recite at any time & each time they recite it it's even differently worded (maybe memorizing some facts that help solidify the story).

Again, I have no problem with compression and decompression being equated to intelligence provided both are allowed to be lossy (or at least one facet of intelligence). That's because you get to inject structure into the stored representation that may not otherwise exist in the original data and you get to choose how to hydrate that representation. That's why LZMA isn't "more intelligent" than ZIP - the algorithm itself is "smarter" at compression but you're not getting to AGI by working on a better LZMA.

It's also why H264 and MP3 aren't intelligent either. While compression is lossy decompression is deterministic. That's why we can characterize LLMs as "more intelligent" than LZMA even though LZMA compresses losslessly better than LLMs.

_hark 4 days ago

I agree with you in spirit. I just thought you might be interested in some of the technical details regarding the relationship between compression and generalization!

I'll have a paper out next week which makes your point precise, using the language of algorithmic rate--distortion theory (lossy compression applied to algorithmic information theory + neural nets).

I think another way of understanding this is through the "Value Equivalence Principle", which points out that if we are learning a model of our environment, we don't want to model everything in full detail, we only want to model things which affect our value function, which determines how we will act. The value function, in this sense, implies a distortion function that we can define lossy compression relative to.

dooglius 5 days ago

This seems to assume that there is a tractable way to encode H efficiently, but this seems very difficult given a model that is focused on understanding the content. Ex: I can easily write a program that can do basic arithmetic, but given say a bitmap scan of elementary school math materials, such a program gives me no easy way to compress that; rather something generic like PNG (that does not know or understand the material) will far outperform.

_hark 5 days ago

Great point. This points to the related issue: what do we want to compress? Do we want to compress "the answer", here the arithmetic expression's solution, or do we want to compress the image?

You can formalize this with rate--distortion theory, by defining a distortion function that says what your objective is. That implies a well-defined complexity relative to that distortion function.

Okay to be clear, I've written a paper on exactly this topic which will be announced in a week or so. So you won't find anything on the subject yet, haha. But I use almost exactly this example.

Jerrrry 4 days ago

  >Okay to be clear, I've written a paper on exactly this topic which will be announced in a week or so. So you won't find anything on the subject yet, haha. But I use almost exactly this example.

I would use Floating Point arithmetic as the example/analogy: one trades off accuracy/precision for exactness/correct-ness when in the extremities. Answers near the more granular representations will be only be able to represented by their nearest value. If this is forsee-able, the floating point implementation can be adjusted to change where the floating point's "extremities'" are.

I used the above analogy and the following when articulating the magnitude of near-lossless-ness that large LLM's have managed to attain, especially when all of the humanities corpus is compressed into a USB flash drive; the Kolmogorov complexity re-mitted/captured is similar to a master-piece like the Mona Lisa having to be described in X many brush-strokes to achieve Y amount of fidelity.

aleph_minus_one 4 days ago

Marcus Hutter is free to sponsor a prize according to his definition of intelligence. You are free to sponsor a prize according to your intelligence definition involving a lossy compression.

Dylan16807 4 days ago

Of course he's "free to".

Are you trying to imply that people need to put up big sums of money when they want to critique someone else's definitions be taken seriously? If yes, I think that's ridiculous. If no, I can't figure out the point of your comment.

echoangle 5 days ago

Isn’t the intelligence shown by compressing lossless the scheme you use? Applying the algorithm is the easy part, the proof of intelligence is inventing the algorithm which compresses.

vlovich123 3 days ago

Yes, you are proving intelligence if you invent the algorithm which compresses. If the prize was for inventing an algorithm that could then build the lossless compression scheme itself then you'd be onto something. But the prize is for the human who invents the better compression algorithm and proof of intelligence of the human would be self-evident.

Jerrrry 3 days ago

Why are you explicitly saying "human"...

AlotOfReading 5 days ago

Hutter put up the prize as a way of getting people to approximate his AIXI algorithm, which he already considers to be the pinnacle of artificially intelligence. That's also why lossy compression isn't interesting. It'd be an approximation of an approximation and not particularly useful for hutter's purpose.

chriswarbo 4 days ago

> his AIXI algorithm, which he already considers to be the pinnacle of artificially intelligence

This paper from Hutter in 2015 https://arxiv.org/abs/1510.04931 considers AIXI to be "subjective", since the choice of Universal Turing Machine is left unspecified:

> We show that Legg-Hutter intelligence and thus balanced Pareto optimality is entirely subjective, and that every policy is Pareto optimal in the class of all computable environments. This undermines all existing optimality properties for AIXI. While it may still serve as a gold standard for AI, our results imply that AIXI is a relative theory, dependent on the choice of the UTM.

patmcc 3 days ago

Intelligence is knowing what you can lose, and losing that and no more.

Jerrrry 3 days ago

I like this definition for a "proto"-intelligent trait.

pizza 4 days ago

"If it were a fact, it wouldn't be called intelligence."

I think the other way to read it is that you're fundamentally going to have to choose (by your freedom to control your algorithm's design) which of the maps from (2^|x|) -> (2^|compress(x)|) are the ones that actually end up getting no compression (ie |compress(x)| actually > |x| - bc of pigeonhole principle) and which ones are the ones that do get to be called compressed. So you do have an implied kind of lossiness w.r.t. to what parts of the input domain are actually compressed.