nojvek 4 days ago

I’d take this bet.

> With this offer, you can tune your algorithm to my data.

One can generate a 1GB file or 10GB file. It is highly likely that there is some form of a repeatable pattern in there to shave off 50-100 bytes by sliding window search.

Then the decompressor is essentially - at this index, expand this pattern. The compressed file excludes the range of pattern.

One may not always get such a pattern, but on multiple tries it’s very feasible. Payout just needs one win in 50 tries.

You could generate a 100GB file. The bigger the file the higher the chance of repeating pattern.

The challenge is won if compressed_file + decompressor is less one byte than original file.

One could have a self executing decompressor to save some file overhead bytes.

4
SideQuark 4 days ago

Your idea fails pretty easily. Simply do the math, and you’ll find your idea won’t work.

One argument: if your method worked, then we could compress nearly any file. This violates the pigeonhole principle.

Another example: for a 4GB file you’d need roughly a 4 byte integer to specify where the repeat started and where the second was, for 8 bytes. Then you need a length byte. Now you need a repeated 10+ bytes sequence to make this compress anything. There are 256^10=2^80 possible such 10 bytes sequence sequences, and only ~2^32 of them in a 4 gb file. So the odds of a repeated are around 1 in 2^50.

Tweak methods and estimate as you wish. It fails.

alt227 3 days ago

The whole point is Mike had control over creating the source file, and so obviously checked for random repeating patterns and removed them before giving the file to the challenger.

On top of this, it takes more data to store the address of the repeating byte pattern than the original pattern does. Therefore you are creating more data than you are saving.

If it was this easy then hundreds of people would have taken on the challenge and won the £5k.

crazygringo 4 days ago

As a general principle, absolutely.

In practice, I wonder what size of file we're talking about that would result in net compression on random data 50% of the time?

I have no intuition whether it's 1 GB, 1 TB, 1 PB, or beyond.

SideQuark 4 days ago

Nope. As your file gets longer so does the data needed to specify where the repeats are. See my other estimate in this thread. It fails spectacularly.

Jerrrry 4 days ago

good compression = pattern eventually

maximum compression = indelineable from random data