I wonder if it would have been possible to win the challenge legitimately?
If a randomly-generated file happened to contain some redundancy through sheer chance, you could hand-craft a compressor to take advantage of it. This compressor would not work in general for random data, but it could work for this one particular case.
It's a bet worth taking, because the payoff, 50:1 ($5,000 to $100), is pretty good. Play the game 50 times and you might get a file you could compress.
The challenge, then, would be for the person offering the bet to generate a really random file that contained no such redundancy. That might not be easy.
I think you can make some argument about why this isn't possible at 50:1 odds.
A plausible "decompressor" is at least, say, 30 or 100 bytes, so the random file needs to have 30 bytes less entropy than you expected, which happens with probability X where X << 1/50. Sum over the whole domain of reasonable decompressors, and you still don't get there.
This argument could do with more rigor, but I think it's correct. Give me 100 million to 1 odds, though, and I'll take my chances trying to brute force a compressor.
This is actually an extremely interesting question. 'Weak' files that are more easily compressable than others certainly exist, but with low probability.
For example, the all-zeros file is a member of the set of all random 3 megabyte files, and it would certainly be possible to compress that, if by great good fortune you were lucky enough to receive it - albeit something that is unlikely to ever happen in the possible lifetime of the universe, given current physical theories.
Is it possible to quantify this idea of a 'weak' file more accurately?
I know very little about this, but a little googling suggests that the measure you're looking for is entropy, which has a mathematical definition: https://en.wikipedia.org/wiki/Entropy_(information_theory)
Yes, you can think of it in terms of (WLOG think of any uniquely-decodable code) prefix-free codes. They're uniquely decodable - for things that are not uniquely decodable, that implies that you could put overlapping codes over that symbol. If you make a matrix like this where the rows are the bitstrings of length b and columns are individual bits:
000 ... 000
000 ... 001
...
111 ... 110
111 ... 111
then you have 2^b rows. Suppose you look at the sub-bitstrings of length k, k < b. They all appear the same number of times, if you count them wherever they appear at any position in across the entire matrix.However, you also know, for sure, that, if a prefix-free code appears in a particular row, that means since it's impossible to overlap with anything else in that row at its span. What does that imply? That the prefix-free codes have a greater 'occupancy percentage' of a single row than all other sub-bitstrings. That means that you must find fewer of them, on average, inside of a single row.
But since we know that all sub-bitstrings appear the same number of times throughout the entire matrix, what else can we deduce? That the prefix-free codes must appear /over more rows / on average, if they cannot appear as many times while looking at bit positions /along the columns/. That means they will occur as a sub-pattern in full-bitstrings more often than typical random sub-patterns.
So weakness here corresponds to the presence of patterns (prefix-free codes) that are:
- non-overlapping within bitstrings
- widely distributed across bitstrings
- due to their wide distribution, there's a higher chance of encountering these patterns in any given random file
- therefore, weak files are more compressible because they contain widely-distributed, non-overlapping patterns that compression algorithms can take advantage of
Human communication is full of forward error correction. So much of it that we can compress text at around 10:1. When compression first got big we could already manage more than 8:1, which makes it very useful in an era before digital photography was big. We developed lossy compression for media.
4k is 8.3 megapixels, at least 24 bits per pixel, and 24 frames a second about 4.8 Gbps if you include the audio. Netflix streams at 15.6Mbps, which is more than 300:1. We talked here a couple years ago about how HBO redid their “dead channel” intro to make the “white noise” compatible with video compression so it didn’t look like ass.
One thing you can do, as the other commenter pointed out, is consider entropy of the file.
However, this restriction is too much for the purposes of this challenge. We don’t actually need a file with low entropy, in fact I claim that a weak file exists for files with entropy 8 (the maximum entropy value) - epsilon for each epsilon > 0.
What we actually need is a sufficiently large chunk in a file to have low entropy. The largeness is in absolute terms, not relative terms.
A very simple file would be taking a very large file with maximum entropy and adding 200 0’s to the end. This would not decrease the entropy of the file much, but it gives way to a compression algorithm that should be able to save ~100 bytes
Note that if this large chunk occurs in the middle of the file, then you will need extra space to encode that position. For example, a random bit string of length 2^n is decently likely to have a run of n zeroes. But this doesn’t help you because you need n bits just to encode where that run happens.
But storing an index for a file of length 2^n takes only n bits, so you need that run of 0’s to be of length n+1 to win
What if we didn't even attempt to compress the file? Theoretically, there is a random number generator + seed combination that would output the same bytes as the source file.
Neat idea but chances are the length of the seed is equal to the length of the original file.
There is a polynomial expression that will match the file. I wonder if expressing it would be compressible to a lower size than original file? [edit] I’m wrong - not all sequences can be expressed with a polynomial.
This reminds me of a data compression scheme I came up with once:
Treat an n bit file as a polynomial over the finite field with characteristic 2. Now, there are some irreducible polynomials in this field, but many polynomials have factors of x and of (x+1). Factor the polynomial into P(x)x^n (x+1)^m. and just collect these terms, storing only P, n, and m.
That's not how seeds work. Seeds are tiny.
Actually this would work perfectly if you knew it was generated in a single pass by a known random number generator and you had tons of time to brute force it.
If the file were generated by a natural source of entropy then forget it.
Or even if modified in a trivial way like adding 1 to every byte.
What is with the many downvotes but no comments? Everything I said is factual.
Seeds are something like 32 bits, though it depends on the exact implementation. But not the length of files.
When implementing a PRNG, you can make its seed as big as you want. There is no mathematical law that dictates or limits the size of a seed.
But I assume the GGP assumes that the author is lazy and used a public available PRNG instead of a custom made. (A long time ago someone broke the login security check in HN using a trick like that. Obviously, it's already fixed.)
I mean sure you could in theory, but in practice that's not how common built-in random number generators work.
I was responding to:
> chances are the length of the seed is equal to the length of the original file
And why would the chances be that? You'd really have to go out of your way for that. I don't even know if there are libraries that can handle a seed and state length on the scale of megabytes. No, chances are 99.99+% it used a seed of a few bytes, because that's how common random number generators designed for efficiency work.
A PRNG with a 32 bit seed can only generate up to 2^32 different sequences of 32bit numbers, which are guaranteed to repeat after 2^32 generated numbers. Without doing the math, Even for a 3mb file chances are very slim there'll be a seed that generates the entire file. The number of unique 3mb files is astronomically larger than the number of sequences you can generate, even if you'd try every possible substring of all the sequences. So you need a PRNG that can generate more sequences, which you can only achieve by making the seed larger. What the person you were replying to was implying is that the size of the seed might approach the size of the sequence you're trying to generate.
Somewhere (discussed on HN) someone devised a "better-than-perfect" compressor. Most inputs get compressed (smaller than input), except for one input that does not. That one input is cryptographically impossible to find - or something along those lines.
Unfortunately I can't find the article I'm describing here, maybe someone else can? It was a long time ago so I might be misrepresenting it slightly.
It can't exist.
Presume this compressor guarantees the output will always be at least one byte smaller (with the exception of the impossible to find input)
So just keep running your data in a loop through the compressor. If you start with a 1MB file, it will take a maximum of a million iterations until the output shrinks down to zero bytes, which is the smallest possible file. Record how many iterations it too.
You can now extract your file by feeding a zero byte file into the decompressor and running the same number of iterations. Which You can now store every 1MB (or smaller) file in the world in just 20 bits.... But that would means there are only 1 million possible 1MB files?
Even if you put some minimum output size limitation on the compressor, say it can't produce any file less than 512 bits, the same argument applies. It's just that the numbers get bigger.
That's how all compressors work, in that likely files (eg. ASCII, obvious patterns, etc) become smaller and unlikely files become bigger.
> likely files (eg. ASCII, obvious patterns, etc) become smaller
Likely files for a real human workload are like that, but if "most inputs" is talking about the set of all possible files, then that's a whole different ball game and "most inputs" will compress very badly.
> unlikely files become bigger
Yes, but when compressors can't find useful patterns they generally only increase size by a small fraction of a percent. There aren't files that get significantly bigger.
In some cases it can be certain, the ascii encoded in the usual 8 bits has fat to trim even if it is random in that space.
Right, but the point was, the case where it became bigger was ~impossible to find.
Yeah good point, kinda glossed over that part of the original post. Don't think that that's possible fwiw.
IMO. the fun part of compression algorithms is that the set of files that become smaller is as narrow as possible while the set of files that become bigger is as big as possible, so _most_ files don't compress well! The trick is to get the set of files that get smaller to be just the useful files and nothing else.
You could design the opposite, a system where ultra-rare files get compressed a lot but most don't compress.
But as stated, what you said is not possible. To compress by a single byte, only 1/256 files can be compressible, or you'd have multiple different inputs compressing down to the same output.