elahieh 4 days ago

I'm rather skeptical of this claim that the data was compressed in 2018 because there is no further information, apart from a hash value given.

If it's a true claim they must have identified some "non-random" aspect of the original data, and then they could have given more info.

2
tugu77 4 days ago

Easy.

Save the sha256 hash of original.dat in compressed.dat. The decompressor cats /dev/random until data of the right size comes out with the correct hash.

Now there are two cases.

1. The reconstructed data is actually equal to original.dat. Challenge won, cash in $5000.

2. The reconstructed data differs from original.dat. It has the same hash though, so you found a collision in sha256. World fame.

In either case, win!

hgomersall 3 days ago

Not necessarily. Consider a big file of random uniformly distributed bytes. It's easy to show that in practice some bytes are more common than others (because random), and that necessarily therefore the expected spacing between those specific bytes is less than 256, which gives you a small fraction of a bit you can save in a recoding of those specific bytes (distance from last byte of a specific value).

With a big enough file those fractions of a bit add up to a non trivial number of bits. You can be cunning about how you encode your deltas too (next delta makes use of remaining unused bits from previous delta).

I haven't worked through all the details, so it might be in the end result everything rebalances to say no, but I'd like to withhold judgement for the moment.

l33t7332273 3 days ago

>It's easy to show that in practice some bytes are more common than others (because random)

I don’t follow. Wouldn’t that be (because not random)

hgomersall 3 days ago

If you generate a billion bytes using a random byte generator, and bin the resultant array into 256 bins, it will not be perfectly flat. You can use that non-flatness to encode your bits more efficiently. I suspect just using codes to do it won't work well because the bin values are so close so you'll struggle to get codes that are efficient enough, but I suspect you can use the second order difference-between-specific byte as the encoded value. That has a much more pronounced distribution heavily weighted to small values.