Item 42231550

anamexis • 4 days ago

He does use the filenames. If you change the filenames randomly (such that the files sort differently), it does not work.

Dylan16807 • 4 days ago

> such that the files sort differently

But if you change them without making them sort differently, everything is fine. He depends on the order, not the filenames. You could even remove the filenames entirely, as long as you patch the code to account for such a strange environment.

1 reply

mafuy • 3 days ago

Not really a good point. If the order of bytes does not matter, then I can compress any file of your liking to O(log n) size :P

1 reply

Dylan16807 • 3 days ago

Wait, whose point are you saying is not good?

I'm saying order does matter and it's the only thing that matters about the separate files using this code.

1 reply

anamexis • 3 days ago

I think the question is, if you remove the filenames entirely, how do you keep the parts ordered?

(Someone else suggested sorting them by file size.)

1 reply

Dylan16807 • 3 days ago

You have to be storing them outside a traditional filesystem to not have filenames, so the way you keep them ordered depends on what your storage mechanism is.

For example, you could store them in a Set object in many programming languages, one that preserves insertion order. Or you could be extracting them one by one from a tar file that has blank filenames stored in it.

1 reply

anamexis • 3 days ago

For the Set example, where would the insertion order come from? For the tar file, the tar file would be larger than the input file it's supposed to be "compressing".

1 reply

Dylan16807 • 3 days ago

> For the Set example, where would the insertion order come from?

It would come from however the files were transferred from competitor computer to verifier computer.

> For the tar file, the tar file would be larger than the input file it's supposed to be "compressing".

It sure would be! I don't see how that's relevant to the filename discussion though?

1 reply

anamexis • 3 days ago

The whole point of the filename discussion is that it's a trick to "compress" the data, such that the sum of the file size of the input files is smaller than the file size of the decompressed output file.

Neither of your ideas work with this.

In terms of just transferring the files in order, then you need to delineate the start and end of each file, which will take more space than the byte you are removing. Same with the tar file.

1 reply

Dylan16807 • 3 days ago

> The whole point of the filename discussion is that it's a trick to "compress" the data

I think you've fundamentally misunderstood my point.

You seem to think I'm trying to make it work without a trick, but I'm not doing that. I'm saying yes there is a trick, but the trick is not based around filenames. The trick needs a sequence of variable-size blobs of bytes, and there's a lot of ways to maintain a sequence.

1 reply

anamexis • 3 days ago

The trick in the OP is absolutely based around filenames, as a source of ordering the input files. I agree that you can use the same idea if you can order the files a different way, but I don't understand why that is significant.

1 reply

Dylan16807 • 3 days ago

Because it seems very unfair to call it "filename shenanigans". The filenames are only there to point the script at the right file. Filename shenanigans would be something like putting actual bytes into the filename.

If you patched `cat` to ignore filename and just spit out each file as given, the script would still work without a single change. If you slightly changed the script to loop over the results of `ls`, it could still be compatible with scrambled filenames.

A script that didn't cheat would also be using filenames to a similar level.

In other words the filenames are a completely fair implementation detail. And that detail can be swapped out without changing the trick in any meaningful way.

The trick is based on having a series of variable-sized blobs of bytes. That's all it needs. If I use javascript instead of sh, and my decompressor is `[...s].join('5')`, I'm using the same trick.

1 reply

anamexis • 3 days ago

> If you patched `cat` to ignore filename and just spit out each file as given, the script would still work without a single change. If you slightly changed the script to loop over the results of `ls`, it could still be compatible with scrambled filenames.

This isn't true! If you scrambled the filenames, the files would be put together in the wrong order and the result would be incorrect. You would need to also transmit the order that the files would be put together separately, which again, together with the size of the files themselves, would be greater than the size of the output.

The key thing here is that the trick works by storing the information of how the blobs are ordered out-of-band. In the OP, that out-of-band place to store the blob order is filename. In your JS example of `[...s].join('5')`, where does the order of [...s] come from? It's not something you can hand-wave away, it's the key thing that makes the trick work.

1 reply

Dylan16807 • 3 days ago

> This isn't true! If you scrambled the filenames

I said "could" because you'd have to either do a limited scramble or hotwire ls to use the right order despite the scrambling. Or sort by date or inode, probably.

> The key thing here is that the trick works by storing the information of how the blobs are ordered out-of-band.

Yes. That is the key, not the filenames.

> In the OP, that out-of-band place to store the blob order is filename.

It is, but the actual use of filenames is not a shenanigan, and the blob order could be easily accomplished without any particular filenames.

> In your JS example of `[...s].join('5')`, where does the order of [...s] come from? It's not something you can hand-wave away, it's the key thing that makes the trick work.

It comes from the process of loading the blobs onto the computer. I'm not trying to hand-wave it, I'm saying it doesn't need filenames or anything resembling filenames. Maybe it came from a tar. Maybe I sent each file in a separate email. All that matters is having an order, and having an order happens by default when you have multiple files. As long as you don't go out of your way to reorder things, the trick works.

1 reply

anamexis • 3 days ago

> It comes from the process of loading the blobs onto the computer. I'm not trying to hand-wave it, I'm saying it doesn't need filenames. Maybe it came from a tar. Maybe I sent each file in a separate email. All that matters is having an order, and having an order happens by default when you have multiple files. As long as you don't go out of your way to reorder things, the trick works.

I guess that's where we disagree. I think you don't have an order by default, you need to explicitly define it, and transmit it, and store it somehow. Which is after all, why it's not true compression. When you account for that metadata, the "compressed" data is not smaller than the original.

In the OP, the cheat was using filenames to store that data. In a tar file, it's using the tar file metadata to store it. In your email, you're storing the email metadata to keep that ordering. In all cases, order is a key thing that you need to explicitly define, transmit, and store. And in all cases, this metadata takes up more space than is saved by the whole scheme.

1 reply

Dylan16807 • 3 days ago

> I guess that's where we disagree. I think you don't have an order by default, you need to explicitly define it, and transmit it, and store it somehow.

It's files on a computer. Those always have an order. Acting like there isn't an order takes active work.

> Which is after all, why it's not true compression. When you account for that metadata, the "compressed" data is not smaller than the original.

Yeah sure, I have never disagreed on this.

> In the OP, the cheat was using filenames to store that data.

Let me make my argument extra clear, and you can tell me if you disagree with either point, and exactly how you disagree with that point:

A) OP did not do anything untoward with filenames. They did a simple loop and even threw away the actual contents of the filename. Even code that wasn't cheating would have a similar use of filenames.

B) OP's trick is not "based on" filenames, it's based on having an order. There are many ways to have an order, and their choice of using filenames is very shallowly integrated into their code.

1 reply

anamexis • 3 days ago

Sure, I disagree with this:

B) OP's trick is not "based on" filenames, it's based on having an order. There are many ways to have an order, and their choice of using filenames is very shallowly integrated into their code.

I think this is a distinction without a difference. OP's trick is based on having an order via filenames. They could have used a different trick that used something besides filenames for ordering, but they didn't.

If instead of asking if he could use multiple files, he asked if he could use a tar file, or email each file separately and specified that they should be fed to the decompressor in order, he would likely have been declined outright because the cheat is more obvious.

1 reply

Dylan16807 • 3 days ago

Okay, well I can respect that interpretation enough. I don't quite see it that way but I don't think I'm going to convince you.

Specifically I don't think it rises to the level of violating rules on filenames. That's why I think the distinction can matter.

1 reply

tugu77 • 2 days ago

The rules should bar contestents from saving entropy outside the payload of the file. Whether that's in a file name or some file system data structure or in some timing side channel is insubstantial.

And once you ban that, it's impossible from an information theoretic point to win the challenge.

hombre_fatal • 4 days ago

Not in any significant way. The decompressor could be changed to require you to feed the files into it in the correct order or expect some other sorting.

What you're saying is like saying that you encoded info in filenames because decompress.sh expects a file "compressed.dat" to exist. It's not describing any meaningful part of the scheme.

1 reply

anamexis • 4 days ago

The filenames contain information that you need in some way for the scheme to work.

You are combining different parts and inserting a missing byte every time you combine the files. You need to combine the parts in the correct order, and the order is part of the information that makes this work.

If the ordering isn't coming from filenames, it needs to come from somewhere else.

1 reply

mhandley • 4 days ago

You could do the same spitting trick but only split at progressively increasing file lengths at the character '5'. The "compression" would be worse, so you'd need a larger starting file, but you could still satisfy the requirements this way and be independent of the filenames. The decompressor would just sort the files by increasing length before merging.

2 replies

gus_massa • 3 days ago

Nice idea, but doesn't this require a linear increase of the length of the partial files and a quadratic size of the original file?

If the length of a file is X, then in the next file you must skip the first X characters and look for a "5" that in average is in the X+128 position. So the average length of the Nth file is 128*N and if you want to reduce C bytes the size of the original file should be ~128C^2/2 (instead of the linear 128*C in the article).

2 replies

Dylan16807 • 3 days ago

It does, but that's fine. He only needs to save 150-200 bytes and the file is 3MB. 128*200²/2 is 2.5MB Though I think it would be 256* here? Still, it can be made to work.

mhandley • 3 days ago

Yes, I think it is quadratic. I don't claim it's practical (the original isn't practical either though), but just that the dependency on filenames isn't fundamental.

1 reply

gus_massa • 3 days ago

> I don't claim it's practical

Don"t interpret my comment as a complain, I was just triying to understand.

This is reposted from time to time, but I don't remember someome proposing this trick before. It's a nice idea.

anamexis • 4 days ago

That's a neat idea.