jandrewrogers 2 days ago

Deletion via encryption only works if every user’s data is completely separate from every other user’s data in the storage layer. This is rarely the case in databases, indexes, etc. It also is often infeasible if the number of users is very large (key schedule state alone will blow up your CPU cache).

Databases with data from multiple users largely can’t work this way unless you are comfortable with a several order of magnitude loss of performance. It has been built many times but performance is so poor that it is deemed unusable.

3
blagie 2 days ago

The entire mess isn't with data in databases, but on laptops for offline analysis, in log files, backups, etc.

It's easy enough to have a SQL query to delete a users' data from the production database for real.

It's all the other places the data goes that's a mess, and a robust system of deletion via encryption could work fine in most of those places, at least in the abstract with the proper tooling.

alisonatwork 2 days ago

Some of these issues could perhaps be addressed by having fixed retention of PII in the online systems, and encryption at rest in the offline systems. If a user wants to access data of theirs which has gone offline, they take the decryption hit. Of course it helps to be critical about how much data should be retained in the first place.

It is true that protecting the user's privacy costs more than not protecting it, but some organizations feel a moral obligation or have a legal duty to do so. And some users value their own privacy enough that they are willing to deal with the decreased convenience.

As an engineer, I find it neat that figuring out how to delete data is often a more complicated problem than figuring out how to create it. I welcome government regulations that encourage more research and development in this area, since from my perspective that aligns actually-interesting technical work with the public good.

jandrewrogers 2 days ago

> As an engineer, I find it neat that figuring out how to delete data is often a more complicated problem than figuring out how to create it.

Unfortunately, this is a deeply hard problem in theory. It is not as though it has not been thoroughly studied in computer science. When GDPR first came out I was actually doing core research on “delete-optimized” databases. It is a problem in other domains. Regulations don’t have the power to dictate mathematics.

I know of several examples in multiple countries where data deletion laws are flatly ignored by the government because it is literally impossible to comply even though they want to. Often this data supports a critical public good, so simply not collecting it would have adverse consequences to their citizens.

tl;dr: delete-optimized architectures are so profoundly pathological to query performance, and a lesser extent insert performance, that no one can use them for most practical applications. It is fundamental to the computer science of the problem. Denial of this reality leads to issues like the above where non-compliance is required because the law didn’t concern itself with the physics of computation.

If the database is too slow to load the data then it doesn’t matter how fast your deterministic hard deletion is because there is no data to delete in the system.

Any improvements in the situation are solving minor problems in narrow cases. The core theory problems are what they are. No amount of wishful thinking will change this situation.

Gigachad 2 days ago

Instantaneous deletes might be impossible, but I really doubt that it’s physically impossible to eventually delete user data. If you soft delete first to hide user data, and then maybe it takes hours, weeks, months to eventually purge from all systems, that’s fine. Regulators aren’t expecting you to edit old backups, only that they eventually get cleared in reasonable time.

Seems that companies are capable of moving mountains when the task is tracking the user and bypassing privacy protections. But when the task is deleting the users data it’s “literally impossible”

alisonatwork 2 days ago

It would be interesting to hear more about your experience with systems where deletion has been deemed "literally impossible".

Every database I have come across in my career has a delete function. Often it is slow. In many places I worked, deleting or expiring data cost almost as much as or sometimes more than inserting it... but we still expired the data because that's a fundamental requirement of the system. So everything costs 2x, so what? The interesting thing is how to make it cost less than 2x.

catlifeonmars 1 day ago

You can use row based encryption and store the encrypted encryption key alongside each row. You use a master key to decrypt the row encryption key and then decrypt the row each time you need to access it. This is the standard way of implementing it.

You can instead switch to a password-based key derivation function for the row encryption key if you want the row to be encrypted by a user provided password

jandrewrogers 1 day ago

This has been tried many times. The performance is so poor as to be unusable for most applications. The technical reasons are well-understood.

The issue is that, at a minimum, you have added 32 bytes to a row just for the key. That is extremely expensive and in many cases will be a large percentage of the entire row; many years ago PostgreSQL went to heroic efforts to reduce 2 bytes per row for performance reasons. It also limits you to row storage, which means query performance will be poor.

That aside, you overlooked the fact that you'll have to compute a key schedule for each row. None of the setup costs of the encryption can be amortized, which makes processing a row extremely expensive computationally.

There is no obvious solution that actually works. This has been studied and implemented extensively. The reason no one does it isn't because no one has thought of it before.

catlifeonmars 1 day ago

You’re not wrong about the downsides. However you’re wrong about the costs being prohibitive on general. I’ve personally worked on quite a few applications that do this and the additional cost has never been an issue.

Obviously context matters and there are some applications where the cost does not outweigh the benefit

infinite8s 1 day ago

I think you and the GP are probably talking about different scale orders of magnitude.

catlifeonmars 1 day ago

Very likely!

But I think there must also be constraints other than scale. The profit margins must also be razor thin.