eurekin 2 days ago

At work we dutifully delete all data on a GDPR request

3
sahila 2 days ago

How do you manage deleting data from backups? Do you know not take backups?

crdrost 2 days ago

"When data subjects exercise one of their rights, the controller must respond within one month. If the request is too complex and more time is needed to answer, then your organisation may extend the time limit by two further months, provided that the data subject is informed within one month after receiving the request."

Backup retention policy 60 days, respond within a week or two telling someone that you have purged their data from the main database but that these backups exist and cannot be changed, but that they will be automatically deleted in 60 days.

The only real difficulty is if those backups are actually restored, then the user deletion needs to be replayed, which is something that would be easy to forget.

Gigachad 2 days ago

Probably most just ignore backups. But there were some good proposals where you encrypt every users data with their own key. So a full delete is just deleting the users encryption key, rendering all data everywhere including backups inaccessible.

jandrewrogers 2 days ago

Deletion via encryption only works if every user’s data is completely separate from every other user’s data in the storage layer. This is rarely the case in databases, indexes, etc. It also is often infeasible if the number of users is very large (key schedule state alone will blow up your CPU cache).

Databases with data from multiple users largely can’t work this way unless you are comfortable with a several order of magnitude loss of performance. It has been built many times but performance is so poor that it is deemed unusable.

blagie 2 days ago

The entire mess isn't with data in databases, but on laptops for offline analysis, in log files, backups, etc.

It's easy enough to have a SQL query to delete a users' data from the production database for real.

It's all the other places the data goes that's a mess, and a robust system of deletion via encryption could work fine in most of those places, at least in the abstract with the proper tooling.

alisonatwork 2 days ago

Some of these issues could perhaps be addressed by having fixed retention of PII in the online systems, and encryption at rest in the offline systems. If a user wants to access data of theirs which has gone offline, they take the decryption hit. Of course it helps to be critical about how much data should be retained in the first place.

It is true that protecting the user's privacy costs more than not protecting it, but some organizations feel a moral obligation or have a legal duty to do so. And some users value their own privacy enough that they are willing to deal with the decreased convenience.

As an engineer, I find it neat that figuring out how to delete data is often a more complicated problem than figuring out how to create it. I welcome government regulations that encourage more research and development in this area, since from my perspective that aligns actually-interesting technical work with the public good.

jandrewrogers 2 days ago

> As an engineer, I find it neat that figuring out how to delete data is often a more complicated problem than figuring out how to create it.

Unfortunately, this is a deeply hard problem in theory. It is not as though it has not been thoroughly studied in computer science. When GDPR first came out I was actually doing core research on “delete-optimized” databases. It is a problem in other domains. Regulations don’t have the power to dictate mathematics.

I know of several examples in multiple countries where data deletion laws are flatly ignored by the government because it is literally impossible to comply even though they want to. Often this data supports a critical public good, so simply not collecting it would have adverse consequences to their citizens.

tl;dr: delete-optimized architectures are so profoundly pathological to query performance, and a lesser extent insert performance, that no one can use them for most practical applications. It is fundamental to the computer science of the problem. Denial of this reality leads to issues like the above where non-compliance is required because the law didn’t concern itself with the physics of computation.

If the database is too slow to load the data then it doesn’t matter how fast your deterministic hard deletion is because there is no data to delete in the system.

Any improvements in the situation are solving minor problems in narrow cases. The core theory problems are what they are. No amount of wishful thinking will change this situation.

Gigachad 2 days ago

Instantaneous deletes might be impossible, but I really doubt that it’s physically impossible to eventually delete user data. If you soft delete first to hide user data, and then maybe it takes hours, weeks, months to eventually purge from all systems, that’s fine. Regulators aren’t expecting you to edit old backups, only that they eventually get cleared in reasonable time.

Seems that companies are capable of moving mountains when the task is tracking the user and bypassing privacy protections. But when the task is deleting the users data it’s “literally impossible”

alisonatwork 2 days ago

It would be interesting to hear more about your experience with systems where deletion has been deemed "literally impossible".

Every database I have come across in my career has a delete function. Often it is slow. In many places I worked, deleting or expiring data cost almost as much as or sometimes more than inserting it... but we still expired the data because that's a fundamental requirement of the system. So everything costs 2x, so what? The interesting thing is how to make it cost less than 2x.

catlifeonmars 2 days ago

You can use row based encryption and store the encrypted encryption key alongside each row. You use a master key to decrypt the row encryption key and then decrypt the row each time you need to access it. This is the standard way of implementing it.

You can instead switch to a password-based key derivation function for the row encryption key if you want the row to be encrypted by a user provided password

jandrewrogers 1 day ago

This has been tried many times. The performance is so poor as to be unusable for most applications. The technical reasons are well-understood.

The issue is that, at a minimum, you have added 32 bytes to a row just for the key. That is extremely expensive and in many cases will be a large percentage of the entire row; many years ago PostgreSQL went to heroic efforts to reduce 2 bytes per row for performance reasons. It also limits you to row storage, which means query performance will be poor.

That aside, you overlooked the fact that you'll have to compute a key schedule for each row. None of the setup costs of the encryption can be amortized, which makes processing a row extremely expensive computationally.

There is no obvious solution that actually works. This has been studied and implemented extensively. The reason no one does it isn't because no one has thought of it before.

catlifeonmars 1 day ago

You’re not wrong about the downsides. However you’re wrong about the costs being prohibitive on general. I’ve personally worked on quite a few applications that do this and the additional cost has never been an issue.

Obviously context matters and there are some applications where the cost does not outweigh the benefit

infinite8s 1 day ago

I think you and the GP are probably talking about different scale orders of magnitude.

catlifeonmars 1 day ago

Very likely!

But I think there must also be constraints other than scale. The profit margins must also be razor thin.

liamYC 2 days ago

Smart, how do you backup the users encryption keys?

aiiane 2 days ago

A set of encryption keys is a lot smaller than the set of all user data, so it's much more viable to have both more redundant hot storage and more frequently rotated cold storage of just the keys.

Trasmatta 2 days ago

Most companies don't keep all backups in perpetuity, and instead have rolling backups over some period of time.

alisonatwork 2 days ago

Backups can have a fixed retention period.

sahila 2 days ago

Sure, but now when the backup is restored two weeks later, is the user redeleted or just forgotten about?

alisonatwork 2 days ago

Depends on the processes in place at the company. Presumably if a backup is restored, some kind of replay has to happen after that, otherwise all the other users are going to lose data that arrived in the interim. A catastrophic failure where both two weeks of user data and all the related events get irretrievably blackholed could still happen, sure, but any company where that is a regular occurrence likely has much bigger problems than complying with GDPR.

The point is that none of these problems are insurmountable - they are all processes and practices that have been in place since long before GDPR and long before I started in this industry 25+ years ago. Even if deletion is only eventually consistent, even if a few pieces of data slip through the cracks, it is not hard to have policies in place that at least provide a best effort at upholding users' privacy and complying with the regulations.

Organizations who choose not to bother, claiming that it's all too difficult, or that because deletion cannot be done 100% perfectly it should not even be attempted at all, are making weak excuses. The cynical take would be that they are just covering for the fact that they really do not respect their users' privacy and simply do not want to give up even the slightest chance of extracting value from that data they illegally and immorally choose to retain.

simonw 2 days ago

Purely out of interest, how do you verify that the GDPR request is coming from the actual user and not an imposter?

dijksterhuis 2 days ago

> The organisation might need you to prove your identity. However, they should only ask you for just enough information to be sure you are the right person. If they do this, then the one-month time period to respond to your request begins from when they receive this additional information.

https://ico.org.uk/for-the-public/your-right-to-get-your-dat...

eurekin 2 days ago

In my domain, our set of services only authorizes Customer Centre system to do so. I guess I'd need to ask them for details, but I always assumed they have checks in place

gruez 2 days ago

That won't work in this case, because I doubt GDPR requests override court orders.