Rietty 2 days ago

> we had a hack to drop all messages in a process's mailbox through the introspection facilities and sometimes we automated that with cron...

What happens to the messages? Do they get processed at a slower rate or on a subsystem that works in the background without having more messages being constantly added? Or do you just nuke them out of orbit and not care? That doesn't seem like a good idea to me since loss of information. Would love to know more about this!

2
toast0 2 days ago

Nuked; it's the only way to be sure. It's not that we didn't care about the messages in the queue, it's just there's too many of them, they can't be processed, and so into the bin they go. This strategy is more viable for reads and less viable for writes, and you shouldn't nuke the mnesia processes's queues, even when they're very backlogged ... you've got to find a way to put backpressure on those things --- maybe a flag to error out on writes before they're sent into the overlarge queue.

Mostly this is happening in the context of request/response. If you're a client and connect to the frontend you send a auth blob, and the frontend sends it to the auth daemon to check it out. If the auth daemon can't respond to the frontend in a reasonable time, the frontend will drop the client; so there's no point in the auth daemon looking at old messages. If it's developed a backlog so high it can't get it back, we failed and clients are having trouble connecting, but the fastest path to recovery is dropping all the current requests in progress and starting fresh.

In some scenarios even if the process knew it was backlogged and wanted to just accept messages one at a time and drop them, that's not fast enough to catch up to the backlog. The longer you're in unrecoverable backlog, the worse the backlog gets, because in addition to the regular load from clients waking up, you've also got all those clients that tried and failed going to retry. If the outage is long enough, you do get a bit of a drop off, because clients that can't connect don't send messages that require waking up other clients, but that effect isn't so big when you've only got a large backlog a few shards.

cess11 2 days ago

If the user client is well implemented either it or the user notices that an action didn't take effect and tries again, similar to what you would do if a phone call was disconnected unexpectedly or what most people would do if a clicked button didn't have the desired effect, i.e. click it repeatedly.

In many cases it's not a big problem if some traffic is wasted, compared to desperately trying to process exactly all of it in the correct order, which at times might degrade service for every user or bring the system down entirely.