this can easily happen in a BEAM system. say you have some shared state you want to access. you create a gen_server to protect this shared state. a gen_server is basically a huge mutex. the gen_server is just a normal beam process that handles requests sent to its message queue and then sends a reply message back. lets say it can process a request normally in 20us. so a 15ms pause would stack up 750 messages in its message queue. now maybe this is not enough to generate a huge outage on its own but maybe as part of your handling you are using the message queue in an unsafe way. so when you check the message queue for a message the BEAM will just search the whole message queue for a message that matches. there are certain patterns the BEAM is able to optimize to prevent the whole message queue being searched (i think almost every pattern is unsafe and the BEAM only optimizes the gen rpc style message patterns) . but if you are using an unsafe pattern when you have a message queue backlog it will destroy the throughput in the system because the time taken to process a message is a function of the message queue length and the message queue length becomes a function of how long it takes to process a message.
Also, the great thing is you might not even have an explicit `receive` statement in your gen_server code. You might just be using a library that is using a `receive` somewhere that is unsafe with a large message queue and now you are burned. The BEAM also added some alternative message queue thing so you are able to use this instead of the main message queue of a process which should be a lot safer but I think a lot of libraries still do not use this. This alternative is 'alias' (https://www.erlang.org/doc/system/ref_man_processes.html#pro...) which does something slightly different from what I thought which is to protect the queue from 'lost' messages. Without aliases 'timeouts' can end up causing the process message queue to be polluted with messages that are no longer being waited on. This can lead to the same problems with large message queues causing throughput of a process to drop. However, usually long lived processes will have a loop that handles messages in the queue.
> lets say it can process a request normally in 20us.
Then what if the OS/thread hangs? Or maybe a hardware issue even. Seems a bit weird to have critical path be blocked by a single mutex. That's a recipe for problems or am I missing something?
Hardware issues happen, but if you're lucky it's a simple failure and the box stops dead. Not much fun, but recovery can be quick and automated.
What's real trouble is when the hardware fault is like one of the 16 nic queues stopped, so most connections work, but not all (depends on the hash of the 4-tuple) or some bit in the ram failed and now you're hitting thousands of ECC correctable errors per second and your effective cpu capacity is down to 10% ... the system is now too slow to work properly, but manages to stay connected to dist and still attracts traffic it can't reasonably serve.
But OS/thread hangs are avoidable in my experience. If you run your beam system with very few OS processes, there's no reason for the OS to cause trouble.
But on the topic of a 15ms pause... it's likely that that pause is causally related to cascading pauses, it might be the beginning or the end or the middle... But when one thing slows down, others do too, and some processes can't recover when the backlog gets over a critical threshold which is kind of unknowable without experiencing it. WhatsApp had a couple of hacks to deal with this. A) Our gen_server aggregation framework used our hacky version of priority messages to let the worker determine the age of requests and drop them if they're too old. B) we had a hack to drop all messages in a process's mailbox through the introspection facilities and sometimes we automated that with cron... Very few processes can work through a mailbox with 1 million messages, dropping them all gets to recovery faster. C) we tweaked garbage collection to run less often when the mailbox was very large --- i think this is addressed by off-heap mailboxes now, but when GC looks through the mailbox every so many iterations and the mailbox is very large, it can drive an unrecoverable cycle as eventually GC time limits throughput below accumulation and you'll never catch up. D) we added process stats so we could see accumulation and drain rates and estimate time to drain / or if the process won't drain and built monitoring around that.
> we had a hack to drop all messages in a process's mailbox through the introspection facilities and sometimes we automated that with cron...
What happens to the messages? Do they get processed at a slower rate or on a subsystem that works in the background without having more messages being constantly added? Or do you just nuke them out of orbit and not care? That doesn't seem like a good idea to me since loss of information. Would love to know more about this!
Nuked; it's the only way to be sure. It's not that we didn't care about the messages in the queue, it's just there's too many of them, they can't be processed, and so into the bin they go. This strategy is more viable for reads and less viable for writes, and you shouldn't nuke the mnesia processes's queues, even when they're very backlogged ... you've got to find a way to put backpressure on those things --- maybe a flag to error out on writes before they're sent into the overlarge queue.
Mostly this is happening in the context of request/response. If you're a client and connect to the frontend you send a auth blob, and the frontend sends it to the auth daemon to check it out. If the auth daemon can't respond to the frontend in a reasonable time, the frontend will drop the client; so there's no point in the auth daemon looking at old messages. If it's developed a backlog so high it can't get it back, we failed and clients are having trouble connecting, but the fastest path to recovery is dropping all the current requests in progress and starting fresh.
In some scenarios even if the process knew it was backlogged and wanted to just accept messages one at a time and drop them, that's not fast enough to catch up to the backlog. The longer you're in unrecoverable backlog, the worse the backlog gets, because in addition to the regular load from clients waking up, you've also got all those clients that tried and failed going to retry. If the outage is long enough, you do get a bit of a drop off, because clients that can't connect don't send messages that require waking up other clients, but that effect isn't so big when you've only got a large backlog a few shards.
If the user client is well implemented either it or the user notices that an action didn't take effect and tries again, similar to what you would do if a phone call was disconnected unexpectedly or what most people would do if a clicked button didn't have the desired effect, i.e. click it repeatedly.
In many cases it's not a big problem if some traffic is wasted, compared to desperately trying to process exactly all of it in the correct order, which at times might degrade service for every user or bring the system down entirely.
Depending on what you want to do, there are ways to change where the blocking occurs, like https://blog.sequinstream.com/genserver-reply-dont-call-us-w...
Part of the problem with BEAM is it doesn't have great ways of dealing with concurrency beyond gen_server (effectively a mutex) and ETS tables (https://www.erlang.org/doc/apps/stdlib/ets). So I think usually the solution would be to use ETS if its possible which is kind of like a ConcurrentHashMap in other languages or to shard or replicate the shared state so it can be accessed in parallel. For read only data that does not change very often the BEAM also has persistent term (https://www.erlang.org/doc/apps/erts/persistent_term.html).
> this can easily happen in a BEAM system
Wait... I thought all you had to do is write it in Erlang and it scales magically!
Erlang makes a lot of hard things possible, some tricky things easy, and some easy things tricky.
There's no magic fairy dust. Just a lot of things that fit together in nice ways if you use them well, and blow up in predictable ways if you have learned how to predict the system.
I know that, having built several myself. Sometimes people get a bit ahead of themselves with the marketing though.