simscitizen 4 days ago

Oh I've debugged this before. Native memory allocator had a scavenge function which suspended all other threads. Managed language runtime had a stop the world phase which suspended all mutator threads. They ran at about the same time and ended up suspending each other. To fix this you need to enforce some sort of hierarchy or mutual exclusion for suspension requests.

> Why you should never suspend a thread in your own process.

This sounds like a good general princple but suspending threads in your own process is kind of necessary for e.g. many GC algorithms. Now imagine multiple of those runtimes running in the same process.

2
hyperpape 4 days ago

> suspending threads in your own process is kind of necessary for e.g. many GC algorithms

I think this is typically done by having the compiler/runtime insert safepoints, which cooperatively yield at specified points to allow the GC to run without mutator threads being active. Done correctly, this shouldn't be subject to the problem the original post highlighted, because it doesn't rely on the OS's ability to suspend threads when they aren't expecting it.

achierius 4 days ago

This is a good approach but can be tricky. E.g. what if your thread spends a lot of time in a tight loop, e.g. doing a big inlined matmul kernel? Since you never hit a function call you don't get safepoints that way -- you can add them to the back-edge of every loop, but that can be a bit unappetizing from a performance perspective.

chipsa 4 days ago

If you don’t create any GC-able objects in the loop, why would you need to call the GC? And if you are, that should involve a function call.

And if you do need to call the GC, you could manually insert function calls every x loop iterations.

MarkSweep 4 days ago

> suspending threads in your own process is kind of necessary for e.g. many GC algorithms

True. Maybe the more precise rule is “only suspend threads for a short amount of time and don’t acquire any locks while doing it”?

The way the .NET runtime follows this rule is it only suspends threads for a very short time. After suspending, the thread is immediately resumed if it not running managed code (in a random native library or syscall). If the thread is running managed code, the thread is hijacked by replacing either the instruction pointer or the return address with a the address of a function that will wait for the GC to finish. The thread is then immediately resumed. See the details here:

https://github.com/dotnet/runtime/blob/main/docs/design/core...

> Now imagine multiple of those runtimes running in the same process.

Can that possibly reliably work? Sounds messy.