Proper thread placement and numa handling does have a massive impact on modern amd cpus - significantly more so than on Xeon systems. This might be anecdotal, but I’ve seen performance improve by 50% on some real world workloads.
NUMA feels like a really big deal on AMD now.
I recently refactored an evolutionary algorithm from Parallel.ForEach over one gigantic population to an isolated population+simulation per thread. The difference is so dramatic (100x+) that loss of large scale population dynamics seems to be more than offset by the # of iterations you can achieve per unit time.
Communicating information between threads of execution should be assumed to be growing more expensive (in terms of latency) as we head further in this direction. More threads is usually not the answer for most applications. Instead, we need to back up and review just how fast one thread can be when the dependent data is in the right place at the right time.
Yes - I almost view the server as a small cluster in a box, and an internal network with the associated performance impact when you start going out of box
Is cross thread latency more expensive in time, or more expensive relative to things like local core throughput?
Time and throughput are inseparable quantities. I would interpret "local core throughput" as being the subclass of timing concerns wherein everything happens in a smaller physical space.
I think a different way to restate the question would be: What are the categories of problems for which the time it takes to communicate cross-thread more than compensates for the loss of cache locality? How often does it make sense to run each thread ~100x slower so that we can leverage some aggregate state?
The only headline use cases I can come up with for using more than <modest #> of threads is hosting VMs in the cloud and running simulations/rendering in an embarrassingly parallel manner. I don't think gaming benefits much beyond a certain point - humans have their own timing issues. Hosting a web app and ferrying the user's state between 10 different physical cores under an async call stack is likely not the most ideal use of the computational resources, and this scenario will further worsen as inter-thread latency increases.
When I was caring more about hardware configuration on databases in big virtual machine hosts not configuring NUMA was an absolute performance killer, more than 50% performance on almost any hardware because as soon as you left the socket the interconnect suuuuucked.