There are large differences in load/store performance across implementations. On Apple Silicon for example the M1 Max a single core can stream about 100GB/s all by itself. This is a significant advantage over competing designs that are built to hit that kind of memory bandwidth only with all-cores workloads. For example five generations of Intel Xeon processors, from Sandybridge through Skylake, were built to achieve about 20GB/s streams from a single core. That is one reason why the M1 was so exceptional at the time it was released. The 1T memory performance is much better than what you get from everyone else.
As far as claims of the M1 Max having > 400GB/s of memory bandwidth, this isn't achievable from CPUs alone. You need all CPUs and GPUs running full tilt to hit that limit. In practice you can hit maybe 250GB/s from CPUs if you bring them all to bear, including the efficiency cores. This is still extremely good performance.
I don't think single M1 cpu can do 100GB/s. This source says 68GB/s peak: https://www.anandtech.com/show/16252/mac-mini-apple-m1-teste...
That's the plain M1. The Max can do a bit more. Same site since you favor it: https://www.anandtech.com/show/17024/apple-m1-max-performanc...
> From a single core perspective, meaning from a single software thread, things are quite impressive for the chip, as it’s able to stress the memory fabric to up to 102GB/s. This is extremely impressive and outperforms any other design in the industry by multiple factors, we had already noted that the M1 chip was able to fully saturate its memory bandwidth with a single core and that the bottleneck had been on the DRAM itself. On the M1 Max, it seems that we’re hitting the limit of what a core can do – or more precisely, a limit to what the CPU cluster can do.
Wow
btw what's about as important is that in practice you don't need to write super clever code to do that, these 68GB/s are easy to reach with textbook code without any cleverness
68 Gbps of memory read/write can be easily reached (assuming the memory bandwidth is there to reach it with) on any current architecture by running a basic loop adding 64 bit scalars. What could be even less clever than that?
Needs to be more than one accumulator.
I mean:
const uint64_t size = // Some large value
uint64_t a[size] = // Some random values
uint64_t b[size] = // Some random values
uint64_t c[size] = {0};
uint64_t i = 0;
while(i < size) {
c[i] = a[i] + b[i];
}
// Disable all optimizations so the above isn't optimized away/vectorized
That's the world's simplest loop with 16 bytes of memory read per loop so even if your core is a piece of crap that averages a single increment and addition per cycle it just needs to run at ~4.3 GHz to still pass the bar anyways. Running this code on my MacBook and my x86 desktop with compiler optimizations off I'm not seeing either fail to reach 64 GB/s.