This has puzzled me for a while. The cited system has 2x89.6 GB/s bandwidth. But a single CCD can do at most 64GB/s of sequential reads. Are claims like "Apple Silicon having 400GB/s" meaningless? I understand a typical single logical CPU can't do more than 50-70GB/s, and it seems like a group of CPU's typically shares a mem controller which is similarly limited.
To rephrase: is it possible to cause 100% mem bandwith utilization with only or 1 or 2 CPU's doing the work per CCD?
On Zen 3, I am able to use nearly the full 51.2GB/sec from a single CPU core. I have not tried using two as I got so close to 51.2GB/sec that I had assumed that going higher was not possible. Off the top of my head, I got 49-50GB/sec, but I last measured a couple years ago.
By the way, if the cores were able to load things at full speed, they would be able to use 640GB/sec each. That is 2 AVX-512 loads per cycle at 5GHz. Of course, they never are able to do this due to memory bottlenecks. Maybe Intel’s Xeon Max series with HBM can, but I would not be surprised to see an unadvertised internal bottleneck there too. That said, it is so expensive and rare that few people will ever run code on one.
People have studied the Xeon Max! Spoiler - yes, it's limited to ~23GB/s per core. It can't achieve anywhere close to the theoretical bandwidth of the HBM even, with all cores active. It's a pretty bad design in my opinion.
https://www.ixpug.org/images/docs/ISC23/McCalpin_SPR_BW_limi...
It is integer factors better overall total BW than ddr5 spr; I think they went for minimal investment + time to market for the spr w/ hbm product rather than heavy investment to hit full bw utilization. Which may have made sense for intel overall given business context etc
There are large differences in load/store performance across implementations. On Apple Silicon for example the M1 Max a single core can stream about 100GB/s all by itself. This is a significant advantage over competing designs that are built to hit that kind of memory bandwidth only with all-cores workloads. For example five generations of Intel Xeon processors, from Sandybridge through Skylake, were built to achieve about 20GB/s streams from a single core. That is one reason why the M1 was so exceptional at the time it was released. The 1T memory performance is much better than what you get from everyone else.
As far as claims of the M1 Max having > 400GB/s of memory bandwidth, this isn't achievable from CPUs alone. You need all CPUs and GPUs running full tilt to hit that limit. In practice you can hit maybe 250GB/s from CPUs if you bring them all to bear, including the efficiency cores. This is still extremely good performance.
I don't think single M1 cpu can do 100GB/s. This source says 68GB/s peak: https://www.anandtech.com/show/16252/mac-mini-apple-m1-teste...
That's the plain M1. The Max can do a bit more. Same site since you favor it: https://www.anandtech.com/show/17024/apple-m1-max-performanc...
> From a single core perspective, meaning from a single software thread, things are quite impressive for the chip, as it’s able to stress the memory fabric to up to 102GB/s. This is extremely impressive and outperforms any other design in the industry by multiple factors, we had already noted that the M1 chip was able to fully saturate its memory bandwidth with a single core and that the bottleneck had been on the DRAM itself. On the M1 Max, it seems that we’re hitting the limit of what a core can do – or more precisely, a limit to what the CPU cluster can do.
Wow
btw what's about as important is that in practice you don't need to write super clever code to do that, these 68GB/s are easy to reach with textbook code without any cleverness
68 Gbps of memory read/write can be easily reached (assuming the memory bandwidth is there to reach it with) on any current architecture by running a basic loop adding 64 bit scalars. What could be even less clever than that?
Needs to be more than one accumulator.
I mean:
const uint64_t size = // Some large value
uint64_t a[size] = // Some random values
uint64_t b[size] = // Some random values
uint64_t c[size] = {0};
uint64_t i = 0;
while(i < size) {
c[i] = a[i] + b[i];
}
// Disable all optimizations so the above isn't optimized away/vectorized
That's the world's simplest loop with 16 bytes of memory read per loop so even if your core is a piece of crap that averages a single increment and addition per cycle it just needs to run at ~4.3 GHz to still pass the bar anyways. Running this code on my MacBook and my x86 desktop with compiler optimizations off I'm not seeing either fail to reach 64 GB/s. Aren't those 400 GB/s a figure which only apply when the GPU with its much wider interface is accessing the memory?
That figure is at the memory controller.
It applies as a maximum speed limit all the time, but it's unlikely that a CPU would cause the memory controller to reach it. Why it's important is that it causes increased latency whenever other bus controllers are competing for bandwidth, but I don't think Apple has documented their internal bus architecture or performance counters necessary to see how.
Another POV is that maybe the max memory bandwidth figure is too vague to guide people optimizing libraries. It would be nice if Apple Silicon was as fast as "400GB/s" sounds. Grounded closer to reality, the parts are 65W.
> The cited system has 2x89.6 GB/s bandwidth.
The following applies for certain only to the Zen4 system; I have no experience with Zen5.
That is the theoretical max bandwidth of the DDR5 memory (/controller) running at 5600 MT/s (roughly: 5600MT/s ÷ 2MT/s × 32 bits/T = 89.6GB/s). There is also a bandwidth limitation between the memory controller (IO die) and the cores themselves (CCDs), along the Infinity Fabric. Infinity Fabric runs at a different clock speed than the cores, their cache(s), and the memory controller; by default, 2/3 of the memory controller. So, if the Memory controller's CLocK (MCLK) is 2800MHz (for 5600MT/s), the FCLK (infinity Fabrick CLocK) will run at 1866.66MHz. With 32 bytes per clock read bandwidth, you get 59.7GB/s maximum sequential memory read bandwidth per CCD<->IOD interconnect.
Many systems (read: motherboard manufacturers) will overclock the FCLK when applying automatic overclocking (such as when selecting XMP/EXPO profiles, and I believe some EXPO profiles include overclocking the FCLK as well. (Note that 5600MT/s RAM is overclocked; the fastest officially supported Zen4 memory speed is 5200MT/s, and most memory kits are 3600MT/s or less until overclocked with their built-in profiles.) In my experience, Zen4 will happily accept FCLK up to 2000MHz, while Zen4 Threadripper (7000 series) seems happy up to 2200MHz. This particular system has the FCLK overclocked to 2000MHz, which will hurt latency[0] (due to not being 2/3 of MCLK) but increase bandwidth. 2000MHz × 32 bytes/cycle = 64GB/s read bandwidth, as quoted in the article.
First: these are theoretical maximums. Even the most "perfect" benchmark won't hit these, and if they do, there are other variables at play not being taken into account (likely lower level caches). You will never, ever see theoretical maximum memory bandwidth in any real application.
Second: no, it is not possible to see maximum memory bandwidth on Zen4 from only one CCD, assuming you have sufficiently fast DDR5 that the FCLK cannot be equal to the MCLK. This is an architecture limitation, although rarely hit in practice for most of the target market. A dual-CCD chip has sufficient memory bandwidth to saturate the memory before the Infinity Fabric (but as alluded to in the article, unless tuned incredibly well, you'll likely run into contention issues and either hit a latency or bandwidth wall in real applications). My quad-CCD Threadripper can achieve nearly 300GB/s, due to having 8 (technically 16) DDR5 channels operating at 5800MT/s and FCLK at 2200MHz; I would need an octo-CCD chip to achieve maximum memory bandwidth utilization.
Third: no, claims like "Apple Silicon having 400GB/s) are not meaningless. Those numbers are achieved the exact same way as above, and the same way Nvidia determines their maximum memory bandwidth on their GPUs. Platform differences (especially CPU vs GPU, but even CPU vs CPU since Apple, AMD, and Intel all have very different topologies) make the numbers incomparable to each other directly. As an example, Apple Silicon can probably achieve higher per-core memory bandwidth than Zen4 (or 5), but also shares bandwidth with the GPU; this may not be great for gaming applications, for instance, where memory bandwidth requirements will be high for both the CPU and GPU, but may be fine for ML inference since the CPU sits mostly idle while the GPU does most of the work.
[0] I'm surprised the author didn't mention this. I can only assume they didn't know this, and haven't tested over frequencies or read much on the overclocking forums about Zen4. Which is fair enough, it's a very complicated topic with a lot of hidden nuances.
> Note that 5600MT/s RAM is overclocked; the fastest officially supported Zen4 memory speed is 5200MT/s
This specifically did change in Zen 5, the max supported is now 5600MT/s
Easily, the memory subsystem on AMDs consumer parts is embarrassingly weak (on all desktop and portable consumer devices in general save for Apple ones and select bespoke designs).