btw what's about as important is that in practice you don't need to write super clever code to do that, these 68GB/s are easy to reach with textbook code without any cleverness
68 Gbps of memory read/write can be easily reached (assuming the memory bandwidth is there to reach it with) on any current architecture by running a basic loop adding 64 bit scalars. What could be even less clever than that?
Needs to be more than one accumulator.
I mean:
const uint64_t size = // Some large value
uint64_t a[size] = // Some random values
uint64_t b[size] = // Some random values
uint64_t c[size] = {0};
uint64_t i = 0;
while(i < size) {
c[i] = a[i] + b[i];
}
// Disable all optimizations so the above isn't optimized away/vectorized
That's the world's simplest loop with 16 bytes of memory read per loop so even if your core is a piece of crap that averages a single increment and addition per cycle it just needs to run at ~4.3 GHz to still pass the bar anyways. Running this code on my MacBook and my x86 desktop with compiler optimizations off I'm not seeing either fail to reach 64 GB/s.