Show HN: Less Slow C++: Revisiting Performance Tricks for C/C++/CUDA/Asm/PTX

Posted by ashvardanian 2 days ago

github.com

Earlier this year, I took a month to reexamine my coding habits and rethink some past design choices. I hope to rewrite and improve my FOSS libraries this year, and I needed answers to a few questions first. Perhaps some of these questions will resonate with others in the community, too.

  - Are coroutines viable for high-performance work?
  - Should I use SIMD intrinsics for clarity or drop to assembly for easier library distribution?
  - Has hardware caught up with vectorized scatter/gather in AVX-512 & SVE?
  - How do secure enclaves & pointer tagging differ on Intel, Arm, & AMD?
  - What's the throughput gap between CPU and GPU Tensor Cores (TCs)?
  - How costly are misaligned memory accesses & split-loads, and what gains do non-temporal loads/stores offer?
  - Which parts of the standard library hit performance hardest?
  - How do error-handling strategies compare overhead-wise?
  - What's the compile-time vs. run-time trade-off for lazily evaluated ranges?
  - What practical, non-trivial use cases exist for meta-programming?
  - How challenging is Linux Kernel bypass with io_uring vs. POSIX sockets?
  - How close are we to effectively using Networking TS or heterogeneous Executors in C++?
  - What are best practices for propagating stateful allocators in nested containers, and which libraries support them?

These questions span from micro-kernel optimizations (nanoseconds) to distributed systems (micro/millisecond latencies). Rather than tackling them all in one post, I compiled my explorations into a repository—extending my previous Google Benchmark tutorial (<https://ashvardanian.com/posts/google-benchmark>)—to serve as a sandbox for performance experimentation.

Some fun observations:

  - Compilers now vectorize 3x3x3 and 4x4x4 single/double precision multiplications well! The smaller one is ~60% slower despite 70% fewer operations, outperforming my vanilla SSE/AVX and coming within 10% of AVX-512.
  - Nvidia TCs vary dramatically across generations in numeric types, throughput, tile shapes, thread synchronization (thread/quad-pair/warp/warp-groups), and operand storage. Post-Volta, manual PTX is often needed (as intrinsics lag), though the new TileIR (introduced at GTC) promises improvements for dense linear algebra kernels.
  - The AI wave drives CPUs and GPUs to converge in mat-mul throughput & programming complexity. It took me a day to debug TMM register initialization, and SME is equally odd. Sierra Forest packs 288 cores/socket, and AVX10.2 drops 256-bit support for 512-bit... I wonder if discrete Intel GPUs are even needed, given CPU advances?
  - In common floating-point ranges, scalar sine approximations can be up to 40x faster than standard implementations, even without SIMD. It's a bit hand-wavy, though; I wish more projects documented error bounds and had 1 & 3.5 ULP variants like Sleef.
  - Meta-programming tools like CTRE can outperform typical RegEx engines by 5x and simplify building parsers compared to hand-crafted FSMs.
  - Once clearly distinct in complexity and performance (DPDK/SPDK vs. io_uring), the gap is narrowing. While pre-5.5 io_uring can boost UDP throughput by 4x on loopback IO, newer zero-copy and concurrency optimizations remain challenging.

The repository is loaded with links to favorite CppCon lectures, GitHub snippets, and tech blog posts. Recognizing that many high-level concepts are handled differently across languages, I've also started porting examples to Rust & Python in separate repos. Coroutines look bad everywhere :(

Overall, this research project was rewarding! Most questions found answers in code — except pointer tagging and secure enclaves, which still elude me in public cloud. I'd love to hear from others, especially on comparing High-Level Synthesis for small matrix multiplications on FPGAs versus hand-written VHDL/Verilog for integral types. Let me know if you have ideas for other cool, obscure topics to cover!