sitkack 5 hours ago

Wonderful!

The edge cases are the gold btw, collect the whole set and keep them in a human and machine readable format.

I'd also go through and using a color coded set of cables, insert bad cables (one at a time at first) while the system is doing an aggressive all to all workload and see how quickly you can identify faults.

It is the gray failures that will bring the system down, often multiple as a single failure will go undetected for months and then finally tip over an inflection point at a later time.

Are you workloads ephemeral and/or do they live migrate? Or will physical hosts have long uptimes? It is nice to be able to rebaseline the hardware before and after host kernel upgrades so you can detect any anomalies.

You would be surprised about how larger of a systemic performance degradation that major cloud providers have been able to see over months because "all machines are the same", high precision but low absolute accuracy. It is nice to run the same benchmarks on bare metal and then again under virtualization.

I am sure you know, but you are running a multivariate longitudinal experiment, science the shit out of it.

1
ca508 5 hours ago

Long running hosts at the moment, but we can drain most workloads off a specific host/rack if required and reschedule it pretty fast. We have the advantage of having a custom scheduler/orchestrator we've been working on for years, so we have a lot of control on that layer than with Kube or Nomad.

Re: Live Migration We're working on adding Live Migration support to our orchestrator atm. We aim to have it running this quarter. That'll makes things super seamless.

Re: kernels We've already seen some perf improvements somewhere between 6.0 and 6.5 (I forget the exact reason/version) - but it was some fix specific to the Sapphire Rapids cpus we had. But I wish we had more time to science on it, it's really fun playing with all the knobs and benchmarking stuff. Some of the telemetry on the new CPUs is also crazy - there's stuff like Intel PCM that can pull super fine-grained telemetry direct from the CPU/chipset https://github.com/intel/pcm. Only used it to confirm that we got NUMA affinity right so far - nothing crazy.

sitkack 4 hours ago

Last thing.

You will need a way to coordinate LM with users due them being sensitive to LM blackouts. Not many workloads are, but the ones that are are the kinds of things that customers will just leave over.

If you are draining a host, make sure new VMs are on hosts that can be guaranteed to be maintenance free for the next x-days. This allows customers to restart their workloads on their schedule and have a guarantee that they won't be impacted. It also encourages good hygiene.

Allow customers to trigger migration.

Charge extra for a long running maintenance free host.

It is good you are hooked into the PCM already. You will experience accidentally antagonistic workloads and the PCM will really help debug those issues.

If I were building a DC, I put as many NICs into a host as possible and use SR-VIO to pass the nics into the guests. The switches should be sized to allow for full speed on all nics. I know it sounds crazy but if you design for a typical crud serving tree, you are a saving a buck but making your software problem 100x harder.

Everything should have enough headroom so it never hits a knee of a contention curve.