In modern times, we have taken the everything breaks all the time, make redundancy and failover cheap/free approach.
VMS(and the hardware it runs on) takes the opposite approach. Keep everything alive forever, even with hardware failures.
So the VMS machines of the day had dual redundant everything, including interconnected memory across machines and SCSI interconnects and everything you could think of was redundant.
VMS clusters could be configured in a hot/hot standby situation, where 2 identical cabinets full of redundant hardware could failover during an instruction and keep going. You can't do that with the modern approach. The documentation was an entire wall of office bookcase almost clear full of books. There was a lot of documentation.
These days, nothing is redundant inside the box level usually, we instead duplicate the boxes and make them cheap sheep, a dime a dozen.
Which approach is better? That's a great question. I'm not aware of any academic exercises on the topic.
All that said, most people don't need decade long uptimes. Even the big clouds don't bother with trying to get to decade long uptimes, as they regularly have outages.
One of the things that blew my mind in my early career was seeing my mentor open the side of a VMS machine (I can’t remember the hardware model sorry) and slide out a giant board of RAM, and then slide in another board the same physical size but it had a CPU on it, and then enable the CPU
The daughter board in that machine could have RAM or CPUs in the same slot and it was changeable without reboots!
Exactly! One would never, ever do that with x86.
You have not seen it, but there are vendors selling such stuff since ~20 years. Google for linux + hardware + cpu hotplug or memory hotplug. The PCI bus helps here.
I have seen it. Yes, they technically exist. Nobody buys them though.
They are ridiculously expensive. Their use-cases in modern compute is a rounding error towards zero. We just don't build computers like that anymore, for good reason. Memory and CPU rarely fail, and when they do fail, they fail the entire box and just replace it. In 99.99% of all cases it's cheaper and easier to do it that way.
There are vanishingly small use-cases where it makes sense to do hotplug CPU/Memory. They charge accordingly.
Like I said in my parent comment, virtually nobody needs uptimes measured in literal decades. If you are in the .01%(rounded up) of compute that actually needs that, the chances of needing to do it with x86 is even smaller.
One example are the VISA and Mastercard payment processing platforms. The way they are designed requires 24/7 literal decades of uptime. When they have partial outages, they make international headlines and end up writing letters like this: https://www.parliament.uk/globalassets/documents/commons-com...