That only works for inference, not training.
Why so?
Because training usually requires bigger batches, doing a backward pass instead of just the forward pass, storing optimizer states in memory etc. This means it takes a lot more RAM than inference, so much more that you can't run it on a single GPU.
If you're training on more than one GPU, the speed at which you can exchange data between them suddenly becomes your bottleneck. To alleviate that problem, you need extremely fast, direct GPU-to-GPU "interconnect", something like NV Link for example, and consumer GPUs don't provide that.
Even if you could train on a single GPU, you probably wouldn't want to, because of the sheer amount of time that would take.
But does this prevent usage of cluster or consumer GPUs to be used in training? Or does it just make it slower and less efficient?
Those are real questions and not argumentative questions.