It's still BF16 model.
Deepseek has proved that fp8 is more cost-effectiveness than fp16, isn't it valid for dozens-B model?
I don't understand what's your point? The optimiser is not fp8 anyway, is it? It's just the weights. I think the extent of fp8 "effectiveness" is greatly exaggerated. Yes, DeepSeek did use fp8, and they did implement it nicely, but it doesn't mean everybody is now got to be using fp8 all of the sudden.