I am confused. I’m not a doctor, but why would a model perform poorly at detecting diseases in X-rays in different genders and races unless the diseases present themselves differently in X-Rays for different races? Shouldn’t the model not have the race and gender information to begin with? Like a model trained on detecting lesions should perform equally well on ANY X-Ray unless lesions show up differently in different demographics.
You and the article are both correct. The disease does present itself differently as a function of these other characteristics, so since the training dataset doesn't contain enough samples of these different presentations, it is unable to effectively diagnose.
> [...] unless lesions show up differently in different demographics.
Well, first the model looks at the entire X-ray and lesions probably do show differently. Maybe it's genetic/sex-based or it's due how lesions develop due environmental factors that are correlated to race or gender. Maybe there's a smaller segment of white people that has the same type of lesion and poor detection.
> Like a model trained on detecting lesions should perform equally well on ANY X-Ray unless lesions show up differently in different demographics.
This is not true in practice.
For a model to perform well looking at ANY X-ray, it would need examples of every kind of X-ray.
That includes along race, gender, amputee status, etc.
The point of classification models is to discover differentiating features.
We don’t know those features before hand, so we give the model as much relevant information as we can and have it discover those features.
There very well may be differences between black woman X-rays and other X-rays, we don’t know for sure.
We can’t have that assumption when building a dataset.
Even believing that there are no possible differences between X-rays of different races is a bias that would be reflected by the dataset.
For a start, women have different body shape and you can (unreliably) tell a woman and from a men from an X-ray. The model can be picking up on those signs as a side effect and end up less correct for demographic it was not trained for.