> StableDiffusion is trained on scraping nearby text on the page
And that nearby text was written by humans, so it may not be explicitly labelled in HTML attributes but if the context wasn't related the scraping wouldn't work.
If you go looking in LAION it's often complete garbage. I think people underestimate how bad it is, and aesthetic finetuning does somehow fix it but not by writing better captions.