I dunno, my only idea is that maybe if you use traditional face detection to find the face/eyes and then do classification (assuming you aren't doing that already?).
Right now that's pretty much what I do. I use YuNet to get faces, crop them out and run detection. It's probably a factor of a not enough data/poor model choice.