From why I understand of VHS (and I could be wrong here since I’ve not written software to read VHS tapes) is that they don’t have a separate audio track. They just encode NTSC or PAL signals as it would be broadcast over the airwaves. That means audio will be encoded in the signal after the video frame. It also means Teletext is also recorded too (which has been useful for Teletext achievers / historians).
Stereo audio, like colour video, was an advancement that came after broadcasting had already been standardised. Which means they had to find room in the signal to squeeze that additional information in (this is why TV sets that aren’t sync with the broadcasting feed go black and white). Stereo was a relatively recent addition, maybe late 80s or early 90s (I remember really clearly when the technology was turned on but can’t recall how old I was) so it wouldn’t surprise me if stereo audio was subject to the same syncing issues as colour video.
VHS HiFi seems to be a different format entirely but which also used the same storage media (like how CDs have a few different storage formats supported by the same hardware optical discs)