Subtitling adds even more issues that machine translation simply can't handle, because like a good book translation, it's an artform.
Making good subtitles means you prioritize readability over accuracy. You have a limited amount of space for your text, and you want to keep a low characters per second, so you cut words, ruthlessly. But you have to choose which words to cut so that it still makes sense, which means that you have to identify filler words so you can cut them, or figure out ways to re-phrase something into a shorter sentence.
You probably also want to preserve the tone and style of the dialogue, which means you have to choose the right synonyms, not just the most common ones.
And if you're creating hearing-impaired subtitles, it becomes even more necessary to understand what's going on in the video. If someone slams a door center-screen, you can cut that from the subtitles if you have more important things to display, but if someone slams a door off-screen, you absolutely have to include it in the subtitles, because that's the kind of information a hearing-impaired person needs.
Good luck training your little machine-learning network how to identify which sound effects originate from objects on-screen and which originate off-screen...