Televisions have considerably more temporal data to work with than an audio stream does. It's very easy to hack together interpolated images, not so easy to predict/denoise/upres time-series audio information.
Past a certain point it's probably easier/more efficient to use the Airpods as a speech-to-text mic and then infer a "high quality" text-to-speech version on your connected device.