This is the first true multimodal network from OpenAI, where you can send an image in and retain the visual properties of the image in the output from the network (previously the input image would be turned into text by the model, and sent to the Dall-E 3 model which would provide a URL). Will we get API updates to be able to do this?
Also, will we be able to tap into a realtime streaming instance through the API to replicate the audio/video streams shown in the demos? I imagine from the Be My Eyes partnership that they have some kind of API like this, but will it be opened up to more developers?
Even disregarding streaming, will the Chat API receive support for audio input/output as well? Previously one might've used a TTS model to voice the output from the model, but with a truly multimodal model the audio output will contain a lot more nuance that can't really be expressed in text.