But I would like to see how this is integrated into applications by third party developers where the AI is doing a specific job. Is it still as impressive?
The biggest challenge I've had with building any autonomous "agents" with generic LLM's is they are overly gullible and accommodating, requiring the need to revert back to legacy chatbot logic trees etc. to stay on task and perform a job. Also STT is rife with speaker interjections, leading to significant user frustrations and they just want to talk to a person. Hard to see if this is really solved yet.
I’ve found that you can create declarative logic trees from JSON and use that as a prompt for the LLM, which it can then use to traverse the tree accordingly. The only issue I’ve encountered is when it wants to jump to part of the tree which is invalid in the current state. For example, you want to move a user into a flow where certain input is required, but the input hasn’t been provided yet. A transition is suggested to the program by the LLM, but it’s impossible so the LLM has to be prompted that the transition is invalid and to correct itself. If it fails to transition again, a default fallback can be given but it’s not ideal at all.
However, another nice aspect of having the tree declared in advance is that it shows human beings what the system is capable and how it’s intended to be used as well. This has proven to be pretty useful, as letting the LLM call functions it sees fit based on broad intentions and system capabilities leaves humans in the dark a bit.
So, I like the structure and dependability. Maybe one day we can depend on LLM magic and not worry about a team understanding the ins and outs of what should or shouldn’t be possible, but we don’t seem to be there yet at all. That could be in part because my prompts were bad, though.
Since it has an okay grasp on how finite state machines and XState work, it seems to do a good job of navigating the tree properly and reliably. It essentially does so by outputting information it thinks the state machine should use as a transition in a JSON object which gets parsed and passed to a transition function. This would fail occasionally so there was a recursive “what’s wrong with this JSON?” prompt to get it to fix its own malformed JSON, haha. That was meant to be a temporary hack but it worked well, so it stayed. There were a few similar tools for trying to correct errors. That might be one of the strangest developments in programming for me… Deploying non-deterministic logic to fix itself in production. It feels wrong, but it works remarkably well. You just need sane fallbacks and recovery tactics.
It was a proprietary project so I can’t share the source, but I think reading up on XState JSON configuration might explain most of it. You can describe most of your machine in a serializable format.
You can actually store a lot of useful data in state names, context, meta, and effect/action names to aid with the prompting and weaving state flows together in a language-friendly way. I also liked that the prompt would be updated by information that went along with the source code, so a deployment would reliably carry the correct information.
The LLM essentially hid a decision tree from the user and smoothed over the experience of navigating it through adaptive and hopefully intuitive language. I’d personally prefer to provide more deterministic flows that users can engage with on their own, but one really handy feature of this was the ability to jump out of child states into parent states without needing to say, list links to these options in the UI. The LLM was good at knowing when to jump from leaves of the tree back up to relevant branches. That’s not always an easy UI problem to solve without an AI to handle it for you.
edit: Something I forgot to add is that the client wanted to be able to modify these trees themselves, so the whole machine configuration was generated by a graph in a database that could be edited. That part was powered by Strapi. There was structured data in there and you could define a state, list which transitions it can make, which actions should be triggered and when, etc. The client did the editing directly in Strapi with no special UI on top.
Their objective is surveying people in a more engaging and personable way. They really wanted surveys which adapt to users rather than piping people through static flows or exposing them to redundant or irrelevant questions. Initially this was done with XState and no LLM (it required some non-ideal UI and configuration under the hood to make those jumps to parent states I mentioned, but it worked), and I can't say how effective it is but they really like it. The AI hype was very very strong on that team.
This is not using TTS or STT. Audio and Image data can be tokenized as readily as text. This is simply a LLM that happens to have been trained to receive and spit out audio and image tokens as well as text tokens. Interjections are a lot more palatable in this paradigm as most of the demos show.
I would wager like 100:1 that this is just introducing some TTS/STT layers. The video processing layer is probably also doing something similarly, by taking an extremely limited number of 'screenshots', carrying out typical image captioning using another layer, and then feeding that as an input. So the demo, to me, seems most likely to just be 3 separate 'plugins' operating in unison - text to speech, speech to text, and image to text.
The interjections are likely just the software being programmed to aggressively begin output following any lull after an input pattern. Note in basically all the videos, the speakers have to repeatedly cut off the LLM as it starts speaking in conversationally inappropriate locations. In the main video which is just an extremely superficial interaction, the speaker made sure to be constantly speaking when interacting, only pausing once to take a breath that I noticed. He also struggled with the timing of his own responses as the LLM still seems to be attached to its typical, and frequently inappropriate, rambling verbosity (though perhaps I'm not one to critique that).
Literally the first paragraph of the linked blog.
"GPT-4o (“o” for “omni”) is a step towards much more natural human-computer interaction—it accepts as input any combination of text, audio, and image and generates any combination of text, audio, and image outputs."
Then
"Prior to GPT-4o, you could use Voice Mode to talk to ChatGPT with latencies of 2.8 seconds (GPT-3.5) and 5.4 seconds (GPT-4) on average. To achieve this, Voice Mode is a pipeline of three separate models: one simple model transcribes audio to text, GPT-3.5 or GPT-4 takes in text and outputs text, and a third simple model converts that text back to audio. This process means that the main source of intelligence, GPT-4, loses a lot of information—it can’t directly observe tone, multiple speakers, or background noises, and it can’t output laughter, singing, or express emotion.
With GPT-4o, we trained a single new model end-to-end across text, vision, and audio, meaning that all inputs and outputs are processed by the same neural network."
I'm sure you'll find this part is a lot quicker to process, giving the instant response (the old gpt4-turbo is generally very quick with simple requests like this). Rather impressively all it would need is an additional custom instruction.
Very clever and eerily human.