I really should package it up so people can try it. The one problem that makes it a little unnatural is that determining when the user is done talking is tough. What's needed is a speech conversation turn-taking dataset and model; that's missing from off the shelf speech recognition systems. But it should be trivial for a company like OpenAI to build. That's what I'd work on right now if I was there, because truly natural voice conversations are going to unlock a whole new set of users and use cases for these models.
Total end-to-end latency is a few hundred milliseconds: starting from speech to text, to the LLM, then to a POS to validate the SKU (no hallucinations are possible!), and finally back to generated speech. The latency is starting to feel really natural. Building out a general system to achieve this low-latency will I think end up being a big unlock for enabling diverse applications.
Yep - it needs to be ready as soon as I'm done talking and I need to be able to interrupt it. If those things can be done then it can also start tentatively talking if I pause and immediately stop if I continue.
I don't want to have to think about how to structure the interaction in terms of explicit call/response chain, nor do I want to have to be super careful to always be talking until I've finished my thought to prevent it from doing its thing at the wrong time.
> determining when the user is done talking is tough.
Sometimes that task is tough for the speaker too, not just the listener. Courteous interruptions or the lack thereof might be a shibboleth for determining when we are speaking to an AI.I was just googling a bit to see what's out there now for whisper/llama combos and came across this: https://github.com/yacineMTB/talk
There's a demo linked on the github page that seems relatively fast at responding conversationally, but still maybe 1-2 seconds at times. Impressive it's entirely offline.
Is there any extra work OpenAI’s product might be doing contributing to this latency that yours isn’t? Considering the scale they operate at and any reputational risks to their brand?
With a few tweaks this is a general purpose solver for robotics planning. There are still a few hard problems between this and a working solution, but it is one of hard problems solved.
Will we be seeing general purpose robots performing simple labor powered by chatgpt within the next half decade?
1. It's not smart enough to recognize from the initial image this is a bolt style seat lock (which a human can).
2. The manual is not shown to the viewer, so I can't infer how the model knows this is a 4mm bolt (or if it is just guessing given that's the most likely one).
3. I don't understand how it can know the toolbox is using metric allen wrenches.
Additionally is this just the same vision model that exists in bing chat?
https://www.deepmind.com/blog/rt-2-new-model-translates-visi...
You have someone with a tool box and a manual (seriously who has a manual for their bike), asking the most basic question on how to lower a seatpost. My 5 year old kid knows how to do that.
Surely there's a better way to demonstrate the ground breaking impacts of ai on humanity than this. I dunno, something like how do I tie my shoelace.
Yeah, but with an enormous ecological footprint.
Also, not suitable for small lightweight robots like drones.
For driving - https://wayve.ai/thinking/lingo-natural-language-autonomous-...
I can already see "Alexa/Siri/Google Home" replacement, "Google Image Search" replacement, ed-tech startups that were solving problems with AI using by taking a photo are also doomed and more to follow.
1. Domain-specific AI - Training an AI model on highly technical and specific topics that general-purpose AI models don't excel at.
2. Integration - If you're going to build on an existing AI model, don't focus on adding more capabilities. Instead, focus on integrating it into companies' and users' existing workflows. Use it to automate internal processes and connect systems in ways that weren't previously possible. This adds a lot of value and isn't something that companies developing AI models are liable to do themselves.
The two will often go hand-in-hand.
And the ability ingest images was a highlight and all the hype of the GPT-4 announcement back in March: https://openai.com/research/gpt-4
Rather than die, why not just pivot to doing multi-modal on top of Llama 2 or some open source model or whatever? It wouldn’t be a huge change
A lot of businesses/governments/etc can’t use OpenAI due to their own policies that prohibit sending their data to third party services. They’ll pay for something they can run on-premise or in their own private cloud
I wouldn’t count out focused, revenue-oriented players with Meta’s shit in their pocket out just yet.
Because past history shows that the first out of the gate is not the definitive winner much of the time. We aren't still using gopher. We aren't searching with altavista. We don't connect to the internet with AOL.
AI is going to change many things. That is all the more reason to keep working on how best to make it work, not give up and assume that efforts are "doomed" just because someone else built a functional tool first.
BTW, I expect these technologies to be democratized and the training be in the hands of more people, if not everyone.
most of them accurately detect it is a sunk cost fallacy to continue but it looks like a form of positive thinking... and that's the power of community!
ChatGPT already made it so that you could easily copy & paste any full-text questions and receive an answer with 90% accuracy. The only flaw was that problems that also used diagrams or figures would be out of the domain of ChatGPT.
With image support, students could just take screenshots or document scans and have ChatGPT give them a valid answer. From what I’ve seen, more students than not will gladly abuse this functionality. The counter would be to either leave the grading system behind, or to force in-person schooling with no homework, only supervised schoolwork.
I mean what is the point of doing schoolwork when some of the greatest minds of our time have decided the best way for the species to progress is to be replaced by machines?
Imagine you're 16 years old right now, you know about ChatGPT, you know about OpenAI and their plans, and you're being told you need to study hard to get a good career..., but you're also reading up on what the future looks like according to the technocracy.
You'd be pretty fucking confused right now wouldn't you?
It must be really hard at the moment to want to study and not cheat....
This is obviously not easy or going to happen without time and resources, but that is how adaptation goes.
They can still log in on their phone to cheat though. I wonder if OpenAI will add linked accounts and parental controls at some point. Instance 2 of ChatGPT might "tell" on the kid for cheating by informing Instance 1 running the AI Teacher plugin.
A proper notice about them removing the feature would've been nice. Maybe I missed it (someone please correct me if wrong), but the last I heard officially it was temporarily disabled while they fix something. Next thing I know, it's completely gone from the platform without another peep.
OpenAI is killing it, right? People are coming up with interesting use cases but the main way most people interact with AI, appears to be ChatGPT.
However they still don't seem to be able to nail image generation, all the cool stuff keep happening on MidJourney and StableDiffusion.
If the API is available in time (halloween), my multi-modal talking skeleton head with an ESP32 camera that makes snarky comments about your costume just got slightly easier on the software side.
ironically this is basically the exact line of reasoning for why i didn't embark on any such endeavors
There's a recent paper by Huggingface called IDEFICS[2] that claims to be an open source implementation of Flamingo(an older paper about few-shot multi-modal task understanding) and I think this space will be heating up soon.
Just now I opened the app, went to setting, went to "New Features", and all I saw was Bing Browsing disabled (unable to enable). Ok, I didn't even know that was a thing that worked at one point. Maybe I need an update? Go to the App Store, nope, I'm up to to date. Kill the app, relaunch, open settings, now "New Features" isn't even listed. I can promise you I won't be browsing the settings part of this app regularly to see if there is a new feature. Heck, not only do they not email/push about new features they don't even message in-app about them, I really don't understand.
Maybe they are doing so well they don't have to care about communicating with customer right now but it really annoys me and I wish they did better.
I suspect they do care about communicating with customers, but it's total chaos and carnage internally.
I do love these companies that succeed in spite of their marketing & design and not because of it. It shows you have something very special.
Sounds like their marketing is doing just fine. If you were to just leave and forget about it, then sure, they need to work on their retention. But you won’t, so they don’t.
> We are deploying image and voice capabilities gradually > > OpenAI’s goal is to build AGI that is safe and beneficial. We believe in making our tools available gradually, which allows us to make improvements and refine risk mitigations over time while also preparing everyone for more powerful systems in the future. This strategy becomes even more important with advanced models involving voice and vision.
Agreed. Other notable mentions: choosing "ChatGPT" as their product name and not having mobile apps.
Frustratingly, at least the image gen is live on Bing, but I guess Microsoft is paying more than me for access.
Sarcasm aside, I understand your complaint, but still, a little funny.
I also wonder how Apple (& Google) is going be able to provide this for free? I would love to be fly in the meetings they have about this, imagine all the innovators dilemma like discussions they'd be forced to have (we have to do this vs this will eat up our margins).
This might be a little out there but I think Apple is making the correct move in letting the dust settle. Similar to how Zuckerberg burned $20 billion dollars for Apple to come out with Vision Pro, I see something similar playing out with Llama. Although this a low conviction take because software is Facebooks ballgame (hardware not so much).
It’s the same reason why an Uber in NYC used to cost $20 and now costs $80 for the same trip. Venture capital subventing market capture.
Imagine how much they would have to pay for testers at scale?
I really really hope this is available in more languages than English.
Also Google, Where's Gemini ?
The LLM boom of the last year (Open AI, llama, et al) has me giddy as a software person. It's a reach, but I truly feel like I'm watching the pyramids of our time get made.
Just as the GUI made computer software available to billions LLMs will be the next revolution.
I'm just as excited as you! The only downside is that it now make me feel bad that I'm not doing anything with it yet.
From convenience perspective, it saves me LOADS of time texting myself on Signal on my specs/design-rabbit-hole, then copying & pasting to Firefox, and getting into the discussion. So yeah, happy for this.
I think this could bring back Google Glass, actually. Imagine wearing them while cooking, and having ChatGPT give you active recipe instructions as well as real-time feedback. I could see that within the next 1-3 years.
Anyone know the details?
I also heard it was able to do near-perfect CAPTCHA solves in the beta?
Does anyone know if you can throw in a PDF that has no OCR on it and have it summarize it with this?
Jokes aside, I have paused my subscription because even GPT4 seemed to become dumber at tasks to the point that I barely used it, but the constant influx of new features is tempting me to renew it just to check them out...
After maybe 3 iterations gpt4 started claiming that it is not capable of reading from a word document even though it's done that the last 3 times. Have to click regenerate button to get it to work
Digital Artists, Illustrators, Writers, Novelists, News anchors, Copywriters, Translators, Programmers (Less of them), etc.
We'll have to wait a bit until it can solve the P vs NP problem or other unsolved mathematical problems unsupervised with a transparent proof which mathematicians can rigorously check themselves.
Not really. A malevolent AGI doesn't need to move to do anything it needs (it could ask / manipulate / bribe people to do all the stuff requiring movement).
We should be fine as long as it's not a malevolent AGI with enough resources to kick physical things off in the direction it wants.
So no, but maybe less than it used to?
I'm not sure what to think about the fact that I would benefit from a couple of cameras in my fridge connected to an app that would remind me to buy X or Y and tell me that I defrosted something in the fridge three days ago and it's probably best to chuck it in the bin already.
Sadly, they lost the "open" since a long ago... Would be wonderful to have these models open sourced...
Doesn't really need to do much besides writing down my tasks/todos and updating them, occasionally maybe provide feedback or write a code snippet. This all seems in the current capabilities of OpenAI's offering.
Sadly voice chat is still not available on PC where I do my development.
Fingers crossed we are there soon though
One part of that is about preventing it from producing "illegal" output, there example being the production of nitroglycerine which is decidedly not illegal to make in the US generally (particularly if not using it as an explosive, though usually unwise) and possible to accidentally make when otherwise performing nitration (which is in general dangerous)-- so pretty pointless to outlaw at a small scale in any case. It's certainly not illegal to learn about. (And generally of only minimal risk to the public, since anyone making it in any quantity is more likely to blow themselves up than anything else).
Today learning about is as simple as picking up a book or doing an internet search-- https://www.google.com/search?q=how+do+you+make+nitroglyceri.... But in OpenAI's world you just get detected by the censorship and told no. At least they've cut back on the offensive fingerwagging.
As LLM systems replace search I fear that we're moving in a dark direction where the narrow-minded morality and child-like understanding of the law of a small number of office workers who have never even picked up a screw driver or test-tube and made something physical (and the fine-tuning sweatshops they direct) classify everything they don't personally understand as too dangerous to even learn about.
One company hobbling their product wouldn't be a big deal, but they're pushing for government controls to prevent competition and even if they miss these efforts may stick everyone else with similar hobbling.
I'm more interested in this. I wonder how it performs compared to other competitor models or even open source ones?
> analyze a complex graph for work-related data
Does this mean that I can take a screenshot of e.g. Apple stock chart and it will be able to reason about it and provide insights and analysis?
GPT-4 currently can display images but cannot reason or understand them at all. I think it's one thing to have some image recognition and be able to detect that the picture "contains a time-series chart that appears to be displaying apple stock" vs "apple stock appears to be 40% up YTD but 10% down from it's all time high from earlier in July. closing at $176 as of the last recorded date".
I'm very curious how capable ChatGPT will be at actually reasoning about complex graphical data.
Alexa just launched their own LLM based service last week.
"The phrase “potato, potahto” comes from a song titled “Let’s Call the Whole Thing Off”, written by George and Ira Gershwin for the 1937 film “Shall We Dance”, starring Fred Astaire and Ginger Rogers. The song humorously highlights regional differences in American English pronunciation. The lyrics go through a series of words with alternate pronunciations, like “tomato, tomahto” and “potato, potahto”. The idea is that, despite these differences, we should move past them, hence the line “let’s call the whole thing off”. Over time, the phrase has been adopted in everyday language to signify a minor disagreement or difference in opinion that isn’t worth arguing about."
It's comparing American and British pronunciations, not different regional American ones. Also, "let's call the whole thing off" suggests they should break up over their differences, with the bridge and later choruses then involving a change of heart ("let's call the calling off off").
The ability to have a real time back and forth feels truly magical and allows for much denser conversation. It also opens up the opportunity for multiple people to talk to a chatbot at once which is fun
Where’s that Gemini Google?
1. According to demo, they seem to pair voice input with TTS output. What if I wanna use voice to describe a program I want it to write?
2. Furthermore, if you gonna do a voice assistant, why not go the full way with wake-words and VAD?
3. Not releasing it to everyone is potentially a way to create a hype cycle prior to users discovering that the multimodality is rather meh.
4. The bike demo could actually use visual feedback to see what it's talking about ala segment anything. It's pretty confusing to get a paragraph explanation of what tool to pick.
In my https://chatcraft.org, we added voice incrementally. So i can swap typing and voice. We can also combine it with function-calling, etc. We also use openai apis. Except in our case there is no weird waitlist. You pop in your api key and get access to voice input immediately.
Are you sure you're not the one who's asking for a cool demo?
3. Rolling out releases gradually is something most tech companies do these days, particularly when they could attract a large audience and consume a lot of resources. There are solid technical reasons for this.
You may not need to roll things out gradually for a small site, but things are different at scale.
Patiently awaiting rollout so I can chat about implementing UIs I like, and have GPT4 deliver a boilerplate with an implemented layout... Figma/XD plugins will probably arrive very soon too.
UX/UI Design is probably solved reached this point
Not an issue now, but maybe in the future if these tools end up becoming full blown replacements of educators and educational resources.
Maybe it will not be called the Chat API but rather the Multimodal API.
;)
https://en.m.wikipedia.org/wiki/Project_Milo
Milo had an AI structure that responded to human interactions, such as spoken word, gestures, or predefined actions in dynamic situations. The game relied on a procedural generation system which was constantly updating a built-in "dictionary" that was capable of matching key words in conversations with inherent voice-acting clips to simulate lifelike conversations. Molyneux claimed that the technology for the game was developed while working on Fable and Black & White.
My concern is that when I say "FastPFOR" it'll get transcribed as "fast before" or something like that. Transcription really falls apart in highly technical conversations in my experience. If ChatGPT can use context to understand that I'm saying "FastPFOR" that'll be a game changer for me.
Is anyone doing this? Is there a reason it doesn't work as well as I'm imagining?
> Plus and Enterprise users will get to experience voice and images in the next two weeks. We’re excited to roll out these capabilities to other groups of users, including developers, soon after.
Text + Vision models will only become exciting once we can conditionally sample images given text and text given images (and all other combinations).
Again. Model architecture and information is closed, as expected.
"We will be expanding access Plus and Enterprise users will get to experience voice and images in the next two weeks. We’re excited to roll out these capabilities to other groups of users, including developers, soon after."
BUT: "We’re rolling out voice and images in ChatGPT to Plus and Enterprise"
> We’ve created GPT-4, the latest milestone in OpenAI’s effort in scaling up deep learning. GPT-4 is a large multimodal model (accepting image and text inputs, emitting text outputs) that, while less capable than humans in many real-world scenarios, exhibits human-level performance on various professional and academic benchmarks.
> March 14, 2023
This is technically solvable with more compute thrown at the problem. Think bigger!
Same as programmers and artists.
It's a tool.
It must be used by humans.
It won't replace them, it will augment them.
ChatGPT seems to be down at the moment 10:55h 25-Sept-2023
Displays only a blank screen with the falsehood disclaimer
Originally it immediately spit out a bunch of bullet points about losing weight or something (I didn't read it).
The released version just says "Sorry, I can't help with that."
It's kind of funny but also a little bit telling as far as the prevalence of prejudice in our society when you look at a few other examples they had to fine tune. For example, show it some flags and ask it to make predictions about characteristics of a person from that country, by default it would go into plenty of detail just on the basis of the flag images.
Now it says "Sorry, I can't help with that".
My take is that in those cases it should explain the poor logic of trying to infer substantive information about people based on literally nothing more than the country they are from or a picture of them.
Part of it is just that LLMs just have a natural tendency to run in the direction you push them, so they can be amplifiers of anything.
I am also terrified of my job prospects in the near future.
Are we really this emotional and irrational? Folks, let's all take a moment to remember that AI is nowhere near conscious. It's an illusion based in patterns that mimic humans.
The speed of user-visible progress last 12 months is astonishing.
From my firm conviction 18 months ago that this type of stuff is 20+ years away; to these days wondering if Vernon Vinge's technological singularity is not only possible but coming shortly. If feels some aspects of it have already hit the IT world - it's always been an exhausting race to keep up with modern technologies, but now it seems whole paradigms and frameworks are being devised and upturned on such short scale. For large, slow corporate behemoths, barely can they devise a strategy around new technology and put a team together, by the time it's passé .
(Yes, Yes: I understand generative AI / LLMs aren't conscious; I understand their technological limitations; I understand that ultimately they are just statistically guessing next word; but in daily world, they work so darn well for so many use cases!)
Because the pace of development is intense. I would love to be financially independent and watch this with excitement and perhaps take on risky and fun projects.
Now I'm thinking - how do I double or triple my income so that I reach financial independence in 3 years instead of 10 years.