Anything with voice controls for routine use is a pretty tough sell. Doing this when you're not completely alone would be annoying to everyone around you.
Most of their examples seem like they could have been done with a right click drop down menu so they don't really need to "re-invent the mouse pointer".
So is this thing talking to Google's servers all the time for the AI integration? So it won't work if you're not connected to the internet? Privacy concerns are obvious; now Google wants to have an AI watching literally everything you do on your computer?
Does it cost the user anything for the LLM use? If it's free will it stay free forever? That's quite a lot to give away if they're expecting people to use it to change a single word like in one of their examples. I guess they're expecting to make the money back by gathering data about literally everything you do on your computer.
There might be a killer app for AI integration with personal computers that has yet to be invented, but this doesn't look like it.
Sometimes I go to a different page to take a screenshot and other times I'm browsing for a file, and other times I'm highlighting some log lines. Cursor did this well, with selecting text in the terminal auto-focusing the Cursor agent textbox so you could talk to the agent and then select some text and you didn't have to re-select the original agent textbox again. The agent is a top-level function in that system not "just another app I have to switch to" to take my context with.
I have some small amount of bias because I've always felt input-constrained on computers. I have to move my hands to go places and that's exasperating. I've tried head tracking, had a vim pedal for a while, and used tiling WMs, and things like this to aid but while my vim-fu is pretty good and I function inside things very well with it, my cross-application interface isn't.
In the end, perhaps we all have our home offices with our Apple Vision Pros and we talk to them like this to maneouvre faster through our machines and get our ideas into them.
Cool research. I wonder what we'll end up with.
Why not constrain your computing? It will require some programming chops, but you can note down your common tasks, figure out where actual input are required, and automate the rest.
The second demo seems to be a wash: there's no time saved in saying "move this" versus "move crab". And an app-specific contextual menu would probably be faster.
The third demo doesn't seem to warrant the use of a pointer at all, since there is only one way to interpret the prompt.
None of this means that this approach will not be successful, but there's a reason why so many attempts to revolutionize user interfaces ended up going nowhere. Talking to your computer was always supposed to be the future, but in practice, it's slower and more finicky than typing.
In fact, the only new UI paradigm of the past 28+ years appears to have been touchscreens and swipe gestures on phones. But they are a matter of necessity. No one wants to finger-paint on a desktop screen.
and that's aside from the obvious privacy problems.
And being able to take photos/videos with the glasses (like the Meta ones nowadays) is really useful with my kid because he often does funny or cute stuff and I don't have time to pull my phone out to take a video/photo of it. I guess it could be useful for video calls too so my parents can see him.
But I just don't see anyone sitting in an office, or even at home, talking to their computer. It's really only useful for hands-free settings like when you are driving, or in the kitchen etc.
Maybe you can share a scenario for that one? I can’t figure a scenario where all of this needs to be true. It seems like a recipe for accidents.
It reminds me of Microsoft Recall in the sense that some portion of the screen is going to be continuously transmitted outside of the users control.
What happens when someone browses something very private (planning a surprise engagement. looking at medical data. planning a protest)? All that data gets slurped to google and subject to a warrant or discovery or building your advertising fingerprint.
Maybe the idea is that the data is sent to AI only when you right click, but that seems like a very thin firewall that a product manager will breach in the interests of delivering "predictive AI" via some kind of precomputed results.
Profit!
Its wild that they even put this out as a demo. It should have been picked apart in the internal meeting. There is no way I'd ever show my product taking 5s to change a 1 to a 2 in a piece of text that the user was already hovering or taking 10s to drag and drop a line of text from one box to another. Even the image of finding a route between two images could be done quick if images were auto OCR'd which is a setting on most image viewers.
Now you get to hear every person in the office do that around you.
Like, good tech, but do googlers live in the real world? Do they genuinely like the idea of an open office full of people talking to their computers? Do they all live alone without human contact?
I'm sure Don Hopkins can tell you a long annotated tale about the NeWS pizza ordering app that displayed a real-time dynamically-updated rotating pie on the screen as you filled out your order.
the agent occasionally spots your real problem like an experienced engineer
I like text selection exactly how it is. I want precise controls.
It's fine for a touch interface like a phone, but on a computer I expect precision. As much as I can get.
Imagine trying to convince someone in the 90s that that's a step forward.
I'm imagining a webpage with a link - instead of opening a new link to quickly google something or opening three new tabs based on hyperlinks, i can point at a paragraph or line and ask it to tell me about it.
Maybe I can point at a song on Spotify and have it find me the youtube video, or vice versa (of course this is assuming a tool like this wouldn't stay locked into one ecosystem.. which it will).
Point is that the concept of talking to the computer with mouse as pointer is pretty cool and i guess a step closer to that whole sci-fi "look at this part of the screen and do something"
Anyway, I built a prototype on this idea, but instead of relying only on hover, I press Option to select a node in a custom AST-ish semantic layer I designed around a minimalist UI grammar, and Option + up/down arrows to move to parent/child node. This way, I have have an accurate pointer to the element I want to talk about, plus a minimal context window (parent component, state, a few navigation related queries).
What I learned from using it, though, is that the killer use case isn't necessarily the flashy "talk to this UI element" interaction shown in the Google demos. I do use it that way too; I have `Option + Shift + click` to copy a selector to the clipboard, so I can give an LLM connected to the live medium a precise reference to the element I want to discuss.
But the place where it has been most useful day to day is much simpler: source navigation. Point at the thing in the UI, jump to the code that is responsible for it. The difficult part is jumping to the code you care about (the code for UI or for the semantic element?), but in my system that distinction turned out to be usually obvious, which is what makes the interaction useful.
If we manage that, my plans for a pure XML based shell might not be too futuristic '<in><ls/></in><out><tree><file date="CDATA[...]">' ...
Also featured in the Starfire vision video from 1992: https://youtu.be/jhe1DFY-SsQ?t=286
Of course learning proper cad software is probably the right thing here, but having Claude write python scripts which generate HTML files which reference three.js to provide a 3d view has gotten me surprisingly far. If something could take my pointer click and reverse whatever coordinate transforms are between the source code and my screen such that the model sees my click in terms of the same coordinate system it's writing python in, well that would be pretty slick.
Until then I've just had it list every surface in a legend, each colored differently, so I can say "three inches down from the top of pole six, and rotate it so the hoop part of the bolt faces northwest."
Interesting but not “reimagining” anything.
I think the real story here is how vibe coding now enables flashy demo sites like this to be built for a concept that hasn’t yet earned it.
(Not going to happen)
1. select text
2. dictate action
Feels very similar to Helix's select text and act on it.
I think text selection could also be voice controlled (with a modal voice input), so one could say: "select sentence, action mode, copy and paste it in my list and remove duplicates"
We couldn't quite track you well enough before. So we're fixing that under the guise of "AI powered capabilities."
Horizontal dragging with a mouse is actually really hard. Nobody's going to use it like that.
Your arm can easily move your hand and cursor up/down by pivoting your shoulder, but there's no mechanism for left/right movement. It's always an arc.
Or put another way: selection will be a lot slower and more tedious than the demo.
But Google is a very ill positioned candidate for such OS. I would rather trust Apple and local-first on-device models.
font-feature-settings: "ss02" on;I'm mostly using my system to make comments on long AI-generated documents (especially design documents). I find it works well to have the AI generate something, and then I read through it, making comments along the way.
You can get pretty far just repeating the things you see... "I'm reading [heading] and [comments]". But I do find some use in selecting content and saying "I don't agree with this" or whatever else.
The result is just an augmented message. It looks like:
<transcript>
Let's see what we've got here.
<selection doc="proposal.md" location="paragraph 3">
The system already...
</selection>
No, I don't like how this is approaching the problem, ...
</transcript>
Then I just send this as a user message. Claude Code (and I'm guessing any of the agentic systems) picks up on the markup very easily. It also helps to label it as a transcript, as it can understand there may be errors, and things like spelling and punctuation are inferred not deliberate. (Some additional instruction is necessary to help it understand, for example, that it should look for homophones that might make more sense in context.)It makes reviewing feel pretty relaxed and natural. I've played around with similar note taking systems, which I think could be great for studying in school, but haven't had the focus on that particular problem to take it very far.
But I think the best thing really is giving the agent a richer understanding of what the user is experiencing and doing and just creating a rich representation of that. The keywords can be useful, but almost only as checkpoints: a keyword can identify the moment to take the transcript and package it up and deliver it.
One difference perhaps in design motivation: I have really embraced long latency interactions. I use ChatGPT with extended thinking by default, and just suck it up when the answer didn't really require thinking. I deliver 10 points of feedback at once instead of little by little. (Often halfway through I explicitly contradict myself, because I'm thinking out loud and my ideas are developing.) I just don't stress out about latency or feedback, and so low-latency but lower-intelligence interactions don't do it for me (such as ChatGPT's advanced voice mode, or probably Thinking Machine's work). I think this focus is in part a value statement: I'm trying to do higher quality work, not faster work.
you select text in vscode, and write a comment, and the llm gets both
Perhaps a text box and file upload isn’t the perfect interface for every use case but it is versatile which is a huge barrier to overcome.
Assuming that today the most efficient way for human to transfer information to machines is via voice. Assuming for machines to convey rich information to humans that's by printing html.
Then a combination of screen + eye tracking + voice is all you need. The mouse doesn't make sense anymore.
At some point I fully expect eye tracking (or attention tracking) to be common enough to be a first-class input method.
On a less serious note, the audience for this is people who want to optimize for what seems like the least amount of effort.
Wait…it's May. Ugh, I'm so confused. :spiral eyes emoji:
Unless of course, their AI gets the same special privileges as the gpu in accessing drm content, and everything else is still locked out.
They have so many great software engineers but unable to use them to speed up coding AI research. Hopefully with Sergey's focus it will get better.
This cursor thing is just another experiment nobody cares about.
Would be tiresome though to hold hand out all day - but good for mobile and handwriting/drawing. Need zero latency.
Furthermore, the mousepad could become the magnetic sensor and not the screen in order to rest the palm. The nail bands then become the equivalent of the mouse so it's a hall effect mouse. But could the pad detect finger twitch for the buttons, though?
There's a reason chairs are still around. They are +2000 years old. Its still waiting to be replaced.
One should be extremely skeptical of claims of replacing tech that has been around for a very long time.
Bullshit!
You don't need any new metaphors to support such (questionable) flows - at all.
Swipes instead of selection rectangles are annoying - you don't see the traces of the swipes on these demo gifs! So, you've effectively "selected" something - but you have to keep in mind WHAT you selected.
Total ridiculous bullshit.
Nightmares are dreams as well and this is a nightmare like Windows Recall.
Technically wonderful though.
This has a good utility.
I also don't think people want to constantly talk to their computers.
People don’t. Tech companies think they do.
I'm hoping for a const-reference joke.
Aaaaand now I can't remember the name of it
Only when you live alone might you be comfortable constantly speaking with your devices. Only if your life if perfectly predefined can you let your fridge order the same food that just gone stale or has been eaten. And only when you are young and healthy and not in any way differing from the "standard" would you be capable of working like these "researchers" imagine you to.
I'm not that person. I'm constantly failing at doing "triple-finger-taps" whenever I'm in need of one. I have a smartwatch with pedestrian navigation and never bothered to remember which vibration pattern means which turn. I don't configure different vibration patterns for different callers on the phone. I have a folding phone, but I almost never do side-by-side windows and when I do, I need to find out how to do that first -- and then how to leave that mode without losing my mind. I almost never use AI features on my phone not because I don't want to, but because I never remember how to activate them. I don't re-configure my gadgets to "fit my mood". I hate recommendations like "you like X, here's Y, it's the same!" I hate that I can't rest my mouse cursor on websites anymore without selecting something actionable, moving, animating or autoplaying.
All of the examples on the linked page are workflows I would never do this way. I won't be talking to my shopping list to double the ingredients. I won't be drawing gestures with my mouse on a document to activate a voice command. I won't use voice commands in general because as it turns out, I'm not capable of bringing out a complete coherent sentence without pausing and/or changing my mind and/or realizing I'm wrong once.
I appreciate those demos for the progress they are showing. It's impressive and astonishing to see restaurants getting extracted from videos or pictures getting expanded or text edited better than I ever could. It's all modern-day magic in a way. One thing it all isn't is a product. We don't have those anymore -- all we get are gimmicks. We don't do common interfaces anymore either, we are separating people in Google/Apple/Xiaomi camps.
And most importantly we don't use that technology for good except for a bunch of people writing e-mails all day, doing shopping lists and booking one of top restaurants in Tokyo for the same evening on a whim. We are long overdue for a remake of "American Psycho", but this time it will be a documentary instead of a satire.
This reads like an April Fools joke. Even the title sounds like satire.