Check out this video to see it in action: https://preview.screen.studio/share/r1h4DuAk. There are lots more demos at https://github.com/browser-use/browser-use on how we control the web with prompts.
We started coding a decade ago with Selenium bots and macros to automate tasks. Then we both moved into ML. Last November, we asked ourselves, “How hard could it be to build the interface between LLMs and the web?”
We launched on Show HN (https://news.ycombinator.com/item?id=42052432) and have since been addressing various challenges of browser automation, such as: - Automation scripts break when the website changes - Automation scripts are annoying to build - Captchas and rate limits - parsing errors and API key management - and perhaps worst of all, login screens.
People use us to fill out their forms, extract data behind login walls, or automate their CRM. Others use the xPaths browser-use clicked on and build their scripts faster, or directly rerun the actions of browser-use deterministically. We’re currently working on robust task reruns, agent memory for long tasks, parallelization for repetitive tasks, and many other sweet improvements.
One interesting aspect is that some companies now want to change their UI to be more agent-friendly. Some developers even replace ugly UIs with nice ones and use browser-use to copy data over.
Besides the open-source we have an API. We host the browser and LLMs for you and help you with handling proxy rotation, persistent sessions and allowing you to run multiple instances in parallel. We price at $30/month—significantly lower than OpenAI’s Operator.
On the open-source side, browser use remains free. You can use any LLM, from Gemini to Sonnet, Qwen, or even DeepSeek-R1. It’s licensed under MIT, giving you full freedom to customize it.
We’d love to hear from you—what automation challenges are you facing? Any thoughts, questions, experiences are welcome!
You are using debugger tools such as CDP, launching playwright without a sandbox, and guiding users to launch Chrome in debugger mode to connect to browser-use on their main browser.
The debugging tools you use have active exploits that Google doesn't fix because they are supposed to be for debugging and not for production/general use. This combined with your other two design choices let an exploit to escalate and infect their main machine.
Have you considered not using all these debugging permissions to productionize your service?
These tools they are guiding users to setup and execute are "inherently insecure" [https://issues.chromium.org/issues/40056642].
So if you go to a site that can take advantage of these loopholes then your browser is likely to be compromised and could escalate from their.
They do have a cloud offering that should not have these risks but then you have to enter your passwords into their cloud browser environment, presenting a different set of risks. Their cloud offering is basically similar to SkyVerne or even a higher cost tier subscription we have at rtrvr.ai
[1] https://developer.chrome.com/docs/chromedriver/get-started
I meant by in production in the sense how you are advising your users to setup the local installation. Even if you launch browser use locally within a container but your restarting the user's Chrome in debug mode and controlling it with CDP from within the container, then the door is wide open to exploits and the container doesn't do anything?!
Are you working on unifying the tools that the LLM uses with the MCP / model context protocol?
As far as I understand, lots of other providers (like Bolt/Stackblitz etc) are migrating towards this. Currently, there's not many tools available in the upstream specification other than File I/O and some minor interactions for system-use - but it would be pretty awesome if tools and services (like say, a website service) could be reflected there as it would save a lot of development overhead for the "LLM bindings".
Very interesting stuff you're building!
https://github.com/Saik0s/mcp-browser-use/blob/main/README.m...
The first, it refused to correctly load the browser tab and would get stuck in a loop trying. I was able to manually override this behavior for the purpose of prototyping.
The second, it hallucinated form input values. I provided it strict instructions on how to fill out a form and when it didn’t know what to do with an address field, it just wrote 123 Main St instead of not being able to complete the form.
The thing I really want and haven’t found in any of the browser agents I’ve tried, is a feedback loop. I don’t personally know what the final format looks like. But I want to know that what I expected to happen inside the browser, actually happened, and I want it to be verifiable. Otherwise I feel like I'm sending request into a black hole.
Hmm really? Maybe you could use the sensitive data api to make it more deterministic? https://docs.browser-use.com/customize/sensitive-data
How would you imagine a perfect feedback loop?
Coincidentally I played with it over the last weekend using Gemini model. It's quite promising!
https://x.com/caydengineer/status/1889835639316807980
One thing I'm hoping for is an increase in speed. Right now, the agent is slow for complex tasks, so we're still in an era where it might be better to codify popular tasks (eg: sending a WhatsApp message) instead of handling them with browser automation. Have yall looked into Groq / Cerberus?
Agent call 1: Send WhatsApp message (to=Magnus, text=hi) Inside, you open WhatsApp and search for Magnus (without LLM)
Agent call 2: Select contact from all possible Magnus contacts Script 3: Type the message and click send
So in total, 2 calls - with Gemini, you could already achieve this in 10-15 seconds.
Do you have any built-in features that address these issues?
On most platforms, browser use only requires the interactive elements, which we extract, and does not need images or videos. We have not yet implemented this optimization, but it will reduce costs for both parties.
Our goal is to abstract backend functionality from webpages. We could cache this, and only update the cache if eTags change.
Websites that really don't want us will come up with audio captchas and new creative methods.
Agents are different from bots. Agents are intended as a direct user clone and could also bring revenue to websites.
Which you or other AIs will then figure a way around. You literally mention "extract data behind login walls" as one of your use cases so it sounds like you just don't give a shit about the websites you are impacting.
It's like saying, "If you really don't want me to break into your house and rifle through your stuff, you should just buy a more expensive security system."
As this project is MIT, that means companies like Amazon can deploy a managed version and can compete against you with prices going close to zero in their free-tier and with a higher quotas than what you are offering.
I predict that this project is likely going to change to AGPL or a new business license to combat against this.
LinkedIn's API sucks. I run an analytics platform[0] that uses it and it only has 10% of what our customers are asking for. It'd be great to use browser-use, but in my experience, you run into all sort of issues with browser automation on LinkedIn.
A bigger problem on LinkedIn for us is all the nested UI elements and different scrolling elements. With some configuration in our extraction layer in buildDomTree.js and some custom actions, I believe someone could build a really cool LinkedIn agent.
- No more ads. No more banner ads. No more Google search ads. No more promoted stories. No more submarine ads.
- No more spam. Low quality content is nuked.
- No more clickbait. Inauthentic headlines from dubious sources are removed.
- No more rage comments. Angry commenters are muted. I have enough to worry about in my day.
- No more low-information comments. All the "Same." and "Nice." and low informational comments are removed to help me focus.
An agent of the future is there to preserve my precious time and attention. It will safeguard it to a level never before seen, even if it threatens the very business model the internet's consumer tools are based on. You work for me. You help me. Google, Reddit, et al. are adversarial relationships.
In the future, advertisers pay me for the privilege of pitching me. No ad reaches me without my consent or payment.
Please fix the internet. We've waited thirty years for this.
Furthermore, valid point: if Pepsi spends $1M on ads, why don't you get a piece of it if they pitch to you?
Edit: for anyone else looking for this, it seems that you can: https://github.com/browser-use/browser-use/blob/70ae758a3bfa...
I misunderstood looking at demo videos, it seemed like you constantly update elements with borders/IDs so I assumed that's what is then passed to vision.
> Do you have any great resources on where to get started?
A great place to start is https://chromium.googlesource.com/chromium/src/+/main/docs/a....
Then we present this list to the LLM with the task and the LLM outputs input_text(id 3, Hello World).
Finally, we execute the Playwright code to perform the actual action of inputting text into this element.
The first site (Idealista) I tried it on flagged and blocked me / my home IP as a bot within 10 seconds.
As an armchair observer, the agents + browser space feels like it’s waiting for someone to make the open source framework that everyone piles on to.
Proxy rotation sounds like a solid way to monetize for businesses.
btw - our biggest challenge is exactly this, solving thousands of issues that arise on the fly.
The other way to achieve this with Browser Use is to save the history from `history = agent.run()` and rerun it with `agent.rerun_history(history)`.
I'd love to see if this can of any use to you!
Ran into a tool called Promptwright on the Discord that was an example of this
So many people building with / on top of browser-use now - spawned a whole cottage industry! :)
I just about fell out of my chair laughing at your cloud hosted tier with the tagline "We have to eat somehow™" aka "please pay us"
I signed up for the paid tier and I'm hopeful this can help us integrate legacy CRM's with our company's unified communication sales tool.
Either way good luck!
The cool thing is that we can extract xPaths from the agent runs and re-run these scripts deterministically. I think that's a big advantage over pure vision-based systems like Operator.
I just saw this win an AI Hackaton in Toronto but they said it was their own thing, quite dishonest. Everyone was rightfully impressed, me as well not gonna lie. I was a bit sus someone could come up with something like this in a weekend, but they were from U of Waterloo, Vector Institute and whatnot, so I said "maybe". Now I know they were just a bunch of scammers, sad.
Anyway, this is a great project, congratulations. It's so good it's making other people win already, lol. I have so many use cases for this. I truly wish you the best!
Edit: Downvote me all you want, if you love scammers so much I can send you their contact so you can "invest" in their trash. Lol.
One of the judges explicitly asked if they actually made this thing or was it something else like "a video" showing what it would be like.
One of the team members confidently replied it was real and that they made it all during the weekend.
It was a bit too good to be true.
Edit: I found a video of the thing. I initially posted it here but decided to delete it, the reason for that is I don't think they deserve to be publicly shamed. We were all having fun and they probably got a little carried away. If any of them sees this just don't do that next time. Play fair.