Using GPT-4 Vision with Vimium to browse the web (opens in new tab)

(github.com)

437 pointswvoch2352y ago128 comments

128 comments

e12e2y ago

It's insane that this is now possible:

https://github.com/ishan0102/vimGPT/blob/682b5e539541cd6d710...

> "You need to choose which action to take to help a user do this task: {objective}. Your options are navigate, type, click, and done. Navigate should take you to the specified URL. Type and click take strings where if you want to click on an object, return the string with the yellow character sequence you want to click on, and to type just a string with the message you want to type. For clicks, please only respond with the 1-2 letter sequence in the yellow box, and if there are multiple valid options choose the one you think a user would select. For typing, please return a click to click on the box along with a type with the message to write. When the page seems satisfactory, return done as a key with no value. You must respond in JSON only with no other fluff or bad things will happen. The JSON keys must ONLY be one of navigate, type, or click. Do not return the JSON inside a code block."

Maxion2y ago

The speed at which this is moving at is mind boggling. This may become crazier than the dot.com boom.

pms2y ago

Until you realize that it doesn't work well with less popular videos (any items really), because "Large Language Models Struggle to Learn Long-Tail Knowledge" [1].

[1] https://proceedings.mlr.press/v202/kandpal23a.html

1 more reply

transistorfan2y ago

At my work there are a large contingent of people who essentially do manual data copying between legacy programs (govt), because the tech debt is so large that we can't figure out a way to plug these things together. Excited for tools like this to eventually act as a layer that can run over these sort of problems, as bizarre a solution as it is from a compute perspective

yreg2y ago

A long, long time ago I worked on a small project for a major multinational grocery chain.

I made them a tool that parses an Excel file with a specific structure and calls some endpoints in their internal system to submit the data.

I was curious, so I asked how they are doing it currently. They led me to a computer at the back of their office. The wallpaper had two rectangles, one of them said MS EXCEL and the other said INTERNET EXPLORER. Then the person opened these apps, carefully positioned both windows exactly into those rectangles and ran some auto-clicker - the kind cheaters would use in RuneScape – which moved the cursor and copied and pasted the values from the Excel into the various forms on the website.

Amazing.

Valgrim2y ago

I worked with a client who used a multi-millon dollar system for moving goods automatically into packaging stations. The system was built and maintained by a major european company. All the data was transfered automatically between systems normally, but one day, for some reason, there was an internal communication error inside the machine which caused a lot of packages to be sent without being recorded as such.

Now normally we would just have contacted the company and asked them for a data extraction so we could cross-reference the data. But since it wasn't clear who was at fault, and we knew it would take weeks for that extraction, we looked for an internal solution first.

Now there was a subsystem in the machine that worked only in Internet Explorer, with an old authentication scheme, that we could use to see the information we needed, so I, being the only person in the team without formal analysis training but having made my way there from a clerk job, knew exactly what to do.

I fired up the old IE, Excel, wrote in 5 minutes a VBA script that did exactly what you described, click there copy that etc, and 30 minutes later we had our extraction, and resolved the issue completely before the packages were even shipped.

All hail Excel.

1 more reply

kspacewalk22y ago

I wonder if it used something like AutoIt[0]. I remember using it at one of my more boring co-op jobs about 20 years ago to automate moving data between a spreadsheet and some obscure database product.

[0] https://en.wikipedia.org/wiki/AutoIt

bboygravity2y ago

Funny that you and others on here don't seem to realize that literally everybody who uses the internet has the exact same data entry problem all the time. Blame it on "old software", but how about the entire internet?

copying (or in most cases even worse: re-typing) form data from one location on the screen into yet another webform.

Username, password, email address, physical address, credit card info etc etc.

Some extensions try to help with data entry, but none of them work properly and consistently enough to really help. Even consistently filling just username and pw is too much to ask.

It's my number 1 frustration when using the internet (worse than ads) and I find it mind-blowing that this hasn't been solved yet with or without LLMs.

I would pay a montly fee for any software that solves this once and for all and it sounds like it's coming (and I'm already paying their monthly fee).

TeMPOraL2y ago

> It's my number 1 frustration when using the internet (worse than ads) and I find it mind-blowing that this hasn't been solved yet with or without LLMs.

Simple: it's because not solving this problem is how our godawful industry makes most of its money. Empowering the user means relinquishing control over their "journey"[0]. Ergonomics means fewer opportunities to upsell or show ads.

I don't have the link handy, but I'm reminded of one of the earliest Windows user interface guidelines documents, back from Windows 95/98 era, which, in a section about theming/visual style, already recognized that they have to allow for full flexibility, because vendors will insist on fucking the experience up for the sake of branding anyway, and resisting it is futile[1].

[0] - I'm trying really hard to hold back my contempt towards terms like this, and the whole salesy way of viewing human-computer interactions.

[1] - They put it in much more polite terms, but the feeling of helplessness was already there.

2 more replies

anonzzzies2y ago

Yeah, my dream would be using this to scrape pages, pop the content into my provide db, serving it up in my own format (which is going to be a white page with letters with inline images and videos that are not ads. And my interactions fed back to the vision model to post in the original. So I never have to see a ‘design’ (heavy js riddled unreadable crap) again in my life. And so I can, with my own tooling, browse and reuse my history including content instead rely on all the broken stuff bolted on the web.

williamcotton2y ago

Bash pipes? The free flow of information through composable tools.

The commercial web? Not the above.

This is just a baseline. I’m sure that an LLM can help the issue but the biggest problem is that these varied HTTP-with-datastores are islands passing messages in bottles back and forth while a bash pipeline is akin to fiber optics.

fragmede2y ago

consistently filling out username and password is all I wanted from my password manager, but it turns out it handles credit card number and other bits of information for me as well.

2 more replies

pseudosaid2y ago

use a password manager. i havent copy pasted form data twice on a site in a long time

loud_cloud2y ago

FTL. See NiagraFiles.

haswell2y ago

The industry buzzword is "Robotic Process Automation", which as a category of products has been focused on using various forms of ML/AI to glue these things together in a common/structured way (in addition to good old fashioned screen scraping).

Up this this point, these products have been quite brittle. The recent explosion of AI tech seems like quite a boon for this space.

keepamovin2y ago

I totally agree on all points, especially around what AI means for this.

I'm kind of in a happy accident situation because I was working on something for RPA, which then became a layer that was factored as its own product, but now might be able to come full circle as a result of AI.

Essentially this layer can function as a "delivery medium" for RPA agent creation, that you can use on any device without download. However, as it has many others uses I've been working on those, but I've been seeking a great reason to get back into RPA.

I have a cool idea to leverage human-guided AI creation of data maps and action tours for RPA, but similar to what you say, unless great care is taken you can end up with a brittle approach. Also, as the market has been quite saturated many reasonable approaches, I just haven't felt compelled.

Yet now I think the possible merging of GPT level AIs with browser instrumentation to deliver an augmented way to browse the web makes that incredibly compelling.

So I'm incredibly thrilled that I have this happy accident of BrowserBox^0 (the factored out layer originally from RPA work above) which provides a pluggable/iframe-emebeddable interface for remotely controlling a headless browser. So now I want to look at unifying BrowserBox with this kind of GPT driven exploration.

It's even cooler, because, as BB enables co-browsing by default (multiplayer browsing) and turns the browser into a "client-server" architecture, I can see plugging in GPT-4V as a connecting client with some kind of minimal API affordance for it to use would, like the very cool vimium keyboard-enabled browsing in the OP, would be such interesting project to try!

We're open source so if you want to check us out or get involved in this quest, come say hi, maybe get involved if you're game!

0: https://github.com/BrowserBox/BrowserBox

1 more reply

leovander2y ago

In the OP's specific instance when would you reach out for a traditional ETL tool vs an RPA solution?

2 more replies

Roark662y ago

Whenever I hear about such a thing (people doing legacy system data extraction manually) I wonder if in every case someone got the estimate for the "proper" solution and just decided a bunch of people typing is cheaper?

Integrating things like Chatgpt will still require people who know what they are doing to look at it, and I wouldn't be surprised if the first advice they give is "don't use chatgpt for it".

spaceman_20202y ago

If the market forces work as they’re supposed to (not a given anymore), then corporations that adopt better tech will see better profits through lower expenses. And then the laggards will have to adapt or die.

Also remember that this is essentially v1 of the software- the Windows 95 of this adoption cycle

aikinai2y ago

I remember years ago thinking it was weird in Ghost in the Shell when a robot had fingers on its fingers to type really fast. Maybe that really won’t happen since they can plug into USB at least, but they will probably use the screen and keyboard input sometimes at least.

nomel2y ago

Why would a keyboard be required? I think the intent to hit a letter would more easily be sent over a bluetooth HID "device". ;)

yjftsjthsd-h2y ago

USB is an attack vector; if it's not exploiting your USB driver it's connecting your data pins to mains power. Keyboards are an air gap.

1 more reply

pixl972y ago

The issue with USB is you have to have power protection circuits. Analog interface at least in the show appeared much harder to hack.

hubraumhugo2y ago

I believe that LLMs will automate most of our data entry/copy/transformation work. 80% of the world's data is unstructured and scattered across formats like HTML, PDFs, or images that are hard to access and analyze. Multimodal models can now tap into that data without having to rely on complex OCR technologies or expensive tooling.

If you go to platforms like Upwork, there are thousands of VAs in low-cost labor countries that do nothing else than manual data entry work. IMO that's a complete waste of human capital and I've made it my personal mission to automate such tedious and un-creative data work with https://kadoa.com.

kristopolous2y ago

I was thinking what the payoff would be to pose as human for these terrible pay click jobs and then assign them to an LLM en masse. There's an arbitrage there ... it may be a good strategy.

I heard recently "click-work" works out to about $4/hr* If you could do that x50, passively, it's a fine income.

* - see https://journals.sagepub.com/doi/full/10.1177/14614448231183... or listen to https://kpfa.org/episode/against-the-grain-october-30-2023/ ... it's a fascinating study. Terrible pay (way below minimum wage) but surprisingly high worker satisfaction. The users seem to view it as entertainment essentially categorizing it as casual gaming.

The "asshole innovator" in me wonders if one could simply make it more entertaining and forego paying the user entirely.

2 more replies

ishan01022y ago

Yup, that's my long term goal. I want an "anything API" that brings structure to anything on the web.

morkalork2y ago

Kinda sci-fi, we're so close to a future where when/if original source code is lost, a mainframe runs in an emulator and the human operating it is also emulated.

FooBarWidget2y ago

It's bizarre computationally, but at this point maybe we have to compare it to the alternative: hiring a person. At least the AI only consumes electricity (which is hopefully green), while a person consumes food (grown with mined fertilizers), or meat (which we know is really bad for the environment).

specialist2y ago

> a large contingent of people who essentially do manual data copying

Yup.

I was briefly part of a decades long effort to migrate off a main frame backend. It was basically a very expensive shared flat file database (eg FileMaker Pro). Used by thousands of applications, neither inventoried or managed. Surely a handful were critical for daily operations, but no one remembered which ones.

And the source data (quality) was filthy.

I suggested we pay some students to manually copy just the bits of data our spiffy "modern" apps needed.

No one was amused.

I also suggested we find a suitable COBOL runtime and just forklift the mainframe's "critical" infra into a virtual machine.

No one was amused.

Lastly, I suggested we throttle access to every unidentified mainframe client. Progressively making it slower over time. Surely we'd hear about anything critical breaking.

That suggestion flew like a lead zeppelin.

alexirobbins2y ago

Working on this layer at https://autotab.com. This sounds like an amazing problem for browser automation to solve, would love to talk with you if you’re interested!

abrichr2y ago

This type of use case is exactly why are building https://github.com/OpenAdaptAI/OpenAdapt

Garlef2y ago

"Chinese Room Automation"

1 more reply

monkeydust2y ago

This has been fruitful ground for RPA offerings like UIPath and Automation Anywhere. Multi-model LLMs open up chance to disrupt them

gumballindie2y ago

Wow. Leaking confidential tax payer data.

transistorfan2y ago

I should have been clearer, it's between two apps that we host internally - applications on our own intranet cannot talk to each other. If you want to get any data out of either of these apps to the world, you need to do a manual export and email/usb which would obviously flag

1 more reply

lachlan_gray2y ago

I think vim is unintentionally a great “embodiment” for chatgpt. There’s nothing that can’t be done with a stream of text, and the internet is full of vimscript already

I started a similar experiment if anyone else is thinking along the same lines :)

https://github.com/LachlanGray/vim-agent

gsuuon2y ago

This is a neat idea!

ishan01022y ago

Hey! Creator here, thanks for sharing! Let me know if anyone has questions and feel free to contribute, I've left some potential next steps in the README.

celeste_lan2y ago

Omg I also just released something pretty similar earlier today https://github.com/Jiayi-Pan/GPT-V-on-Web. But it received little attention.

ishan01022y ago

Woah looks great, not surprised that multiple people thought of this! Your prompt looks much better than mine, I'm not really taking advantage of any of the default Vimium shortcuts.

jimmySixDOF2y ago

Nice. I know Open Interpreter are trying to get Selenium automated to natural language control and quite a few other projects are also popping up on HN lately. The vimium approach is a lot lighter so looks promising. One way or another the as-published world wide web is turning into its own dynamic API overlay server. Ingest all the Sources!

jgalentine0072y ago

Very cool use for Vimium, I like the approach!

ishan01022y ago

Thank you!

squeegmeister2y ago

How does this differ from how ChatGPT currently browses the web?

poulpy1232y ago

could it be used to make a bot that visit and parse websites to extrat relevant information without writing a parser for each websites ?

roland352y ago

what terminal are you using???

ishan01022y ago

Warp! (warp.dev)

maccam9122y ago

I've been playing with a similar idea of screenshots and actions from GPT-4 Vision for browsing, but after trying and failing to overlay info in the screenshot, I ended up just getting the accessibility tree from playwright and sending that along as text so the model would know what options it had for interaction. In my case it seemed to work better, I see the creator is here and has a list of future ideas, maybe add this to the list if you think its a good idea?

ishan01022y ago

Cool that’s a solid idea, I was trying to only use visual data but this could make the agent a lot more powerful, I’ll try this really soon

manmal2y ago

Probably better to capture all the content and not just what fits on one screen. Most pages should fit as text (or HTML?) in the new extended token window.

arbuge2y ago

Better watch token costs. The per token costs are lower now but even so a full context load still costs almost $4.

mackross2y ago

Been playing with this through the ChatGPT interface for the past few weeks. Couple of tips. Update the css to get rid of the gradients and rounded corners. I found red with bold white text to be most consistent. Increase the font size. If two labels overlap, push them apart and add an arrow to the element. Send both images to the API, a version with the annotations added and a version without.

karmasimida2y ago

We can create an autopilot for browser.

It is going to incredibly difficult moving forward to distinguish bot traffic, if this is deployed at scale.

The problem I see is this isn't going to be cheap or even affordable in short term.

ishan01022y ago

I think costs can come down if you finetune open source models like llava or cogvlm. This demo also cost about 6 cents so it's not insanely expensive either, especially with clever prompting.

reqo2y ago

How will tools like this affect web tracking or generally advertisements on the internet? Imagine you could have an agent browse the web for you and fetch exactly what you are seraching for without you seeing any ads/pop ups or being tracked along the way! Could be a great ”ad blocker”! Could it perhaps also make SEO useless and thus improve the quality of internet? But I wonder if it also could have negative effects such as the ads being “interweaved” into the fetch content somehow!

famouswaffles2y ago

Since this is sending screenshots of pages to GPT, won't it see the ads as well?

FooBarWidget2y ago

Many Dutch companies pay salaries by

1. receiving payslips from the accountant, and then

2. manually initiating bank transfers to each employee for the amount in the corresponding payslip, and then

3. manually initiating a bank transfer to the tax authority to pay the withholded salary taxes.

This is completely useless manual labor. There should be no reason for this to be a manual procedure. And yet it's almost impossible to automate this. The accountant portal either has no API, or it has an API but lets you download the data as PDF, and/or the API costs good money. The bank either has no API, or it requires you to sign up for a developer account as if you're going to publish a public app, when you're just looking to automate some internal procedures.

So the easiest way to pay salaries and taxes is still to hire a person to do it manually. Hopefully one day that won't be necessary anymore. I wouldn't trust an AI to actually initiate the bank transfers, but maybe they can just prepare the transactions and then a person has to approve the submission.

martinald2y ago

I don't think this really has much to do with AI. In the UK there are solutions like Pento now which do all this, including automating payments via open banking to the user and the tax authority and automatically filing tax filings:

https://www.pento.io/la/payroll-software

nvm0n22y ago

That's just a bank problem. Certainly this isn't how payroll works for large companies. Banks usually let you upload XML files that define a set of SWIFT payments, this is how I do payroll even for a small company. The accountants supply the XML file too, presumably they have an app that generates it.

is_true2y ago

In my country it's similar but for some data you have to upload to the government agency's site, I think it was earlier this year that they released a statement saying that people using software to perform actions on the website could get banned.

abrichr2y ago

Thanks for the tip!

Automating repetitive GUI workflows is the goal of https://github.com/OpenAdaptAI/OpenAdapt

snake_doc2y ago

Ah, very similar to Adept’s[1] concept? Though, their product seems not yet ready.

[1] https://www.adept.ai/

jatins2y ago

It's also a little insane to me that what Adept has been supposedly building for years with 300+ mil in funding can now be built in a day with Open AI APIs?

I think Adept pivoted along the way but original concept was very similar to this.

abrichr2y ago

Agreed! This is part of the motivation behind https://github.com/OpenAdaptAI/OpenAdapt

sunshadow2y ago

But its too expensive to become practical with the OpenAI API. Also, demo is cool until you see the real-world webpages, then you'll realize that this only works less than %50 of webpages.

1 more reply

amks2y ago

https://www.adept.ai/blog/experiments :)

ishan01022y ago

Yep, took inspiration from them and a couple other startups

QkPrsMizkYvt2y ago

What other startups did you use for inspiration?

karmasimida2y ago

This is precisely the demo I am thinking.

dangerwill2y ago

How is this making your browsing experience any better? You still have to know what you want to do, and it is just faster to type Rick roll into youtube directly and click the links directly instead of having to type k, or vh, or whatever. You are just adding a useless chatgpt middleman between you and the browser that you likely spend all day in anyway and should be adept at navigating

circuit102y ago

It's a proof of concept for how it could do more complicated tasks

bnchrch2y ago

Personally. This is what Im really excited about chatgpt for. Data has just become alot more free to access.

thekid3142y ago

I'm curious to see what it does when it sees a captcha.

ishan01022y ago

From OpenAI docs[1]: "For safety reasons, we have implemented a system to block the submission of CAPTCHAs."

[1] https://platform.openai.com/docs/guides/vision

circuit102y ago

There was an exploit that let you access the GPT-4 vision model months before release (and this restriction) and it could do this: https://media.discordapp.net/attachments/1020661972322230272...

xur172y ago

Yeah, I've been feeding screenshots from selenium to the vision API, and when I trigger bot detection on a website, chatgpt refuses to process the image.

1 more reply

burcs2y ago

This is amazing, I feel like these vision models are going to make everything so much more accessible. Between the Be My Eyes app integration and now this, I'm really excited for how this transforms the web.

ctoth2y ago

I agree, and I think we're a year or two away from a full end-to-end trained screen reader. The ground truth from existing systems would provide great training material.

As a technical blind person, my only concern is the inherent loss of privacy while sharing stuff with the big models.

supriyo-biswas2y ago

There are open source models such as https://github.com/THUDM/CogVLM and https://github.com/haotian-liu/LLaVA.

ternaus2y ago

Love the idea.

It also shows that GPT-4V created a new angle in web scraping.

I guess, this or similar code would be leveraged in many projects like:

1. Scrape XXX websites, say LinkedIn or Twitter use all types of methods in the DOM to prevent it, but fighting working well GPT-4V + OCR would be ultra hard.

2. Give me an analysis of what these XXX companies are doing. And this could be done for competitors, to understand the landscape of some industry, or even plainly to get news.

Large-scale scrapping, not depending on the source code of the pages is a powerful infrastructural change.

sebastiennight2y ago

It took me a while to get what you meant, because... I'm not sure "XXX websites" usually means what you intended to convey here :)

ternaus2y ago

I feel very innocent now, as it did not even cross my mind ;)

DalasNoin2y ago

I tried to use it, but unfortunately it often did not add the little annotations for the different options to the screen and it got stuck in a loop. This bot works by adding a two letter combination to each clickable option, but sometimes they don't show up. It managed to sign in to twitter ones, but really quickly I burned through the 100 images api limit.

Maybe for a future version it only uses vision for difficult situations in which it gets stuck and otherwise uses the text based browser?

comment_ran2y ago

It's so cool. I was wondering if we can make crawler tool much easier and better. It's more similar to the "human" way to interact with a website.

ranulo2y ago

This could enable human language test automation scripts and could either improve my life as a QA engineer a lot or completely destroy it. Not sure yet.

sunshadow2y ago

You're good until this is cheaper than your salary.

jackconsidine2y ago

Looks extremely cool. Trying to run it though, I get stuck at "Getting actions for the given objective..." (using the example on the repo)

ishan01022y ago

Huh weird, I'm getting that too. OpenAI has been having periodic outages today, think that might be why since it was working fine earlier.

jechamt2y ago

https://www.bleepingcomputer.com/news/security/openai-confir... News reports and their https://status.openai.com/incidents/21vl32gvx3hb incident reports indicate they are mitigating / fighting off attacks recently

silentguy2y ago

Usually there are a lot of comments about how text is the best interface and it's making a comeback in the LLMs but in this case picture is the better medium since parsing the webpage js would prove too difficult. I think a screenshot of a webpage has a smaller footprint than the raw payloads (js, assets, etc.).

snthpy2y ago

Looks cool. Unfortunately I expected this to enhance my Vimium experience but it looks like this is using Vimium to enhance GPT4, right?

silentguy2y ago

I think this can be extended to desktop as well. There are programs that act like vimium for your desktop (win-vind, etc.). I don't have the openai API key to try it but I wish someone gave it a try (in obviously an isolated environment).

jonathanlb2y ago

Hmm interesting. I'm curious what this means for accessibility and screen readers.

imranq2y ago

Is the vision model directly reading the screen and therefore also reading the Vimeo tags? It might be more effective to export the DOM tags and the associated elements as a Json object that is fed into chatGPT without using the vision component

dymk2y ago

> Currently the Vision API doesn't support JSON mode or function calling, so we have to rely on more primitive prompting methods.

maccam9122y ago

I found that it works well to ask it to generate JSON the best it can, then pass it to gpt-3.5-turbo with the JSON response mode and instruct it to just clean up whatever input it received.

1 more reply

gvv2y ago

Nice job! The horrors GPT-4 must endure to watch ads, truly inhumane

doctorM2y ago

i think this is actively dangerous. well not yet. but getting there.

i know - ai isn't meant to be sentient. but if it looks like a duck and quacks like a duck...

how do i know that the comments here aren't done by dedicated hacker news ai bots?

the potential danger could come from lack of supervision down the road.

i didn't get much sleep last night so this is less coherent than it could be.

braindead_in2y ago

Why not build a new browser with GPT baked in?

reustle2y ago

Curious, how would that differ? Assuming it is just grabbing the rendered HTML DOM after each action, isn’t it nearly the same?

owenpalmer2y ago

This will be fantastic for accessibility

nostrowski2y ago

This will be in a future history book under a chapter titled "the beginning of the end"

startages2y ago

There is just so much you can do with GPT-4 vision, I just hope it's more affordable.

mediumsmart2y ago

this is awesome and great news, nevermind that the AI found the wrong video in the demo

https://www.youtube.com/watch?v=jRyX1tC2OS0

bilekas2y ago

This is actually pretty interesting.. I am thinking maybe it would be faster than writing up selenium tests themselves if we could just give a few instructions.

I'm still going through the source, but really nice idea and great example of enriching the GPT with tools like vimium.

rpigab2y ago

This is amazing that it's possible and works, but I wonder if the electricity cost is sustainable in the long run.

For handicapped people who depend on tools like this for accessibility, it's justified, but I wouldn't use it myself if it uses too much power.

I'm sure OpenAI and friends love operating at a loss until everyone uses their products, then enshittify or raise prices, like Netflix, Microsoft, Google, etc., but CO2 emissions can't be easily reversed.

I'd be glad to listen to other points of view though, maybe everything we do on computers is already bad for the environment anyway and comparing which one pollutes more is vain, idk.

j / k navigate · click thread line to collapse

128 comments

e12e2y ago

It's insane that this is now possible:

https://github.com/ishan0102/vimGPT/blob/682b5e539541cd6d710...

Maxion2y ago

The speed at which this is moving at is mind boggling. This may become crazier than the dot.com boom.

pms2y ago

Until you realize that it doesn't work well with less popular videos (any items really), because "Large Language Models Struggle to Learn Long-Tail Knowledge" [1].

[1] https://proceedings.mlr.press/v202/kandpal23a.html

1 more reply

transistorfan2y ago

yreg2y ago

A long, long time ago I worked on a small project for a major multinational grocery chain.

I made them a tool that parses an Excel file with a specific structure and calls some endpoints in their internal system to submit the data.

Amazing.

Valgrim2y ago

All hail Excel.

1 more reply

kspacewalk22y ago

[0] https://en.wikipedia.org/wiki/AutoIt

bboygravity2y ago

copying (or in most cases even worse: re-typing) form data from one location on the screen into yet another webform.

Username, password, email address, physical address, credit card info etc etc.

Some extensions try to help with data entry, but none of them work properly and consistently enough to really help. Even consistently filling just username and pw is too much to ask.

It's my number 1 frustration when using the internet (worse than ads) and I find it mind-blowing that this hasn't been solved yet with or without LLMs.

I would pay a montly fee for any software that solves this once and for all and it sounds like it's coming (and I'm already paying their monthly fee).

TeMPOraL2y ago

> It's my number 1 frustration when using the internet (worse than ads) and I find it mind-blowing that this hasn't been solved yet with or without LLMs.

[0] - I'm trying really hard to hold back my contempt towards terms like this, and the whole salesy way of viewing human-computer interactions.

[1] - They put it in much more polite terms, but the feeling of helplessness was already there.

2 more replies

anonzzzies2y ago

williamcotton2y ago

Bash pipes? The free flow of information through composable tools.

The commercial web? Not the above.

fragmede2y ago

consistently filling out username and password is all I wanted from my password manager, but it turns out it handles credit card number and other bits of information for me as well.

2 more replies

pseudosaid2y ago

use a password manager. i havent copy pasted form data twice on a site in a long time

loud_cloud2y ago

FTL. See NiagraFiles.

haswell2y ago

Up this this point, these products have been quite brittle. The recent explosion of AI tech seems like quite a boon for this space.

keepamovin2y ago

I totally agree on all points, especially around what AI means for this.

Yet now I think the possible merging of GPT level AIs with browser instrumentation to deliver an augmented way to browse the web makes that incredibly compelling.

We're open source so if you want to check us out or get involved in this quest, come say hi, maybe get involved if you're game!

0: https://github.com/BrowserBox/BrowserBox

1 more reply

leovander2y ago

In the OP's specific instance when would you reach out for a traditional ETL tool vs an RPA solution?

2 more replies

Roark662y ago

Integrating things like Chatgpt will still require people who know what they are doing to look at it, and I wouldn't be surprised if the first advice they give is "don't use chatgpt for it".

spaceman_20202y ago

Also remember that this is essentially v1 of the software- the Windows 95 of this adoption cycle

aikinai2y ago

nomel2y ago

Why would a keyboard be required? I think the intent to hit a letter would more easily be sent over a bluetooth HID "device". ;)

yjftsjthsd-h2y ago

USB is an attack vector; if it's not exploiting your USB driver it's connecting your data pins to mains power. Keyboards are an air gap.

1 more reply

pixl972y ago

The issue with USB is you have to have power protection circuits. Analog interface at least in the show appeared much harder to hack.

hubraumhugo2y ago

kristopolous2y ago

I was thinking what the payoff would be to pose as human for these terrible pay click jobs and then assign them to an LLM en masse. There's an arbitrage there ... it may be a good strategy.

I heard recently "click-work" works out to about $4/hr* If you could do that x50, passively, it's a fine income.

The "asshole innovator" in me wonders if one could simply make it more entertaining and forego paying the user entirely.

2 more replies

ishan01022y ago

Yup, that's my long term goal. I want an "anything API" that brings structure to anything on the web.

morkalork2y ago

Kinda sci-fi, we're so close to a future where when/if original source code is lost, a mainframe runs in an emulator and the human operating it is also emulated.

FooBarWidget2y ago

specialist2y ago

> a large contingent of people who essentially do manual data copying

Yup.

And the source data (quality) was filthy.

I suggested we pay some students to manually copy just the bits of data our spiffy "modern" apps needed.

No one was amused.

I also suggested we find a suitable COBOL runtime and just forklift the mainframe's "critical" infra into a virtual machine.

No one was amused.

Lastly, I suggested we throttle access to every unidentified mainframe client. Progressively making it slower over time. Surely we'd hear about anything critical breaking.

That suggestion flew like a lead zeppelin.

alexirobbins2y ago

Working on this layer at https://autotab.com. This sounds like an amazing problem for browser automation to solve, would love to talk with you if you’re interested!

abrichr2y ago

This type of use case is exactly why are building https://github.com/OpenAdaptAI/OpenAdapt

Garlef2y ago

"Chinese Room Automation"

1 more reply

monkeydust2y ago

This has been fruitful ground for RPA offerings like UIPath and Automation Anywhere. Multi-model LLMs open up chance to disrupt them

gumballindie2y ago

Wow. Leaking confidential tax payer data.

transistorfan2y ago

1 more reply

lachlan_gray2y ago

I think vim is unintentionally a great “embodiment” for chatgpt. There’s nothing that can’t be done with a stream of text, and the internet is full of vimscript already

I started a similar experiment if anyone else is thinking along the same lines :)

https://github.com/LachlanGray/vim-agent

gsuuon2y ago

This is a neat idea!

ishan01022y ago

Hey! Creator here, thanks for sharing! Let me know if anyone has questions and feel free to contribute, I've left some potential next steps in the README.

celeste_lan2y ago

Omg I also just released something pretty similar earlier today https://github.com/Jiayi-Pan/GPT-V-on-Web. But it received little attention.

ishan01022y ago

Woah looks great, not surprised that multiple people thought of this! Your prompt looks much better than mine, I'm not really taking advantage of any of the default Vimium shortcuts.

jimmySixDOF2y ago

jgalentine0072y ago

Very cool use for Vimium, I like the approach!

ishan01022y ago

Thank you!

squeegmeister2y ago

How does this differ from how ChatGPT currently browses the web?

poulpy1232y ago

could it be used to make a bot that visit and parse websites to extrat relevant information without writing a parser for each websites ?

roland352y ago

what terminal are you using???

ishan01022y ago

Warp! (warp.dev)

maccam9122y ago

ishan01022y ago

Cool that’s a solid idea, I was trying to only use visual data but this could make the agent a lot more powerful, I’ll try this really soon

manmal2y ago

Probably better to capture all the content and not just what fits on one screen. Most pages should fit as text (or HTML?) in the new extended token window.

arbuge2y ago

Better watch token costs. The per token costs are lower now but even so a full context load still costs almost $4.

mackross2y ago

karmasimida2y ago

We can create an autopilot for browser.

It is going to incredibly difficult moving forward to distinguish bot traffic, if this is deployed at scale.

The problem I see is this isn't going to be cheap or even affordable in short term.

ishan01022y ago

I think costs can come down if you finetune open source models like llava or cogvlm. This demo also cost about 6 cents so it's not insanely expensive either, especially with clever prompting.

reqo2y ago

famouswaffles2y ago

Since this is sending screenshots of pages to GPT, won't it see the ads as well?

FooBarWidget2y ago

Many Dutch companies pay salaries by

1. receiving payslips from the accountant, and then

2. manually initiating bank transfers to each employee for the amount in the corresponding payslip, and then

3. manually initiating a bank transfer to the tax authority to pay the withholded salary taxes.

martinald2y ago

https://www.pento.io/la/payroll-software

nvm0n22y ago

is_true2y ago

abrichr2y ago

Thanks for the tip!

Automating repetitive GUI workflows is the goal of https://github.com/OpenAdaptAI/OpenAdapt

snake_doc2y ago

Ah, very similar to Adept’s[1] concept? Though, their product seems not yet ready.

[1] https://www.adept.ai/

jatins2y ago

It's also a little insane to me that what Adept has been supposedly building for years with 300+ mil in funding can now be built in a day with Open AI APIs?

I think Adept pivoted along the way but original concept was very similar to this.

abrichr2y ago

Agreed! This is part of the motivation behind https://github.com/OpenAdaptAI/OpenAdapt

sunshadow2y ago

But its too expensive to become practical with the OpenAI API. Also, demo is cool until you see the real-world webpages, then you'll realize that this only works less than %50 of webpages.

1 more reply

amks2y ago

https://www.adept.ai/blog/experiments :)

ishan01022y ago

Yep, took inspiration from them and a couple other startups

QkPrsMizkYvt2y ago

What other startups did you use for inspiration?

karmasimida2y ago

This is precisely the demo I am thinking.

dangerwill2y ago

circuit102y ago

It's a proof of concept for how it could do more complicated tasks

bnchrch2y ago

Personally. This is what Im really excited about chatgpt for. Data has just become alot more free to access.

thekid3142y ago

I'm curious to see what it does when it sees a captcha.

ishan01022y ago

From OpenAI docs[1]: "For safety reasons, we have implemented a system to block the submission of CAPTCHAs."

[1] https://platform.openai.com/docs/guides/vision

circuit102y ago

There was an exploit that let you access the GPT-4 vision model months before release (and this restriction) and it could do this: https://media.discordapp.net/attachments/1020661972322230272...

xur172y ago

Yeah, I've been feeding screenshots from selenium to the vision API, and when I trigger bot detection on a website, chatgpt refuses to process the image.

1 more reply

burcs2y ago

ctoth2y ago

I agree, and I think we're a year or two away from a full end-to-end trained screen reader. The ground truth from existing systems would provide great training material.

As a technical blind person, my only concern is the inherent loss of privacy while sharing stuff with the big models.

supriyo-biswas2y ago

There are open source models such as https://github.com/THUDM/CogVLM and https://github.com/haotian-liu/LLaVA.

ternaus2y ago

Love the idea.

It also shows that GPT-4V created a new angle in web scraping.

I guess, this or similar code would be leveraged in many projects like:

1. Scrape XXX websites, say LinkedIn or Twitter use all types of methods in the DOM to prevent it, but fighting working well GPT-4V + OCR would be ultra hard.

2. Give me an analysis of what these XXX companies are doing. And this could be done for competitors, to understand the landscape of some industry, or even plainly to get news.

Large-scale scrapping, not depending on the source code of the pages is a powerful infrastructural change.

sebastiennight2y ago

It took me a while to get what you meant, because... I'm not sure "XXX websites" usually means what you intended to convey here :)

ternaus2y ago

I feel very innocent now, as it did not even cross my mind ;)

DalasNoin2y ago

Maybe for a future version it only uses vision for difficult situations in which it gets stuck and otherwise uses the text based browser?

comment_ran2y ago

It's so cool. I was wondering if we can make crawler tool much easier and better. It's more similar to the "human" way to interact with a website.

ranulo2y ago

This could enable human language test automation scripts and could either improve my life as a QA engineer a lot or completely destroy it. Not sure yet.

sunshadow2y ago

You're good until this is cheaper than your salary.

jackconsidine2y ago

Looks extremely cool. Trying to run it though, I get stuck at "Getting actions for the given objective..." (using the example on the repo)

ishan01022y ago

Huh weird, I'm getting that too. OpenAI has been having periodic outages today, think that might be why since it was working fine earlier.

jechamt2y ago

silentguy2y ago

snthpy2y ago

Looks cool. Unfortunately I expected this to enhance my Vimium experience but it looks like this is using Vimium to enhance GPT4, right?

silentguy2y ago

jonathanlb2y ago

Hmm interesting. I'm curious what this means for accessibility and screen readers.

imranq2y ago

dymk2y ago

> Currently the Vision API doesn't support JSON mode or function calling, so we have to rely on more primitive prompting methods.

maccam9122y ago

I found that it works well to ask it to generate JSON the best it can, then pass it to gpt-3.5-turbo with the JSON response mode and instruct it to just clean up whatever input it received.

1 more reply

gvv2y ago

Nice job! The horrors GPT-4 must endure to watch ads, truly inhumane

doctorM2y ago

i think this is actively dangerous. well not yet. but getting there.

i know - ai isn't meant to be sentient. but if it looks like a duck and quacks like a duck...

how do i know that the comments here aren't done by dedicated hacker news ai bots?

the potential danger could come from lack of supervision down the road.

i didn't get much sleep last night so this is less coherent than it could be.

braindead_in2y ago

Why not build a new browser with GPT baked in?

reustle2y ago

Curious, how would that differ? Assuming it is just grabbing the rendered HTML DOM after each action, isn’t it nearly the same?

owenpalmer2y ago

This will be fantastic for accessibility

nostrowski2y ago

This will be in a future history book under a chapter titled "the beginning of the end"

startages2y ago

There is just so much you can do with GPT-4 vision, I just hope it's more affordable.

mediumsmart2y ago

this is awesome and great news, nevermind that the AI found the wrong video in the demo

https://www.youtube.com/watch?v=jRyX1tC2OS0

bilekas2y ago

This is actually pretty interesting.. I am thinking maybe it would be faster than writing up selenium tests themselves if we could just give a few instructions.

I'm still going through the source, but really nice idea and great example of enriching the GPT with tools like vimium.

rpigab2y ago

This is amazing that it's possible and works, but I wonder if the electricity cost is sustainable in the long run.

For handicapped people who depend on tools like this for accessibility, it's justified, but I wouldn't use it myself if it uses too much power.

I'd be glad to listen to other points of view though, maybe everything we do on computers is already bad for the environment anyway and comparing which one pollutes more is vain, idk.

j / k navigate · click thread line to collapse