https://github.com/ishan0102/vimGPT/blob/682b5e539541cd6d710...
> "You need to choose which action to take to help a user do this task: {objective}. Your options are navigate, type, click, and done. Navigate should take you to the specified URL. Type and click take strings where if you want to click on an object, return the string with the yellow character sequence you want to click on, and to type just a string with the message you want to type. For clicks, please only respond with the 1-2 letter sequence in the yellow box, and if there are multiple valid options choose the one you think a user would select. For typing, please return a click to click on the box along with a type with the message to write. When the page seems satisfactory, return done as a key with no value. You must respond in JSON only with no other fluff or bad things will happen. The JSON keys must ONLY be one of navigate, type, or click. Do not return the JSON inside a code block."
I made them a tool that parses an Excel file with a specific structure and calls some endpoints in their internal system to submit the data.
I was curious, so I asked how they are doing it currently. They led me to a computer at the back of their office. The wallpaper had two rectangles, one of them said MS EXCEL and the other said INTERNET EXPLORER. Then the person opened these apps, carefully positioned both windows exactly into those rectangles and ran some auto-clicker - the kind cheaters would use in RuneScape – which moved the cursor and copied and pasted the values from the Excel into the various forms on the website.
Amazing.
Now normally we would just have contacted the company and asked them for a data extraction so we could cross-reference the data. But since it wasn't clear who was at fault, and we knew it would take weeks for that extraction, we looked for an internal solution first.
Now there was a subsystem in the machine that worked only in Internet Explorer, with an old authentication scheme, that we could use to see the information we needed, so I, being the only person in the team without formal analysis training but having made my way there from a clerk job, knew exactly what to do.
I fired up the old IE, Excel, wrote in 5 minutes a VBA script that did exactly what you described, click there copy that etc, and 30 minutes later we had our extraction, and resolved the issue completely before the packages were even shipped.
All hail Excel.
copying (or in most cases even worse: re-typing) form data from one location on the screen into yet another webform.
Username, password, email address, physical address, credit card info etc etc.
Some extensions try to help with data entry, but none of them work properly and consistently enough to really help. Even consistently filling just username and pw is too much to ask.
It's my number 1 frustration when using the internet (worse than ads) and I find it mind-blowing that this hasn't been solved yet with or without LLMs.
I would pay a montly fee for any software that solves this once and for all and it sounds like it's coming (and I'm already paying their monthly fee).
Simple: it's because not solving this problem is how our godawful industry makes most of its money. Empowering the user means relinquishing control over their "journey"[0]. Ergonomics means fewer opportunities to upsell or show ads.
I don't have the link handy, but I'm reminded of one of the earliest Windows user interface guidelines documents, back from Windows 95/98 era, which, in a section about theming/visual style, already recognized that they have to allow for full flexibility, because vendors will insist on fucking the experience up for the sake of branding anyway, and resisting it is futile[1].
--
[0] - I'm trying really hard to hold back my contempt towards terms like this, and the whole salesy way of viewing human-computer interactions.
[1] - They put it in much more polite terms, but the feeling of helplessness was already there.
The commercial web? Not the above.
This is just a baseline. I’m sure that an LLM can help the issue but the biggest problem is that these varied HTTP-with-datastores are islands passing messages in bottles back and forth while a bash pipeline is akin to fiber optics.
Up this this point, these products have been quite brittle. The recent explosion of AI tech seems like quite a boon for this space.
I'm kind of in a happy accident situation because I was working on something for RPA, which then became a layer that was factored as its own product, but now might be able to come full circle as a result of AI.
Essentially this layer can function as a "delivery medium" for RPA agent creation, that you can use on any device without download. However, as it has many others uses I've been working on those, but I've been seeking a great reason to get back into RPA.
I have a cool idea to leverage human-guided AI creation of data maps and action tours for RPA, but similar to what you say, unless great care is taken you can end up with a brittle approach. Also, as the market has been quite saturated many reasonable approaches, I just haven't felt compelled.
Yet now I think the possible merging of GPT level AIs with browser instrumentation to deliver an augmented way to browse the web makes that incredibly compelling.
So I'm incredibly thrilled that I have this happy accident of BrowserBox^0 (the factored out layer originally from RPA work above) which provides a pluggable/iframe-emebeddable interface for remotely controlling a headless browser. So now I want to look at unifying BrowserBox with this kind of GPT driven exploration.
It's even cooler, because, as BB enables co-browsing by default (multiplayer browsing) and turns the browser into a "client-server" architecture, I can see plugging in GPT-4V as a connecting client with some kind of minimal API affordance for it to use would, like the very cool vimium keyboard-enabled browsing in the OP, would be such interesting project to try!
We're open source so if you want to check us out or get involved in this quest, come say hi, maybe get involved if you're game!
Integrating things like Chatgpt will still require people who know what they are doing to look at it, and I wouldn't be surprised if the first advice they give is "don't use chatgpt for it".
Also remember that this is essentially v1 of the software- the Windows 95 of this adoption cycle
If you go to platforms like Upwork, there are thousands of VAs in low-cost labor countries that do nothing else than manual data entry work. IMO that's a complete waste of human capital and I've made it my personal mission to automate such tedious and un-creative data work with https://kadoa.com.
I heard recently "click-work" works out to about $4/hr* If you could do that x50, passively, it's a fine income.
* - see https://journals.sagepub.com/doi/full/10.1177/14614448231183... or listen to https://kpfa.org/episode/against-the-grain-october-30-2023/ ... it's a fascinating study. Terrible pay (way below minimum wage) but surprisingly high worker satisfaction. The users seem to view it as entertainment essentially categorizing it as casual gaming.
The "asshole innovator" in me wonders if one could simply make it more entertaining and forego paying the user entirely.
Yup.
I was briefly part of a decades long effort to migrate off a main frame backend. It was basically a very expensive shared flat file database (eg FileMaker Pro). Used by thousands of applications, neither inventoried or managed. Surely a handful were critical for daily operations, but no one remembered which ones.
And the source data (quality) was filthy.
I suggested we pay some students to manually copy just the bits of data our spiffy "modern" apps needed.
No one was amused.
--
I also suggested we find a suitable COBOL runtime and just forklift the mainframe's "critical" infra into a virtual machine.
No one was amused.
Lastly, I suggested we throttle access to every unidentified mainframe client. Progressively making it slower over time. Surely we'd hear about anything critical breaking.
That suggestion flew like a lead zeppelin.
I started a similar experiment if anyone else is thinking along the same lines :)
It is going to incredibly difficult moving forward to distinguish bot traffic, if this is deployed at scale.
The problem I see is this isn't going to be cheap or even affordable in short term.
1. receiving payslips from the accountant, and then
2. manually initiating bank transfers to each employee for the amount in the corresponding payslip, and then
3. manually initiating a bank transfer to the tax authority to pay the withholded salary taxes.
This is completely useless manual labor. There should be no reason for this to be a manual procedure. And yet it's almost impossible to automate this. The accountant portal either has no API, or it has an API but lets you download the data as PDF, and/or the API costs good money. The bank either has no API, or it requires you to sign up for a developer account as if you're going to publish a public app, when you're just looking to automate some internal procedures.
So the easiest way to pay salaries and taxes is still to hire a person to do it manually. Hopefully one day that won't be necessary anymore. I wouldn't trust an AI to actually initiate the bank transfers, but maybe they can just prepare the transactions and then a person has to approve the submission.
Automating repetitive GUI workflows is the goal of https://github.com/OpenAdaptAI/OpenAdapt
I think Adept pivoted along the way but original concept was very similar to this.
As a technical blind person, my only concern is the inherent loss of privacy while sharing stuff with the big models.
It also shows that GPT-4V created a new angle in web scraping.
I guess, this or similar code would be leveraged in many projects like:
1. Scrape XXX websites, say LinkedIn or Twitter use all types of methods in the DOM to prevent it, but fighting working well GPT-4V + OCR would be ultra hard.
2. Give me an analysis of what these XXX companies are doing. And this could be done for competitors, to understand the landscape of some industry, or even plainly to get news.
Large-scale scrapping, not depending on the source code of the pages is a powerful infrastructural change.
Maybe for a future version it only uses vision for difficult situations in which it gets stuck and otherwise uses the text based browser?
i know - ai isn't meant to be sentient. but if it looks like a duck and quacks like a duck...
how do i know that the comments here aren't done by dedicated hacker news ai bots?
the potential danger could come from lack of supervision down the road.
i didn't get much sleep last night so this is less coherent than it could be.
I'm still going through the source, but really nice idea and great example of enriching the GPT with tools like vimium.
For handicapped people who depend on tools like this for accessibility, it's justified, but I wouldn't use it myself if it uses too much power.
I'm sure OpenAI and friends love operating at a loss until everyone uses their products, then enshittify or raise prices, like Netflix, Microsoft, Google, etc., but CO2 emissions can't be easily reversed.
I'd be glad to listen to other points of view though, maybe everything we do on computers is already bad for the environment anyway and comparing which one pollutes more is vain, idk.