An example of this is California drought data: automatically grabbing data on the drought is incredibly difficult because it involves scraping HTML tables. I tried to build an API that presents drought data so volunteers would have an easier time building out data visualizations. I ended up just getting exhausted doing all the scraping work.
I then moved onto a new project: building a free-to-use Padmapper for affordable housing. The data for income restricted apartment units are driven by a government contracted vendor. A city county will declare income stabilization policies and legally enforce them against landowners and then the landowners would send over their list of units to the vendor.
This would be great except the vendor does the bare minimum. Padmapper looks amazing but, really, it's only applicable for the upper middle class due to explosive housing costs in the Bay Area. So, in order to provide a more modern website and mobile application for the community, I started to scrape the vendor's website. It was terrible. I kept getting throttled. So I gave up.
We have a new WrapAPI API Builder looks like a browser, and is as easy to use as one too. You can define your API's inputs with a quick tap on the address bar, and point and click at the data you want to extract.
We also have a Chrome extension is smarter and better-integrated than ever. It records your requests and It'll automatically create parameter inputs for the values that change between requests to the same endpoints. The contents of your captures are immediately ready for you to start defining outputs and data to extract too.
Let me know if you have any questions or feedback!
This is a big thing on many sites now.
Also, since that is the case, you could build this in a few hours using something like: https://github.com/bda-research/node-crawler. Yes, it would have no gui, so you lose that.
Just reading about Kantu now. It reminds me of http://www.sikuli.org/
Is this happening on your site? If not, would appreciate some tips about coding it and how to handle exception cases where the wizard can't keep in sync or user click on unintended page elements.
The most helpful part is that you can pass a callback which will trigger before/during/after each step, which can let you ensure that the state of the page matches what you're expecting. In our case, we use it to make sure that you're switched to the right tab, etc. Take a look! I highly recommend it.
I often see one or more commenters write what seems like an excessively positive thought dump on Show HNs. It just doesn't seem like the natural conversational tone everyone uses, but I can't quite put my finger on it.
Has anyone else noticed it? Is there a term for this sort of writing style?
That endpoint will then emit a state token, which includes the session cookies. You can feed that state token into your next request and it'll authenticate you
I wanted to give you a heads-up that the youtube video at the end of your joyride tutorial is broken.
It tries to play this: https://www.youtube.com/watch?v=10yKzP3gtkc
Why?
When I was last working inside an organization and reviewing vendors for a product, it really left a bad taste in my mouth when they had "Ask for Pricing." I get it, my consulting work is basically Ask for Pricing, I understand the business strategy. But it's such a headache to sit through bullshit product demos for multiple vendors over a few weeks just to hear that their pricing structure is way out of line.
There is this idea that a lot of companies have, where they're more "professional" or conversion-optimized by removing public pricing and putting everyone through a sales funnel. But that concept only works if 1) you have a great product and 2) you have a great sales team, capable of making my time to failure in the conversion process fast and painless. Every company thinks they have this, but they almost never do. I really don't think you want to optimize your business for keeping stamp enthusiasts happy.
In the back of their heads, some people imagine the service is going to be huge, and then they worry that all the profits will be paid out to wrapapi.
Better to have a high headline number and then offer discounts for certain uses (non-profit, open source, students, etc). People are optimistic about how much money they might make so a high headline future price for when you graduate from the free tier is not necessarily bad.
WrapApi seems to tackle the same task (web scraping) from a very different angle. I wonder if anyone has used both and can compare.
Let's say you have a web-based inventory management system or CRM that requires a login, but you want to take data a customer has sent you in a spreadsheet and automatically batch enter it into the CRM, which doesn't have that functionality. You could then:
1. Create an API endpoint that allows you to log into that system and return a state token
2. Create a second API endpoint that's parametrized the inputs of the form to create a new inventory entry
3. Chain those 2 API endpoints together so that the 2 actions are actually combined into one API call
Our focus is not only on getting data, but automating the many things that you or your company does with websites to save time
I've used similar services like parsehub.com in the past and if they didn't have a pricing page I would have never tried it. Just my 2 cents.
You are using xpath here right?
Bought by Palantir, they retired in a good way, keeping people's data available for a moment and communicating well.
It was a great product still complicated to get a practical business model.
This WrapAPI v2 is an alternative I think, but I would use them with care as the economical model is unsure and it seems to be really new, still promising! :)
The company that runs this software as a service needs to be very careful. 3Taps was similar and got destroyed for relaying data scraped from Craigslist.
Contacting the server after its operator has expressed its wish for you to stop is a violation of the CFAA (in that you are "exceeding authorized access" and/or gaining "unauthorized access" to a protected computer system). If it's found that the site's ToS is binding upon you, which it typically would be, you don't really even need separate notice to be held liable.
Storing a copy of a web page in RAM creates a copy that is eligible for copyright protection, and it is likely that any implied license to read that page will be invalidated by the access revocation.
IANAL.
https://books.google.ca/books?id=a-yu2-JUQNAC&pg=PT249&lpg=P...