Scraping is really something that's better done in the back end, and today, there are a lot of libraries that let you access web sites from Java and run all the Javascript you need in order to display the page properly.
Second, if I was trying to scrape, I'd rather do scraping with WebDriver than anything else, and injecting some client side scraping tools and using WebDriver as a driver, not a driver/scraper sounds remarkably better.
I see no reason to ever not use a browser to consume html content.
If we want to Publish Everywhere Syndicate to Own Site (#IndieWeb dubs this PESOS), if we want to have our own experiences we can talk about, client side is the way to go.
And I just loooove listening to artoo beep over and over ;)
But combining both would be nice to make it possible to automatize scrapers that have been developed quickly directly in the browser with artoo.
I don't know if it is possible, but could this run as a Chrome Extension, in a background script, loading various pages, executing code on then and keep going, storing the data at the extension's localStorage?
It could also store the code of the scrapers, for reusing.
It's annoying to have to run scripts multiple times, tweaking it after each run to get exactly what you need. It's a waste of time...
>> ipython
[In 1]: from pyquery import PyQuery as pq
[In 2]: pq("http://www.foo.com")("<some jquery selectors>")
(inspect output, repeat till right)... or do it with requests + lxml.etree, or whatever you want
when you have what you need, copy and paste into a file
Might be better to use another name?
it also disable CSP. i'm not exactly sure how the extension works. maybe it is turned on/off on per tab basis and defaults to off which would be quite safe. but if it defaults to on then it can be kind of risky.
I basically built a bookmarklet that let's you define the actions locally on your browser, and then run the scrapes in your own box, essentially allowing unmetered scraping without charging per page.