Artoo, the client-side scraping companion (opens in new tab)

(medialab.github.io)

83 pointsjacomyal11y ago28 comments

28 comments

Still not convinced by the reasons offered for client-side scraping. If I'm on my browser, I'm not interested in consuming JSON.

Scraping is really something that's better done in the back end, and today, there are a lot of libraries that let you access web sites from Java and run all the Javascript you need in order to display the page properly.

rektide11y ago

To each their own. I'm not interested in systematic scraping. I just want to take back, take home the web experience I've had, and be able to digest and work with it latter. The things that I want to work with are the sights and experiences I've had. Client side is perfect.

Second, if I was trying to scrape, I'd rather do scraping with WebDriver than anything else, and injecting some client side scraping tools and using WebDriver as a driver, not a driver/scraper sounds remarkably better.

I see no reason to ever not use a browser to consume html content.

rektide11y ago

For example, favoriting a tweet on twitter is lossy: there's no after-the-fact scraping I can do to know where I was, what time it was when I favorited the thing.

If we want to Publish Everywhere Syndicate to Own Site (#IndieWeb dubs this PESOS), if we want to have our own experiences we can talk about, client side is the way to go.

rouxrc11y ago

Skeptical at first as well coming from the good ol curl/grep/sed backend scraping world, I changed my mind considering authentication issues and instructions saving: no more need to try and auth on complex websites via phantom without knowing what actually happens, I can just log in and see in my browser what I actually wanna scrape and still rerun it later as a script.

And I just loooove listening to artoo beep over and over ;)

Yomguithereal11y ago

Backend and frontend scraping just don't attend the same needs. Running backend monsters to scrape small to medium amount of data only once is such a drag when frontend scraping can take less than half an hour to perform the same task. Plus you can see the results of your code live while browsing the DOM comfortably. Finally, nobody prevents you from using artoo backend when you execute javascript.

jacomyalOP11y ago

Basically, it makes scraping accessible to almost anyone who can use a browser and write some CSS selectors.

brucehart11y ago

Great work! I really like this! I typically use the JavaScript console bookmarklet for tasks like this, but it is not specifically designed for scraping. I would love to see an option that would allow Artoo commands to be packaged into a PhantomJS script. Developers could use Artoo manually to figure out what elements should be targeted and then the PhantomJS script to run it in an automated fashion.

Yomguithereal11y ago

This would indeed be nice and this is precisely what we intend to code next.

ghkbrew11y ago

What advantages does this have over Phantom.js[1] ?

[1] http://phantomjs.org

jacomyalOP11y ago

Both are really different. Phantom.js is a headless browser while artoo is a tool to easily scrape data from website.

But combining both would be nice to make it possible to automatize scrapers that have been developed quickly directly in the browser with artoo.

fiatjaf11y ago

This is awesome. I've been dreaming about this for weeks.

I don't know if it is possible, but could this run as a Chrome Extension, in a background script, loading various pages, executing code on then and keep going, storing the data at the extension's localStorage?

It could also store the code of the scrapers, for reusing.

fiatjaf11y ago

Well, I see you already have almost all I suggested. Now I would want something to make the ajaxSpider render the pages using the browser engine, instead of just getting pure HTML.

Yomguithereal11y ago

This is an interesting point. I created an issue on the github repository concerning this matter. Maybe you'd like to comment on it about your use case so we can improve the tool?

nnnnni11y ago

I would like to see something that helps create useful, specific scrapers for languages like Ruby and Python.

It's annoying to have to run scripts multiple times, tweaking it after each run to get exactly what you need. It's a waste of time...

the_cat_kittles11y ago

  >> ipython

  [In 1]: from pyquery import PyQuery as pq
  [In 2]: pq("http://www.foo.com")("<some jquery selectors>")

(inspect output, repeat till right)

... or do it with requests + lxml.etree, or whatever you want

when you have what you need, copy and paste into a file

nnnnni11y ago

PERFECT! Thanks a lot =-)

dfischer11y ago

This is in direct conflict with another library: http://artoo.io

Might be better to use another name?

benmmurphy11y ago

this jquery injection looks kind of dangerous. Looks like code from code.jquery.com is loaded into any page. Say I go to https://secretsquirrel.com and they have been very careful to only load javascript from their own domain but now it can also load malicious javascript from https://code.jquery.com.

it also disable CSP. i'm not exactly sure how the extension works. maybe it is turned on/off on per tab basis and defaults to off which would be quite safe. but if it defaults to on then it can be kind of risky.

Yomguithereal11y ago

jquery is injected carefully by artoo so it does not break anything on the host page. However, CSP override is not default on artoo and you have to install the chrome extension to perform this. But this extension has solely to be activated when scraping and only developers should use them while knowing its effects.

thebiglebrewski11y ago

Yeaaaah you might wanna rename that. I think the other Artoo already has enough traction and this will just confuse people.

kej11y ago

They seem different enough that anyone interested in these would be able to tell them apart.

nkozyra11y ago

I'd also argue that one makes more sense in terms of naming.

rektide11y ago

FWIW I vote no change.

notastartup11y ago

This is great for simple, quick job. However, you can do only so much in a local browser itself.

I basically built a bookmarklet that let's you define the actions locally on your browser, and then run the scrapes in your own box, essentially allowing unmetered scraping without charging per page.

http://scrape.ly

sogen11y ago

closed beta?

notastartup11y ago

I'm still putting the finishing touches. Will email everyone when it's ready to use.

1 more reply

j / k navigate · click thread line to collapse

28 comments

EamonLeonard11y ago

Another "Artoo" http://artoo.io/

zak_mc_kracken11y ago

Still not convinced by the reasons offered for client-side scraping. If I'm on my browser, I'm not interested in consuming JSON.

rektide11y ago

I see no reason to ever not use a browser to consume html content.

rektide11y ago

For example, favoriting a tweet on twitter is lossy: there's no after-the-fact scraping I can do to know where I was, what time it was when I favorited the thing.

If we want to Publish Everywhere Syndicate to Own Site (#IndieWeb dubs this PESOS), if we want to have our own experiences we can talk about, client side is the way to go.

rouxrc11y ago

And I just loooove listening to artoo beep over and over ;)

Yomguithereal11y ago

jacomyalOP11y ago

Basically, it makes scraping accessible to almost anyone who can use a browser and write some CSS selectors.

brucehart11y ago

Yomguithereal11y ago

This would indeed be nice and this is precisely what we intend to code next.

ghkbrew11y ago

What advantages does this have over Phantom.js[1] ?

[1] http://phantomjs.org

jacomyalOP11y ago

Both are really different. Phantom.js is a headless browser while artoo is a tool to easily scrape data from website.

But combining both would be nice to make it possible to automatize scrapers that have been developed quickly directly in the browser with artoo.

fiatjaf11y ago

This is awesome. I've been dreaming about this for weeks.

It could also store the code of the scrapers, for reusing.

fiatjaf11y ago

Well, I see you already have almost all I suggested. Now I would want something to make the ajaxSpider render the pages using the browser engine, instead of just getting pure HTML.

Yomguithereal11y ago

This is an interesting point. I created an issue on the github repository concerning this matter. Maybe you'd like to comment on it about your use case so we can improve the tool?

nnnnni11y ago

I would like to see something that helps create useful, specific scrapers for languages like Ruby and Python.

It's annoying to have to run scripts multiple times, tweaking it after each run to get exactly what you need. It's a waste of time...

the_cat_kittles11y ago

  >> ipython

  [In 1]: from pyquery import PyQuery as pq
  [In 2]: pq("http://www.foo.com")("<some jquery selectors>")

(inspect output, repeat till right)

... or do it with requests + lxml.etree, or whatever you want

when you have what you need, copy and paste into a file

nnnnni11y ago

PERFECT! Thanks a lot =-)

dfischer11y ago

This is in direct conflict with another library: http://artoo.io

Might be better to use another name?

benmmurphy11y ago

Yomguithereal11y ago

thebiglebrewski11y ago

Yeaaaah you might wanna rename that. I think the other Artoo already has enough traction and this will just confuse people.

kej11y ago

They seem different enough that anyone interested in these would be able to tell them apart.

nkozyra11y ago

I'd also argue that one makes more sense in terms of naming.

rektide11y ago

FWIW I vote no change.

notastartup11y ago

This is great for simple, quick job. However, you can do only so much in a local browser itself.

I basically built a bookmarklet that let's you define the actions locally on your browser, and then run the scrapes in your own box, essentially allowing unmetered scraping without charging per page.

http://scrape.ly

sogen11y ago

closed beta?

notastartup11y ago

I'm still putting the finishing touches. Will email everyone when it's ready to use.

1 more reply

j / k navigate · click thread line to collapse