Use Node.js to Extract Data from the Web (opens in new tab)

(storminthecastle.com)

80 pointsjohnrobinsn12y ago34 comments

34 comments

Don't forget streams, the more `node.js` way to parse HTML:

    var http = require('http');
    var tr = require('trumpet')();
    var request = require('request');
    request.get('http://www.echojs.com")
      .pipe(tr.createReadStream("article > span"))
      .pipe(process.stdout);

That's it! See https://github.com/substack/node-trumpet and their tests for more.

substack12y ago

You probably meant:

    var tr = require('trumpet')();
    tr.createReadStream('article > span')
      .pipe(process.stdout);
    
    var request = require('request');
    request.get('http://www.echojs.com').pipe(tr);

Bonus: I just noticed a simple bug in the selector engine from running your intended code that I just fixed in trumpet@1.5.6.

kanzure12y ago

And then there's hyperquest because maybe you want to do more than five simultaneous requests:

https://github.com/substack/hyperquest

ssafejava12y ago

True - you can also disable the globalAgent or change the number of pooled connections. Connection pooling was generally a bad idea (tm) in Node and afaik will be removed in the near future.

zenocon12y ago

I've done a considerable amount of scraping; if you're poking around at nicely designed web pages, node/cheerio will be nice, but if you need to scrape data out of a DOM mess with quirks and iframes w/in iframes and forms buried 6 posts deep (inside iframes with quirks), I'd use PhantomJS + CasperJS. Having a real browser sometimes makes a difference.

MrBlue12y ago

PhantomJS + CasperJS is definitely the way to go when scraping data from complex pages. It's also great for circumventing bot detection. :)

enscr12y ago

I find scrapy (python) to be more robust for large scale scraping. There are cases where you want/need the javascript action and that's when you need a real browser. Otherwise the rendering would just slow things down.

techaddict00912y ago

Does this help in scraping website which provide data via jquery ? I mean does this render the javascript on page ?

klibertp12y ago

Yes. It interprets and executes JS like a real browser would. Which is nice. For Python: http://jeanphix.me/Ghost.py/

nodesocket12y ago

Have you played around with node.io? https://github.com/chriso/node.io

Encapsulates all this functionality in an easy to use interface.

httpteapot12y ago

Last commit 3 months ago. Do you know if this project is still alive?

nacs12y ago

Haven't used node.io but 3 months isn't that old.

Also, if you check the issues page for the project ( https://github.com/chriso/node.io/issues ), the author seems to be responding to any open issues with the latest comment by author being a month ago.

chrisohara12y ago

Author here.

Still active, although development has slowed down.

If you have any questions or issues just submit an issue @ Github and I'll help asap.

MrBlue12y ago

Node.io is pretty much dead.

nostrademons12y ago

There're also Node.js bindings for Gumbo if folks want HTML5 compliance:

https://github.com/karlwestin/node-gumbo-parser

It might be interesting if someone were to implement a Cheerio-like API on top of that, as Cheerio has a nicer API but Gumbo's parser is more spec-compliant.

aroman12y ago

Cheerio is really really awesome. I've used it to build a considerably sophisticated web scraping backend to wrap my school's homework website and re-expose/augment via node/mongo/backbone/websockets.

There are definitely some bugs in cheerio if you're looking to do some really fancy selector queries, but for the most part it's extremely performant and pleasant to use.

If anyone is interested in seeing what a sophisticated, parallalized usage of cheerio looks like, feel free to browse through the app I was mentioning above -- it's open source: https://github.com/aroman/keeba/blob/master/jbha.coffee

victorhooi12y ago

Hmm, interesting.

I'm also looking at doing a web-scraping project with Node.js.

I was going to go with CasperJS (http://casperjs.org/), which seems fairly active and is based on PhantomJS.

Their quickstart guide is actually creating a scraper:

http://docs.casperjs.org/en/latest/quickstart.html

However, I'm wondering how this (Cheerio) compares - anybody have any experiences?

premasagar12y ago

See also http://noodlejs.com for a Node-based web scraper that also handles JSON and other file formats.

It was initially built as a hack project to replace a core subset of YQL. (I helped to guide an intern at my company Dharmafly, Aaron Acerboni, when he built it).

dfrodriguez14312y ago

I like to use the readability API so I don't need to see the HTML of every single site. I did an example here: http://danielfrg.github.io/blog/2013/08/20/relevant-content-...

chatman12y ago

Isn't scrapy easier to use than this?

hackula112y ago

Cheerio is really easy for anyone familiar with jQuery (most node.js devs I would imagine).

level0912y ago

its probably more organized and easier to read than a huge number of nested callbacks

AsymetricCom12y ago

there's a lot of better ways to do this. Most of them involve documented standards so your code doesn't break the moment someone changes something.

mholt12y ago

This is cool... if the content is structured. (Ever tried finding addresses in arbitrary text? Much harder: http://smartystreets.com/products/liveaddress-api/extract)

babby12y ago

Come on, that's not really a scraping problem, it's more of a text parsing problem coupled with an API lookup or scrape to verify the address.

Though, id probably just google for some good address regexes, match against pages, for each address just throw them into something like maps.google.com/?q=[address] then try to scrape whatever normally pops up for a valid result. Also helps if you're expecting addresses to be in a certain country.

greenido12y ago

Similar to what I wrote a week ago: http://greenido.wordpress.com/2013/08/21/yahoo-finance-api-w... :)

tommoor12y ago

I run an API that could help with this type of thing where the page includes microformats (A surprising amount) at http://pagemunch.com

shospes12y ago

We also used cheerio and node.js and built an click & extract interface around it: http://www.site2mobile.com/.

garyjob12y ago

Interesting, I encounter the same set of problems as well last year when working on two side projects. Ended up building a webscraping service with a point and click interface on top of it : https://krake.io

level0912y ago

here is how I like to do it :

  from pyquery import PyQuery as pq
  doc = pq('http://google.com')
  print doc('#hplogo')

tectonic12y ago

Remember to use SelectorGadget (http://selectorgadget.com) to help generate your CSS selectors.

zerni12y ago

nice!

I did a webcrawler with node.js myself last year. It's only a quick try but you can find the worker class here: https://gist.github.com/zerni/6337067

Unfortunately jsdom had a memory leak so the crawler died after a while...

cheeaun12y ago

If you want to fix the memory leak, I remember you need to do `window.close()` after the job is done.

zerni12y ago

thanks mate!

j / k navigate · click thread line to collapse

34 comments

STRML12y ago

Don't forget streams, the more `node.js` way to parse HTML:

    var http = require('http');
    var tr = require('trumpet')();
    var request = require('request');
    request.get('http://www.echojs.com")
      .pipe(tr.createReadStream("article > span"))
      .pipe(process.stdout);

That's it! See https://github.com/substack/node-trumpet and their tests for more.

substack12y ago

You probably meant:

    var tr = require('trumpet')();
    tr.createReadStream('article > span')
      .pipe(process.stdout);
    
    var request = require('request');
    request.get('http://www.echojs.com').pipe(tr);

Bonus: I just noticed a simple bug in the selector engine from running your intended code that I just fixed in trumpet@1.5.6.

kanzure12y ago

And then there's hyperquest because maybe you want to do more than five simultaneous requests:

https://github.com/substack/hyperquest

ssafejava12y ago

True - you can also disable the globalAgent or change the number of pooled connections. Connection pooling was generally a bad idea (tm) in Node and afaik will be removed in the near future.

zenocon12y ago

MrBlue12y ago

PhantomJS + CasperJS is definitely the way to go when scraping data from complex pages. It's also great for circumventing bot detection. :)

enscr12y ago

techaddict00912y ago

Does this help in scraping website which provide data via jquery ? I mean does this render the javascript on page ?

klibertp12y ago

Yes. It interprets and executes JS like a real browser would. Which is nice. For Python: http://jeanphix.me/Ghost.py/

nodesocket12y ago

Have you played around with node.io? https://github.com/chriso/node.io

Encapsulates all this functionality in an easy to use interface.

httpteapot12y ago

Last commit 3 months ago. Do you know if this project is still alive?

nacs12y ago

Haven't used node.io but 3 months isn't that old.

chrisohara12y ago

Author here.

Still active, although development has slowed down.

If you have any questions or issues just submit an issue @ Github and I'll help asap.

MrBlue12y ago

Node.io is pretty much dead.

nostrademons12y ago

There're also Node.js bindings for Gumbo if folks want HTML5 compliance:

https://github.com/karlwestin/node-gumbo-parser

It might be interesting if someone were to implement a Cheerio-like API on top of that, as Cheerio has a nicer API but Gumbo's parser is more spec-compliant.

aroman12y ago

There are definitely some bugs in cheerio if you're looking to do some really fancy selector queries, but for the most part it's extremely performant and pleasant to use.

victorhooi12y ago

Hmm, interesting.

I'm also looking at doing a web-scraping project with Node.js.

I was going to go with CasperJS (http://casperjs.org/), which seems fairly active and is based on PhantomJS.

Their quickstart guide is actually creating a scraper:

http://docs.casperjs.org/en/latest/quickstart.html

However, I'm wondering how this (Cheerio) compares - anybody have any experiences?

premasagar12y ago

See also http://noodlejs.com for a Node-based web scraper that also handles JSON and other file formats.

It was initially built as a hack project to replace a core subset of YQL. (I helped to guide an intern at my company Dharmafly, Aaron Acerboni, when he built it).

dfrodriguez14312y ago

I like to use the readability API so I don't need to see the HTML of every single site. I did an example here: http://danielfrg.github.io/blog/2013/08/20/relevant-content-...

chatman12y ago

Isn't scrapy easier to use than this?

hackula112y ago

Cheerio is really easy for anyone familiar with jQuery (most node.js devs I would imagine).

level0912y ago

its probably more organized and easier to read than a huge number of nested callbacks

AsymetricCom12y ago

there's a lot of better ways to do this. Most of them involve documented standards so your code doesn't break the moment someone changes something.

mholt12y ago

This is cool... if the content is structured. (Ever tried finding addresses in arbitrary text? Much harder: http://smartystreets.com/products/liveaddress-api/extract)

babby12y ago

Come on, that's not really a scraping problem, it's more of a text parsing problem coupled with an API lookup or scrape to verify the address.

greenido12y ago

Similar to what I wrote a week ago: http://greenido.wordpress.com/2013/08/21/yahoo-finance-api-w... :)

tommoor12y ago

I run an API that could help with this type of thing where the page includes microformats (A surprising amount) at http://pagemunch.com

shospes12y ago

We also used cheerio and node.js and built an click & extract interface around it: http://www.site2mobile.com/.

garyjob12y ago

level0912y ago

here is how I like to do it :

  from pyquery import PyQuery as pq
  doc = pq('http://google.com')
  print doc('#hplogo')

tectonic12y ago

Remember to use SelectorGadget (http://selectorgadget.com) to help generate your CSS selectors.

zerni12y ago

nice!

I did a webcrawler with node.js myself last year. It's only a quick try but you can find the worker class here: https://gist.github.com/zerni/6337067

Unfortunately jsdom had a memory leak so the crawler died after a while...

cheeaun12y ago

If you want to fix the memory leak, I remember you need to do `window.close()` after the job is done.

zerni12y ago

thanks mate!

j / k navigate · click thread line to collapse