var http = require('http');
var tr = require('trumpet')();
var request = require('request');
request.get('http://www.echojs.com")
.pipe(tr.createReadStream("article > span"))
.pipe(process.stdout);
That's it! See https://github.com/substack/node-trumpet and their tests for more. var tr = require('trumpet')();
tr.createReadStream('article > span')
.pipe(process.stdout);
var request = require('request');
request.get('http://www.echojs.com').pipe(tr);
Bonus: I just noticed a simple bug in the selector engine from running your intended code that I just fixed in trumpet@1.5.6.Encapsulates all this functionality in an easy to use interface.
Also, if you check the issues page for the project ( https://github.com/chriso/node.io/issues ), the author seems to be responding to any open issues with the latest comment by author being a month ago.
Still active, although development has slowed down.
If you have any questions or issues just submit an issue @ Github and I'll help asap.
https://github.com/karlwestin/node-gumbo-parser
It might be interesting if someone were to implement a Cheerio-like API on top of that, as Cheerio has a nicer API but Gumbo's parser is more spec-compliant.
There are definitely some bugs in cheerio if you're looking to do some really fancy selector queries, but for the most part it's extremely performant and pleasant to use.
If anyone is interested in seeing what a sophisticated, parallalized usage of cheerio looks like, feel free to browse through the app I was mentioning above -- it's open source: https://github.com/aroman/keeba/blob/master/jbha.coffee
I'm also looking at doing a web-scraping project with Node.js.
I was going to go with CasperJS (http://casperjs.org/), which seems fairly active and is based on PhantomJS.
Their quickstart guide is actually creating a scraper:
http://docs.casperjs.org/en/latest/quickstart.html
However, I'm wondering how this (Cheerio) compares - anybody have any experiences?
It was initially built as a hack project to replace a core subset of YQL. (I helped to guide an intern at my company Dharmafly, Aaron Acerboni, when he built it).
Though, id probably just google for some good address regexes, match against pages, for each address just throw them into something like maps.google.com/?q=[address] then try to scrape whatever normally pops up for a valid result. Also helps if you're expecting addresses to be in a certain country.
from pyquery import PyQuery as pq
doc = pq('http://google.com')
print doc('#hplogo')I did a webcrawler with node.js myself last year. It's only a quick try but you can find the worker class here: https://gist.github.com/zerni/6337067
Unfortunately jsdom had a memory leak so the crawler died after a while...