Web scraping with Ruby (opens in new tab)

(chrismytton.uk)

55 pointshecticjeff11y ago31 comments

31 comments

I had to write scrapers in Ruby for a very large application that scraped all kinds of government information from various states. We found (after a lot of pain working with very procedural scrapers) that a modified producer/consumer pattern worked well. We found that making classes for the producers (they were classes that described each page to be scraped, with methods that matched the modeled data) allowed for easy maintenance. We then created consumers that could be passed any of the page specific producer classes, and knew how to persist the scraped data.

Once I had a good pattern in place I could easily create subclasses of the data type I was trying to scrape, basically pointing each of the modeled data methods to an xpath that was specific to that page.

psynapse11y ago

I lead a team that works on several hundred bots scraping at high frequency. We also separate the problem of site structure and payload parsing, though slightly differently.

We have a low frequency discovery process that delves the site to create a representative meta-data structure. This is then read by a high frequency process to create a list of URLs to fetch and parse each time.

The behaviour can then be modified and/or work divided between processes by using command line arguments that cause filtering of the meta-data.

troels11y ago

I too run a crawler that visits a lot of pages, although not at a particular high frequency. We visit hundreds of sites and each site then has a custom bot that essentially has two methods: find_links and extract. The first finds more links to visit on the site (e.g. navigates and follows pagination) whereas the latter finds and stores records. Is this similar to your approach?

Incidentally, at scale I find that the more tricky part is the whole orchestration (Schedule crawls, make sure resources are used most efficiently without overloading the target sites, properly detecting errors) is the hardest part.

psynapse11y ago

The discovery process is crawling I suppose, but only within the same site. It is always assured that the higher speed process accesses data that we want to parse. It does no navigation.

Aside from having the physical capacity for the suite to run 24/7, our main challenge is speed. All data must be parsed, matched to other data in our database and published with the lowest possible latency.

We have pretty strict validation. Addressing errors in retrospect is preferable to publishing incorrect data.

troels11y ago

If I understand you right, you have a lot of different data types to scrape, so essentially you have a sub-program for each data type and when a page is downloaded, you let each of these have a go at the page and emit content if it finds any? Or did I completely miss the point?

boie002511y ago

Yeah, I think we're on the same page. I just hacked together a quick example at this gist: https://gist.github.com/boie0025/ae9697eed61cbf5342a6

troels11y ago

Thanks for the snippet - make sense.

adanto684011y ago

We do something very similar & I'd love to get in touch if you'd be interested in discussing further. My email is in my profile if you'd be willing to reach out.

Doctor_Fegg11y ago

I'd suggest going with mechanize from the off - not just, as the article says, "[when] the site you’re scraping requires you to login first, for those instances I recommend looking into mechanize".

Mechanize allows you to write clean, efficient scraper code without all the boilerplate. It's the nicest scraping solution I've yet encountered.

hecticjeffOP11y ago

I agree that mechanize is an excellent scraping solution, but for something really basic like this where we're not clicking links or submitting forms it seemed like a bit of an overkill :)

Doctor_Fegg11y ago

Each to their own, but I find the Mechanize syntax much easier even for simple scraping work. You can use CSS selectors as per the example, or XPath should you want to get more complex.

wnm11y ago

I recommend having a look at capybara [0]. It is build on top of nokogiri, and is actually a tool to write acceptence tests. But it can also be used for web scraping: you can open websites, click on links, fill in forms, find elements on a page (via xpath or css), get their values, etc... I prefer it over nokogiri because of its nice DSL and good documentation [1]. It also can execute javascript, which sometimes is handy for scraping.

I've spend a lot of time working on web scrapers for two of my projects, http://themescroller.com (dead) and http://www.remoteworknewsletter.com, and I think the holy grail is to build a rails app around your scraper. You can write your scrapers as libs, and then make them executable as rake tasks, or even cronjobs. And because its a rails app you can save all scraped data as actual models and have them persisted in a database. With rails its also super easy to build an api around your data, or build a quick backend for it via rails scaffolds.

[0] https://github.com/jnicklas/capybara [1] http://www.rubydoc.info/github/jnicklas/capybara/

joshmn11y ago

I always see people using something like HTTParty or open-uri for pulling down the page. My preferred (by far) is typhoeus, as it supports parallel requests and wraps around libcurl.

https://github.com/typhoeus/typhoeus

jstoiko11y ago

I'd suggest taking a look at Scrapy (http://scrapy.org). It is built on top of Twisted (asynchronous) and uses xPath which makes your "scraping" code a lot more readable.

klibertp11y ago

Scrapy is written in Python. This looks like a Ruby focused article. It's even written in the title, no need to actually go and read it. I'd say your suggestion is simply off-topic here.

As for Scrapy itself, it's a big framework, written on top of even bigger framework which is probably better described as a platform at this point. I've used Scrapy in a couple of projects and I also worked with Twisted before, which made things significantly easier for me, and it still was quite a bit of a hassle to set things up. IIRC configuring a pipeline for saving images to disk with their original names was kind of a nightmare. It does perform extremely well and scales to insane workloads, but I would never use it for simple scrapper for a single site. For those requests+lxml work extremely well.

cheald11y ago

Nokogiri can use xpath, as well, FWIW. The article's example could be a lot more terse.

pkmishra11y ago

Scraping is generally easy but challenges come when you are scraping large amount of unstructured data and how well you respond to page changes pro-actively. Scrapy is very good. I couldn't find similar tool in Ruby though.

k__11y ago

Can anyone list some good resources about scraping, with gotchas etc.?

Jake23211y ago

I wrote an article on scraping last year which recieved a lot of praise. May be worth a read - http://jakeaustwick.me/python-web-scraping-resource/

forlorn11y ago

My recipe is to use Typhoeus (https://github.com/typhoeus/typhoeus) + Nokogiri. I have tried lots of different options including EventMachine with em-http-request and reactor loop and concurrent-ruby (both a re very poorly documented)

Typhoeus has a built-in concurrency mechanism with callbacks with specified number of concurrent http requests. You just create a hydra object, create the first request object with URL and a callback (you have to check errors like 404 yourself) where you extract another URLs from the page and push them to hydra again with the same on another callback.

joshmn11y ago

Just said this myself. I love Typhoeus, though I can't spell it 9/10 times.

llamataboot11y ago

I really the scraping chapter in the Bastard's Book of Ruby http://ruby.bastardsbook.com/chapters/web-scraping/

programminggeek11y ago

Why not just use like watir or selenium?

bradleyland11y ago

Because then you're running an entire browser when all you really need is an HTTP library and a parser.

richardpetersen11y ago

How do you get the script to save the json file?

cheald11y ago

Or, you can write it from within Ruby:

    open("out.json", "w") {|f| f.puts JSON.dump(showings) }

tom-lord11y ago

    f << JSON.dump(showings)

avoids adding any new lines (not that it really matters in the case of JSON)

cheald11y ago

Doh, yes. `print` rather than `puts` is also acceptable.

hecticjeffOP11y ago

You can redirect the output of the script to a json file, so in this case something like:

$ ruby scraper.rb > showings.json

richardpetersen11y ago

Perfect, thank you

mychaelangelo11y ago

thanks for sharing this - great scraping intro for us newbies (I'm new to ruby and ROR).

j / k navigate · click thread line to collapse

31 comments

boie002511y ago

psynapse11y ago

I lead a team that works on several hundred bots scraping at high frequency. We also separate the problem of site structure and payload parsing, though slightly differently.

The behaviour can then be modified and/or work divided between processes by using command line arguments that cause filtering of the meta-data.

troels11y ago

psynapse11y ago

The discovery process is crawling I suppose, but only within the same site. It is always assured that the higher speed process accesses data that we want to parse. It does no navigation.

We have pretty strict validation. Addressing errors in retrospect is preferable to publishing incorrect data.

troels11y ago

boie002511y ago

Yeah, I think we're on the same page. I just hacked together a quick example at this gist: https://gist.github.com/boie0025/ae9697eed61cbf5342a6

troels11y ago

Thanks for the snippet - make sense.

adanto684011y ago

We do something very similar & I'd love to get in touch if you'd be interested in discussing further. My email is in my profile if you'd be willing to reach out.

Doctor_Fegg11y ago

I'd suggest going with mechanize from the off - not just, as the article says, "[when] the site you’re scraping requires you to login first, for those instances I recommend looking into mechanize".

Mechanize allows you to write clean, efficient scraper code without all the boilerplate. It's the nicest scraping solution I've yet encountered.

hecticjeffOP11y ago

I agree that mechanize is an excellent scraping solution, but for something really basic like this where we're not clicking links or submitting forms it seemed like a bit of an overkill :)

Doctor_Fegg11y ago

Each to their own, but I find the Mechanize syntax much easier even for simple scraping work. You can use CSS selectors as per the example, or XPath should you want to get more complex.

wnm11y ago

[0] https://github.com/jnicklas/capybara [1] http://www.rubydoc.info/github/jnicklas/capybara/

joshmn11y ago

I always see people using something like HTTParty or open-uri for pulling down the page. My preferred (by far) is typhoeus, as it supports parallel requests and wraps around libcurl.

https://github.com/typhoeus/typhoeus

jstoiko11y ago

I'd suggest taking a look at Scrapy (http://scrapy.org). It is built on top of Twisted (asynchronous) and uses xPath which makes your "scraping" code a lot more readable.

klibertp11y ago

Scrapy is written in Python. This looks like a Ruby focused article. It's even written in the title, no need to actually go and read it. I'd say your suggestion is simply off-topic here.

cheald11y ago

Nokogiri can use xpath, as well, FWIW. The article's example could be a lot more terse.

pkmishra11y ago

k__11y ago

Can anyone list some good resources about scraping, with gotchas etc.?

Jake23211y ago

I wrote an article on scraping last year which recieved a lot of praise. May be worth a read - http://jakeaustwick.me/python-web-scraping-resource/

forlorn11y ago

joshmn11y ago

Just said this myself. I love Typhoeus, though I can't spell it 9/10 times.

llamataboot11y ago

I really the scraping chapter in the Bastard's Book of Ruby http://ruby.bastardsbook.com/chapters/web-scraping/

programminggeek11y ago

Why not just use like watir or selenium?

bradleyland11y ago

Because then you're running an entire browser when all you really need is an HTTP library and a parser.

richardpetersen11y ago

How do you get the script to save the json file?

cheald11y ago

Or, you can write it from within Ruby:

    open("out.json", "w") {|f| f.puts JSON.dump(showings) }

tom-lord11y ago

    f << JSON.dump(showings)

avoids adding any new lines (not that it really matters in the case of JSON)

cheald11y ago

Doh, yes. `print` rather than `puts` is also acceptable.

hecticjeffOP11y ago

You can redirect the output of the script to a json file, so in this case something like:

$ ruby scraper.rb > showings.json

richardpetersen11y ago

Perfect, thank you

mychaelangelo11y ago

thanks for sharing this - great scraping intro for us newbies (I'm new to ruby and ROR).

j / k navigate · click thread line to collapse