Once I had a good pattern in place I could easily create subclasses of the data type I was trying to scrape, basically pointing each of the modeled data methods to an xpath that was specific to that page.
We have a low frequency discovery process that delves the site to create a representative meta-data structure. This is then read by a high frequency process to create a list of URLs to fetch and parse each time.
The behaviour can then be modified and/or work divided between processes by using command line arguments that cause filtering of the meta-data.
Incidentally, at scale I find that the more tricky part is the whole orchestration (Schedule crawls, make sure resources are used most efficiently without overloading the target sites, properly detecting errors) is the hardest part.
Mechanize allows you to write clean, efficient scraper code without all the boilerplate. It's the nicest scraping solution I've yet encountered.
I've spend a lot of time working on web scrapers for two of my projects, http://themescroller.com (dead) and http://www.remoteworknewsletter.com, and I think the holy grail is to build a rails app around your scraper. You can write your scrapers as libs, and then make them executable as rake tasks, or even cronjobs. And because its a rails app you can save all scraped data as actual models and have them persisted in a database. With rails its also super easy to build an api around your data, or build a quick backend for it via rails scaffolds.
[0] https://github.com/jnicklas/capybara [1] http://www.rubydoc.info/github/jnicklas/capybara/
As for Scrapy itself, it's a big framework, written on top of even bigger framework which is probably better described as a platform at this point. I've used Scrapy in a couple of projects and I also worked with Twisted before, which made things significantly easier for me, and it still was quite a bit of a hassle to set things up. IIRC configuring a pipeline for saving images to disk with their original names was kind of a nightmare. It does perform extremely well and scales to insane workloads, but I would never use it for simple scrapper for a single site. For those requests+lxml work extremely well.
Typhoeus has a built-in concurrency mechanism with callbacks with specified number of concurrent http requests. You just create a hydra object, create the first request object with URL and a callback (you have to check errors like 404 yourself) where you extract another URLs from the page and push them to hydra again with the same on another callback.
open("out.json", "w") {|f| f.puts JSON.dump(showings) } f << JSON.dump(showings)
avoids adding any new lines (not that it really matters in the case of JSON)$ ruby scraper.rb > showings.json