You can check it out right here https://github.com/justuswilhelm/kata/blob/master/python/cor...
I'm messing around with some of the ParallelUniverse Java fiber implementation and what I do is spam fibers to download pages and send the String response over to another fiber over a channel that maintains a thread pool to parse response body as they come in//create new fibers to read these links.
I'm really just doing this to get more familiar with async programming and specifically the paralleluniverse Java libs but one thing I'm struggling a bit with is how to best make it well behaved (e.g right now there's no bound on number of outstanding HTPT requests).
While I do have access to great general technology related advice, this post is bound to bring people well versed in crawling.
My question is: in terms of crawling speed (and I know this is dependent of several factors) what's a decent amount of pages a good crawler could do per day?
The crawler I built is doing about 120K pages per day which to our initial needs is not bad at all, but wonder if in the crawling world this is peanuts or a decent chunk of pages?
- How many servers (if distributed)?
- How many cores/server?
- What kind of processing takes place for each page?
Does it just download and save the pages somewhere (local filesystem, cloud storage, database) or it extracts (semi) structured data? And so on.
Specifics aside, these days it's not hard to crawl millions of pages/day on commodity servers. Some related posts:http://www.michaelnielsen.org/ddi/how-to-crawl-a-quarter-bil...
http://blog.semantics3.com/how-we-built-our-almost-distribut...
http://engineering.bloomreach.com/crawling-billions-of-pages...
http://engineering.bloomreach.com/crawling-billions-of-pages...
By fine-grained I mean that fetching, crawling, extraction and whatever other processing you're doing should be separate, discrete steps.
Example naive topology:
Fetcher: Pops next URL off a queue, fetches it, stores the raw data somewhere, emits a "fetched" event.
Link extractor: Subscribes to fetch events, extracts every URL from the data, each of which is emitted as a "link" event.
Crawling scheduler: Listens to link events, schedules "fetch" events for each URL. This is where you might add filtering and prioritization rules, for example.
Now you have three queues and three consumers, which can run in parallel with any number of worker processes dedicated to them. A naive solution could use something like a database for the events, but a dedicated queue such as RabbitMQ would fare better.
I generally use distributed crawlers, which means I can scale to millions of pages per day (assuming different domains). The biggest limiting factor is the database layer, how many writes can I do in a day.
If I need to go faster, I just spin up another crawler worker, which connects to the queue and starts pulling jobs.
I believe anything under a million pages / day should be do-able by a homebuilt, single-server system.
Unlike Blekko we are just capturing the source and dumping it into a DB without doing any analysis. As soon as you start trying to parse anything in the crawl data your hardware requirements go through the roof. parallel with wget or curl is enough to crawl millions of pages per day. I often use http://puf.sourceforge.net/ when I need to do a quick crawl
"puf -nR -Tc 5 -Tl 5 -Td 20 -t 1 -lc 200 -dc 5 -i listofthingstodownload" will easily do 10-20 million pages per day if you are spreading your requests across a lot of hosts.
You should be able to achieve > 120k per day for sure though. That's less than two per second.
1) I check whether or not the page we just scrapped has any of the tags we are looking for.
2) We then extract any information within those tags (images, etc.)
3) We follow trough every link and if it's not in the seen/scrapped list, we add them to the queue.
Not sure if this helps to narrow it down.
Thanks!
I'm just now finishing a project for an ISP building a cache of webpages using my project jBrowserDriver. They can basically turn on as many VMs as they need to horizontally scale out, and the servers all seamlessly load balance themselves and pull work off a central queue. One important part is to handle failures and crashes, isolating impact to everything else. In this approach, separate OS processes are helpful.
At blekko, we did ~ 100k pages/day/server with our production crawler, running on a cluster which was also doing anti-web-spam, inverting outgoing links into incoming links, indexing everything, and analytics batch jobs supporting development.
So unless you're doing a LOT of work on every webpage, you're kinda slow.
The easiest mistake to make is to not be asynch enough. This Python example is great.
I've written more web crawlers than I can count in php, python, scala, golang, nodejs, and perl. Right now assuming you want to just gather some form of JSON/HTML from the response, I would use golang and gokogiri with XPaths (and of course json unmarshal for json). It will make you laugh at 120k per day. Feel free to ping me if you would like to discuss making me one of those freelancers.
https://dl.dropboxusercontent.com/u/44889964/Descartes%20%20...
This translates to about 700MM/month. The bump you see this month is just us adding more crawling nodes to our cluster.
Writing an efficient and _well behaved_ web crawler is imho quite a complicated undertaking. Others here have already pointed out that it's more or less a scalability problem, hence, a number doesn't make sense. reinhardt has provided a list of links - which from a quick glance - look very interesting and might bring you further.