Building blocks of a scalable webcrawler. (opens in new tab)

(blog.marc-seeger.de)

80 points0x4415y ago29 comments

29 comments

IIRC, sriramk from around here (http://news.ycombinator.com/user?id=sriramk) had also 'rolled his own' web-crawler as a project in college about 5-6 (?) years back. He blogged about it fairly actively back then, and I really enjoyed following his journey (esp. when after months of dev and testing, he finally 'slipped it into the wild'). Tried to dredge up those posts, but he seems to have taken them down :( A shame really - they were quite a fascinating look at the early-stage evolution of a programmer!

Sriram, you around? ;)

sriramk15y ago

Thanks but mine was definitely a toy. I think I got it to around 100K pages or so but that's about it (seemed like a big deal back then).

You can see some of those posts here (http://web.archive.org/web/20041206230457/www.dotnetjunkies....). Quite embarrassing to see the quality of my output from back then

Basically, I did the following

- Pull down dmoz.org's datasets (not sure whether I crawled it or whether they had a dump - I think the latter) - Spin up crawlers (implemented in C# at the time) on various machines, writing to a central repo. The actual design of the crawler was based on Mercator (check out the paper on citeseer) - Use Lucene to construct TF.IDF indices on top of the repository - Throw up a nice UI (with the search engine name spelled out in a Google-like font). The funny part is that this probably impressed the people evaluating the project more than anything else.

I did do some cool hacks around showing a better snippet than Google did at the time but I just didn't have the networking bandwidth to do anything serious. Fun for a college project.

The funny thing is a startup which is involved in search contacted me a few weeks ago precisely because of this project. I had to tell that person how much of a toy it was :)

rb2k_15y ago

Do you remember how fast the "toy" was? (pages/second, domains/s, ...) :)

1 more reply

rb2k_15y ago

Uh, look what the cat dragged in: my thesis :)

Hope some of you enjoy the read, I'm open for comments and criticism

cd3415y ago

I'm guessing English isn't your primary language. There are numerous typos and grammatical errors throughout the document. Spellcheck would catch about 90% of them as they aren't project names. The grammar errors might be more difficult since you are getting the right wordstem and you're getting words that are close.

It also seems like 2/3 of the way through, you started templating your answers and reviews and didn't do as thorough an analysis of the competitive solutions.

The thesis does demonstrate that you know and understand the technology but I don't get the sense you have an in-depth understanding of what was done. While the results suggest the project was successful, it seems more like you were an observer, validating decisions. It also seems you don't agree the decisions made were the correct ones based on some of the underlying tones.

Still, there is a lot of great information in there, presented very well. You might consider submitting it to highscalability.com.

Would you implement the current structure the same way after writing your thesis?

rb2k_15y ago

Thank you for the feedback!

Typos: Yeah, I'm German. Could you just point out some of the errors (2-3), that would help me to look for them harder next time :)

Lack of detail towards the end: The thesis was written after most of the project was done and I wanted to give people new to the field an introduction to the tools I used and the problems I encountered. All of this was an actual internship project and the ability to use it as my thesis was just a nice "addon".

That's probably why you (rightfully) noticed that some of the competitive solutions (e.g. graph databases) might have not gotten the level of detail and research they deserved. It was a balance between delivering a working product and putting the thesis on a theoretically sound basis while moving to another country :)

In general, I'd re-implement it more or less the same way. I would probably do one or more of the following things:

- take a look at how Riak search turned out

- switch from MySQL to Postgres

- Think about another way of determining popularity than incoming links (can get problematic when trying to recrawl sites... you'd have to keep track of all of the domains that link to a certain site. Maybe graph databases would be a good solution for this problem)

- start with coding EVERYTHING in an asynchronous manner. Maybe use em-synchrony (https://github.com/igrigorik/em-synchrony)

- write more tests (the more the better)

1 more reply

arkitaip15y ago

Very timely and interesting. I am currently looking for a crawler that tightly integrated with Drupal and that can be easily managed through Drupal nodes. Any suggestions on a solution for a small site that only needs to handle thousands of pages/urls?

rb2k_15y ago

I don't really know what the "managed through Drupal nodes" means in this context. For a simple drupal fulltext search I can recommend apache solr ( http://drupal.org/project/apachesolr ).

For regular crawling:

I found anemone ( http://anemone.rubyforge.org/ ) to be a lovely framework for single page crawls.

Other interesting candidates:

https://github.com/hasmanydevelopers/RDaneel

http://www.redaelli.org/matteo-blog/projects/ebot/

http://nutch.apache.org/ (meh, java)

toumhi15y ago

scrapy (http://scrapy.org/) is a well-documented and open source python scraping framework that I've used in a couple of projects.

1 more reply

nowarninglabel15y ago

Excellent to have this up. I'm glad that you made it available. I'm doing web crawling w/ Drupal, so always interesting to see how others are doing it.

rubyrescue15y ago

great paper. You mentioned that you would have considered Riak if it had search. Now that it does, if you did this again would you use seriously consider using it instead?

rb2k_15y ago

The crawler currently runs on a single large EC2 instance. I could see myself trying to use a bunch of EC2 micro instances instead and then use Riak + Riak Search.

I actually tried putting a dump of the data into Riak and it seemed to hold up pretty well on my macbook.

Another problem was the fact that Riak didn't allow me to do server-side increments on the "incoming links" counter which mysql, mongodb or redis allowed. However, I think that this is something that could be solved using Redis as a caching layer.

I have to admit that I would love to use Riak for something just because it seems to be a really slick piece of software, so it's hard to stay objective :)

yesno15y ago

I like Ted Dziuba solution:

http://teddziuba.com/2010/10/taco-bell-programming.html

Full-stack programmer at work!

rb2k_15y ago

I loved his approach too, but if you want to end up with something that you can search freely for properties it gets a little tiresome with just bash and *nix tools :)

inovica15y ago

A good read and very timely from my perspective. We created a crawler in Python a couple of years ago for RSS feeds, but we ran into a number of issues with it, so put it on hold as we concentrated on work that made money :) We started to look at the project last week and we've been looking at rolling our own versus looking at frameworks like Scrapy. The main thing for us is being able to scale. Anyone who has knowledge of creating a distributed crawler in Python I'd welcome some advice.

Thanks again. Really good post

rb2k_15y ago

After having written the thesis and thought about that stuff for another few weeks, my résumé would be:

- Use asynchronous I/O to maximize single-node speed (twisted should be a good choice for python). It might be strange in the beginning, but it usually makes up for it, especially with languages that aren't good at threading (ruby, python, ...).

- Redis is awesome! Fast, functional, beautiful :)

- Riak seems to be a great distributed datastore if you really have to scale over multiple nodes.

- Solr or Sphinx are just better optimized than most datastores when it comes to fulltext-search

- Take a day to look at graph databases (I'm still not 100% sure if I could have used one for my use cases)

inovica15y ago

Thanks for the tips! I really appreciate it. I'll check these out. All getting very exciting for my Christmas project!

boyter15y ago

If you are just doing RSS feeds I would say go it yourself. Armed with Feedparser (http://feedparser.org/) you can implement what you want pretty quickly.

For both http://www.searchforphp.com/ and http://www.searchforpython.com/ I wrote my own RSS reader. To make it scale out I just used Pythons multiprocessing to parse it out to 50 or so concurrent downloads. I can tear through thousands or feeds pretty quickly that way. The next step to multiple machines is just throw in a queue system and get a list of feeds from it.

Pretty simple stuff really.

richcollins15y ago

I'm having good luck using node.js's httpClient and vertex.js for crawl state / persistence.

rb2k_15y ago

Oh, node.js is definitely a great direction to go!

One of my problems was that a lot of the "usual" libraries are written in a synchronous/blocking manner behind the scenes. This is something that the node.js ecosystem would probably solve right from the start.

The downside of a relatively new library like httpClient is, that it is missing things like automatically following redirects. While this can be implemented in the crawler code, it complicates things.

How big are the datasets that vertex.js/tokyo cabinet is able to handle for you?

Node.js is on the list of things I'd like to play with a bit more (just like Scala, Erlang, graph databases, mirah, ...). Is your crawler's source code available by any chance?

richcollins15y ago

My dataset is still small, but you can scale a single TC db to nearly arbitrary size (8EB). It can also write millions of kv pairs / second.

Vertex.js can't quite keep up with TC as its written in javascript. However, it does let you batch writes into logical transactions, which you can use to get fairly high throughput.

The source isn't open as its fairly specific to my app, http://luciebot.com/. I'd be happy to chat about the details without releasing the source. richcollins@gmail.com / richcollins on freenode.

1 more reply

nl15y ago

Can someone please explain what FPGA-aware garbage collection is?

rb2k_15y ago

Straight from the man himself: http://buytaert.net/files/fpl05-paper.pdf

Abstract:

During codesign of a system, one still runs into the impedance mismatch between the software and hardware worlds. This paper identies the different levels of abstraction of hardware and software as a major culprit of this mismatch. For example, when programming in high-level object-oriented languages like Java, one has disposal of objects, methods, memory management, that facilitates development but these have to be largely . . . abandoned when moving the same functionality into hardware. As a solution, this paper presents a virtual machine, based on the Jikes Research Virtual Machine, that is able to bridge the gap by providing the same capabilities to hardware components as to software components. This seamless integration is achieved by introducing an architecture and protocol that allow recongurable hardware and software to communicate with each other in a transparent manner i.e. no component of the design needs to be aware whether other components are implemented in hardware or in software. Further, in this paper we present a novel technique that allows recongurable hardware to manage dynamically allocated memory. This is achieved by allowing the hardware to hold references to objects and by modifying the garbage collector of the virtual machine to be aware of these references in hardware. We present benchmark results that show, for four different, well- known garbage collectors and for a wide range of applications, that a hardware-aware garbage collector results in a marginal overhead and is therefore a worthwhile addition to the developer's toolbox.

j / k navigate · click thread line to collapse

29 comments

shrikant15y ago

Sriram, you around? ;)

sriramk15y ago

Thanks but mine was definitely a toy. I think I got it to around 100K pages or so but that's about it (seemed like a big deal back then).

You can see some of those posts here (http://web.archive.org/web/20041206230457/www.dotnetjunkies....). Quite embarrassing to see the quality of my output from back then

Basically, I did the following

I did do some cool hacks around showing a better snippet than Google did at the time but I just didn't have the networking bandwidth to do anything serious. Fun for a college project.

The funny thing is a startup which is involved in search contacted me a few weeks ago precisely because of this project. I had to tell that person how much of a toy it was :)

rb2k_15y ago

Do you remember how fast the "toy" was? (pages/second, domains/s, ...) :)

1 more reply

rb2k_15y ago

Uh, look what the cat dragged in: my thesis :)

Hope some of you enjoy the read, I'm open for comments and criticism

cd3415y ago

It also seems like 2/3 of the way through, you started templating your answers and reviews and didn't do as thorough an analysis of the competitive solutions.

Still, there is a lot of great information in there, presented very well. You might consider submitting it to highscalability.com.

Would you implement the current structure the same way after writing your thesis?

rb2k_15y ago

Thank you for the feedback!

Typos: Yeah, I'm German. Could you just point out some of the errors (2-3), that would help me to look for them harder next time :)

In general, I'd re-implement it more or less the same way. I would probably do one or more of the following things:

- take a look at how Riak search turned out

- switch from MySQL to Postgres

- start with coding EVERYTHING in an asynchronous manner. Maybe use em-synchrony (https://github.com/igrigorik/em-synchrony)

- write more tests (the more the better)

1 more reply

arkitaip15y ago

rb2k_15y ago

I don't really know what the "managed through Drupal nodes" means in this context. For a simple drupal fulltext search I can recommend apache solr ( http://drupal.org/project/apachesolr ).

For regular crawling:

I found anemone ( http://anemone.rubyforge.org/ ) to be a lovely framework for single page crawls.

Other interesting candidates:

https://github.com/hasmanydevelopers/RDaneel

http://www.redaelli.org/matteo-blog/projects/ebot/

http://nutch.apache.org/ (meh, java)

toumhi15y ago

scrapy (http://scrapy.org/) is a well-documented and open source python scraping framework that I've used in a couple of projects.

1 more reply

nowarninglabel15y ago

Excellent to have this up. I'm glad that you made it available. I'm doing web crawling w/ Drupal, so always interesting to see how others are doing it.

rubyrescue15y ago

great paper. You mentioned that you would have considered Riak if it had search. Now that it does, if you did this again would you use seriously consider using it instead?

rb2k_15y ago

The crawler currently runs on a single large EC2 instance. I could see myself trying to use a bunch of EC2 micro instances instead and then use Riak + Riak Search.

I actually tried putting a dump of the data into Riak and it seemed to hold up pretty well on my macbook.

I have to admit that I would love to use Riak for something just because it seems to be a really slick piece of software, so it's hard to stay objective :)

yesno15y ago

I like Ted Dziuba solution:

http://teddziuba.com/2010/10/taco-bell-programming.html

Full-stack programmer at work!

rb2k_15y ago

I loved his approach too, but if you want to end up with something that you can search freely for properties it gets a little tiresome with just bash and *nix tools :)

inovica15y ago

Thanks again. Really good post

rb2k_15y ago

After having written the thesis and thought about that stuff for another few weeks, my résumé would be:

- Redis is awesome! Fast, functional, beautiful :)

- Riak seems to be a great distributed datastore if you really have to scale over multiple nodes.

- Solr or Sphinx are just better optimized than most datastores when it comes to fulltext-search

- Take a day to look at graph databases (I'm still not 100% sure if I could have used one for my use cases)

inovica15y ago

Thanks for the tips! I really appreciate it. I'll check these out. All getting very exciting for my Christmas project!

boyter15y ago

If you are just doing RSS feeds I would say go it yourself. Armed with Feedparser (http://feedparser.org/) you can implement what you want pretty quickly.

Pretty simple stuff really.

richcollins15y ago

I'm having good luck using node.js's httpClient and vertex.js for crawl state / persistence.

rb2k_15y ago

Oh, node.js is definitely a great direction to go!

How big are the datasets that vertex.js/tokyo cabinet is able to handle for you?

Node.js is on the list of things I'd like to play with a bit more (just like Scala, Erlang, graph databases, mirah, ...). Is your crawler's source code available by any chance?

richcollins15y ago

My dataset is still small, but you can scale a single TC db to nearly arbitrary size (8EB). It can also write millions of kv pairs / second.

Vertex.js can't quite keep up with TC as its written in javascript. However, it does let you batch writes into logical transactions, which you can use to get fairly high throughput.

The source isn't open as its fairly specific to my app, http://luciebot.com/. I'd be happy to chat about the details without releasing the source. richcollins@gmail.com / richcollins on freenode.

1 more reply

nl15y ago

Can someone please explain what FPGA-aware garbage collection is?

rb2k_15y ago

Straight from the man himself: http://buytaert.net/files/fpl05-paper.pdf

Abstract:

j / k navigate · click thread line to collapse