Designing Web Crawlers, Spiders and Bots


Designing Web Crawlers, Spiders and Bots

The huge status of the Search Engine Giant – Google has always  led me to explore the technologies it uses to power itself. This has led me to explore the all new world of Web Crawlers ,  Indexers,  Full Text Search,  Map Reduce File Systems like Hadoop and lots of more techy stuff.

The first thing I wanted to start with was writing a web crawler , just for kinda fun.

I wrote my first web crawler in first year of my college in Visual Basic. It was a small crap with a simple web browser control in VB embedded in a form and simple html parser to parse text.  But I know how much excited I was to write it.

Coming off these years my understanding of Web Crawlers have matured a lot and I have been able to spun complex crawlers from scratch or modify state of art crawlers like Nutch and Bixo for my needs.

Most of the Crawlers or Scrappers I write are written in C# or Java using Advanced Design Patterns and can be robust enough to crawl millions of web pages within a week’s time and can run across several clusters of computers in Master Slave Mode fetching in storing data in simple databases to complex Map Reduce file systems.

Over a series of articles I will be explaining how to write a basic to very complex web crawler using effective design patterns.

Check out the articles as series. These will include my work with some common crawling engines like Bixo and Search Engines like Nutch.

1. What’s  a web crawler or Web Scraper?

2. How to make your search engine?

3. Is it possible to take on Google, Yes but how? Vertical search …? Localized search ? Nah…

4. What is indexing, how does an indexer work, rambling on Apache Lucene project.

5. Whats a BOT?  How does it automate stuff we do daily. Will spam really rule in the internet one day?

and many more ….