Techno m’lounge – Where technology meets human senses.

Technology, if not handled with care becomes disruptive, Iam a live example…
Home » Page 12

Designing Web Crawlers , Spiders and Bots


The huge status of the Search Engine Giant – Google has always  led me to explore the technologies it uses to power itself. This has led me to explore the all new world of Web Crawlers ,  Indexers,  Full Text Search,  Map Reduce File Systems like Hadoop and lots of more techy stuff.

The first thing I wanted to start with was writing a web crawler , just for kinda fun.

I wrote my first web crawler in first year of my college in Visual Basic. It was a small crap with a simple web browser control in VB embedded in a form and simple html parser to parse text.  But I know how much excited I was to write it.

Coming off these years my understanding of Web Crawlers have matured a lot and I have been able to spun complex crawlers from scratch or modify state of art crawlers like Nutch and Bixo for my needs.

Most of the Crawlers or Scrappers I write are written in C# or Java using Advanced Design Patterns and can be robust enough to crawl millions of web pages within a week’s time and can run across several clusters of computers in Master Slave Mode fetching in storing data in simple databases to complex Map Reduce file systems.

Over a series of articles I will be explaining how to write a basic to very complex web crawler using effective design patterns.

Check out the articles as series. These will include my work with some common crawling engines like Bixo and Search Engines like Nutch.

1. What’s  a web crawler or Web Scraper?

2. How to make your search engine?

3. Is it possible to take on Google, Yes but how? Vertical search …? Localized search ? Nah…

4. What is indexing, how does an indexer work, rambling on Apache Lucene project.

5. Whats a BOT?  How does it automate stuff we do daily. Will spam really rule in the internet one day?

and many more ….

Popularity: 17% [?]

Foray into Cloud Computing – The Amazon Services EC2


Last time I did a project using Amazon S3 for a client which was a data backup solution involved storing the data of client machines on Amazon S3 servers and then sync the data with local copies. It was real fun foraying into the world of Cloud Computing and playing around with services over the web.

This time again I got a chance to play with Amazon Services and into more of them. The task was to setup a Live On demand Video Streaming Server using Wowza Media server, Amazon Elastic Cloud Compute, Amazon DevPay and Video Streams from various RSS Feeds.

The fun started with learning to use something called EC2 UI for firefox. Its a cool firefox plugin which gives you access to use EC2 instances from your browser.

For those people who hate putty and the black console this is a life saver. However me being a geek , it was just of matter of getting my hands wet .

To start , you need to create a key file which is a x.509 private key with the extension ” .pem” . Amazon EC2 doesnot use passwords for authentication , but it uses certificates to do the same. So the challenge response mechanism is basically powered by sending a certificate file(.ppk) in putty which is a file signed using your private key (.pem). Amazon has its own copy of the private key which it uses to authenticate the .ppk file.

Check putty’s manuals on how to generate a public key certificate file. i.e (.ppk).

This project gave me unique insights on the wonderfulness of EC2 and the power that Cloud computing possess.

Will surely write more on this in future ! Keep watching..

Popularity: 18% [?]