<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Techno m'lounge - Where technology meets human senses. &#187; Web Crawlers</title>
	<atom:link href="http://sumitghosh.co.in/tag/web-crawlers/feed/" rel="self" type="application/rss+xml" />
	<link>http://sumitghosh.co.in</link>
	<description>Technology, if not handled with care becomes disruptive, Iam a live example...</description>
	<lastBuildDate>Sun, 13 Jun 2010 00:56:14 +0000</lastBuildDate>
	<generator>http://wordpress.org/?v=2.9.2</generator>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
			<item>
		<title>Designing Web Crawlers , Spiders and Bots</title>
		<link>http://sumitghosh.co.in/designing-web-crawlers-spiders-and-bots/</link>
		<comments>http://sumitghosh.co.in/designing-web-crawlers-spiders-and-bots/#comments</comments>
		<pubDate>Sun, 21 Jun 2009 17:36:02 +0000</pubDate>
		<dc:creator>Sumit Ghosh</dc:creator>
				<category><![CDATA[Crawler Bots and Search Engines]]></category>
		<category><![CDATA[Nutch]]></category>
		<category><![CDATA[Bots]]></category>
		<category><![CDATA[Search Engines]]></category>
		<category><![CDATA[Web Crawlers]]></category>

		<guid isPermaLink="false">http://sumitghosh.co.in/?p=213</guid>
		<description><![CDATA[







The huge status of the Search Engine Giant &#8211; Google has always  led me to explore the technologies it uses to power itself. This has led me to explore the all new world of Web Crawlers ,  Indexers,  Full Text Search,  Map Reduce File Systems like Hadoop and lots of more techy stuff.
The first thing [...]]]></description>
			<content:encoded><![CDATA[<p>
<!-- Begin Google Adsense code -->
<script type="text/javascript"><!--
google_ad_client = "pub-7073257741073458";
/* 300x250, created 9/27/09 - bysumit */
google_ad_slot = "8746555401";
google_ad_width = 300;
google_ad_height = 250;
//-->
</script>
<script type="text/javascript"
src="http://pagead2.googlesyndication.com/pagead/show_ads.js">
</script>
<!-- End Google Adsense code -->
<br />
The huge status of the Search Engine Giant &#8211; Google has always  led me to explore the technologies it uses to power itself. This has led me to explore the all new world of Web Crawlers ,  Indexers,  Full Text Search,  Map Reduce File Systems like Hadoop and lots of more techy stuff.</p>
<p>The first thing I wanted to start with was writing a web crawler , just for kinda fun.</p>
<p>I wrote my first web crawler in first year of my college in Visual Basic. It was a small crap with a simple web browser control in VB embedded in a form and simple html parser to parse text.  But I know how much excited I was to write it.</p>
<p>Coming off these years my understanding of Web Crawlers have matured a lot and I have been able to spun complex crawlers from scratch or modify state of art crawlers like Nutch and Bixo for my needs.</p>
<p>Most of the Crawlers or Scrappers I write are written in C# or Java using Advanced Design Patterns and can be robust enough to crawl millions of web pages within a week&#8217;s time and can run across several clusters of computers in Master Slave Mode fetching in storing data in simple databases to complex Map Reduce file systems.</p>
<p>Over a series of articles I will be explaining how to write a basic to very complex web crawler using effective design patterns.</p>
<p>Check out the articles as series. These will include my work with some common crawling engines like Bixo and Search Engines like Nutch.</p>
<p>1. What&#8217;s  a web crawler or Web Scraper?</p>
<p>2. How to make your search engine?</p>
<p>3. Is it possible to take on Google, Yes but how? Vertical search &#8230;? Localized search ? Nah&#8230;</p>
<p>4. What is indexing, how does an indexer work, rambling on Apache Lucene project.</p>
<p>5. Whats a BOT?  How does it automate stuff we do daily. Will spam really rule in the internet one day?</p>
<p>and many more &#8230;.<br />

<!-- Begin Google Adsense code -->
<script type="text/javascript"><!--
google_ad_client = "pub-7073257741073458";
/* 300x250, created 9/27/09 - bysumit */
google_ad_slot = "8746555401";
google_ad_width = 300;
google_ad_height = 250;
//-->
</script>
<script type="text/javascript"
src="http://pagead2.googlesyndication.com/pagead/show_ads.js">
</script>
<!-- End Google Adsense code -->
</p>
<img src="http://sumitghosh.co.in/?ak_action=api_record_view&id=213&type=feed" alt="" />]]></content:encoded>
			<wfw:commentRss>http://sumitghosh.co.in/designing-web-crawlers-spiders-and-bots/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>
