How internet search engine works

About Internet Search Engines

The internet contains a vast collection of information, which is spread out in every part of the world on remote web servers. The problem in locating the correct information on the internet led to the creation of search technology, known as the internet search engine. A search engine can provide links to relevant information based on your requirement or query. Examples of popular internet search engines are Google, Yahoo, MSN, Lycos and Ask Jeeves. In order to understand the terminology and techniques to position your website pages for higher ranking in search engines, the knowledge of the basic functioning of a search engine is essential.

Functions of Internet Search Engines

A search engine is a computer software, that is continually modified to avail of the lastest technologies in order to provide improved search results. Each search engine does the same functions of collecting, organizing, indexing and serving results in its own unique ways, thus employing various algorithms and techniques, which are their trade secrets. In short, the functions of a search engine can be categorized into the following:
Crawling the internet for web content.
Indexing the web content.
Storing the website contents.
Search algorithms and results.

Crawling and Spidering the Web

Crawling is the method of following links on the web to different websites, and gathering the contents of these websites for storage in the search engines databases. Crawling the internet can start afresh (starting with a popular website containing lots of links, such as Yahoo) or from existing older indexes of websites. The crawler (also known as a web robot or a web spider) is a software program that can download web content (web pages, images, documents and other files), and then follow hyper-links within these web contents to download the linked contents. The linked contents can be on the same site or on a different website.

The crawling continues until it finds a logical stop, such as a dead end with no external links or reaching the set number of levels inside the website's link structure. If a website is not linked from other websites on the internet, the crawler will be unable to locate it. Therefore, if the website is new, and has no links from other sites, that website has to be submitted to each of the search engines for crawling.

The efficiency of the crawler makes it crawl multiple websites at the same time, so as to collect billions of website contents as frequently as it can. News and media sites are crawled more frequently (every hour or so) by advanced search engines like Google, in order to deliver updated news and content in their search results. The crawler also does not flood a single website with a high volume of requests at the same time, but spreads the crawling over a period of time so that the web site does not crash. Usually search engines crawl only a few (three or four) levels deep from the homepage of a website. The term deep crawl is used to denote that the crawler or spider can index pages that are many levels deep. Google is an example of a deep crawler.

Crawlers or web robots follow guidelines specified for them by the website owner using the robots exclusion protocol (robots.txt). The robots.txt will specify the files or folders that the owner does not want the crawler to index in its database. Many search engine crawlers do not like unfriendly URLs, such as those generated by database driven websites. These website URLs contain parameters after the question mark (such as http://somedomain.com/article.php?cat=1&id=3). Search engines dislike such URLs because the website can overwhelm the crawler by using parameters to generate thousands of new web pages for indexing with similar content. Thus, crawlers often disregard the changes in the parameters as part of a new URL to spider.

Search engine friendly URLs are used to compensate for this problem.

New user:Take a look & Sign up from here.

Wednesday, February 4, 2009