Grab the widget  Tech Thoughts

Thursday, December 13, 2007

How does Google collect and rank results for your website?

Recent study and analysis , I found some new concept about the page rank and search engine optimization. Here I publish how you come to know about Googlebot or spider will collect data from your website. This article will help you a lot to get some nice information about this important topics.

Crawling and Indexing

A lot of things have to happen before you see a web page containing your Google search results. Our first step is to crawl and index the billions of pages of the World Wide Web. This job is performed by Googlebot, our 'spider,' which connects to web servers around the world to fetch documents. The crawling program doesn't really roam the web; it instead asks a web server to return a specified web page, then scans that web page for hyperlinks, which provide new documents that are fetched the same way. Our spider gives each retrieved page a number so it can refer to the pages it fetched.

Our crawl has produces an enormous set of documents, but these documents aren't searchable yet. Without an index, if you wanted to find a term like civil war, our servers would have to read the complete text of every document every time you searched.

So the next step is to build an index. To do this, we 'invert' the crawl data; instead of having to scan for each word in every document, we juggle our data in order to list every document that contains a certain word. For example, the word 'civil' might occur in documents 3, 8, 22, 56, 68, and 92, while the word 'war' might occur in documents 2, 8, 15, 22, 68, and 77.

Once we've built our index, we're ready to rank documents and determine how relevant they are. Suppose someone comes to Google and types in civil war. In order to present and score the results, we need to do two things:
Find the set of pages that contain the user's query somewhere
Rank the matching pages in order of relevance

We've developed an interesting trick that speeds up the first step: instead of storing the entire index on one very powerful computer, Google uses hundreds of computers to do the job. Because the task is divided among many machines, the answer can be found much faster. To illustrate, let's suppose an index for a book was 30 pages long. If one person had to search for several pieces of information in the index, it would take at least several seconds for each search. But what if you gave each page of the index to a different person? Thirty people could search their portions of the index much more quickly than one person could search the entire index alone. Similarly, Google splits its data between many machines to find matching documents faster.

How do we find pages that contain the user's query? Let's return to our civil war example. The word 'civil' was in documents 3, 8, 22, 56, 68, and 92; the word 'war' was in documents 2, 8, 15, 22, 68, and 77. Let's write the documents across the page and look for those with both words.

civil 3 8 22 56 68 92
war 2 8 15 22 68 77
both words 8 22 68

Arranging the documents this way makes clear that the words 'civil' and 'war' appear in three documents (8, 22, and 68). The list of documents that contain a word is called a 'posting list,' and looking for documents with both words is called 'intersecting a posting list.' (A fast way to intersect two posting lists is to walk down both at the same time. If one list skips from 22 to 68, you can skip ahead to document 68 on the other list as well.)


Ranking Results

Now we have the set of pages that contain the user's query somewhere, and it's time to rank them in terms of relevance. Google uses many factors in ranking. Of these, the PageRank algorithm might be the best known. PageRank evaluates two things: how many links there are to a web page from other pages, and the quality of the linking sites. With PageRank, five or six high-quality links from websites such as www.cnn.com and www.nytimes.com would be valued much more highly than twice as many links from less reputable or established sites.

But we use many factors besides PageRank. For example, if a document contains the words 'civil' and 'war' right next to each other, it might be more relevant than a document discussing the Revolutionary War that happens to use the word 'civil' somewhere else on the page. Also, if a page includes the words 'civil war' in its title, that's a hint that it might be more relevant than a document with the title '19th Century American Clothing.' In the same way, if the words 'civil war' appear several times throughout the page, that page is more likely to be about the civil war than if the words only appear once.



As a rule, Google tries to find pages that are both reputable and relevant. If two pages appear to have roughly the same amount of information matching a given query, we'll usually try to pick the page that more trusted websites have chosen to link to. Still, we'll often elevate a page with fewer links or lower PageRank if other signals suggest that the page is more relevant. For example, a web page dedicated entirely to the civil war is often more useful than an article that mentions the civil war in passing, even if the article is part of a reputable site such as Time.com.

Once we've made a list of documents and their scores, we take the documents with the highest scores as the best matches. Google does a little bit of extra work to try to show snippets – a few sentences – from each document that highlight the words that a user typed. Then we return the ranked URLs and the snippets to the user as results pages.

As you can see, running a search engine takes a lot of computing resources. For each search that someone types in, over 500 computers may work together to find the best documents, and it all happens in under half a second.

You can follow the link to learn more:
http://www.mattcutts.com/blog/


This article will originated from Here

No comments:

AddThis Feed Button

ArchSociety news