The Google PageRank system: Is it really in use?
Sections in this article:
1.0-Introduction
2.0-Applying the PR system for the first time
· 2.1-During indexing
· 2.2-Afer reindexing
3.0-Applying thrPR system for reindexing
· 3.1-During reindexing
· 3.2-After reindexing
4.0-Conclusion
1.0-Introduction:
Google’s patented PageRank system of ranking a web page(or any other document that is available on the Internet) is ofcourse the world’s best algorithm for a search engine. It is this PageRank that has made Google different from other search engines. Since its introduction, PageRank has got almost all responsible webmasters on fire, looking for a link from a highly ranked web page. This system has also resulted in the development of
A separate stream of websites like Link Directories and Reciprocal Link Directories. When reviewed theoretically this system seems to be almost correct, but, when it comes to practical application it fails. The extent of failure of the PageRank system depends on how and when Google applies it to deliver search results.
Below is an extract from the article ‘The Anatomy of a Lareg-Scale Hyper Textual Web Search Engine’ by Larry Page and Sergey Brin.
“------------------Extract starts---------------“
2.1 PageRank: Bringing Order to the Web
The citation (link) graph of the web is an important resource that has largely gone unused in existing web search engines. We have created maps containing as many as 518 million of these hyperlinks, a significant sample of the total. These maps allow rapid calculation of a web page's "PageRank", an objective measure of its citation importance that corresponds well with people's subjective idea of importance. Because of this correspondence, PageRank is an excellent way to prioritize the results of web keyword searches. For most popular subjects, a simple text matching search that is restricted to web page titles performs admirably when PageRank prioritizes the results (demo available at google.stanford.edu). For the type of full text searches in the main Google system, PageRank also helps a great deal.
2.1.1 Description of PageRank Calculation
Academic citation literature has been applied to the web, largely by counting citations or backlinks to a given page. This gives some approximation of a page's importance or quality. PageRank extends this idea by not counting links from all pages equally, and by normalizing by the number of links on a page. PageRank is defined as follows:
We assume page A has pages T1...Tn which point to it (i.e., are citations). The parameter d is a damping factor which can be set between 0 and 1. We usually set d to 0.85. There are more details about d in the next section. Also C(A) is defined as the number of links going out of page A. The PageRank of a page A is given as follows:
PR(A) = (1-d) + d (PR(T1)/C(T1) + ... + PR(Tn)/C(Tn))
Note that the PageRanks form a probability distribution over web pages, so the sum of all web pages' PageRanks will be one.
PageRank or PR(A) can be calculated using a simple iterative algorithm, and corresponds to the principal eigenvector of the normalized link matrix of the web. Also, a PageRank for 26 million web pages can be computed in a few hours on a medium size workstation. There are many other details which are beyond the scope of this paper.
2.2 Anchor Text
The text of links is treated in a special way in our search engine. Most search engines associate the text of a link with the page that the link is on. In addition, we associate it with the page the link points to. This has several advantages. First, anchors often provide more accurate descriptions of web pages than the pages themselves. Second, anchors may exist for documents which cannot be indexed by a text-based search engine, such as images, programs, and databases. This makes it possible to return web pages which have not actually been crawled. Note that pages that have not been crawled can cause problems, since they are never checked for validity before being returned to the user. In this case, the search engine can even return a page that never actually existed, but had hyperlinks pointing to it. However, it is possible to sort the results, so that this particular problem rarely happens.
This idea of propagating anchor text to the page it refers to was implemented in the World Wide Web Worm [McBryan 94] especially because it helps search non-text information, and expands the search coverage with fewer downloaded documents. We use anchor propagation mostly because anchor text can help provide better quality results. Using anchor text efficiently is technically difficult because of the large amounts of data which must be processed. In our current crawl of 24 million pages, we had over 259 million anchors which we indexed.
“------------------Extract ends---------------“
The formula in the first section of the extract is referred to as the PageRank Formula or the PR Formula further in this article.
A few possibilites of how and when Google calculates the PageRank of a webpage (and how the system might fail) are discussed below.
To discuss in detail we might the following information as a base:
Let us consider a 2 database tables with the following fields. The database model may be of any kind that is with more or less fields for more options. But here is just a simple one to enable you to understand this article.
Table name: page_index
page_id
page_url
page_title
page_description
page_keywords
page_content
indexed_date
outgoing_links
incoming_links
page_rank
Table name: links
link_id
anchor_text
link_from_url
link_to_url
Here the table ‘page_index’ is the table that is searched when a user uses the search engine. The table links is used to maintain the records links containing in the page indexed. The fields page_id and link_id are auto incremented each time when a record is inserted(so that they act a unique serial number).
2.0-Applying the PR system for the first time
2.1-Applying the PageRank system for the first time if the PageRank is calculated during the indexing process:
To start let us consider two webpages A and B. We consider that the web has only two pages, so as to avoid complexity in understanding the problem. However this problem can be understood in the complex form also. The same results are obtained in every case- whether complex or simple. With each having a link to each other. When the indexing process is started with page A. The indexing program may need to index page B, in order to calculate the PageRank of page A, according to the formula mentioned in the extract from the article in section 1.0
Since he 2 pages mentioned above have reciprocal links, the indexing program might start off with an endless redirection thus calculating the PageRank of 0 pages even after a dime duration of 10 hours!
Suppose if the page B contains a link to page C, which in turn contains a link to page D and so on…, then the indexing progarm might reach the last page and then stop indexing if the last page contains no link. If the last page contains a link to any one of the previously crawled(but not indexed) pages then the indexing program might again get stuck in a process of endless redirection as mentioned above.
2.2-Calculating the PageRank after indexing all the pages for the first time:
Here again the same problem—endless redirection(as described in section 2.1) arises, if the PR is calculated after indexing all the pages once.
So inorder to get the PR system working, we need to manually set the PR of all or atleast a few(some big number that is small to a computer) in each category related(or most linked) sites. When Google was in its development stages there were about 10 million web pages on the Internet(just dream about accurately setting the PR of about 10000 web pages of different categories—Impossible job, so Google might have not done it).
The practical possibility is that the initial PR of each page on the web be set equal to another and then start the indexing process and calculate the PR for each page.
If the above method is applied, then there are equal probabilities that the PR of a web page(may or may not be a broken link), containing just links(about 2000 of them) may get a higher PR while an informative hompage(like that of Yahoo!’s) might get a low PR(may be 1 or even 0!).
3.0-Applying the PR system for reindexing
3.1-Calculating the PR during reindexing:
For a while let us assume that we have calculated the PRs of all the web pages. Now let us consider reindexing the pages and calculate their new PR value.
Let us start with a web page A. If we find a link to another webpage B then we index it (inorder to calculate the PR according to the PR Formula). If we find a link on page B to another page C, which in turn links to another page D and goes on… then we reach the nth page just to calculate the PR of page A. If page n is a broken link then the indexing program might stop working or again as mentioned in section2.1, might go on a cyclic ride.
In order to get this working again we might have to consider the old PR values each page that page A links to or gets links incoming links from, which is as good as never updating th PR.
3.2-Calculating the PR after reindexing:
Now assuming that we have indexed all the webpages on our computer. Again we might have to consider the old PR values of each webpage in order to calculate the new PR values. Thus ending with the same result as in section 3.1
4.0-Conclusion
Thus when the PageRank system is put into practical use we would end up with no PR value for a page or wrong (therefore useless) PR values. The same result is obtained whether the indexing program is run on one computer or on many computers simultaneously.
But we also have a contradiction: Google uses PR to deliver search results.
These results are relating to the keyword searched for.
So the question now is 'Is Google really using PR to deliver search results?'