"The Algorithm That Powers Google's PageRank"

“The Algorithm That Powers Google’s PageRank”


Every time we look up something like ‘Why is 11 not pronounced onety-one’ or ‘Why isn’t there an E grade’ on Google, the underlying algorithm of the search engine scans its database to find the most fitting answers to provide the user. This blog will delve into the Google Pagerank system, in a straightforward manner with keywords and formulas to clarify how the algorithm operates.

Pagerank is an innovative algorithm utilized solely by Google, developed in 1998. Given the immense data it possesses, the search engine confronts issues of space-time complexity. The aim of PageRank is to identify links to and from pages while assessing the authority, reliability, popularity, and significance of each page. To achieve this, the search engine conducts querying and indexing. The Query class executes operations using the user’s input, while the Index class organizes data about each page’s content and assigns scores to those pages.

To illustrate how this algorithm operates, an example of a corpus will be shown to demonstrate the processing of the words involved. Typically, the corpus would encompass an entire website or file, but for simplicity, the example will comprise only phrases.

The phrases are “Dogs, dogs, dogs”, “The running dogs”, “Adopt a dog”, and “Cute video of a dog running”.

There are four primary tasks assigned to the Query class

Using the instance of the running dog, here’s a breakdown of what each function accomplishes.

1. Parsing: extracting text from XML format for processing
2. Tokenising: eliminating punctuation and whitespace
3. Discarding stop words such as but not limited to ‘like’, ‘the’, or ‘that’
4. Stemming words: simplifying inflected words to their base form

e.g: running → run

In the same way, the remaining words would be accounted for as “ ‘Dog’, ‘Dog’, ‘Dog’”, “ ‘Run’, ‘Dog’ “, “ ‘Adopt’, ‘dog’ ”, “ ‘Cute’, ‘Video’, ‘Dog’, ‘Run’”.

The Indexing class performs all necessary calculations to ascertain the significance of a specific word on a page. This figure is then utilized to score the importance of the page in relation to the keyword entered by the user.

The first step the algorithm must take is to calculate the Term Frequency, which is the number of times a word appears in the document and corpus. This is computed with the formula

TF = x/y, where

x = number of times a word appears

y = number of occurrences of the most frequently used word

Taking ‘run’ from the phrase ‘The Running Dog’ as an example, this yields a TF value of ¼ or 0.25.

Next, the algorithm determines the Inverse Document Frequency (IDF), which measures how frequently the word appears in the document and across the entire corpus using the equation

IDF = log(n/n(i))

where

n = total documents in a corpus

i = the word in the corpus

n(i) = document count containing word i in the corpus

Using ‘run’ as the example, this calculation would give us the IDF value of log(4/2) or log(2), which results in an IDF value of 0.301.

After obtaining these two values, the relevance of the word can be scored using the equation

Relevance = TF * IDF

This process is then reiterated with the quantity of keywords present on a page. For the word ‘run’ in the phrase ‘The Running Dog’, we derive a relevance score of 0.075.

After evaluating how crucial a page is concerning a keyword, its relative authority to each other is assessed. This procedure can be streamlined into a series of rules:

1. The more links a page has to other pages, the greater the score it gets
2. If a more authoritative page, page P, links to page X, then page X will receive the same level of authority as page P
3. If page P links to only a few pages, the scores among those pages are elevated
4. If page P is nearer to page X (fewer clicks needed to navigate from one to the other), there is a higher PageRank score

With these points in consideration, we can compute the PageRank score.

Initially, we need to determine the weight of the page under consideration.

w(pm) = weight of page P from page M

n(pm) = unique page links between page P and page M

t = total pages in the corpus

e = a constant given

If page m links to page p, this can be evaluated using the equation

The weight that ‘The running Dog’ receives from ‘Dogs, Dogs, Dogs’ would be

yielding a weight of 0.268. The weight