Every time we seek information such as ‘Why isn’t 11 pronounced onety-one’ or ‘Why isn’t there an E grade’ on Google, the algorithm behind the search engine navigates through its data corpus to identify the most appropriate responses to the user’s queries. This blog will examine the Google PageRank system in a simplified manner with keywords and formulas to elucidate how the algorithm operates.
PageRank is an original algorithm utilized solely by Google, developed in 1998. Given the immense volume of data it manages, the search engine encounters space-time complexity. The aim of PageRank is to discover links to and from pages and to assess the authority, dependability, popularity, and significance of each page. To achieve this, the search engine conducts querying and indexing. The query class undertakes operations based on the user’s input, while the Index class organizes data regarding the content of each page and subsequently scores these pages.
To illustrate the functioning of this algorithm, an example corpus will be presented to demonstrate the handling of the words being transmitted. Typically, the corpus would encompass an entire website or file, but for the sake of simplicity, it will only consist of phrases.
These phrases consist of “Dogs, dogs, dogs”, “The running dogs”, “Adopt a dog” and “Cute video of a dog running”.
There are four primary tasks the Query class must carry out:
Using the example of the running dog, here’s what each function accomplishes.
1. Parsing: extracting text from XML format from the input for processing
2. Tokenizing: eliminating punctuation and whitespace
3. Removing stop words such as but not limited to ‘like’, ‘the’ or ‘that’
4. Stemming words: reducing inflected words to their simplest form
e.g: running → run
In the same way, the other words would be represented as “ ‘Dog’, ‘Dog’, ‘Dog’”, “ ‘Run’, ‘Dog’ “, “ ‘Adopt’, ‘dog’ ”, “ ‘Cute’, ‘Video’, ‘Dog’, ‘Run’”.
The Indexing class performs all calculations necessary to ascertain the significance of a specific word on a page. This figure is then employed to evaluate the relative importance of this page concerning the keyword searched by the user.
The first requirement for the algorithm is to determine the Term Frequency, which indicates how many times a word appears in the document and corpus. This is computed using the equation:
TF = x/y, where
x = specific times a word is present
y = specific number of times the most frequently used word is present
Using the word run as an illustration from the phrase ‘The Running Dog’, this would yield a TF value of ¼ or 0.25.
Next, the algorithm will identify the Inverse Document Frequency (IDF), which denotes how many times the word occurs in the document and the entire corpus through the equation:
IDF = log(n/n(i))
where
n = total number of documents in a corpus
i = word in corpus
n(i) = number of documents containing word i in corpus
For the word ‘run’, this will give us the IDF value of log(4/2) or log(2), resulting in an IDF value of 0.301.
Once these two results are obtained, the relevance of the word can be computed using the equation:
Relevance = TF * IDF
This process is then repeated for the number of keywords present on a page. Utilizing the word ‘run’ from the phrase ‘The Running Dog’ provides us with a relevance score of 0.075.
After assessing the significance of a page in relation to a keyword, the calculated authority of the page in relation to others is determined. This process can be distilled into a set of guidelines:
1. The more links a page has to other pages, the higher the score it attains
2. If a more authoritative page, page P, links to page X, page X will inherit an equivalent authority level as page P
3. If page P links to only a limited number of pages, the scores of these interconnected pages are elevated
4. If page P is nearer to page X (fewer clicks required to navigate from one to the other), there is a higher PageRank score
Considering this, we can compute the PageRank score.
Initially, we need to determine the weight of the page in question.
w(pm) = weight of page P derived from page M
n(pm) = number of unique page links between page P and page M
t = total number of pages in corpus
e = a given constant
If page m links to page p, this can be evaluated using the equation:
The weight assigned to the page ‘The running Dog’ from ‘Dogs, Dogs, Dogs’ would be
resulting in a weight of 0.268. The weight of the Page