Understanding Google’s PageRank Algorithm: A Straightforward Overview
Whenever you enter a query into Google, such as “Why isn’t 11 pronounced onety-one?” or “Why is there no E grade?”, Google’s search engine gets to work. Behind the scene, an advanced algorithm analyzes billions of web pages to present the most pertinent and authoritative results. A key aspect of this ingenious mechanism is the PageRank algorithm—first developed by Google creators Larry Page and Sergey Brin in 1998.
This article seeks to simplify the PageRank concept, providing an insight into how Google evaluates and ranks webpages based on their relevance and authority.
What Is PageRank?
PageRank is an algorithm utilized by Google Search to prioritize web pages in search results. It functions by giving a numerical value to each page, indicating its significance or “authority” within the extensive network of interconnected web pages.
Fundamentally, PageRank operates on the principle that a page is regarded as important if it is linked to by other important pages.
The Two Fundamental Components: Querying and Indexing
Before PageRank can perform its function, two essential processes must occur first:
1. Querying — involves interpreting the user’s search intent.
2. Indexing — concentrates on understanding and organizing web pages according to their content.
Let’s illustrate these with a straightforward dataset (corpus) consisting of four phrases:
– “Dogs, dogs, dogs”
– “The running dogs”
– “Adopt a dog”
– “Cute video of a dog running”
From here, two classes are defined:
1. Query Class
This class processes the user’s input prior to matching it with relevant documents. It carries out:
– Parsing: Extracting raw text from structured formats like XML.
– Tokenizing: Splitting sentences into individual words.
– Removing Stop Words: Words such as “the,” “a,” and “is” are omitted as they carry minimal significance.
– Stemming: Reducing words to their base forms (e.g., “running” → “run”).
For example, “The running dogs” would be parsed and stemmed into the tokens: “run”, “dog”.
2. Index Class
The Index class evaluates how significant a word is within a document and across all documents. This significance is assessed using two primary metrics:
A. Term Frequency (TF)
This quantifies how often a word appears in a single document.
Equation:
TF = x / y
Where:
x = count of a word’s appearances
y = count of the most frequent word in the same document
Example:
In “The running dog”, if “run” occurs once and the most frequent word (“dog”) occurs 4 times, then:
TF(run) = 1 / 4 = 0.25
B. Inverse Document Frequency (IDF)
This measures how uncommon or unique a word is across all documents.
Equation:
IDF = log(n / n(i))
Where:
n = total number of documents
n(i) = count of documents that feature the word
Example:
If “run” is found in 2 out of 4 documents:
IDF(run) = log(4 / 2) = log(2) ≈ 0.301
C. Relevance Score
The multiplication of TF and IDF yields the Relevance Score — a metric of how vital a specific word is in a given document.
Relevance(run) = TF × IDF = 0.25 × 0.301 = 0.075
Grasping Page Authority via PageRank
Once a page is evaluated for relevance, the algorithm assesses its importance by analyzing how web pages interlink.
PageRank Simplified Principles:
– A page with numerous inbound links is viewed as more authoritative.
– If a credible page links to another, it imparts some of its authority.
– If a high-authority page links to very few pages, each link has greater significance.
– Pages that are closer (in click distance) to many others receive a higher PageRank.
Computing PageRank Score
To determine how much “rank” a page derives from another, we apply the formula:
w(pm) = PR(m) / L(m)
Where:
– PR(m) is the PageRank of page ‘m’
– L(m) is the count of unique outbound links from page ‘m’
If page P gains a link from page M:
Weight w(pm) = PR(M) / L(M)
If page M does not link to page P:
A damping factor is introduced to mimic random web exploration, commonly e = 0.15
This assigns a weight of e / total number of pages (e.g., 0.15 / 4 = 0.0375)
Final PageRank Formula:
PR(P)