TechLead
Lesson 2 of 13
5 min read
SEO

How Search Engines Work

Understanding crawling, indexing, and ranking algorithms

To optimise for search engines effectively, you need to understand what they actually do with your pages. The process has three distinct stages — and each one is a potential failure point.

Stage 1 — Crawling

Search engine crawlers (Googlebot, Bingbot) are automated programs that follow links across the web, downloading pages they discover. Google starts from a seed set of known pages and follows every link they contain.

  • Crawling is rate-limited — Google won't hammer your server. You can control this with crawl-delay in robots.txt
  • Pages not linked to from anywhere are unlikely to be crawled (orphan pages)
  • JavaScript-rendered content can be crawled but may be delayed — Google runs a second wave of rendering after initial indexing
# robots.txt — controls which pages crawlers may visit
User-agent: *
Disallow: /admin/
Disallow: /api/
Allow: /

Sitemap: https://example.com/sitemap.xml

Stage 2 — Indexing

Once crawled, pages go into Google's index — a massive data store mapping words and concepts to pages. Google parses your HTML, extracts text, images, links, and structured data, and stores a representation of the page.

Pages can be crawled but not indexed — this happens when Google finds:

  • noindex meta tag or header
  • Soft 404s (200 status with "page not found" content)
  • Duplicate content (the canonical version is indexed instead)
  • Low quality or thin content
  • Content blocked by robots.txt (but Google can still find and crawl it via links)

Stage 3 — Ranking

When a user searches, Google retrieves relevant pages from the index and ranks them. Google's ranking algorithm evaluates hundreds of signals, but the most important ones are relevance, quality, and authority.

  • Relevance: does the page content match the search query and intent?
  • Quality: is the content comprehensive, accurate, and trustworthy (E-E-A-T)?
  • Authority: how many high-quality sites link to this page?
  • Page experience: Core Web Vitals, mobile-friendliness, HTTPS
  • Freshness: for news and time-sensitive queries, recent content ranks higher

PageRank and Link Signals

PageRank (still a foundational signal) measures a page's authority based on the quantity and quality of links pointing to it. A link from a high-authority page (like a major news site) passes more "link equity" than a link from a new blog.

<!-- Pass link equity (default) -->
<a href="/article">Read more</a>

<!-- rel="nofollow" — don't pass equity (user-generated, untrusted) -->
<a href="https://example.com" rel="nofollow">External link</a>

<!-- rel="sponsored" — paid/affiliate links -->
<a href="https://partner.com" rel="sponsored">Partner</a>

<!-- rel="ugc" — user-generated content (comments, forums) -->
<a href="https://..." rel="ugc">User link</a>

Sitemaps

An XML sitemap tells Google which pages exist and when they were last updated. Submit it via Google Search Console.

<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
  <url>
    <loc>https://example.com/learn-react/hooks</loc>
    <lastmod>2026-04-01</lastmod>
    <changefreq>monthly</changefreq>
    <priority>0.8</priority>
  </url>
</urlset>

What You Can Control

  • Crawl budget: fix crawl errors in GSC, remove thin/duplicate pages, block unimportant URLs
  • Indexing: use canonical tags, structured data, and rich content — avoid noindex on valuable pages
  • Ranking: create comprehensive, well-structured content; earn quality backlinks; improve Core Web Vitals

Continue Learning