To optimise for search engines effectively, you need to understand what they actually do with your pages. The process has three distinct stages — and each one is a potential failure point.
Stage 1 — Crawling
Search engine crawlers (Googlebot, Bingbot) are automated programs that follow links across the web, downloading pages they discover. Google starts from a seed set of known pages and follows every link they contain.
- Crawling is rate-limited — Google won't hammer your server. You can control this with
crawl-delayin robots.txt - Pages not linked to from anywhere are unlikely to be crawled (orphan pages)
- JavaScript-rendered content can be crawled but may be delayed — Google runs a second wave of rendering after initial indexing
# robots.txt — controls which pages crawlers may visit
User-agent: *
Disallow: /admin/
Disallow: /api/
Allow: /
Sitemap: https://example.com/sitemap.xml
Stage 2 — Indexing
Once crawled, pages go into Google's index — a massive data store mapping words and concepts to pages. Google parses your HTML, extracts text, images, links, and structured data, and stores a representation of the page.
Pages can be crawled but not indexed — this happens when Google finds:
noindexmeta tag or header- Soft 404s (200 status with "page not found" content)
- Duplicate content (the canonical version is indexed instead)
- Low quality or thin content
- Content blocked by
robots.txt(but Google can still find and crawl it via links)
Stage 3 — Ranking
When a user searches, Google retrieves relevant pages from the index and ranks them. Google's ranking algorithm evaluates hundreds of signals, but the most important ones are relevance, quality, and authority.
- Relevance: does the page content match the search query and intent?
- Quality: is the content comprehensive, accurate, and trustworthy (E-E-A-T)?
- Authority: how many high-quality sites link to this page?
- Page experience: Core Web Vitals, mobile-friendliness, HTTPS
- Freshness: for news and time-sensitive queries, recent content ranks higher
PageRank and Link Signals
PageRank (still a foundational signal) measures a page's authority based on the quantity and quality of links pointing to it. A link from a high-authority page (like a major news site) passes more "link equity" than a link from a new blog.
<!-- Pass link equity (default) -->
<a href="/article">Read more</a>
<!-- rel="nofollow" — don't pass equity (user-generated, untrusted) -->
<a href="https://example.com" rel="nofollow">External link</a>
<!-- rel="sponsored" — paid/affiliate links -->
<a href="https://partner.com" rel="sponsored">Partner</a>
<!-- rel="ugc" — user-generated content (comments, forums) -->
<a href="https://..." rel="ugc">User link</a>
Sitemaps
An XML sitemap tells Google which pages exist and when they were last updated. Submit it via Google Search Console.
<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
<url>
<loc>https://example.com/learn-react/hooks</loc>
<lastmod>2026-04-01</lastmod>
<changefreq>monthly</changefreq>
<priority>0.8</priority>
</url>
</urlset>
What You Can Control
- Crawl budget: fix crawl errors in GSC, remove thin/duplicate pages, block unimportant URLs
- Indexing: use canonical tags, structured data, and rich content — avoid noindex on valuable pages
- Ranking: create comprehensive, well-structured content; earn quality backlinks; improve Core Web Vitals