Technical SEO

Crawl Budget: a practical guide for large platforms

Article cover: Crawl Budget — practical guide

Step-by-step crawl budget audit: log analysis, Google Search Console, sources of crawl waste (parametric URLs, facets, temporary pages), remediation measures, Sitemap index setup, rate limiting, KPIs, and a roadmap for Dev/SEO teams.

Crawl budget is the number of pages Googlebot is willing to crawl on your site within a given period. On small sites it is virtually unlimited. On platforms with millions of URLs it becomes a strategic resource: the bot wastes it on filters, pagination, empty templates and staging pages instead of crawling priority content. The result is delayed indexation of new products, weak catalogue coverage, and lost rankings.

Where the crawl budget goes: valuable pages compete with filters, UTM parameters and pagination for the bot's attention.
Google officially distinguishes two concepts: crawl rate limit (how often the bot can visit without overloading the server) and crawl demand (how many pages Google wants to crawl based on their perceived value). Crawl budget is the intersection of these two factors. Optimisation yields real impact specifically on large platforms.
5–10×

Budget overspend

Typical ratio of crawled URLs to genuinely valuable pages on an unoptimised e-commerce platform

60–80%

Lost to duplicates

Share of budget consumed by parametric URLs and pagination when no mitigation is in place

2–4 wks

Indexation delay

Typical delay before a new SKU appears in the index when the budget is saturated

>50k

Problem threshold

Platforms with more than 50,000 URLs already feel the effects of crawl budget constraints

Step 1: collecting and analysing crawler logs

Web server logs are the primary source of truth about what Googlebot actually crawls. Log analysis reveals the real request frequency, which sections the bot ignores or gets stuck on, and what HTTP responses the server returns. Without logs, any optimisation is guesswork.

Extracting Googlebot requests from Nginx / Apache

BASH
# Filter only Googlebot requests from the last 7 days
grep 'Googlebot' /var/log/nginx/access.log \
  | awk '{print $7}' \
  | sort | uniq -c | sort -rn \
  | head -50

# HTTP status codes seen by Googlebot
grep 'Googlebot' /var/log/nginx/access.log \
  | awk '{print $9}' \
  | sort | uniq -c | sort -rn

# Top URLs returning 404 to Googlebot
grep 'Googlebot' /var/log/nginx/access.log \
  | awk '$9 == "404" {print $7}' \
  | sort | uniq -c | sort -rn | head -30

For structured analysis of large log volumes use dedicated tools: Screaming Frog Log File Analyser, Semrush Log Analyser, or an ELK stack (Elasticsearch + Kibana). They automatically group URLs by pattern and plot crawl rate over time.

MetricWhat it showsTarget range
Crawl rate (req/day)Crawl intensity — how many pages Googlebot requests per dayPlatform-dependent; track the trend
Pages crawled/dayUnique URLs crawled per day — the primary crawl budget KPIShould rise after optimisation
200 / (301+302) / 404 / 500 ratioShare of each HTTP response type across all bot requests200 > 90%; 404 < 5%; 500 ≈ 0%
% duplicate URLs in crawlShare of parametric and duplicate URLs in the log< 15% after optimisation
New pages / total crawledRatio of first-time crawls to repeat visitsAim to increase the share of new pages

Step 2: Google Search Console and server response analysis

Google Search Console provides aggregated crawl data without requiring access to server logs. The Crawl Stats report is the key entry point. It shows the average daily request count, perceived page load speed, and distribution of response codes.

  • GSC → Settings → Crawl Stats: crawl request graph, breakdown by file type and response code. Look for 404/500 spikes and drops after deployments.
  • GSC → Indexing → Pages: number of indexed URLs vs. reasons for non-indexation. "Discovered — currently not indexed" is a direct crawl budget signal.
  • GSC → URL Inspection: manual status check for a specific page — when it was last crawled, what Googlebot saw.
  • GSC → Settings → Crawl Rate: you can throttle crawl rate if the bot is overloading the server. Normally not needed — only restrict when you have measurable performance problems.
"Discovered — currently not indexed" in GSC is the most direct signal of a crawl budget shortage. Google knows the pages exist (found via links or in the sitemap) but cannot crawl them in time. The immediate priority is freeing budget from junk URLs.

Sources of crawl budget waste

On a large platform the bot loses budget to six typical sources. Identify each in your logs and estimate the volume — this gives you the priority order for optimisation.

Comparison of wasted vs optimized crawl budget allocation.

Parametric URLs

?sort=price&filter=red&page=3 generates an exponential number of unique URLs with identical or near-identical content.

Infinite pagination

/catalog?page=N — thousands of URLs with diminishing value. The bot can endlessly crawl old catalogue pages.

Duplicate content

Identical pages on different URLs: /category/ and /category?utm_source=email, HTTP vs. HTTPS, www vs. non-www.

Temporary pages

Webinar landing pages, demo mockups, ?preview=1 pages, drafts — they appear and disappear but the bot keeps spending budget on them.

Empty templates

Category pages with no products, profiles with no content, auto-generated tag pages — low-quality content generated by the thousands.

Session parameters

?sessionid=abc123, ?token=xyz — unique URLs per session. Each such URL is a "new page" to the bot.

URL parameters and duplicates: remediation

Parametric URLs are the biggest source of waste on e-commerce platforms. There is no single silver bullet — the right solution depends on the nature of the parameter.

Parameter typeExampleRecommended action
Sorting / ordering?sort=price_ascDisallow in robots.txt or canonical pointing to the parameter-free URL
Filters with no unique content?color=red&size=XLCanonical to clean category page + noindex via meta robots
Pagination?page=5Canonical on pages 2+ to page 1 (or Disallow for deep pagination)
Tracking / UTM?utm_source=emailCanonical to UTM-free URL on every page, or parameter handling in GSC
Sessions / tokens?sessionid=abcDisallow: /*?sessionid= in robots.txt; fix URL generation in code
Preview / draft?preview=1&draft=trueDisallow: /*?preview= and /*?draft= in robots.txt
Example Nginx config for canonicalising URL parameters:
TEXT
# robots.txt — blocking parametric URLs
User-agent: *

# Sort and filter parameters
Disallow: /*?sort=
Disallow: /*?order=
Disallow: /*?filter=
Disallow: /*?color=
Disallow: /*?size=

# Pagination deeper than page 3 (Allow/Disallow approach)
Disallow: /*?page=
Allow: /*?page=1
Allow: /*?page=2
Allow: /*?page=3

# Session and tracking parameters
Disallow: /*?sessionid=
Disallow: /*?token=
Disallow: /*?ref=
Disallow: /*?utm_

# Preview and drafts
Disallow: /*?preview=
Disallow: /*?draft=
For parameters with partially unique content (e.g. a brand filter with a unique brand description) canonical is preferable to Disallow: the bot can visit the page, read the canonical, and pass link equity to the main page. Disallow fully excludes the page from crawling and potentially from the index.

Filters, facets and pagination

Faceted navigation is the most complex case. It creates a combinatorial explosion of URLs: 10 filters with 5 values each potentially generate over a million combinations. The goal is to keep valuable facets (brand + category) and block useless combinations.

  1. Identify valuable facets: pages with high organic traffic or unique SEO content (e.g. /catalog/nike/ with brand-specific text) — keep in the index with a self-referencing canonical.
  2. Add noindex + follow to low-value combinations: the page renders (links are followed) but does not enter the index. Use <meta name="robots" content="noindex, follow"> in the template when more than one facet is active.
  3. Disallow purely technical parameters: sorting, colour, size — if they carry no SEO value — block via robots.txt.
  4. AJAX-loaded facets: if facets load via JavaScript without reflecting in the URL, the bot sees no additional URLs. This solves the problem at the architecture level.

Pagination: canonical vs Disallow

The optimal pagination strategy depends on scale: up to 10 pages — allow crawling and add a canonical pointing to page 1 on pages 2+. For hundreds of pagination pages — Disallow beyond page 3–5. Infinite scroll with no URL parameters is treated as a single page by the bot — good for budget.

Temporary pages: webinars, mockups, demos

Temporary pages are a hidden budget drain on platforms with active marketing teams. Webinar landing pages, demo mockups, client preview pages, A/B test variants — they appear fast and often stay forever, silently consuming crawl budget.

  • meta robots noindex: add <meta name="robots" content="noindex, nofollow"> to the template of all temporary pages. The bot visits, reads the directive, and excludes the URL from the index. The simplest method when template access is available.
  • X-Robots-Tag HTTP header: if temporary pages have no dedicated HTML template (e.g. generated PDFs or files), use the HTTP response header: X-Robots-Tag: noindex, nofollow.
  • Disallow by path prefix: if temporary pages live under predictable path prefixes, block them in robots.txt. No guarantee of removal from an already-indexed URL, but new URLs will not be crawled.
  • HTTP authentication: for staging and preview pages this is the only reliable full-isolation method. Googlebot has no credentials and will not crawl the page.
Example noindex markup in tag and HTTP header:
TEXT
# robots.txt — patterns for temporary pages
User-agent: *

# Webinars and events
Disallow: /webinars/
Disallow: /events/draft-
Disallow: /landing/test-

# Mockups and previews
Disallow: /_preview/
Disallow: /mockups/
Disallow: /demo/

# A/B tests and experiments
Disallow: /*?variant=
Disallow: /*?experiment=
Disallow: /*?ab_test=

# Template preview parameters
Disallow: /*?preview=
Disallow: /*?draft=
Disallow: /*?revision=
Disallow does not remove a page from the index if it was already indexed. To delist already-indexed temporary pages, use noindex (the bot must have access to read the directive). Once the page disappears from the index, apply Disallow or set up HTTP authentication.

Sitemap index at scale

With millions of URLs a single XML sitemap is impractical: it is capped at 50,000 URLs and 50 MB. A Sitemap Index is a table-of-contents file that points to child sitemaps. It lets you structure priorities and direct budget where it is needed most.

XML
<?xml version="1.0" encoding="UTF-8"?>
<sitemapindex xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
  <!-- Priority sections — crawled first -->
  <sitemap>
    <loc>https://example.com/sitemap-products.xml</loc>
    <lastmod>2026-05-15</lastmod>
  </sitemap>
  <sitemap>
    <loc>https://example.com/sitemap-categories.xml</loc>
    <lastmod>2026-05-15</lastmod>
  </sitemap>
  <!-- Lower-priority sections -->
  <sitemap>
    <loc>https://example.com/sitemap-blog.xml</loc>
    <lastmod>2026-05-10</lastmod>
  </sitemap>
  <!-- Do NOT include noindex pages or duplicates! -->
</sitemapindex>
Recommendations for sitemap index structure:
  • Split sitemaps by content type: products, categories, blog, static pages — each in a separate file. This lets you see per-section indexation stats in GSC.
  • Keep lastmod honest: lastmod signals freshness to the bot. Set the date of the last real change, not today's date — otherwise the bot stops trusting the signal.
  • Exclude blocked URLs: pages with noindex or Disallow must not appear in the sitemap — they send contradictory signals to the bot.
  • Child sitemap limit: each child file — maximum 50,000 URLs or 50 MB (compressed). Split into parts when exceeded: sitemap-products-1.xml, sitemap-products-2.xml.
Recommendations for sitemap index structure:

Server configuration and rate limiting

The server side affects crawl budget in both directions: slow responses force Googlebot to reduce crawl intensity, while 5xx errors make the bot wait and waste time on retries. Server optimisation directly improves crawl efficiency.

HTTP codeGooglebot interpretationRecommended action
200 OKPage exists; bot crawls and indexesThe norm for all public pages
301/302Redirect — bot follows; budget is spent on both URLsMinimise redirect chains; fix dead links
404 Not FoundPage does not exist; bot stops crawling that URLAcceptable for deleted pages; > 5% 404 in Googlebot logs = problem
429 Too Many RequestsTemporary throttle — bot slows crawl rateUse under genuine server load; Googlebot respects 429
503 Service UnavailableTemporary unavailability — bot will retry laterCorrect response during maintenance; do not abuse
500 Server ErrorServer error — bot retries, wasting budgetEliminate 500 causes; every 5xx burns budget
Example Nginx configuration for crawl management:
NGINX
# Nginx: return 429 when any crawler sends too many requests
limit_req_zone $http_user_agent zone=crawlers:10m rate=10r/s;

location / {
    limit_req zone=crawlers burst=20 nodelay;
    limit_req_status 429;
    # ... other settings
}
Example Nginx configuration for crawl management:
To control Googlebot's crawl rate, use GSC → Settings → Crawl Rate — not Crawl-delay in robots.txt. Google officially ignores Crawl-delay. Crawl-delay works for Yandex and Bing.

KPIs and monitoring

Crawl budget requires continuous monitoring: any deployment can accidentally expose thousands of new parametric URLs. Set up a weekly dashboard and a monthly review of the following metrics.

Key crawl budget KPIs

Target values for an optimised large platform

Share of 200 OK among Googlebot requests> 90%
Share of 404 among Googlebot requests< 5%
Share of duplicate URLs in crawl< 15%
Pages indexed / pages submitted in Sitemap> 85%

  • Weekly: compare GSC crawl requests for current vs. previous week. A sudden spike signals new duplicates introduced by a recent deployment.
  • Monthly: check GSC → Pages → "Discovered — currently not indexed". Growth with an unchanged site size = crawl budget problem.
  • After every major deployment: run Screaming Frog over the site, check the unique URL count. Compare to the previous snapshot.
  • Quarterly: full log analysis over 30 days. Look for new parametric URL patterns introduced by updated code.
  • Alerts: set a notification if 5xx responses exceed 1% of all Googlebot requests in a single day.

Quick wins and roadmap

Crawl budget optimisation is not a one-off task — it is a process. Divide the work into quick wins (effect in 1–2 weeks) and systemic changes (1–3 months).

Week 1Block temporary pages and preview parameters

Add noindex to webinar, mockup, and demo templates. Add Disallow for ?preview=, ?draft= in robots.txt.

Impact: −10–30% of budget
Week 2Disallow for tracking parameters and sessions

?utm_, ?sessionid=, ?ref= — to robots.txt. Canonical for UTM variants on all templates.

Impact: −5–15% of budget
Month 1Sitemap index optimisation

Split into thematic sitemaps. Remove noindex pages from the sitemap. Add honest lastmod.

Impact: improved indexation coverage
Months 1–2Canonical and noindex for facets

Identify valuable vs. junk facet combinations. Implement noindex + follow for low-value combinations.

Impact: −20–50% of budget
Months 2–3Server response optimisation

Eliminate redirect chains. Reduce 404 count from logs. Optimise TTFB for crawled sections.

Impact: higher crawl rate
OngoingMonitoring and regular review

GSC dashboard, 5xx alerts, quarterly log analysis. Check after every major deployment.

Impact: regression prevention

Quick wins checklist

  • noindex added to all temporary page templates (webinars, mockups, demos)
  • Disallow for ?preview=, ?draft=, ?sessionid=, ?utm_ in robots.txt
  • Canonical pointing to clean URLs on all parametric pages (sorting, pagination 2+)
  • noindex and Disallow pages removed from the Sitemap
  • Sitemap index split by section with accurate lastmod
  • Redirect chains reduced to a single hop
  • 404 in Googlebot logs < 5% — dead links removed
  • Weekly monitoring dashboard configured in GSC

FAQ

Google recommends paying attention when the site exceeds 1 million URLs. In practice, issues appear earlier — at 50–100k URLs with active parametric page generation. If "Discovered — currently not indexed" is growing in GSC, it is time for an audit regardless of site size.
Both matter and complement each other. The Sitemap ensures the bot learns about page existence. Internal links pass PageRank and signal relative importance. A page with no internal links and only a Sitemap entry will be crawled less often than a page with hundreds of internal links.
Yes, indirectly. A CDN speeds up server responses, allowing Googlebot to crawl more pages per unit of time at the same crawl rate limit. Fast TTFB is one of the factors that influences crawl intensity.
Canonical does not block page crawling — the bot may still visit it. However, after reading the canonical the bot understands the page is a duplicate and reduces its re-crawl frequency. Canonical solves the indexation duplicate problem but does not fully solve the crawl problem — Disallow or noindex are more effective for that.
Yes. After freeing up budget, update the Sitemap with correct lastmod for new pages. Use Google's Indexing API for direct URL submission (officially for news and job postings, but effective for other content types in practice). Create internal links from authoritative sections to new pages.