Crawl Budget: a practical guide for large platforms

Step-by-step crawl budget audit: log analysis, Google Search Console, sources of crawl waste (parametric URLs, facets, temporary pages), remediation measures, Sitemap index setup, rate limiting, KPIs, and a roadmap for Dev/SEO teams.

Crawl budget is the number of pages Googlebot is willing to crawl on your site within a given period. On small sites it is virtually unlimited. On platforms with millions of URLs it becomes a strategic resource: the bot wastes it on filters, pagination, empty templates and staging pages instead of crawling priority content. The result is delayed indexation of new products, weak catalogue coverage, and lost rankings.

Diagram: Googlebot distributes crawl budget between valuable URLs and budget-wasting pages — Where the crawl budget goes: valuable pages compete with filters, UTM parameters and pagination for the bot's attention.

Google officially distinguishes two concepts: crawl rate limit (how often the bot can visit without overloading the server) and crawl demand (how many pages Google wants to crawl based on their perceived value). Crawl budget is the intersection of these two factors. Optimisation yields real impact specifically on large platforms.

5–10×

Budget overspend

Typical ratio of crawled URLs to genuinely valuable pages on an unoptimised e-commerce platform

60–80%

Lost to duplicates

Share of budget consumed by parametric URLs and pagination when no mitigation is in place

2–4 wks

Indexation delay

Typical delay before a new SKU appears in the index when the budget is saturated

>50k

Problem threshold

Platforms with more than 50,000 URLs already feel the effects of crawl budget constraints

Step 1: collecting and analysing crawler logs

Web server logs are the primary source of truth about what Googlebot actually crawls. Log analysis reveals the real request frequency, which sections the bot ignores or gets stuck on, and what HTTP responses the server returns. Without logs, any optimisation is guesswork.

Extracting Googlebot requests from Nginx / Apache

BASH

# Filter only Googlebot requests from the last 7 days
grep 'Googlebot' /var/log/nginx/access.log \
  | awk '{print $7}' \
  | sort | uniq -c | sort -rn \
  | head -50

# HTTP status codes seen by Googlebot
grep 'Googlebot' /var/log/nginx/access.log \
  | awk '{print $9}' \
  | sort | uniq -c | sort -rn

# Top URLs returning 404 to Googlebot
grep 'Googlebot' /var/log/nginx/access.log \
  | awk '$9 == "404" {print $7}' \
  | sort | uniq -c | sort -rn | head -30

For structured analysis of large log volumes use dedicated tools: Screaming Frog Log File Analyser, Semrush Log Analyser, or an ELK stack (Elasticsearch + Kibana). They automatically group URLs by pattern and plot crawl rate over time.

Metric	What it shows	Target range
Crawl rate (req/day)	Crawl intensity — how many pages Googlebot requests per day	Platform-dependent; track the trend
Pages crawled/day	Unique URLs crawled per day — the primary crawl budget KPI	Should rise after optimisation
200 / (301+302) / 404 / 500 ratio	Share of each HTTP response type across all bot requests	200 > 90%; 404 < 5%; 500 ≈ 0%
% duplicate URLs in crawl	Share of parametric and duplicate URLs in the log	< 15% after optimisation
New pages / total crawled	Ratio of first-time crawls to repeat visits	Aim to increase the share of new pages

Step 2: Google Search Console and server response analysis

Google Search Console provides aggregated crawl data without requiring access to server logs. The Crawl Stats report is the key entry point. It shows the average daily request count, perceived page load speed, and distribution of response codes.

GSC → Settings → Crawl Stats: crawl request graph, breakdown by file type and response code. Look for 404/500 spikes and drops after deployments.
GSC → Indexing → Pages: number of indexed URLs vs. reasons for non-indexation. "Discovered — currently not indexed" is a direct crawl budget signal.
GSC → URL Inspection: manual status check for a specific page — when it was last crawled, what Googlebot saw.
GSC → Settings → Crawl Rate: you can throttle crawl rate if the bot is overloading the server. Normally not needed — only restrict when you have measurable performance problems.

"Discovered — currently not indexed" in GSC is the most direct signal of a crawl budget shortage. Google knows the pages exist (found via links or in the sitemap) but cannot crawl them in time. The immediate priority is freeing budget from junk URLs.

Sources of crawl budget waste

On a large platform the bot loses budget to six typical sources. Identify each in your logs and estimate the volume — this gives you the priority order for optimisation.

Crawl budget: how it's spent and how to save it — Comparison of wasted vs optimized crawl budget allocation.

Parametric URLs

?sort=price&filter=red&page=3 generates an exponential number of unique URLs with identical or near-identical content.

Infinite pagination

/catalog?page=N — thousands of URLs with diminishing value. The bot can endlessly crawl old catalogue pages.

Duplicate content

Identical pages on different URLs: /category/ and /category?utm_source=email, HTTP vs. HTTPS, www vs. non-www.

Temporary pages

Webinar landing pages, demo mockups, ?preview=1 pages, drafts — they appear and disappear but the bot keeps spending budget on them.

Empty templates

Category pages with no products, profiles with no content, auto-generated tag pages — low-quality content generated by the thousands.

Session parameters

?sessionid=abc123, ?token=xyz — unique URLs per session. Each such URL is a "new page" to the bot.

URL parameters and duplicates: remediation

Parametric URLs are the biggest source of waste on e-commerce platforms. There is no single silver bullet — the right solution depends on the nature of the parameter.

Parameter type	Example	Recommended action
Sorting / ordering	?sort=price_asc	Disallow in robots.txt or canonical pointing to the parameter-free URL
Filters with no unique content	?color=red&size=XL	Canonical to clean category page + noindex via meta robots
Pagination	?page=5	Canonical on pages 2+ to page 1 (or Disallow for deep pagination)
Tracking / UTM	?utm_source=email	Canonical to UTM-free URL on every page, or parameter handling in GSC
Sessions / tokens	?sessionid=abc	Disallow: /*?sessionid= in robots.txt; fix URL generation in code
Preview / draft	?preview=1&draft=true	Disallow: /?preview= and /?draft= in robots.txt

Example Nginx config for canonicalising URL parameters:

TEXT

# robots.txt — blocking parametric URLs
User-agent: *

# Sort and filter parameters
Disallow: /*?sort=
Disallow: /*?order=
Disallow: /*?filter=
Disallow: /*?color=
Disallow: /*?size=

# Pagination deeper than page 3 (Allow/Disallow approach)
Disallow: /*?page=
Allow: /*?page=1
Allow: /*?page=2
Allow: /*?page=3

# Session and tracking parameters
Disallow: /*?sessionid=
Disallow: /*?token=
Disallow: /*?ref=
Disallow: /*?utm_

# Preview and drafts
Disallow: /*?preview=
Disallow: /*?draft=

For parameters with partially unique content (e.g. a brand filter with a unique brand description) canonical is preferable to Disallow: the bot can visit the page, read the canonical, and pass link equity to the main page. Disallow fully excludes the page from crawling and potentially from the index.

Filters, facets and pagination

Faceted navigation is the most complex case. It creates a combinatorial explosion of URLs: 10 filters with 5 values each potentially generate over a million combinations. The goal is to keep valuable facets (brand + category) and block useless combinations.

Identify valuable facets: pages with high organic traffic or unique SEO content (e.g. /catalog/nike/ with brand-specific text) — keep in the index with a self-referencing canonical.
Add noindex + follow to low-value combinations: the page renders (links are followed) but does not enter the index. Use <meta name="robots" content="noindex, follow"> in the template when more than one facet is active.
Disallow purely technical parameters: sorting, colour, size — if they carry no SEO value — block via robots.txt.
AJAX-loaded facets: if facets load via JavaScript without reflecting in the URL, the bot sees no additional URLs. This solves the problem at the architecture level.

Pagination: canonical vs Disallow

The optimal pagination strategy depends on scale: up to 10 pages — allow crawling and add a canonical pointing to page 1 on pages 2+. For hundreds of pagination pages — Disallow beyond page 3–5. Infinite scroll with no URL parameters is treated as a single page by the bot — good for budget.

Temporary pages: webinars, mockups, demos

Temporary pages are a hidden budget drain on platforms with active marketing teams. Webinar landing pages, demo mockups, client preview pages, A/B test variants — they appear fast and often stay forever, silently consuming crawl budget.

meta robots noindex: add <meta name="robots" content="noindex, nofollow"> to the template of all temporary pages. The bot visits, reads the directive, and excludes the URL from the index. The simplest method when template access is available.
X-Robots-Tag HTTP header: if temporary pages have no dedicated HTML template (e.g. generated PDFs or files), use the HTTP response header: X-Robots-Tag: noindex, nofollow.
Disallow by path prefix: if temporary pages live under predictable path prefixes, block them in robots.txt. No guarantee of removal from an already-indexed URL, but new URLs will not be crawled.
HTTP authentication: for staging and preview pages this is the only reliable full-isolation method. Googlebot has no credentials and will not crawl the page.

Example noindex markup in tag and HTTP header:

TEXT

# robots.txt — patterns for temporary pages
User-agent: *

# Webinars and events
Disallow: /webinars/
Disallow: /events/draft-
Disallow: /landing/test-

# Mockups and previews
Disallow: /_preview/
Disallow: /mockups/
Disallow: /demo/

# A/B tests and experiments
Disallow: /*?variant=
Disallow: /*?experiment=
Disallow: /*?ab_test=

# Template preview parameters
Disallow: /*?preview=
Disallow: /*?draft=
Disallow: /*?revision=

Disallow does not remove a page from the index if it was already indexed. To delist already-indexed temporary pages, use noindex (the bot must have access to read the directive). Once the page disappears from the index, apply Disallow or set up HTTP authentication.

Sitemap index at scale

With millions of URLs a single XML sitemap is impractical: it is capped at 50,000 URLs and 50 MB. A Sitemap Index is a table-of-contents file that points to child sitemaps. It lets you structure priorities and direct budget where it is needed most.

XML

<?xml version="1.0" encoding="UTF-8"?>
<sitemapindex xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
  <!-- Priority sections — crawled first -->
  <sitemap>
    <loc>https://example.com/sitemap-products.xml</loc>
    <lastmod>2026-05-15</lastmod>
  </sitemap>
  <sitemap>
    <loc>https://example.com/sitemap-categories.xml</loc>
    <lastmod>2026-05-15</lastmod>
  </sitemap>
  <!-- Lower-priority sections -->
  <sitemap>
    <loc>https://example.com/sitemap-blog.xml</loc>
    <lastmod>2026-05-10</lastmod>
  </sitemap>
  <!-- Do NOT include noindex pages or duplicates! -->
</sitemapindex>

Recommendations for sitemap index structure:

Split sitemaps by content type: products, categories, blog, static pages — each in a separate file. This lets you see per-section indexation stats in GSC.
Keep lastmod honest: lastmod signals freshness to the bot. Set the date of the last real change, not today's date — otherwise the bot stops trusting the signal.
Exclude blocked URLs: pages with noindex or Disallow must not appear in the sitemap — they send contradictory signals to the bot.
Child sitemap limit: each child file — maximum 50,000 URLs or 50 MB (compressed). Split into parts when exceeded: sitemap-products-1.xml, sitemap-products-2.xml.

Recommendations for sitemap index structure:

Server configuration and rate limiting

The server side affects crawl budget in both directions: slow responses force Googlebot to reduce crawl intensity, while 5xx errors make the bot wait and waste time on retries. Server optimisation directly improves crawl efficiency.

HTTP code	Googlebot interpretation	Recommended action
200 OK	Page exists; bot crawls and indexes	The norm for all public pages
301/302	Redirect — bot follows; budget is spent on both URLs	Minimise redirect chains; fix dead links
404 Not Found	Page does not exist; bot stops crawling that URL	Acceptable for deleted pages; > 5% 404 in Googlebot logs = problem
429 Too Many Requests	Temporary throttle — bot slows crawl rate	Use under genuine server load; Googlebot respects 429
503 Service Unavailable	Temporary unavailability — bot will retry later	Correct response during maintenance; do not abuse
500 Server Error	Server error — bot retries, wasting budget	Eliminate 500 causes; every 5xx burns budget

Example Nginx configuration for crawl management:

NGINX

# Nginx: return 429 when any crawler sends too many requests
limit_req_zone $http_user_agent zone=crawlers:10m rate=10r/s;

location / {
    limit_req zone=crawlers burst=20 nodelay;
    limit_req_status 429;
    # ... other settings
}

Example Nginx configuration for crawl management:

To control Googlebot's crawl rate, use GSC → Settings → Crawl Rate — not Crawl-delay in robots.txt. Google officially ignores Crawl-delay. Crawl-delay works for Yandex and Bing.

KPIs and monitoring

Crawl budget requires continuous monitoring: any deployment can accidentally expose thousands of new parametric URLs. Set up a weekly dashboard and a monthly review of the following metrics.

Key crawl budget KPIs

Target values for an optimised large platform

Weekly: compare GSC crawl requests for current vs. previous week. A sudden spike signals new duplicates introduced by a recent deployment.
Monthly: check GSC → Pages → "Discovered — currently not indexed". Growth with an unchanged site size = crawl budget problem.
After every major deployment: run Screaming Frog over the site, check the unique URL count. Compare to the previous snapshot.
Quarterly: full log analysis over 30 days. Look for new parametric URL patterns introduced by updated code.
Alerts: set a notification if 5xx responses exceed 1% of all Googlebot requests in a single day.

Quick wins and roadmap

Crawl budget optimisation is not a one-off task — it is a process. Divide the work into quick wins (effect in 1–2 weeks) and systemic changes (1–3 months).

Week 1Block temporary pages and preview parameters

Add noindex to webinar, mockup, and demo templates. Add Disallow for ?preview=, ?draft= in robots.txt.

Impact: −10–30% of budget

Week 2Disallow for tracking parameters and sessions

?utm_, ?sessionid=, ?ref= — to robots.txt. Canonical for UTM variants on all templates.

Impact: −5–15% of budget

Month 1Sitemap index optimisation

Split into thematic sitemaps. Remove noindex pages from the sitemap. Add honest lastmod.

Impact: improved indexation coverage

Months 1–2Canonical and noindex for facets

Identify valuable vs. junk facet combinations. Implement noindex + follow for low-value combinations.

Impact: −20–50% of budget

Months 2–3Server response optimisation

Eliminate redirect chains. Reduce 404 count from logs. Optimise TTFB for crawled sections.

Impact: higher crawl rate

OngoingMonitoring and regular review

GSC dashboard, 5xx alerts, quarterly log analysis. Check after every major deployment.

Impact: regression prevention

Quick wins checklist

noindex added to all temporary page templates (webinars, mockups, demos)
Disallow for ?preview=, ?draft=, ?sessionid=, ?utm_ in robots.txt
Canonical pointing to clean URLs on all parametric pages (sorting, pagination 2+)
noindex and Disallow pages removed from the Sitemap
Sitemap index split by section with accurate lastmod
Redirect chains reduced to a single hop
404 in Googlebot logs < 5% — dead links removed
Weekly monitoring dashboard configured in GSC

FAQ

Google recommends paying attention when the site exceeds 1 million URLs. In practice, issues appear earlier — at 50–100k URLs with active parametric page generation. If "Discovered — currently not indexed" is growing in GSC, it is time for an audit regardless of site size.

Both matter and complement each other. The Sitemap ensures the bot learns about page existence. Internal links pass PageRank and signal relative importance. A page with no internal links and only a Sitemap entry will be crawled less often than a page with hundreds of internal links.

Yes, indirectly. A CDN speeds up server responses, allowing Googlebot to crawl more pages per unit of time at the same crawl rate limit. Fast TTFB is one of the factors that influences crawl intensity.

Canonical does not block page crawling — the bot may still visit it. However, after reading the canonical the bot understands the page is a duplicate and reduces its re-crawl frequency. Canonical solves the indexation duplicate problem but does not fully solve the crawl problem — Disallow or noindex are more effective for that.

Yes. After freeing up budget, update the Sitemap with correct lastmod for new pages. Use Google's Indexing API for direct URL submission (officially for news and job postings, but effective for other content types in practice). Create internal links from authoritative sections to new pages.