Technical SEO
Crawl Budget: a practical guide for large platforms

Step-by-step crawl budget audit: log analysis, Google Search Console, sources of crawl waste (parametric URLs, facets, temporary pages), remediation measures, Sitemap index setup, rate limiting, KPIs, and a roadmap for Dev/SEO teams.
Crawl budget is the number of pages Googlebot is willing to crawl on your site within a given period. On small sites it is virtually unlimited. On platforms with millions of URLs it becomes a strategic resource: the bot wastes it on filters, pagination, empty templates and staging pages instead of crawling priority content. The result is delayed indexation of new products, weak catalogue coverage, and lost rankings.
Budget overspend
Typical ratio of crawled URLs to genuinely valuable pages on an unoptimised e-commerce platform
Lost to duplicates
Share of budget consumed by parametric URLs and pagination when no mitigation is in place
Indexation delay
Typical delay before a new SKU appears in the index when the budget is saturated
Problem threshold
Platforms with more than 50,000 URLs already feel the effects of crawl budget constraints
Step 1: collecting and analysing crawler logs
Web server logs are the primary source of truth about what Googlebot actually crawls. Log analysis reveals the real request frequency, which sections the bot ignores or gets stuck on, and what HTTP responses the server returns. Without logs, any optimisation is guesswork.
Extracting Googlebot requests from Nginx / Apache
# Filter only Googlebot requests from the last 7 days
grep 'Googlebot' /var/log/nginx/access.log \
| awk '{print $7}' \
| sort | uniq -c | sort -rn \
| head -50
# HTTP status codes seen by Googlebot
grep 'Googlebot' /var/log/nginx/access.log \
| awk '{print $9}' \
| sort | uniq -c | sort -rn
# Top URLs returning 404 to Googlebot
grep 'Googlebot' /var/log/nginx/access.log \
| awk '$9 == "404" {print $7}' \
| sort | uniq -c | sort -rn | head -30For structured analysis of large log volumes use dedicated tools: Screaming Frog Log File Analyser, Semrush Log Analyser, or an ELK stack (Elasticsearch + Kibana). They automatically group URLs by pattern and plot crawl rate over time.
| Metric | What it shows | Target range |
|---|---|---|
| Crawl rate (req/day) | Crawl intensity — how many pages Googlebot requests per day | Platform-dependent; track the trend |
| Pages crawled/day | Unique URLs crawled per day — the primary crawl budget KPI | Should rise after optimisation |
| 200 / (301+302) / 404 / 500 ratio | Share of each HTTP response type across all bot requests | 200 > 90%; 404 < 5%; 500 ≈ 0% |
| % duplicate URLs in crawl | Share of parametric and duplicate URLs in the log | < 15% after optimisation |
| New pages / total crawled | Ratio of first-time crawls to repeat visits | Aim to increase the share of new pages |
Step 2: Google Search Console and server response analysis
Google Search Console provides aggregated crawl data without requiring access to server logs. The Crawl Stats report is the key entry point. It shows the average daily request count, perceived page load speed, and distribution of response codes.
- GSC → Settings → Crawl Stats: crawl request graph, breakdown by file type and response code. Look for 404/500 spikes and drops after deployments.
- GSC → Indexing → Pages: number of indexed URLs vs. reasons for non-indexation. "Discovered — currently not indexed" is a direct crawl budget signal.
- GSC → URL Inspection: manual status check for a specific page — when it was last crawled, what Googlebot saw.
- GSC → Settings → Crawl Rate: you can throttle crawl rate if the bot is overloading the server. Normally not needed — only restrict when you have measurable performance problems.
Sources of crawl budget waste
On a large platform the bot loses budget to six typical sources. Identify each in your logs and estimate the volume — this gives you the priority order for optimisation.
Parametric URLs
?sort=price&filter=red&page=3 generates an exponential number of unique URLs with identical or near-identical content.
Infinite pagination
/catalog?page=N — thousands of URLs with diminishing value. The bot can endlessly crawl old catalogue pages.
Duplicate content
Identical pages on different URLs: /category/ and /category?utm_source=email, HTTP vs. HTTPS, www vs. non-www.
Temporary pages
Webinar landing pages, demo mockups, ?preview=1 pages, drafts — they appear and disappear but the bot keeps spending budget on them.
Empty templates
Category pages with no products, profiles with no content, auto-generated tag pages — low-quality content generated by the thousands.
Session parameters
?sessionid=abc123, ?token=xyz — unique URLs per session. Each such URL is a "new page" to the bot.
URL parameters and duplicates: remediation
Parametric URLs are the biggest source of waste on e-commerce platforms. There is no single silver bullet — the right solution depends on the nature of the parameter.
| Parameter type | Example | Recommended action |
|---|---|---|
| Sorting / ordering | ?sort=price_asc | Disallow in robots.txt or canonical pointing to the parameter-free URL |
| Filters with no unique content | ?color=red&size=XL | Canonical to clean category page + noindex via meta robots |
| Pagination | ?page=5 | Canonical on pages 2+ to page 1 (or Disallow for deep pagination) |
| Tracking / UTM | ?utm_source=email | Canonical to UTM-free URL on every page, or parameter handling in GSC |
| Sessions / tokens | ?sessionid=abc | Disallow: /*?sessionid= in robots.txt; fix URL generation in code |
| Preview / draft | ?preview=1&draft=true | Disallow: /*?preview= and /*?draft= in robots.txt |
# robots.txt — blocking parametric URLs
User-agent: *
# Sort and filter parameters
Disallow: /*?sort=
Disallow: /*?order=
Disallow: /*?filter=
Disallow: /*?color=
Disallow: /*?size=
# Pagination deeper than page 3 (Allow/Disallow approach)
Disallow: /*?page=
Allow: /*?page=1
Allow: /*?page=2
Allow: /*?page=3
# Session and tracking parameters
Disallow: /*?sessionid=
Disallow: /*?token=
Disallow: /*?ref=
Disallow: /*?utm_
# Preview and drafts
Disallow: /*?preview=
Disallow: /*?draft=Filters, facets and pagination
Faceted navigation is the most complex case. It creates a combinatorial explosion of URLs: 10 filters with 5 values each potentially generate over a million combinations. The goal is to keep valuable facets (brand + category) and block useless combinations.
- Identify valuable facets: pages with high organic traffic or unique SEO content (e.g. /catalog/nike/ with brand-specific text) — keep in the index with a self-referencing canonical.
- Add noindex + follow to low-value combinations: the page renders (links are followed) but does not enter the index. Use
<meta name="robots" content="noindex, follow">in the template when more than one facet is active. - Disallow purely technical parameters: sorting, colour, size — if they carry no SEO value — block via robots.txt.
- AJAX-loaded facets: if facets load via JavaScript without reflecting in the URL, the bot sees no additional URLs. This solves the problem at the architecture level.
Pagination: canonical vs Disallow
The optimal pagination strategy depends on scale: up to 10 pages — allow crawling and add a canonical pointing to page 1 on pages 2+. For hundreds of pagination pages — Disallow beyond page 3–5. Infinite scroll with no URL parameters is treated as a single page by the bot — good for budget.
Temporary pages: webinars, mockups, demos
Temporary pages are a hidden budget drain on platforms with active marketing teams. Webinar landing pages, demo mockups, client preview pages, A/B test variants — they appear fast and often stay forever, silently consuming crawl budget.
- meta robots noindex: add
<meta name="robots" content="noindex, nofollow">to the template of all temporary pages. The bot visits, reads the directive, and excludes the URL from the index. The simplest method when template access is available. - X-Robots-Tag HTTP header: if temporary pages have no dedicated HTML template (e.g. generated PDFs or files), use the HTTP response header:
X-Robots-Tag: noindex, nofollow. - Disallow by path prefix: if temporary pages live under predictable path prefixes, block them in robots.txt. No guarantee of removal from an already-indexed URL, but new URLs will not be crawled.
- HTTP authentication: for staging and preview pages this is the only reliable full-isolation method. Googlebot has no credentials and will not crawl the page.
noindex markup in tag and HTTP header:# robots.txt — patterns for temporary pages
User-agent: *
# Webinars and events
Disallow: /webinars/
Disallow: /events/draft-
Disallow: /landing/test-
# Mockups and previews
Disallow: /_preview/
Disallow: /mockups/
Disallow: /demo/
# A/B tests and experiments
Disallow: /*?variant=
Disallow: /*?experiment=
Disallow: /*?ab_test=
# Template preview parameters
Disallow: /*?preview=
Disallow: /*?draft=
Disallow: /*?revision=Sitemap index at scale
With millions of URLs a single XML sitemap is impractical: it is capped at 50,000 URLs and 50 MB. A Sitemap Index is a table-of-contents file that points to child sitemaps. It lets you structure priorities and direct budget where it is needed most.
<?xml version="1.0" encoding="UTF-8"?>
<sitemapindex xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
<!-- Priority sections — crawled first -->
<sitemap>
<loc>https://example.com/sitemap-products.xml</loc>
<lastmod>2026-05-15</lastmod>
</sitemap>
<sitemap>
<loc>https://example.com/sitemap-categories.xml</loc>
<lastmod>2026-05-15</lastmod>
</sitemap>
<!-- Lower-priority sections -->
<sitemap>
<loc>https://example.com/sitemap-blog.xml</loc>
<lastmod>2026-05-10</lastmod>
</sitemap>
<!-- Do NOT include noindex pages or duplicates! -->
</sitemapindex>- Split sitemaps by content type: products, categories, blog, static pages — each in a separate file. This lets you see per-section indexation stats in GSC.
- Keep lastmod honest: lastmod signals freshness to the bot. Set the date of the last real change, not today's date — otherwise the bot stops trusting the signal.
- Exclude blocked URLs: pages with noindex or Disallow must not appear in the sitemap — they send contradictory signals to the bot.
- Child sitemap limit: each child file — maximum 50,000 URLs or 50 MB (compressed). Split into parts when exceeded: sitemap-products-1.xml, sitemap-products-2.xml.
Server configuration and rate limiting
The server side affects crawl budget in both directions: slow responses force Googlebot to reduce crawl intensity, while 5xx errors make the bot wait and waste time on retries. Server optimisation directly improves crawl efficiency.
| HTTP code | Googlebot interpretation | Recommended action |
|---|---|---|
| 200 OK | Page exists; bot crawls and indexes | The norm for all public pages |
| 301/302 | Redirect — bot follows; budget is spent on both URLs | Minimise redirect chains; fix dead links |
| 404 Not Found | Page does not exist; bot stops crawling that URL | Acceptable for deleted pages; > 5% 404 in Googlebot logs = problem |
| 429 Too Many Requests | Temporary throttle — bot slows crawl rate | Use under genuine server load; Googlebot respects 429 |
| 503 Service Unavailable | Temporary unavailability — bot will retry later | Correct response during maintenance; do not abuse |
| 500 Server Error | Server error — bot retries, wasting budget | Eliminate 500 causes; every 5xx burns budget |
# Nginx: return 429 when any crawler sends too many requests
limit_req_zone $http_user_agent zone=crawlers:10m rate=10r/s;
location / {
limit_req zone=crawlers burst=20 nodelay;
limit_req_status 429;
# ... other settings
}KPIs and monitoring
Crawl budget requires continuous monitoring: any deployment can accidentally expose thousands of new parametric URLs. Set up a weekly dashboard and a monthly review of the following metrics.
- Weekly: compare GSC crawl requests for current vs. previous week. A sudden spike signals new duplicates introduced by a recent deployment.
- Monthly: check GSC → Pages → "Discovered — currently not indexed". Growth with an unchanged site size = crawl budget problem.
- After every major deployment: run Screaming Frog over the site, check the unique URL count. Compare to the previous snapshot.
- Quarterly: full log analysis over 30 days. Look for new parametric URL patterns introduced by updated code.
- Alerts: set a notification if 5xx responses exceed 1% of all Googlebot requests in a single day.
Quick wins and roadmap
Crawl budget optimisation is not a one-off task — it is a process. Divide the work into quick wins (effect in 1–2 weeks) and systemic changes (1–3 months).
Add noindex to webinar, mockup, and demo templates. Add Disallow for ?preview=, ?draft= in robots.txt.
Impact: −10–30% of budget?utm_, ?sessionid=, ?ref= — to robots.txt. Canonical for UTM variants on all templates.
Impact: −5–15% of budgetSplit into thematic sitemaps. Remove noindex pages from the sitemap. Add honest lastmod.
Impact: improved indexation coverageIdentify valuable vs. junk facet combinations. Implement noindex + follow for low-value combinations.
Impact: −20–50% of budgetEliminate redirect chains. Reduce 404 count from logs. Optimise TTFB for crawled sections.
Impact: higher crawl rateGSC dashboard, 5xx alerts, quarterly log analysis. Check after every major deployment.
Impact: regression preventionQuick wins checklist
- noindex added to all temporary page templates (webinars, mockups, demos)
- Disallow for ?preview=, ?draft=, ?sessionid=, ?utm_ in robots.txt
- Canonical pointing to clean URLs on all parametric pages (sorting, pagination 2+)
- noindex and Disallow pages removed from the Sitemap
- Sitemap index split by section with accurate lastmod
- Redirect chains reduced to a single hop
- 404 in Googlebot logs < 5% — dead links removed
- Weekly monitoring dashboard configured in GSC