Crawling
The first step of the search algorithm: how bots discover and crawl pages. Managing crawl budget and its effect on indexing.
Crawling is the process by which a search engine bot (Googlebot, Bingbot, etc.) discovers and fetches pages by following links, collecting data for later indexing.
How crawling works
A search bot starts with known URLs (sitemap.xml, links from other sites) and follows internal and external links. Every discovered page enters a processing queue for indexing.
Crawl budget
For large sites (10k+ pages), the bot's resources are finite. If it wastes time crawling junk (duplicates, session URLs, filter pages), it may miss important new pages. Crawl budget combines frequency (how often the bot visits) and the number of URLs it is willing to scan per session.
Controlling crawling
Key levers:
- robots.txt — block crawling of unnecessary sections.
- Clean response codes — 200 for desired pages, 404/301 for unwanted ones.
- Sitemap.xml — hint which pages to crawl first.
- Canonical — indicate the preferred URL so the bot doesn't waste time on duplicates.
- Log analysis — study which pages the bot actually visits.
Common questions
Discuss your project?
Share your goals and website context — I will suggest a practical next step.