Crawling

Crawling

The first step of the search algorithm: how bots discover and crawl pages. Managing crawl budget and its effect on indexing.

In brief

Crawling is the process by which a search engine bot (Googlebot, Bingbot, etc.) discovers and fetches pages by following links, collecting data for later indexing.

How crawling works

A search bot starts with known URLs (sitemap.xml, links from other sites) and follows internal and external links. Every discovered page enters a processing queue for indexing.

Crawl budget

For large sites (10k+ pages), the bot's resources are finite. If it wastes time crawling junk (duplicates, session URLs, filter pages), it may miss important new pages. Crawl budget combines frequency (how often the bot visits) and the number of URLs it is willing to scan per session.

Controlling crawling

Key levers:

robots.txt — block crawling of unnecessary sections.
Clean response codes — 200 for desired pages, 404/301 for unwanted ones.
Sitemap.xml — hint which pages to crawl first.
Canonical — indicate the preferred URL so the bot doesn't waste time on duplicates.
Log analysis — study which pages the bot actually visits.

Crawling does not guarantee indexing. After fetching, the page undergoes evaluation and may not be included in the index.

FAQ

Common questions

Crawling is fetching and loading the page; indexing is analyzing and adding it to the searchable database.

It depends on authority, update frequency and site size. From several times a day to once every few weeks.

Submit a sitemap in Search Console, use the URL Inspection tool, and earn a few external links.

Check robots.txt, exclude parameter-based URLs in the search console settings, set up canonicals.

How crawling works

Crawl budget

Controlling crawling

Common questions

Discuss your project?