Robots.txt

A root-level text file that tells crawlers which paths may be fetched and where sitemaps live. It steers crawling—it is not a reliable way to hide URLs from search results.

In brief

Robots.txt implements the Robots Exclusion Protocol—User-agent groups, Disallow/Allow prefix rules, optional Sitemap lines, and occasional vendor extensions. One mis-scoped rule can block large parts of a site.

Purpose

Crawlers fetch robots.txt to learn path-level fetch permissions before large-scale crawling. Each hostname (and scheme) needs its own file at the web root; CDNs and staging hosts are easy to misconfigure.

Blocking a URL with robots.txt prevents crawling but not necessarily indexing—Google may still list the URL if links exist, sometimes without a snippet.

Directives

Disallow prefixes block fetches; Allow can reopen nested paths depending on crawler precedence rules. List absolute sitemap URLs with repeated Sitemap lines when needed. Wildcards are supported in limited ways—verify against Google's spec before relying on them.

  • Order rules thoughtfully when multiple prefixes overlap.
  • Never rely on robots.txt for secrets—use authentication.
  • Treat staging domains as first-class citizens with explicit policies.

Example file

Below is a teaching snippet with comments and a Sitemap line. This site's live rules are always at /robots.txt on the same origin you are browsing from—the path is identical on local and production hosts; only the hostname in the address bar changes. That file shows what is disallowed from crawling (here, /api/). For absolute URLs elsewhere (e.g. the Sitemap line emitted by Next), set NEXT_PUBLIC_SITE_URL in your deploy environment.

TXT
User-agent: *
Allow: /
Disallow: /api/

# Teaching: block a tree but reopen a branch
# Disallow: /admin/
# Allow: /admin/public/

Sitemap: https://www.example.com/sitemap.xml

Common mistakes

  • Blocking CSS/JS so Googlebot cannot render the page faithfully.
  • Typos in paths that silently fail to match intended URLs.
  • Post-migration robots regressions that block entire sections.
  • Confusing robots.txt with meta robots or X-Robots-Tag semantics.

Practice

  • Keep rendering assets crawlable unless you accept partial rendering.
  • Reference sitemap indexes from robots.txt and retest after releases.
  • Track robots.txt changes in version control alongside deploy notes.

Common questions

No—it manages crawling. Use noindex, appropriate HTTP statuses, authenticated walls, or Search Console removals depending on intent.
Each hostname needs its own robots.txt at its root; subdomains are separate hosts.
Core REP syntax aligns, but always verify vendor-specific notes for wildcards and crawl-delay style directives.
Prefix matching is literal—/admin blocks /administration unless you scope rules carefully. Test in crawl reports.
Direct contacts

Discuss your project?

Share your goals and website context — I will suggest a practical next step.