Technical SEO

Robots.txt: complete setup guide for SEO

Article cover: Robots.txt — complete setup guide

Robots.txt syntax breakdown, ready-made templates for public sites, e-commerce and staging pages, comparison with meta robots and X-Robots-Tag, testing tools and a final checklist.

Robots.txt is a plain text file at the root of your site that tells search crawlers which sections they may visit and which they should skip. It is the first thing a crawler reads when it arrives at your site. The file has existed since 1994 and remains the de facto standard for crawl control.

Robots.txt is a recommendation, not an enforcement mechanism. Reputable bots (Googlebot, Yandex) honour the directives. Malicious crawlers and scrapers ignore the file entirely. For protecting sensitive data, use server-side access controls and authentication — not robots.txt.

The key limitation: a page blocked via Disallow does not disappear from the index automatically. If other sites link to it, Google can index the URL without visiting it — and may show an empty snippet in search results. To fully remove a page from the index, you need different tools.

Googlebot decision tree: robots.txt controls crawling, meta noindex controls indexing. Combined, they yield four distinct outcomes.

What is robots.txt and its limitations

The robots.txt file must be placed at the domain root: https://example.com/robots.txt. Subdomains have their own files: robots.txt for blog.example.com lives at https://blog.example.com/robots.txt. The Robots Exclusion Standard (REP) defines the core directives; Google and Yandex have extended it with additional capabilities.

Crawl control

Prevents bots from visiting certain sections — reduces pointless server load and preserves crawl budget for valuable pages.

Points to Sitemap

The Sitemap directive helps search engines find your XML sitemap immediately without any additional setup.

Does not protect content

Robots.txt is publicly visible. It neither encrypts nor hides page content — it only asks bots not to visit.

Format and syntax

The file consists of blocks called records. Each record starts with one or more User-agent directives and can contain any number of Disallow, Allow, Crawl-delay, and Sitemap lines. Records are separated by a blank line. Comments start with #.

DirectiveSupportDescription
User-agentAllBot identifier. * means "all crawlers". A single block can have multiple User-agent lines.
DisallowAllPath that is off-limits for crawling. An empty value means "nothing is blocked" (allow everything).
AllowGoogle, YandexExplicitly permits a path within a disallowed section. Takes priority over Disallow.
Crawl-delayYandex, BingMinimum delay in seconds between bot requests. Ignored by Google.
SitemapGoogle, Yandex, BingAbsolute URL of the XML sitemap. Multiple Sitemap lines are allowed.

Wildcards: * and $

Google and Yandex support two wildcard characters in Disallow and Allow paths. An asterisk * matches any sequence of characters (including an empty string). A dollar sign $ anchors the end of a URL — an exact match up to the last character.

TEXT
# Block all URLs with the ?sort= parameter
Disallow: /*?sort=

# Block only /page (exact match, not /page/child)
Disallow: /page$

# Block all PDFs across the entire site
Disallow: /*.pdf$
When Disallow and Allow conflict, the more specific rule wins (more characters). If lengths are equal, Allow wins. This behaviour is Google-specific; other bots may implement it differently.

The Sitemap directive can go anywhere in the file — it does not need to be inside a specific User-agent block. Always use a full absolute URL:

TEXT
Sitemap: https://example.com/sitemap.xml
If you use a Sitemap Index, one Sitemap line pointing to the index file is enough — you do not need to list every child sitemap. The search engine discovers them all through the index.

Ready-made examples for different scenarios

Scenario 1: simple public site

For a small site with no private sections a minimal file is enough: allow everything, block the admin panel and technical paths, declare the sitemap.

TEXT
# robots.txt for example.com — simple public site
User-agent: *
Disallow: /admin/
Disallow: /login
Disallow: /dashboard/

Sitemap: https://example.com/sitemap.xml

Scenario 2: large e-commerce with catalogue and parameter pages

An online store generates hundreds of thousands of URLs through filters, sorting, and pagination. Most are duplicates or pages with no standalone value. Block parametric URLs using Disallow with wildcards while keeping clean category and product pages accessible.

TEXT
# robots.txt for e-commerce example.com
User-agent: *

# Technical and admin sections
Disallow: /admin/
Disallow: /checkout/
Disallow: /cart
Disallow: /account/
Disallow: /api/

# Parametric catalogue duplicates
Disallow: /*?sort=
Disallow: /*?filter=
Disallow: /*?page=
Disallow: /*?ref=
Disallow: /search

# Explicitly allow a useful parameter in one section
Allow: /catalog/shoes?size=

Crawl-delay: 1

Sitemap: https://example.com/sitemap.xml
For parametric pages with unique content (e.g. a product with a specific size or colour), prefer a canonical tag over Disallow — it preserves crawl budget without completely cutting off the page.

Scenario 3: site with staging and temporary pages

Webinar landing pages, drafts, template pages, and staging sections should be blocked from crawling. Use Disallow by path prefix and, if needed, a separate block for specific bots.

TEXT
# robots.txt for example.com — main site with staging sections
User-agent: *
Disallow: /webinar-drafts/
Disallow: /templates/
Disallow: /staging/
Disallow: /_preview/
Disallow: /thank-you-test

# Block the entire site for a specific bot (e.g. an aggressive crawler)
User-agent: AhrefsBot
Disallow: /

User-agent: SemrushBot
Disallow: /

Sitemap: https://example.com/sitemap.xml
Blocking a section via robots.txt does not mean its content is confidential. URLs of staging pages can still be indexed if external sites link to them. For true isolation, use HTTP Basic Auth or IP allowlisting at the server level.

Disallow vs meta robots noindex vs X-Robots-Tag

These are three different tools with different jobs. Mixing them up leads to a classic mistake: a page is blocked in robots.txt and also has a noindex meta tag — Google cannot read the meta tag because the page is blocked from crawling, so the noindex directive is never executed.

ToolWhere it livesWhat it doesWhen to use
Disallow in robots.txtrobots.txtPrevents the bot from visiting the URL. Does not remove from the index if the URL is already there.Block service sections, reduce unnecessary crawl budget consumption.
meta robots noindexHTML <head>Allows the bot to visit but forbids adding to the index. The bot reads the tag and follows the directive.Remove a specific page from the index (thank-you pages, pagination, filter pages).
X-Robots-TagHTTP response headerEquivalent to meta robots, but works for any file type: PDFs, images, documents.Control indexation of non-HTML resources (PDF, DOCX, images).
Never combine Disallow and noindex for the same page. If a page is blocked in robots.txt, Google cannot read its noindex — and may keep the URL in the index with an empty snippet. Use noindex only on pages the crawler is allowed to visit.

Example of X-Robots-Tag in an HTTP response header (configured in Nginx or application code):

TEXT
# Nginx: prevent indexation of all PDFs
location ~* \.pdf$ {
    add_header X-Robots-Tag "noindex, nofollow";
}

How to test robots.txt

Always validate changes before deploying. A single typo can block your entire site from crawlers.

  • Google Search Console → robots.txt Tester: built directly into GSC. Enter a URL to check whether Googlebot and other Google bots are allowed or blocked.
  • Yandex Webmaster → Robots.txt analysis: the equivalent tool for Yandex. Shows how Yandex Bot interprets your file.
  • Screaming Frog SEO Spider → Configuration → Robots.txt: scans the site respecting robots.txt and lists blocked URLs in the report.
  • curl: quick command-line check to read the file and verify the HTTP status.
Example robots.txt applying the above recommendations:
BASH
# Fetch robots.txt and inspect its contents
curl -s https://example.com/robots.txt

# Check HTTP status (must be 200)
curl -I https://example.com/robots.txt

Common mistakes and how to fix them

MistakeSymptomFix
Disallow: / for the whole siteGooglebot stops crawling everything. GSC shows pages as "blocked by robots.txt".Remove the line or replace with a specific path. This often sneaks in during deployments.
Accidental space in pathDisallow: /admin / (with a space) — the bot reads the path literally and nothing is blocked.Remove spaces. Paths are case-sensitive and whitespace-sensitive.
noindex + Disallow on the same pagePage stays in the index; noindex directive is never executed.Remove Disallow, keep only noindex. The bot must be able to visit the page to read the tag.
Missing User-agent before rulesSyntax error: rules apply to no bot.Add User-agent: * before the Disallow/Allow block.
robots.txt returns 404 or 500404 is treated as "no restrictions"; 500 halts crawling temporarily.Ensure HTTP 200. An empty file equals "allow everything".

Security recommendations

Robots.txt is public and indexed by search engines — anyone can read it. Publishing paths to admin panels, backups, internal APIs, or config files in robots.txt effectively hands attackers a map of your site.

Do not rely on robots.txt as the only protection for sensitive sections. A path like Disallow: /internal-api/v2/keys/ tells an attacker exactly where to look.
  • Do not expose sensitive paths — backups (.zip, .sql), config files (.env, config.php), internal APIs.
  • Protect confidential sections at the server level — HTTP Basic Auth, IP allowlist, OAuth — not via robots.txt.
  • Keep the Disallow list minimal — only paths that genuinely need to be excluded from crawling for SEO reasons.
  • Regularly review robots.txt in GSC — make sure deployments haven't accidentally introduced Disallow: /.

Checklist and final example

  • File is accessible at https://example.com/robots.txt and returns HTTP 200
  • UTF-8 encoding, no BOM (Byte Order Mark)
  • Every block starts with User-agent
  • Blank line separates blocks for different User-agents
  • No combination of Disallow + noindex on the same page
  • No sensitive paths exposed publicly
  • Crawl-delay set for Yandex if server load is high
  • Sitemap directive points to the root index file
  • File validated in GSC and Yandex Webmaster
  • Screaming Frog shows the expected number of blocked URLs

Final example robots.txt for example.com with comments — a universal template covering most scenarios:

TEXT
# robots.txt for example.com
# Updated: 2026-05-15

# Rules for all crawlers
User-agent: *

# Admin and auth sections
Disallow: /admin/
Disallow: /login
Disallow: /dashboard/
Disallow: /account/

# Technical paths
Disallow: /api/
Disallow: /_next/
Disallow: /cdn-cgi/

# Parametric duplicates (e-commerce)
Disallow: /*?sort=
Disallow: /*?filter=
Disallow: /*?ref=
Disallow: /search

# Staging and temporary pages
Disallow: /staging/
Disallow: /webinar-drafts/
Disallow: /templates/

# Explicit allow for a useful API path
Allow: /api/og-image/

# Crawl rate for Yandex
User-agent: Yandex
Crawl-delay: 1

# Aggressive SEO bots (optional)
User-agent: AhrefsBot
Disallow: /

User-agent: SemrushBot
Disallow: /

# Sitemap
Sitemap: https://example.com/sitemap.xml

FAQ

Indirectly. The file itself is not a ranking signal, but blocking pages with Disallow reduces crawling of those pages. Block important pages and they will lose visibility. Block duplicate parametric URLs and you free up crawl budget for valuable pages.
Google treats a missing robots.txt (404) as "no restrictions" and crawls the entire site. A 500 error or timeout is treated as a temporary block — Google pauses crawling until the file is restored.
Yes. Create two separate User-agent blocks. Googlebot reads only its block, Yandex reads only its own. The User-agent: * block applies to all bots that have no dedicated block.
Replace or remove the Disallow: / line in the User-agent: * block. After deployment verify in GSC URL Inspection. Normal crawling resumes within a few days of the fix.
Use an exact path without a trailing slash: Disallow: /private-doc.pdf. To block a file extension across the whole site, use a wildcard: Disallow: /*.pdf$