Technical SEO
Robots.txt: complete setup guide for SEO

Robots.txt syntax breakdown, ready-made templates for public sites, e-commerce and staging pages, comparison with meta robots and X-Robots-Tag, testing tools and a final checklist.
Robots.txt is a plain text file at the root of your site that tells search crawlers which sections they may visit and which they should skip. It is the first thing a crawler reads when it arrives at your site. The file has existed since 1994 and remains the de facto standard for crawl control.
The key limitation: a page blocked via Disallow does not disappear from the index automatically. If other sites link to it, Google can index the URL without visiting it — and may show an empty snippet in search results. To fully remove a page from the index, you need different tools.
What is robots.txt and its limitations
The robots.txt file must be placed at the domain root: https://example.com/robots.txt. Subdomains have their own files: robots.txt for blog.example.com lives at https://blog.example.com/robots.txt. The Robots Exclusion Standard (REP) defines the core directives; Google and Yandex have extended it with additional capabilities.
Crawl control
Prevents bots from visiting certain sections — reduces pointless server load and preserves crawl budget for valuable pages.
Points to Sitemap
The Sitemap directive helps search engines find your XML sitemap immediately without any additional setup.
Does not protect content
Robots.txt is publicly visible. It neither encrypts nor hides page content — it only asks bots not to visit.
Format and syntax
The file consists of blocks called records. Each record starts with one or more User-agent directives and can contain any number of Disallow, Allow, Crawl-delay, and Sitemap lines. Records are separated by a blank line. Comments start with #.
| Directive | Support | Description |
|---|---|---|
| User-agent | All | Bot identifier. * means "all crawlers". A single block can have multiple User-agent lines. |
| Disallow | All | Path that is off-limits for crawling. An empty value means "nothing is blocked" (allow everything). |
| Allow | Google, Yandex | Explicitly permits a path within a disallowed section. Takes priority over Disallow. |
| Crawl-delay | Yandex, Bing | Minimum delay in seconds between bot requests. Ignored by Google. |
| Sitemap | Google, Yandex, Bing | Absolute URL of the XML sitemap. Multiple Sitemap lines are allowed. |
Wildcards: * and $
Google and Yandex support two wildcard characters in Disallow and Allow paths. An asterisk * matches any sequence of characters (including an empty string). A dollar sign $ anchors the end of a URL — an exact match up to the last character.
# Block all URLs with the ?sort= parameter
Disallow: /*?sort=
# Block only /page (exact match, not /page/child)
Disallow: /page$
# Block all PDFs across the entire site
Disallow: /*.pdf$The Sitemap directive can go anywhere in the file — it does not need to be inside a specific User-agent block. Always use a full absolute URL:
Sitemap: https://example.com/sitemap.xmlReady-made examples for different scenarios
Scenario 1: simple public site
For a small site with no private sections a minimal file is enough: allow everything, block the admin panel and technical paths, declare the sitemap.
# robots.txt for example.com — simple public site
User-agent: *
Disallow: /admin/
Disallow: /login
Disallow: /dashboard/
Sitemap: https://example.com/sitemap.xmlScenario 2: large e-commerce with catalogue and parameter pages
An online store generates hundreds of thousands of URLs through filters, sorting, and pagination. Most are duplicates or pages with no standalone value. Block parametric URLs using Disallow with wildcards while keeping clean category and product pages accessible.
# robots.txt for e-commerce example.com
User-agent: *
# Technical and admin sections
Disallow: /admin/
Disallow: /checkout/
Disallow: /cart
Disallow: /account/
Disallow: /api/
# Parametric catalogue duplicates
Disallow: /*?sort=
Disallow: /*?filter=
Disallow: /*?page=
Disallow: /*?ref=
Disallow: /search
# Explicitly allow a useful parameter in one section
Allow: /catalog/shoes?size=
Crawl-delay: 1
Sitemap: https://example.com/sitemap.xmlScenario 3: site with staging and temporary pages
Webinar landing pages, drafts, template pages, and staging sections should be blocked from crawling. Use Disallow by path prefix and, if needed, a separate block for specific bots.
# robots.txt for example.com — main site with staging sections
User-agent: *
Disallow: /webinar-drafts/
Disallow: /templates/
Disallow: /staging/
Disallow: /_preview/
Disallow: /thank-you-test
# Block the entire site for a specific bot (e.g. an aggressive crawler)
User-agent: AhrefsBot
Disallow: /
User-agent: SemrushBot
Disallow: /
Sitemap: https://example.com/sitemap.xmlDisallow vs meta robots noindex vs X-Robots-Tag
These are three different tools with different jobs. Mixing them up leads to a classic mistake: a page is blocked in robots.txt and also has a noindex meta tag — Google cannot read the meta tag because the page is blocked from crawling, so the noindex directive is never executed.
| Tool | Where it lives | What it does | When to use |
|---|---|---|---|
| Disallow in robots.txt | robots.txt | Prevents the bot from visiting the URL. Does not remove from the index if the URL is already there. | Block service sections, reduce unnecessary crawl budget consumption. |
| meta robots noindex | HTML <head> | Allows the bot to visit but forbids adding to the index. The bot reads the tag and follows the directive. | Remove a specific page from the index (thank-you pages, pagination, filter pages). |
| X-Robots-Tag | HTTP response header | Equivalent to meta robots, but works for any file type: PDFs, images, documents. | Control indexation of non-HTML resources (PDF, DOCX, images). |
Example of X-Robots-Tag in an HTTP response header (configured in Nginx or application code):
# Nginx: prevent indexation of all PDFs
location ~* \.pdf$ {
add_header X-Robots-Tag "noindex, nofollow";
}How to test robots.txt
Always validate changes before deploying. A single typo can block your entire site from crawlers.
- Google Search Console → robots.txt Tester: built directly into GSC. Enter a URL to check whether Googlebot and other Google bots are allowed or blocked.
- Yandex Webmaster → Robots.txt analysis: the equivalent tool for Yandex. Shows how Yandex Bot interprets your file.
- Screaming Frog SEO Spider → Configuration → Robots.txt: scans the site respecting robots.txt and lists blocked URLs in the report.
- curl: quick command-line check to read the file and verify the HTTP status.
# Fetch robots.txt and inspect its contents
curl -s https://example.com/robots.txt
# Check HTTP status (must be 200)
curl -I https://example.com/robots.txtCommon mistakes and how to fix them
| Mistake | Symptom | Fix |
|---|---|---|
| Disallow: / for the whole site | Googlebot stops crawling everything. GSC shows pages as "blocked by robots.txt". | Remove the line or replace with a specific path. This often sneaks in during deployments. |
| Accidental space in path | Disallow: /admin / (with a space) — the bot reads the path literally and nothing is blocked. | Remove spaces. Paths are case-sensitive and whitespace-sensitive. |
| noindex + Disallow on the same page | Page stays in the index; noindex directive is never executed. | Remove Disallow, keep only noindex. The bot must be able to visit the page to read the tag. |
| Missing User-agent before rules | Syntax error: rules apply to no bot. | Add User-agent: * before the Disallow/Allow block. |
| robots.txt returns 404 or 500 | 404 is treated as "no restrictions"; 500 halts crawling temporarily. | Ensure HTTP 200. An empty file equals "allow everything". |
Security recommendations
Robots.txt is public and indexed by search engines — anyone can read it. Publishing paths to admin panels, backups, internal APIs, or config files in robots.txt effectively hands attackers a map of your site.
- Do not expose sensitive paths — backups (.zip, .sql), config files (.env, config.php), internal APIs.
- Protect confidential sections at the server level — HTTP Basic Auth, IP allowlist, OAuth — not via robots.txt.
- Keep the Disallow list minimal — only paths that genuinely need to be excluded from crawling for SEO reasons.
- Regularly review robots.txt in GSC — make sure deployments haven't accidentally introduced Disallow: /.
Checklist and final example
- File is accessible at https://example.com/robots.txt and returns HTTP 200
- UTF-8 encoding, no BOM (Byte Order Mark)
- Every block starts with User-agent
- Blank line separates blocks for different User-agents
- No combination of Disallow + noindex on the same page
- No sensitive paths exposed publicly
- Crawl-delay set for Yandex if server load is high
- Sitemap directive points to the root index file
- File validated in GSC and Yandex Webmaster
- Screaming Frog shows the expected number of blocked URLs
Final example robots.txt for example.com with comments — a universal template covering most scenarios:
# robots.txt for example.com
# Updated: 2026-05-15
# Rules for all crawlers
User-agent: *
# Admin and auth sections
Disallow: /admin/
Disallow: /login
Disallow: /dashboard/
Disallow: /account/
# Technical paths
Disallow: /api/
Disallow: /_next/
Disallow: /cdn-cgi/
# Parametric duplicates (e-commerce)
Disallow: /*?sort=
Disallow: /*?filter=
Disallow: /*?ref=
Disallow: /search
# Staging and temporary pages
Disallow: /staging/
Disallow: /webinar-drafts/
Disallow: /templates/
# Explicit allow for a useful API path
Allow: /api/og-image/
# Crawl rate for Yandex
User-agent: Yandex
Crawl-delay: 1
# Aggressive SEO bots (optional)
User-agent: AhrefsBot
Disallow: /
User-agent: SemrushBot
Disallow: /
# Sitemap
Sitemap: https://example.com/sitemap.xml