Technical SEO

XML Sitemap: complete guide — attributes, types, localisation and robots.txt

Article cover: XML Sitemap — complete guide

Complete XML sitemap guide: required and optional attributes, sitemap index, specialised maps for images, video and news, hreflang for localised sites, the single-line robots.txt rule — with links to Google and Yandex documentation.

An XML sitemap is a file you hand directly to the search crawler instead of waiting for it to discover pages by following links. You explicitly say: "here is a list of URLs I want indexed, here is when they were last updated". It is not a guarantee of indexing, but it significantly speeds up discovery and reduces crawl budget waste.

The sitemaps.org protocol was introduced in 2005 — Google adopted it first, followed by Yandex a year later. Today all major search engines support the standard.

How Googlebot uses a sitemap: it gets the URL list directly and crawls pages without wandering through internal links.

What is an XML sitemap and why you need it

A search bot discovers pages in two ways: by following internal links and by reading sitemaps. Link-based crawling works well for pages with many inbound links. But pages without any links, recently published content, or sections with sparse internal linking may be missed or discovered too slowly. That is exactly where a sitemap helps.

Speeds up crawling

The bot receives a URL list directly and doesn't have to traverse internal links page by page.

Signals updates

The lastmod attribute tells the search engine a page has changed and needs to be re-crawled.

Supports multilingual sites

Through xhtml:link you can declare hreflang relationships directly in the sitemap, without duplicating tags in HTML.

Sitemaps are especially valuable for: new sites without inbound links; large sites (thousands of pages); pages with rich media content (images, video); multilingual sites with hreflang.

Basic file structure

A minimal valid sitemap is an XML file with an encoding declaration, a root urlset element, and url elements inside it. Each url must contain at least one required attribute — loc.

XML
<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
  <url>
    <loc>https://example.com/</loc>
    <lastmod>2026-05-15</lastmod>
    <changefreq>daily</changefreq>
    <priority>1.0</priority>
  </url>
  <url>
    <loc>https://example.com/about</loc>
    <lastmod>2026-04-01</lastmod>
    <changefreq>monthly</changefreq>
    <priority>0.7</priority>
  </url>
  <url>
    <loc>https://example.com/blog/article-slug</loc>
    <lastmod>2026-05-10</lastmod>
    <changefreq>monthly</changefreq>
    <priority>0.75</priority>
  </url>
</urlset>

The file must be UTF-8 encoded. Special characters in URLs are XML-escaped: & becomes &amp;, < becomes &lt;. A single sitemap file is limited to 50,000 URLs and 50 MB uncompressed.

Attributes: loc, lastmod, changefreq, priority

AttributeRequiredFormatDescription
locYesAbsolute URLFull page address including protocol and domain. Maximum 2048 characters.
lastmodNoW3C Datetime (YYYY-MM-DD)Last modification date. Google only trusts it when the value is stable and accurate.
changefreqNoalways / hourly / daily / weekly / monthly / yearly / neverHint about update frequency. Google uses it as one signal but does not follow it strictly.
priorityNo0.0 — 1.0Relative importance of the URL within your site. Does not affect rankings in search results.

loc — the only required attribute

The URL must be absolute and match what the server actually returns: if the page is served over HTTPS, loc must use HTTPS. If the site uses www, loc must include www. A mismatch between loc and the real address causes Google to ignore the entry.

lastmod — the most valuable optional attribute

The date must reflect real content changes, not template updates or sitemap regeneration. If you update lastmod on every deploy without changing actual content, Google stops trusting the field and ignores it. Accepted formats: 2026-05-15, 2026-05-15T10:30:00+03:00, 2026-05-15T07:30:00Z.

Do not automatically set today's date on every build. The date should only change when the main content of the page actually changes.

changefreq — a hint, not a directive

changefreq does not control the crawler's schedule — it is just a hint. Google officially states it uses the value as one of many signals. Yandex treats the field as informational too. Practical defaults: homepage — daily, blog posts — monthly, legal pages — yearly.

priority — relative importance within your site

Priority from 0.0 to 1.0 tells the search engine which pages you consider more important — relative to your own site, not compared to other sites. Setting 1.0 for every page is meaningless: the search engine treats it as no prioritisation at all. A sensible scheme: homepage 1.0, section hubs 0.85, articles and products 0.7–0.75, utility pages 0.5.

Sitemap Index: when one file isn't enough

If you have more than 50,000 pages or the file exceeds 50 MB — you need a Sitemap Index. This is an XML file that references child sitemaps rather than URLs directly. Each child file is a regular urlset, capped at 50,000 entries.

XML
<?xml version="1.0" encoding="UTF-8"?>
<sitemapindex xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
  <sitemap>
    <loc>https://example.com/sitemap/blog.xml</loc>
    <lastmod>2026-05-15T00:00:00Z</lastmod>
  </sitemap>
  <sitemap>
    <loc>https://example.com/sitemap/products.xml</loc>
    <lastmod>2026-05-14T00:00:00Z</lastmod>
  </sitemap>
  <sitemap>
    <loc>https://example.com/sitemap/static.xml</loc>
    <lastmod>2026-04-01T00:00:00Z</lastmod>
  </sitemap>
</sitemapindex>

You can split by content type (blog, products, static pages) or by locale (ru.xml, en.xml). Both approaches are valid. Splitting by locale makes it easy to monitor indexation status per language separately.

In robots.txt it's enough to list only the root Sitemap Index — you don't need to list every child file separately. The search engine discovers them through the index.

Image sitemap

Images can appear in Google Images and bring additional traffic. To help the crawler discover and understand them, add the image extension to your regular urlset. Each url entry can contain up to 1,000 images.

XML
<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"
        xmlns:image="http://www.google.com/schemas/sitemap-image/1.1">
  <url>
    <loc>https://example.com/gallery/city</loc>
    <image:image>
      <image:loc>https://example.com/images/city-skyline.jpg</image:loc>
      <image:title>City skyline at sunset</image:title>
      <image:caption>Downtown panorama taken from the observation deck, 2025</image:caption>
      <image:geo_location>New York, USA</image:geo_location>
      <image:license>https://creativecommons.org/licenses/by/4.0/</image:license>
    </image:image>
  </url>
</urlset>
Attributes of the <url> tag and allowed values:
TagRequiredDescription
image:locYesAbsolute image URL. Can be on a different domain (e.g. CDN).
image:titleNoImage title. Equivalent to the img title attribute.
image:captionNoImage caption. Equivalent to the img alt attribute.
image:geo_locationNoGeographic location of the subject in the image.
image:licenseNoURL of the image license.

Images in the sitemap do not replace the alt attribute in HTML — both work together. The sitemap helps the crawler discover images; alt describes their meaning.

Video sitemap

A video sitemap enables your content to appear in Google Video search and in rich snippets with video previews. The namespace is video.

XML
<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"
        xmlns:video="http://www.google.com/schemas/sitemap-video/1.1">
  <url>
    <loc>https://example.com/tutorials/getting-started</loc>
    <video:video>
      <video:thumbnail_loc>https://example.com/thumbnails/tutorial-1.jpg</video:thumbnail_loc>
      <video:title>Getting started with the product</video:title>
      <video:description>Step-by-step guide for first-time users</video:description>
      <video:content_loc>https://example.com/videos/tutorial-1.mp4</video:content_loc>
      <video:duration>183</video:duration>
      <video:publication_date>2026-03-10T12:00:00+00:00</video:publication_date>
      <video:family_friendly>yes</video:family_friendly>
    </video:video>
  </url>
</urlset>

video:duration is in seconds. video:content_loc must point to a playable file (mp4, webm), not a player page. For YouTube or Vimeo hosted videos, use video:player_loc instead of video:content_loc.

News sitemap

Google News Sitemap is a special format for publishers participating in Google News. It includes only articles published within the last 48 hours. Older content should not be included — it is automatically dropped from the news index.

XML
<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"
        xmlns:news="http://www.google.com/schemas/sitemap-news/0.9">
  <url>
    <loc>https://example.com/news/tech-breakthrough-2026</loc>
    <news:news>
      <news:publication>
        <news:name>Example Daily</news:name>
        <news:language>en</news:language>
      </news:publication>
      <news:publication_date>2026-05-15T09:00:00+00:00</news:publication_date>
      <news:title>Technology breakthrough set to transform the industry</news:title>
    </news:news>
  </url>
</urlset>
The News Sitemap format is not suitable for regular blogs. It is designed exclusively for news publishers approved by Google News. Using the format without participating in the program gives no benefit.

Localisation: hreflang in sitemaps

If your site exists in multiple languages or for multiple regions, you need to connect the corresponding pages through hreflang. This can be done three ways: via link rel tags in HTML, via HTTP Link headers, or directly in the sitemap. The sitemap approach is most practical for large sites because it requires no changes to page templates.

In the sitemap, hreflang is declared through the xhtml namespace. Each page must appear as a url entry, and each entry must list all language variants — including itself.

XML
<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"
        xmlns:xhtml="http://www.w3.org/1999/xhtml">
  <!-- Russian version -->
  <url>
    <loc>https://example.com/blog/seo-guide</loc>
    <xhtml:link rel="alternate" hreflang="ru" href="https://example.com/blog/seo-guide"/>
    <xhtml:link rel="alternate" hreflang="en" href="https://example.com/en/blog/seo-guide"/>
    <xhtml:link rel="alternate" hreflang="x-default" href="https://example.com/blog/seo-guide"/>
  </url>
  <!-- English version -->
  <url>
    <loc>https://example.com/en/blog/seo-guide</loc>
    <xhtml:link rel="alternate" hreflang="ru" href="https://example.com/blog/seo-guide"/>
    <xhtml:link rel="alternate" hreflang="en" href="https://example.com/en/blog/seo-guide"/>
    <xhtml:link rel="alternate" hreflang="x-default" href="https://example.com/blog/seo-guide"/>
  </url>
</urlset>

The x-default value indicates the fallback page — the one shown to users who don't match any explicit language setting. This is typically the site's primary language version.

Reciprocity rule: if page A references page B through hreflang, page B must reference page A in return. One-sided hreflang declarations are treated as errors by Google and ignored.
hreflang codeUse case
ruRussian language, any region
enEnglish language, any region
en-USEnglish language, US region
ru-RURussian language, Russia region
x-defaultDefault page for undetermined language/region

Robots.txt: one line is enough

To let search engines find your sitemap, add a single Sitemap directive to robots.txt. If you use a Sitemap Index, point to it. The crawler discovers all child files through the index automatically.

TEXT
User-agent: *
Disallow: /admin/
Disallow: /api/
Allow: /

Sitemap: https://example.com/sitemap.xml

If you have multiple root index files (e.g. separate ones for ru and en), you can list multiple Sitemap directives:

TEXT
Sitemap: https://example.com/sitemap/index-ru.xml
Sitemap: https://example.com/sitemap/index-en.xml
Beyond robots.txt, submit the sitemap in Google Search Console (Sitemaps section) and Yandex Webmaster (Indexing → Sitemap files). This speeds up initial processing and gives you statistics on discovered and indexed URLs.

Google and Yandex documentation

Official Google sitemap documentation

Official Yandex Webmaster Sitemap documentation

Yandex supports the sitemaps.org standard and additionally supports the news extension. Yandex reads changefreq and priority but recommends focusing on accurate lastmod — it matters more for re-crawling updated pages.

Checklist

  • File is UTF-8 encoded with <?xml version="1.0" encoding="UTF-8"?> on the first line
  • All URLs are absolute (with protocol and domain) and match the real canonical
  • HTTPS in loc when the site serves HTTPS
  • lastmod reflects the real content change date — not the build date
  • File does not exceed 50,000 URLs or 50 MB uncompressed
  • For large sites — Sitemap Index with child files
  • For multilingual pages — xhtml:link with hreflang including x-default
  • For media content — appropriate extensions (image, video, news)
  • robots.txt contains a Sitemap directive with an absolute URL
  • Sitemap submitted in Google Search Console and Yandex Webmaster
  • Pages with noindex are not included in the sitemap
  • Pages with canonical pointing to a different URL are not included

FAQ

Technically no. If the site is well linked internally and all pages are reachable from the homepage — the bot will find them on its own. But a sitemap is always beneficial: it speeds up crawling and gives you control over what gets indexed.
No. Pages with a noindex meta tag or X-Robots-Tag: noindex must not appear in the sitemap — the contradiction confuses search engines. Include only pages you want indexed.
No. Priority is a crawl prioritisation hint within your site, not a ranking signal. It helps the bot decide which pages to visit more often, but has no effect on position in search results.
Whenever pages are added or removed. For dynamic sites (blog, e-commerce) it is best to regenerate the sitemap automatically on build. Static sites update it manually when the structure changes.
The standard doesn't limit the number of Sitemap directives in robots.txt. But best practice is one Sitemap Index, from which Google and Yandex discover all other files.