Blog / SEO / Robots.txt in 2026: syntax, AI crawlers, and how to test the file | SEOquick

SEO · 18 years of practice · updated June 2026

Robots.txt in 2026: syntax, AI crawlers, and how to test the file | SEOquick

One robots.txt file controls how Google and AI crawlers browse your site. Here is the syntax, the common mistakes, and the new rules for 2026.

Author

SEOquick Team

CEO · SEO Strategy · ~9 min read

Fact-check

Anatolii Ulitovskyi

Founder · AI & GEO · June 2026

Robots.txt is a text file in the root of your site that controls crawling: it tells search and AI crawlers which sections to visit and which to skip. But remember the main rule of 2026: robots.txt controls crawling, not indexing. To remove a page from search, you need noindex, not Disallow.

Anyone who works on website promotion should understand this file and know how to write the most common directives. A well-built robots.txt helps you save crawl budget and is a basic technical SEO tool. A mistake in a single line, however, can block your entire site from Google or break page rendering.

To understand how robots.txt works, recall how search engines operate. Crawlers do two jobs: crawling the web for new information and indexing content so users can find it. Following billions of links, a bot behaves like a spider in a web — it walks the territory and looks at what is new.

When a bot arrives at a site, but before crawling, it first looks for the robots.txt file. If the file exists, it reads the instructions and acts accordingly. If there is no file, or it contains no restrictions, the bot keeps crawling everything.

What does your competitor have that you do not? Traffic. SEOquick can help!

We will bring a wave of traffic to your site through SEO.

We do it with white-hat methods only, with no filters or penalties from Google.

We run a deep optimization: strengthen content, grow links and reputation. And it works!

Book a call

First look at Robots.txt

Robots.txt is a plain text file created by a webmaster to instruct crawlers. It holds recommendations on how to crawl the pages of a site. In simple terms: the file tells the bot where not to go, what to crawl for search, and what not to.

The file lives in the root directory of the site. Every time a crawler arrives, it looks for it in one specific place — the main directory of the domain. If there is no file at example.com/robots.txt, the bot assumes there are no instructions at all and crawls everything.

Key technical details for 2026:

The file name is case-sensitive: it must be named exactly "robots.txt" (not Robots.txt or robots.TXT).
It is a public file — any user can see it at /robots.txt. So never use it to hide confidential data.
Each subdomain must have its own robots.txt: both blog.example.com and example.com are crawled via separate files.
The encoding is standard UTF-8, otherwise crawlers may read the content incorrectly.
The size limit for Google is 500 KB; anything larger is ignored.

Note: if robots.txt is not in the root (for example, example.com/index/robots.txt), it will not be taken into account.

Why does this matter? Mostly to save crawl budget and keep the index tidy: so the crawler does not waste time on service sections, filters, and parameters, but focuses on important pages. A solid robots.txt is a mandatory part of a technical site audit.

What robots.txt can and cannot do

Robots.txt controls crawler access to certain areas of a site. This is useful but dangerous: a single line can accidentally forbid Googlebot from crawling the whole resource. To avoid confusion, keep a clear task table in mind.

Job	Use robots.txt?	Note
Reduce crawl waste	yes	filters, parameters, technical folders
Block CSS/JS	usually no	Google should understand the page
Remove a page from index	no	use noindex or remove the URL
Point to sitemap	yes	useful for search engines
Hide private data	no	use authentication, not robots

Where robots.txt is appropriate:

Crawl budget savings. Block crawling of filters, sorting parameters (?sort=, ?color=), internal search results, and endless URL combinations.
Service sections. Admin, cart, user account, technical folders.
Pointing to the Sitemap. It is useful to specify the path to the XML sitemap in the file.
Reducing server load from too-frequent crawler requests to heavy sections.

Where robots.txt is useless or harmful:

Removing a page from the index. Disallow does not remove a URL from search — you need noindex or the URL removal tool in Search Console.
Hiding private data. Use authentication and a password, not robots.txt.
Blocking CSS and JS. If you block resources needed for rendering, Google sees a "broken" page. According to a 2026 audit, about 63% of large sites accidentally block important CSS/JS due to careless wildcard rules.

Note: a page blocked in robots.txt can still appear in search results if it is linked to from this site or elsewhere — just without a description (snippet).

To check whether the file exists, type the root domain into the address bar and add /robots.txt.

Robots.txt syntax: the core directives

Robots.txt syntax is simple. Each line is a field, a colon, and a value. Field names are case-insensitive, but path values (after Disallow/Allow) are case-sensitive. In its simplest form the file looks like this:

User-agent: *
Disallow: /wp-admin/
Allow: /wp-admin/admin-ajax.php
Sitemap: https://example.com/sitemap.xml

Let us cover the key directives that matter in 2026:

User-agent — the name of the crawler the rules are addressed to. An asterisk (*) means "for all bots." Rule blocks for different User-agents are separated by an empty line.
Disallow — a ban on crawling the specified path. One Disallow line per path.
Allow — permission to crawl a page or subfolder, even if the parent folder is blocked. Supported by Google and Bing.
Sitemap — points to the location of the XML sitemap. Must be a full URL with protocol. You can specify several sitemaps.

If the file contains rules for several User-agents, the crawler applies the block addressed specifically to it. All other bots follow the general directives in the User-agent: * group.

An important nuance: the Crawl-delay directive is not supported by Googlebot. To manage crawl speed for Google, use the settings in Search Console, not robots.txt.

Special characters (regular expressions) help when working with pages and subfolders:

* — a wildcard, replaces any sequence of characters;
$ — matches the end of a URL;
# — a comment, the crawler ignores everything after it.

A few practical examples. Block the entire site from all crawlers (useful for a site in development):

User-agent: *
Disallow: /

Open the entire site for crawling — an empty Disallow means "everything is allowed":

User-agent: *
Disallow:

Block a specific folder for Googlebot only:

User-agent: Googlebot
Disallow: /example-subfolder/

AI crawlers in robots.txt: the big new chapter of 2026

The most important change of recent years is AI crawlers. Today robots.txt manages not only Google but also bots of large language models. Here it is critical to understand the difference between two types of AI bots:

Training crawlers collect content to train models: GPTBot (OpenAI), Google-Extended (Gemini), ClaudeBot (Anthropic), CCBot (Common Crawl). Blocking these bots prevents your content from being used for training.
Search / RAG crawlers hit the site at the moment of a user query and provide citation with a link: OAI-SearchBot and ChatGPT-User (OpenAI), PerplexityBot, Claude-SearchBot. Blocking these bots removes you from AI-search impressions and traffic.

The recommended strategy for most businesses in 2026: block training crawlers but allow search crawlers. That way your content appears in AI-search answers with attribution and brings referral visits, but is not used to train someone else's models. Example block:

# Block model training User-agent: GPTBot Disallow: /

User-agent: Google-Extended Disallow: /

User-agent: ClaudeBot Disallow: /

User-agent: CCBot Disallow: /

Allow AI search with citation

User-agent: OAI-SearchBot Allow: /

User-agent: PerplexityBot Allow: /

Important caveats. The old Claude-Web and anthropic-ai tokens are no longer active — sites that block only those are not actually blocking the current ClaudeBot. And remember: aggressive scrapers (for example, Bytespider or "stealth" crawlers) can ignore robots.txt and spoof their User-Agent. The only real way to protect crawl budget from such bots is at the server or WAF level. If you are bringing AI tools into your promotion, it is worth planning an access policy in advance — we help with this as part of AI tool development.

Robots.txt vs noindex: the key difference

This is the most common and most expensive mistake. Remember the formula: robots.txt controls crawling, noindex controls indexing.

Disallow in robots.txt forbids the crawler from visiting a page. But if external links point to the page, it can still appear in search — without a snippet.
noindex (the <meta name="robots" content="noindex"> tag or the X-Robots-Tag HTTP header) forbids adding the page to the index.

The main trap: you cannot put both Disallow and noindex on the same page. If you block a URL in robots.txt, the crawler cannot reach the page and will not see the noindex tag — so the page stays in the index. Do it this way: to remove a page from search, allow it to be crawled and add noindex. To save crawl budget on a section that should not be crawled anyway, use Disallow.

Testing robots.txt and common mistakes

A broken robots.txt is a problem that takes time to detect. Before you publish the file, test it. Google provides a robots.txt report right in Search Console (Settings → robots.txt report): it shows the last fetch date, errors, and warnings.

The most common mistakes we see during search promotion of sites:

blocking the whole site with Disallow: / and forgetting to remove it after moving from staging;
blocking CSS and JS, so Google sees a broken, "non-mobile" page;
blocking URLs in robots.txt and expecting the page to disappear from the index (noindex is needed instead);
forgetting to specify the Sitemap;
not re-checking robots.txt after a redesign or migration;
blocking only outdated AI tokens, leaving current bots without rules.

If a page still shows in search after being blocked, check in Search Console whether Google has re-indexed the site, and whether there are external links to the blocked page. Timely analysis helps avoid trouble and saves time. Always verify the rules against the official docs: robots.txt introduction from Google and how Google interprets robots.txt.

FAQ: common questions about robots.txt

Will Disallow remove a page from Google search?

No. Disallow blocks crawling only. If external links point to the page, it can stay in search without a description. To remove it, use noindex or the URL removal tool in Search Console.

Can I put noindex directly in robots.txt?

No. Google has not officially supported the noindex directive in robots.txt since 2019. Use the robots meta tag or the X-Robots-Tag HTTP header on the page itself, without blocking it in robots.txt.

Should I block AI crawlers?

It depends on your strategy. If you do not want your content used for model training, block GPTBot, Google-Extended, ClaudeBot, CCBot. But allow search crawlers (OAI-SearchBot, PerplexityBot) to stay in AI search and earn referral visits.

Why should I not block CSS and JS?

Without these files, Googlebot cannot render the page correctly and sees it as "broken" — which hurts the mobile assessment and ranking. Always keep resources needed for rendering open.

Does Googlebot support the Crawl-delay directive?

No. Googlebot ignores Crawl-delay. Manage crawl speed for Google through the settings in Search Console.

Does each subdomain need its own robots.txt?

Yes. Each subdomain is crawled via its own file. blog.example.com and example.com must have separate robots.txt files in the root.

15.06.2026 1 min read

Link Building in Simple Words: Where to Get Permanent Links and How to Promote a Site with Links in 2026

Link building in simple words from a practitioner since 2008: how permanent links differ from rented links, why the black-hat SEO era is over, white-hat methods with examples, internal linking, AI-assisted link building, and sources.

Read →

12.06.2026 15 min read

Google Ads Keywords in 2026: Research, Match Types, Negative Keywords

How Google Ads keywords actually work in 2026: real match type behavior, keyword research, campaign structure, negative keywords and PMax.

Read →

12.06.2026 19 min read

Performance Max for an Online Store: A Setup and Optimization Case Study

How to set up Performance Max for an online store: a case study with ROAS growth from 2.8 to 5.1, the Merchant Center feed, asset groups, budget and optimization.

Read →

SEOquick

Want to apply this to your site?

We will review the current situation, find the first growth levers, and suggest a practical working format.

Discuss a project → View services

Robots.txt in 2026: syntax, AI crawlers, and how to test the file | SEOquick

What does your competitor have that you do not? Traffic. SEOquick can help!

First look at Robots.txt

What robots.txt can and cannot do

Robots.txt syntax: the core directives

AI crawlers in robots.txt: the big new chapter of 2026

Allow AI search with citation

Robots.txt vs noindex: the key difference

Testing robots.txt and common mistakes

FAQ: common questions about robots.txt

Will Disallow remove a page from Google search?

Can I put noindex directly in robots.txt?

Should I block AI crawlers?

Why should I not block CSS and JS?

Does Googlebot support the Crawl-delay directive?

Does each subdomain need its own robots.txt?

Related articles

Link Building in Simple Words: Where to Get Permanent Links and How to Promote a Site with Links in 2026

Google Ads Keywords in 2026: Research, Match Types, Negative Keywords

Performance Max for an Online Store: A Setup and Optimization Case Study

Want to apply this to your site?