What is Robots.txt? Controlling Search Engine Crawling

Direct Answer

Robots.txt is a plain text file placed at the root of a website (yourdomain.com/robots.txt) that gives instructions to search engine crawlers about which pages and directories they should or should not access. When Googlebot visits a site, it reads robots.txt first and follows its instructions before crawling. Robots.txt can block specific directories, specific file types, or specific bot user agents. It is also used to declare the location of XML sitemaps. Critically, robots.txt controls crawling, not indexing — blocked pages can still appear in search results if they have external links pointing to them.

Robots.txt is one of the most commonly misunderstood technical SEO elements. Many site owners assume that blocking a page in robots.txt prevents it from appearing in search results — but this is incorrect. Robots.txt only prevents Googlebot from crawling the page; if external links point to the blocked page, Google can still index it based on the link signals alone. To prevent a page from appearing in search results, a noindex directive on the page itself is required.

Common robots.txt use cases

Block admin and staging areas — /wp-admin/, /admin/, /staging/ should always be blocked from crawlers
Block duplicate parameter URLs — filtering and sorting parameters that generate content duplicates
Block low-value areas — user profile pages, login pages, and search results pages
Allow specific crawlers — useful for allowing AI crawlers (GPTBot, PerplexityBot) while managing other bots
Sitemap declaration — pointing bots to your XML sitemap location ('Sitemap: https://yourdomain.com/sitemap.xml')
Test staging environments — blocking all bots from staging sites to prevent duplicate indexation

Technical SEO audit

Can robots.txt harm SEO if configured incorrectly?

Yes — incorrectly configured robots.txt is one of the most damaging technical SEO mistakes. Accidentally blocking Googlebot from accessing the entire site (a common error after migrations) will cause all pages to drop from the index within days. Blocking CSS and JavaScript files prevents Google from rendering pages correctly, causing rendering failures. Always test robots.txt changes in the Robots.txt Tester in Google Search Console before deploying, and monitor indexation immediately after any robots.txt change.

Should robots.txt explicitly allow AI crawlers?

Yes — explicitly allowing AI crawlers by name is a positive AEO/GEO signal. The default User-agent: * rule allows all crawlers including AI bots, but many sites have specific disallow rules for subpaths that may inadvertently block AI crawlers. Explicitly listing GPTBot, Google-Extended, ClaudeBot, PerplexityBot, and other major AI crawlers with Allow: / rules signals active cooperation with AI indexing. This does not override disallow rules for other bots unless specifically crafted to do so.

What is Robots.txt? Controlling Search Engine Crawling

Common robots.txt use cases

Related articles

Want expert help with your digital marketing?

What is Robots.txt? Controlling Search Engine Crawling

Common robots.txt use cases

Related articles

What is a Google Penalty? Manual Actions and Algorithmic Demotions Explained

How Does Google's Search Algorithm Work? The Key Systems Explained

What is a CDN? Content Delivery Networks and Website Performance

Want expert help with your digital marketing?