Robots.txt is a plain text file placed at the root of a website (yourdomain.com/robots.txt) that gives instructions to search engine crawlers about which pages and directories they should or should not access. When Googlebot visits a site, it reads robots.txt first and follows its instructions before crawling. Robots.txt can block specific directories, specific file types, or specific bot user agents. It is also used to declare the location of XML sitemaps. Critically, robots.txt controls crawling, not indexing — blocked pages can still appear in search results if they have external links pointing to them.
Robots.txt is one of the most commonly misunderstood technical SEO elements. Many site owners assume that blocking a page in robots.txt prevents it from appearing in search results — but this is incorrect. Robots.txt only prevents Googlebot from crawling the page; if external links point to the blocked page, Google can still index it based on the link signals alone. To prevent a page from appearing in search results, a noindex directive on the page itself is required.
Common robots.txt use cases
- Block admin and staging areas — /wp-admin/, /admin/, /staging/ should always be blocked from crawlers
- Block duplicate parameter URLs — filtering and sorting parameters that generate content duplicates
- Block low-value areas — user profile pages, login pages, and search results pages
- Allow specific crawlers — useful for allowing AI crawlers (GPTBot, PerplexityBot) while managing other bots
- Sitemap declaration — pointing bots to your XML sitemap location ('Sitemap: https://yourdomain.com/sitemap.xml')
- Test staging environments — blocking all bots from staging sites to prevent duplicate indexation
Yes — incorrectly configured robots.txt is one of the most damaging technical SEO mistakes. Accidentally blocking Googlebot from accessing the entire site (a common error after migrations) will cause all pages to drop from the index within days. Blocking CSS and JavaScript files prevents Google from rendering pages correctly, causing rendering failures. Always test robots.txt changes in the Robots.txt Tester in Google Search Console before deploying, and monitor indexation immediately after any robots.txt change.
Yes — explicitly allowing AI crawlers by name is a positive AEO/GEO signal. The default User-agent: * rule allows all crawlers including AI bots, but many sites have specific disallow rules for subpaths that may inadvertently block AI crawlers. Explicitly listing GPTBot, Google-Extended, ClaudeBot, PerplexityBot, and other major AI crawlers with Allow: / rules signals active cooperation with AI indexing. This does not override disallow rules for other bots unless specifically crafted to do so.