Crawl budget is the number of pages Googlebot (and other search engine bots) will crawl on your site within a given time period. It is determined by two factors: crawl rate limit (how fast Google crawls without overloading your server) and crawl demand (how often Google wants to recrawl your pages based on how useful they are). Small sites with fast, well-structured pages rarely need to worry about crawl budget. Large sites with thousands of pages — particularly ecommerce sites with faceted navigation or news sites with extensive archives — can benefit significantly from crawl budget optimisation.
Crawl budget becomes critical when a site has pages that waste crawl allocation — low-quality pages, duplicate content, parameterised URLs, and infinite scroll pagination can all consume crawl budget without contributing to search visibility. When these pages absorb Googlebot's attention, important new content and key commercial pages are crawled less frequently, slowing their indexation.
How to optimise crawl budget
- Block low-value URLs in robots.txt — search results pages, user profile pages, print versions, session ID parameters
- Use noindex on thin or duplicate content — preventing indexation reduces the crawl demand for those pages
- Implement canonical tags — identifying the preferred version of duplicate pages so bots focus on the right URL
- Fix crawl errors — 404s, redirect chains, and server errors all waste crawl allocation
- Optimise internal linking — orphan pages (no internal links) receive less crawl attention
- Improve server response time — faster responses allow more pages to be crawled per visit
- Submit XML sitemaps — helping bots identify which pages are most important to crawl
Sites with over 10,000 pages should actively manage crawl budget. This includes: large ecommerce sites with faceted navigation generating millions of URL combinations, news and publishing sites with extensive content archives, user-generated content platforms with variable quality pages, and sites with multiple language or regional versions creating content duplication. For small sites (under 1,000 pages) with clean architecture, crawl budget is rarely a limiting factor.
Robots.txt is a plain-text file in a website's root directory that gives instructions to search engine crawlers about which pages and directories they should not access. Blocking low-value sections of a site through robots.txt frees up crawl budget for important pages. However, robots.txt disallowed pages can still be indexed if they have external links — to prevent indexation, noindex meta tags on the pages themselves are more reliable. Robots.txt controls crawling; noindex controls indexing.