Articles
GEO Fundamentals

The Dual robots.txt Strategy: Block Training Bots, Allow Search Bots

AI engines run training bots and search bots. Blocking training bots protects your IP while allowing search bots preserves live citation visibility.

Visibility Team

In 2026, every major AI platform operates two distinct types of crawlers: training bots that ingest content into model weights, and search bots that retrieve pages for real-time answers. The standard practice for publishers is to block the former and allow the latter.

Why the Dual Strategy Matters

When a training bot crawls your site, your content may be incorporated into the model’s weights with no attribution. When a search bot crawls your site, your brand is cited as a source in real-time answers. The distinction determines whether you get visibility or get consumed.

The Two Types of AI Crawlers

Training Bots (Block These)

Training bots ingest content to train or update foundation models. Your content becomes part of the model’s parameters without live attribution.

BotPlatformPurpose
GPTBotOpenAITraining data collection
ClaudeBotAnthropicModel training ingestion
anthropic-aiAnthropicLegacy training crawler
Google-ExtendedGoogleGemini training data
CCBotCommon CrawlOpen training corpus

Search / Live-Citation Bots (Allow These)

Search bots retrieve pages in real time to generate answers with live citations.

BotPlatformPurpose
OAI-SearchBotOpenAIChatGPT search index
ChatGPT-UserOpenAIUser-triggered retrieval
Claude-SearchBotAnthropicClaude live search
Claude-WebAnthropicClaude web retrieval
PerplexityBotPerplexitySearch and live retrieval
# ===== TRAINING BOTS (Block) =====
User-agent: GPTBot
Disallow: /

User-agent: ClaudeBot
Disallow: /

User-agent: anthropic-ai
Disallow: /

User-agent: Google-Extended
Disallow: /

User-agent: CCBot
Disallow: /

# ===== SEARCH BOTS (Allow) =====
User-agent: OAI-SearchBot
Allow: /

User-agent: ChatGPT-User
Allow: /

User-agent: Claude-SearchBot
Allow: /

User-agent: Claude-Web
Allow: /

User-agent: PerplexityBot
Allow: /

# ===== TRADITIONAL SEARCH =====
User-agent: Googlebot
Allow: /

User-agent: Bingbot
Allow: /

User-agent: *
Allow: /

Sitemap: https://yourdomain.com/sitemap.xml

Important Implementation Notes

  • PerplexityBot serves both roles: There is no separate training bot for Perplexity. If you want Perplexity citations, you must allow it.
  • Google-Extended is distinct from Googlebot: Blocking Google-Extended does not affect Google Search rankings.
  • Keep robots.txt under 500KB: Large files may be truncated by crawlers.
  • Return HTTP 200: Ensure the file returns a 200 status code, not a redirect.
  • Use explicit Allow rules: Do not rely on inference from wildcard rules.

Beyond robots.txt

robots.txt controls access. To maximize citation value, pair it with:

  • llms.txt: A machine-readable description of your business for AI systems
  • Structured data: JSON-LD schemas that help AI parsers extract meaning
  • Semantic HTML: Proper landmarks and heading hierarchy for content structure

Visibility automates all three as part of our GEO platform.

#robots.txt #AI crawlers #GPTBot #ClaudeBot #GEO #SEO