Question 1

Does robots.txt actually block AI training?

Accepted Answer

robots.txt is a voluntary standard — it relies on crawlers choosing to respect it. Major AI companies (OpenAI, Anthropic, Google, Meta) have committed to honoring robots.txt for their AI training crawlers. However, it does not provide a legal or technical guarantee. Content already crawled before you added the block may still be in training datasets.

Question 2

Which AI companies respect robots.txt?

Accepted Answer

OpenAI (GPTBot, ChatGPT-User, OAI-SearchBot), Anthropic (ClaudeBot, Claude-Web), Google (Google-Extended), Apple (Applebot-Extended), and Meta (Meta-ExternalAgent) all publicly respect robots.txt. Common Crawl (CCBot) also honors it.

Question 3

Where do I put robots.txt?

Accepted Answer

Place the file in the root directory of your website so it's accessible at https://yoursite.com/robots.txt. For static site generators, put it in your public or static folder. For ZeroDeploy, Netlify, and Cloudflare Pages, put it in your build output directory alongside index.html.

Question 4

Can I block AI crawlers without hurting my SEO?

Accepted Answer

Yes. AI training crawlers (GPTBot, ClaudeBot, Google-Extended, CCBot) are completely separate from search engine indexing crawlers (Googlebot, Bingbot). Blocking AI crawlers has no effect on your search rankings.

Question 5

What's the difference between blocking GPTBot and ChatGPT-User?

Accepted Answer

GPTBot collects training data for OpenAI's models — blocking it prevents your content from being used in future model training. ChatGPT-User is the browser that ChatGPT uses when a user asks it to visit a URL in real-time. Blocking ChatGPT-User prevents ChatGPT from fetching your pages during conversations but doesn't affect training.

Question 6

Should I block Common Crawl (CCBot)?

Accepted Answer

Common Crawl is a nonprofit that builds an open web archive. Many AI companies use this archive for training data, so blocking CCBot reduces your content's availability to a wide range of AI systems at once. However, Common Crawl data is also used for academic research and web analytics.

Bot Name	Company	Purpose	User-Agent String
GPTBot	OpenAI	Training data collection for GPT models	`GPTBot`
ChatGPT-User	OpenAI	Real-time browsing in ChatGPT (user-initiated)	`ChatGPT-User`
OAI-SearchBot	OpenAI	ChatGPT search results (SearchGPT)	`OAI-SearchBot`
ClaudeBot	Anthropic	Training data collection for Claude models	`ClaudeBot`
Claude-Web	Anthropic	Real-time browsing in Claude (user-initiated)	`Claude-Web`
Google-Extended	Google	Training data for Gemini (separate from search indexing)	`Google-Extended`
CCBot	Common Crawl	Open web archive used by many AI companies for training	`CCBot`
Meta-ExternalAgent	Meta	Training data collection for Meta AI / Llama	`Meta-ExternalAgent`
FacebookBot	Meta	AI features on Facebook and Instagram	`FacebookBot`
Bytespider	ByteDance	Training data for ByteDance AI products	`Bytespider`
PerplexityBot	Perplexity	AI search engine indexing and answer generation	`PerplexityBot`
Amazonbot	Amazon	Alexa AI and Amazon product search	`Amazonbot`
Applebot-Extended	Apple	Training data for Apple Intelligence features	`Applebot-Extended`
cohere-ai	Cohere	Training data for Cohere language models	`cohere-ai`

robots.txt Generator

Search Engines

AI Crawlers

Other Options

Custom Rules

Which AI Bots Are Crawling Your Site?

robots.txt Syntax Reference

User-agent

Disallow

Allow

Sitemap

Crawl-delay

Wildcards

Frequently Asked Questions

Deploy your site with zero configuration