For media organizations and publishers looking to block ChatGPT and other products from crawling and accessing their content, there is a very easy solution to put in place.
Some major companies have announced this intent recently, and the UK's Independent Publishers Alliance warned its members to block ChatGPT and bot crawling on their websites as soon as possible. If you feel similarly about protecting your intellectual property from AI tool training on your content, read on:
There are a series of types of crawlers that crawl most websites and use the content they discover to train LLMs. I will explain some of them here to help you decide how to handle each one.
Common AI Bots Crawling the Web
User-agent: anthropic-ai: Anthropic-ai is an unconfirmed agent possibly used by Anthropic to download training data for its LLMs (Large Language Models) that power AI products like Claude.
User-agent: Bytespider: Bytespider is a web crawler operated by ByteDance, the Chinese owner of TikTok. It's allegedly used to download training data for its LLMs (Large Language Model), including those powering ChatGPT competitor Doubao.
User-agent: ClaudeBot: ClaudeBot is a web crawler operated by Anthropic that downloads training data for its LLMs (Large Language Models) that power AI products like Claude.
User-agent: Claude-Web: Claude-Web is an AI-related agent operated by Anthropic. It's currently unclear exactly what it's used for since there's no official documentation.
User-agent: CCBot: A web crawler commonly associated with the Common Crawl project, which collects and processes web data to create a freely accessible web archive for use in research and analysis.
User-agent: GPTBot: GPTBot is a web crawler OpenAI operates to download training data for its LLMs (Large Language Models) that power AI products like ChatGPT.
User-agent: ChatGPT-User: ChatGPT-User is dispatched by OpenAI's ChatGPT in response to user prompts. Its AI-generated answers will usually contain a summary of the content on the website, along with a reference link.
User-agent: cohere-ai: Cohere-ai is an unconfirmed agent possibly dispatched by Cohere's AI chat products in response to user prompts when it needs to retrieve content on the internet.
User-agent: Diffbot: Diffbot is an intelligent web crawler used to understand, aggregate, and ultimately sell structured website data for real-time monitoring and training other AI models.
User-agent: FacebookBot: FacebookBot is a web crawler used by Meta to download training data for its AI speech recognition technology.
User-agent: Omgilibot: A web crawler likely used to gather data for training large language models (LLMs) or other AI applications, though its exact purpose and origin might not be fully transparent.
User-agent: Omgili: Omgili is a web crawler used by Webz.io to maintain a repository of web crawl data that it sells to other companies, including those using it to train AI models.
User-agent: PerplexityBot: PerplexityBot is a web crawler used by Perplexity to index search results that allow their AI Assistant to answer user questions. The assistant's answers normally contain references to the website as inline sources.
User-agent: Google-Extended: Google-Extended is a web crawler used by Google to download AI training content for its AI products like the Gemini assistant and its Vertex AI generative APIs to supply AI Overviews.
How to Block Common AI Bots from Crawling Your Content
If you want to block all of the AI bots detailed above, add all the text below to your robots.txt settings. Or, add the exact text from each user-agent below for the ones you want to block access to your content.
User-agent: anthropic-ai
Disallow: /
User-agent: Bytespider
Disallow: /
User-agent: ClaudeBot
Disallow: /
User-agent: Claude-Web
Disallow: /
User-agent: CCBot
Disallow: /
User-agent: Diffbot
Disallow: /
User-agent: GPTBot
Disallow: /
User-agent: ChatGPT-User
Disallow: /
User-agent: cohere-ai
Disallow: /
User-agent: FacebookBot
Disallow: /
User-agent: Omgilibot
Disallow: /
User-agent: Omgili
Disallow: /
User-agent: PerplexityBot
Disallow: /
User-agent: Google-Extended
Disallow: /
Drop us a line if you need help taking the next step here.