Understanding the AI Crawlers that are Crawling your Website

For media organizations and publishers looking to block ChatGPT and other products from crawling and accessing their content, there is a simple solution you can put in place.

Some major companies have announced this intent, and the UK's Independent Publishers Alliance warned its members to block ChatGPT and bot crawling on their websites as soon as possible. If you feel similarly about protecting your intellectual property from AI tool training on your content, read on:

Various types of crawlers crawl most websites and use the content they discover to train large language models (LLMs). We will explain many of them here to help you make a decision on how to handle each one.

Common AI Bots Crawling the Web

User-agent: anthropic-ai: Anthropic-ai is an unconfirmed agent possibly used by Anthropic to download training data for its LLMs that power AI products like Claude.

User-agent: Amazonbot:  AmazonBot is an agent used by Alexa for AI answer support. Amazon says the data “enables Alexa to answer questions,” but documents don’t claim it’s used to train broad LLMs.

User-agent: Applebot:  Apple's web crawler, used to gather information from the internet to power various search features within Apple's ecosystem.

User-agent: Applebot-ExtendedApplebot-Extended is a secondary user agent created by Apple to provide website owners with more control over how their content is used to train Apple’s AI models. Blocking only Applebot-Extended keeps you in Spotlight/Siri results while telling Apple not to use your content for model training.

User-agent: Bytespider: Bytespider is a web crawler operated by ByteDance, the Chinese owner of TikTok. It's allegedly used to download training data for its LLMs (Large Language Model), including those powering ChatGPT competitor Doubao.

User-agent: ClaudeBot: ClaudeBot is a web crawler operated by Anthropic that downloads training data for its LLMs that power AI products like Claude.

User-agent: Claude-Web: This crawler is activated when a user's query prompts Claude to retrieve real-time information from the web. Blocking this bot on your site would prevent Claude from retrieving your content in response to user requests.

User-agent: Claude-SearchBot: This crawler evaluates web pages to improve the quality of search results within Claude's internal search feature. To have your site appear in Claude's embedded results, you need to allow this bot access. 

User-agent: CCBot: A web crawler commonly associated with the Common Crawl project, which collects and processes web data to create a freely accessible web archive for use in research and analysis.

User-agent: GPTBot: GPTBot is a web crawler that OpenAI operates to download training data for its LLMs (Large Language Models) that power AI products like ChatGPT.

User-agent: ChatGPT-User: ChatGPT-User is dispatched by OpenAI's ChatGPT in response to user prompts. Its AI-generated answers will usually contain a summary of the content on the website, along with a reference link.

User-agent: cohere-ai: Cohere-ai is an unconfirmed agent possibly dispatched by Cohere's AI chat products in response to user prompts when it needs to retrieve content on the internet.

User-agent: Diffbot: Diffbot is an intelligent web crawler used to understand, aggregate, and ultimately sell structured website data for real-time monitoring and training other AI models.

User-agent: FacebookBot: FacebookBot is a web crawler used by Meta to download training data for its AI speech recognition technology.

User-agent: Omgilibot: A web crawler likely used to gather data for training large language models (LLMs) or other AI applications, though its exact purpose and origin might not be fully transparent.

User-agent: Omgili: Omgili is a web crawler used by Webz.io to maintain a repository of web crawl data that it sells to other companies, including those using it to train AI models.

User-agent: PerplexityBot: PerplexityBot is a web crawler used by Perplexity to index search results that allow their AI Assistant to answer user questions. The assistant's answers normally contain references to the website as inline sources.

User-agent: Perplexity-User: A user agent operated by Perplexity AI that facilitates real-time information retrieval during user interactions with their AI assistant.

User-agent: Google-Extended: Google-Extended is a web crawler used by Google to download AI training content for its AI products like Gemini.

How to Block AI Bots from Ingesting Your Content

To block any of the AI bots listed above, add each to your robots.txt settings as outlined below, using GPTBot as an example:

User-agent: GPTBot
Disallow: /

Drop us a line if you need help taking the next step here.