Which AI crawlers and bots visit your site? A 2026 guide to AI user agents

Dozens of AI bots now request publisher pages, but for practical purposes they do only three jobs, and the job is what should drive your response. A training crawler gathers content in bulk to train a model. A retrieval crawler builds the index an assistant searches. A live agent fetches a single page in real time to answer one user's question. The directory below lists the named user agents in each group, from GPTBot and ClaudeBot to OAI-SearchBot, PerplexityBot, ChatGPT-User and Perplexity-User, and sets out whether to block, allow, or monetise each one.

The three types of AI bot, and why the difference matters

A training crawler collects content in bulk so it can become part of a model's long-term knowledge. The value to you is indirect, which is why training access is the usual subject of licensing and pay-per-crawl conversations.

A retrieval crawler builds and refreshes a searchable index that an assistant queries at the moment of a question. Allowing these helps keep your content eligible to be cited in AI search results.

A live agent fetches a single page in real time because a user just asked something that needs current information. This is the category closest to purchase intent, and the one most worth monetising, because the read happens at the moment of a decision.

The same company often runs more than one of these, with separate names, so you can treat them differently.

Training crawlers

GPTBot is OpenAI's training crawler. ClaudeBot is Anthropic's. Google-Extended is not a crawler that fetches pages but a control token used in robots.txt to opt out of content being used to improve Google's generative models. Applebot-Extended likewise governs whether Apple may use your content for training. Meta-ExternalAgent is Meta's training and ingestion crawler. CCBot is the crawler operated by Common Crawl, whose open datasets are widely used to train models. Disallowing these keeps your content out of model training, with little effect on whether you are cited in AI search.

Retrieval and search crawlers

OAI-SearchBot is OpenAI's crawler for its search index, distinct from GPTBot, so you can allow ChatGPT search presence while keeping content out of training. PerplexityBot indexes content for Perplexity. Claude-SearchBot is Anthropic's retrieval crawler, controllable separately from ClaudeBot. Amazonbot supports Amazon's services including retrieval. Allowing these tends to preserve your eligibility to appear and be cited in AI search answers.

Live agents

ChatGPT-User is the agent OpenAI dispatches when a ChatGPT user's prompt requires fetching a live page. Perplexity-User is Perplexity's equivalent, and Claude-User is Anthropic's. These activate per user question and retrieve the specific page needed to compose an up-to-date answer. Blocking them prevents your content from being used in live answers, which removes you from the moment a user is actively researching. Monetising them, rather than blocking, captures value from that high-intent read.

Why the user-agent name alone is not enough

Every name above is a string a bot declares about itself, and a string can be spoofed. A scraper can present itself as a browser, or even as a reputable AI crawler, to slip past rules. So the user-agent is a starting signal, not proof. Reliable identification verifies the request against the bot owner's published IP ranges, reverse DNS, and, where supported, cryptographic signatures. This is why accurate bot classification is done at the server or CDN edge rather than by trusting the header, and why edge-level analytics can attribute traffic to the right owner when a raw log cannot.

How to see which bots are visiting you

Because most of these bots never run JavaScript, they do not appear in client-side analytics like Google Analytics. To see them you need measurement below the browser, at the server or CDN edge, where every request is visible and can be verified and typed. blankspace provides this as its analytics layer, classifying each AI agent as a live search agent, training crawler, search crawler, or scraper, attributing it to its owner, and showing the volume and page-level breakdown, all before any JavaScript would have run. That visibility is the basis for deciding, per bot and per page, whether to block, allow, or monetise.

Frequently asked questions

What is the difference between GPTBot and ChatGPT-User?

GPTBot is OpenAI's training crawler, which gathers content in bulk for model training. ChatGPT-User is the live agent that fetches a specific page in real time to answer a particular user's question. They have different purposes and can be controlled separately.

Is Google-Extended a crawler?

No. Google-Extended is a control token you use in robots.txt to opt out of having your content used to improve Google's generative models. It is not a separate bot that fetches pages; Google's normal crawling is done by Googlebot.

Can I allow AI search but block AI training?

Yes. The major AI companies separate the two, for example OAI-SearchBot versus GPTBot and Claude-SearchBot versus ClaudeBot, so you can allow the retrieval crawler while disallowing the training crawler.

Why don't these bots show up in Google Analytics?

Because the large majority do not execute JavaScript, and Google Analytics depends on JavaScript running in the browser. Seeing them requires server-side or CDN-edge measurement.

How do I know a bot is really who it claims to be?

By verifying the request against the owner's published IP ranges, reverse DNS, and cryptographic signatures where available, rather than trusting the user-agent string, which can be spoofed. This verification is typically done at the edge.