Part 2 of 4: the practical plan we're recommending to our publisher partners.
TL;DR
- Existing training-bot policies (GPTBot, ClaudeBot, et al.) cover the first AI dependency. They do not cover the second.
- Information Agents and the rest of Google's user-triggered agent stack ignore robots.txt by design.
- The instrumentation work is not technically hard. It's a four week sequence: see the traffic, classify it, decide your policy, open the commercial conversation.
- The single biggest gap in publishing right now isn't engineering. It's that most publishers don't have a counterparty named for the commercial conversation that Q4 is going to force.
- The Q4 commercial conversations are being framed this summer. Not in autumn. This summer.
In part 1, we said the question stopped being theoretical at Google I/O on 19 May.
This is the practical follow up.
Almost every publisher we've spoken to in the last six months already has a training-bot policy.
They've blocked GPTBot in robots.txt.
They've made noise about ClaudeBot.
They 1% have had the licensing conversation with OpenAI. The other 99% wish they could but have no leverage.
All of that work matters.
None of it covers the second dependency.
Information Agents are not training crawlers.
They are user-triggered fetchers.
They ignore robots.txt by design, because Google considers them an extension of the user's intent, not an autonomous bot.
The Google-Agent fetcher that surfaces this traffic in your logs was originally documented for Project Mariner, which Google shut down on 4 May*. The underlying technology was absorbed into Gemini Agent, Chrome Auto Browse, AI Mode in Search, and now Information Agents.
The same user agent now covers all of it.
That makes Google-Agent the highest-leverage single instrumentation target on the publisher side. Get this one right and you can see most of Google's live retrieval traffic at the request layer.
Here is the four week plan.
Week 1: See the traffic
You cannot manage what you cannot see.
The work in week one is making live search agent traffic visible in your existing tooling.
This is server log work, not browser analytics.
GA4 will not give you what you need.
The traffic doesn't render JavaScript and won't fire the tag.
The bots you want to identify, at minimum:
Google-Agent(Information Agents, Gemini Agent, Chrome Auto Browse, AI Mode)GPTBot,OAI-SearchBot,ChatGPT-User(OpenAI)ClaudeBot,Claude-User,Claude-SearchBot(Anthropic)PerplexityBot,Perplexity-UserMeta-ExternalAgent,Meta-ExternalFetcherBytespider,Amazonbot,Applebot-Extended
For each, you want: hit count, hit rate over time, pages hit, status codes returned, time-of-day distribution.
Tools that work for this depend on your stack.
Cloudflare Logpush into BigQuery. Datadog log management. Splunk. The ELK stack. A custom pipeline pointed at your access logs. Any of them.
By the end of week one, you should have a single dashboard view of bot traffic by category.
That dashboard is the artefact.
Week 2: Classify and segment
Volume is not the question.
The question is what these agents are doing on your site.
A training crawler that hits ten thousand pages a month is a different problem from a live retrieval agent that fetches a hundred pages per user query, in real time, against a prompt that's about to surface inside an AI response.
The classification you want, per category:
- Are they hitting paywalled or gated content? If yes, by what mechanism?
- Are they fetching at human-comparable rates, or 10x to 60x like Circle's research suggests AI agents do per query?
- Are they concentrated on specific verticals - shopping, pricing, sports, finance - where Information Agent use cases overlap your content?
- Are they following your sitemap, or only the URLs cited by other AI surfaces?
This is the segment that goes into your audience model.
Live search agents are a third audience category, alongside humans and indexed crawlers, and they need to be measured as one.
By the end of week two, you should have a publisher-side AI traffic dashboard with three populations cleanly separated: human, training crawler, live retrieval agent.
Week 3: Decide the policy
This is where most publishers stall.
It's also where the work gets interesting.
The robots.txt trap: it does not apply to user-triggered fetchers. Blocking Google-Agent in robots.txt does nothing. The agent will ignore it, because Google's documentation explicitly says user-triggered fetchers act on user intent, not crawler schedule.
The WAF trap: aggressive bot challenges - JavaScript injection, CAPTCHA, rate-limiting at the IP level - will block Google-Agent. But they will also break the experience of the user who delegated the task to that agent. That user is in your audience already. Blocking their agent means blocking their access to your content via the surface they chose.
The Web Bot Auth opportunity: Google is now signing every Google-Agent request cryptographically, using the agent.bot.goog identity. Cloudflare, Akamai, and AWS WAF are already implementing verification*.
This is the substantive shift.
The decision space is no longer "block or allow this user agent."
It's "verify the signature, classify the verified agent, and route the request according to commercial policy."
By the end of week three, you should have a documented policy that answers three questions:
- Which AI agents do we verify and allow at full content access?
- Which do we serve a degraded response to - excerpt, snippet, paywalled wall?
- Which do we block at the WAF?
The answers will be different for different parts of your site.
They will also change.
The point of the artefact is not permanence.
The point is having the conversation in a structured way before someone else forces it.
Week 4: Open the commercial conversation
This is the gap.
The instrumentation work is engineering. The policy work is operations. The commercial work is the hardest, because for most publishers there is no counterparty named for it yet.
The conversation that needs to happen in the next four to eight weeks is who in the publishing organisation owns AI traffic monetisation - and what they are empowered to negotiate.
Two things to negotiate.
First, the commercial framework for content access. If verified Google-Agent traffic is going to fetch your content live, at scale, every quarter from this summer onwards, what's the rev-share model? Not "should we block Google" - that conversation is over. The conversation is what the contractual layer looks like.
Second, the data exchange. You will be sitting on increasingly valuable telemetry about how live search agents interact with your content. What's that data worth, and to whom?
It's not zero.
By the end of week four, you should have a one-pager you can take to commercial leadership. Two paragraphs of context, three numbers from the dashboard, two named decisions.
Not a strategy doc. A trigger for a meeting.
The window
The Q4 commercial conversations are being framed this summer.
Not in autumn.
This summer.
Information Agents ship to AI Pro and Ultra subscribers in the next ten weeks. The first contractual conversations between Google and the major publishers will follow within the quarter.
The publishers in those rooms with their own instrumentation, their own data, and a clearly owned commercial mandate will set the terms.
The ones without will accept them.
We built blankspace to be the counterparty for the commercial conversation in week four.
More on that in the rest of this series.
Part 3 in this series: the technical breakdown. Google-Agent, Web Bot Auth, and the cryptographic identity layer changing how publishers verify AI traffic.
Sources: *https://www.theverge.com/tech/925559/google-project-mariner-shut-down *https://developers.google.com/crawling/docs/crawlers-fetchers/google-user-triggered-fetchers *https://developers.cloudflare.com/bots/reference/bot-verification/web-bot-auth/
