Two things that happened this week that every publisher needs to understand
Both involve bots.
Both change how publishers should be thinking about their AI strategy.
And both happened quietly enough that most publisher commercial teams won't have heard about either.
Google created a new type of bot that bypasses robots.txt by design
On March 20, 2026, Google added a new entry to its official fetcher documentation: Google-Agent
This is not Googlebot.
Googlebot crawls the web continuously, indexing pages for search.
Google-Agent only activates when a human user asks a Google AI assistant to do something on their behalf.
Research a product.
Compare options.
Complete a task.
The agent goes to your site. The user doesn't.
The critical detail:
Google classifies Google-Agent as a user triggered fetcher, not a crawler. Because the visit is initiated by a real human action, Google treats it like a browser visit.
Robots.txt exclusions for user triggered fetchers are typically bypassed entirely.
So if a publisher has blocked all AI bots in their robots.txt, Google-Agent visits can still happen. The human user asked for it, so the agent actions it.
Project Mariner, Google's experimental AI browsing tool, is the first product using Google-Agent. It's currently limited to the US. But Google has also published an IP range file (user-triggered-agents.json) that publishers can use to at least identify this traffic in server logs.
What this means in practice is that user-triggered agent visits are a fundamentally different category from training crawlers.
They're harder to block, they're growing as AI assistants become habitual, and they don't show up in standard analytics.
A real person asked for that visit,
your infrastructure served it,
and you see nothing.
The emerging question is commercial.
If a user delegates a research task to an AI agent, and the agent reads your content to complete that task, what is that visit worth? And who captures that value?
No one has a clean answer yet. But Google just made the infrastructure question much more concrete.
Publishers are blocking the Wayback Machine despite their journalists using it
23 major news publishers have now blocked or limited the Internet Archive's crawler, ia_archiverbot, the tool that populates the Wayback Machine. Among them are The New York Times, The Guardian, USA Today, The Financial Times.
The stated reason is AI.
Publishers are worried that AI companies are accessing their content through the Wayback Machine's archive without permission, using it as a structured backdoor into years of journalism.
The NYT's position:
"Times content on the Internet Archive is being used by AI companies in violation of copyright law to directly compete with us."
The concern is legit.
AI companies have used publicly accessible repositories to gather training data.
The Wayback Machine holds over a trillion webpage snapshots.
It is a significant source.
But the collateral damage is significant and massively overlooked.
USA Today is one of the publishers blocking the Wayback Machine. USA Today also recently published an important investigative piece about ICE detention statistics that relied on the Wayback Machine to reconstruct historical records that had been deleted.
Their journalists needed the archive.
Their lawyers blocked it.
The Guardian has taken a more nuanced position.
It limits the archive's API access and filters content from the Wayback Machine interface, but hasn't issued a blanket block.
The FT blocks any bot that tries to scrape paywalled content, including the Internet Archive, but most FT stories are paywalled anyway.
The Wayback Machine director's response is worth reading:
"Libraries are not the problem, and blocking access to web archives is not the solution. Doing so risks serious harm to the public record."
He's right.
And the publishers aren't entirely wrong either.
This is a genuinely hard problem, and the blunt tools available (block everything, block nothing) don't fit the nuance of the situation.
What both stories have in common
They're both examples of publishers making high stakes decisions about AI bot access without granular visibility into what's actually happening.
Blocking the Wayback Machine to prevent AI scraping, while your own journalists depend on it, is a decision made without good data on how often AI companies actually use it versus how often journalists do.
Failing to identify Google-Agent traffic is a decision made by default, because most publishers don't have the tooling to distinguish user triggered agent visits from other traffic categories.
In both cases, the underlying problem is the same.
Publishers don't have clear visibility into who is visiting their sites, why, and what the right response to each category of visit is.
That visibility is the prerequisite for any strategy, aggressive or cooperative, to be based on evidence rather than fear.
