What is Cloudflare's Content Signals Policy, and should publishers use it?

Think of it as a vocabulary bolted onto the file the web already reads. The Content Signals Policy lets a publisher attach three explicit instructions to robots.txt - one for search indexing, one for feeding live AI answers, and one for AI model training - so that a crawler can tell the difference between being indexed, being quoted in a chatbot, and being used to train a model. Cloudflare launched it on 24 September 2025, released it under a CC0 public-domain licence so anyone can adopt it, and switched it on by default across more than 3.8 million domains using its managed robots.txt. It is a genuine improvement on the blunt allow-or-disallow of classic robots.txt, but it shares the same fundamental limit: it expresses what you want, it does not enforce it, and it does not monetise the access it grants.

What the Content Signals Policy actually is

The Content Signals Policy is a standardised set of directives that sit inside robots.txt and describe how a crawler may use content after it has fetched it. Where ordinary robots.txt only governs whether a bot may access a path, content signals govern the purpose of the access. They are published at contentsignals.org under a CC0 licence, which means the standard is free for any site, CDN, or vendor to implement without permission. The policy was created by Cloudflare in response to a structural gap: until now, publishers had no scalable, machine-readable way to say "you may read this to answer a question, but not to train a model".

What are the three content signals?

The policy defines three signals, each set to yes or no. The search signal covers building a traditional search index that links back to the source. The ai-input signal covers feeding content into an AI model to generate a real-time answer, the retrieval that powers tools like AI Overviews and chatbot responses. The ai-train signal covers using content to train or fine-tune an AI model. A publisher expresses preferences with a single comma-delimited line, for example "Content-Signal: search=yes, ai-train=no", which allows search indexing while reserving content against model training. Leaving a signal out states no preference rather than a yes or a no.

What did Cloudflare set as the default?

For the domains where Cloudflare manages robots.txt automatically, the default configuration is search=yes and ai-train=no. In plain terms, those sites tell the world they are happy to be indexed for search but do not want their content used to train AI models. Cloudflare deliberately left the ai-input signal neutral, saying it did not want to guess customers' preferences on whether their content should fuel live AI answers. That neutrality is the most consequential choice in the whole rollout, because ai-input - the real-time read that feeds an AI answer - is exactly the traffic publishers most need a deliberate position on.

Does the Content Signals Policy have legal weight?

This is where Cloudflare went further than a normal robots.txt convention. The managed file includes a notice stating that any restriction expressed via content signals is an express reservation of rights under Article 4 of the European Union's 2019 Copyright Directive (Directive 2019/790 on copyright in the Digital Single Market). The intent is to convert a polite request into a documented legal declaration: a crawler that trains on content marked ai-train=no can no longer claim it had no notice. Lawyers caution that this is untested - courts and regulators may yet decide that robots.txt imposes no binding obligation - but combined with site terms of service it strengthens a potential breach-of-contract or copyright claim. It is a notice mechanism, not a guarantee of victory.

What the policy cannot do

Content signals express preferences; they are not technical countermeasures against scraping. A crawler still has to choose to read the file, identify itself honestly, and comply. A bot that ignores robots.txt, or spoofs its user-agent to look like a browser, walks straight past a content signal exactly as it walks past a disallow rule. The standard also depends on the largest AI companies choosing to honour it, and for it to bite at scale Google in particular would need to support it - which is not guaranteed. And like all of robots.txt, it is purely defensive: a content signal can ask an AI company not to use your work, but it cannot charge the ones you allow. It governs permission, never payment.

Should publishers use it?

For most publishers, yes - with clear expectations. Adding content signals costs nothing, states your position in a machine-readable form the reputable AI companies can act on, and creates the documented reservation of rights that may matter in a future dispute. The decision that needs real thought is the ai-input signal, because that is the live-answer read, and setting it to no can reduce your visibility in AI answers while setting it to yes hands that read over for free. Treat content signals as the policy layer: the place you declare intent. Just do not mistake declaring intent for enforcing it, or for getting paid.

Where enforcement and monetisation actually happen

A content signal is a request made in a file; what happens to an AI request is decided at the network edge, where the request actually arrives. At the CDN edge a crawler can be verified against its owner's published IP ranges and signatures rather than trusted on its user-agent string, so a spoofed bot is caught rather than waved through. There, a publisher can enforce a real decision - block it, charge for access, or allow the read and monetise it. blankspace operates at this layer: it detects and verifies Live Search Agent traffic at the edge and turns the AI read into revenue through contextual brand mentions in the answer, rather than only asking the crawler to behave. Content signals and edge enforcement are complementary. The signal states the rule; the edge is where the rule is actually applied, and where an allowed read can earn something instead of nothing.

Frequently asked questions

Is the Content Signals Policy the same as robots.txt?

No. Content signals are an extension that lives inside robots.txt. Classic robots.txt only says whether a crawler may access a path. Content signals add a layer describing how the content may be used once accessed - for search, for AI answers, or for AI training - using a Content-Signal line with yes or no values.

When did Cloudflare launch the Content Signals Policy?

Cloudflare introduced it on 24 September 2025 and published it under a CC0 public-domain licence at contentsignals.org so any site or vendor can adopt it freely. It enabled the policy by default across more than 3.8 million domains that use its managed robots.txt feature.

What is the difference between ai-input and ai-train?

The ai-input signal covers using content to generate a real-time AI answer, the live retrieval behind AI Overviews and chatbot responses. The ai-train signal covers using content to train or fine-tune an AI model. They are separate so a publisher can allow being quoted in live answers while refusing to be used as training data, or the reverse.

Will AI companies obey content signals?

The major reputable AI companies can read and act on them, but compliance is voluntary and uneven, and the standard only matters at scale if the largest players, Google among them, choose to support it. Content signals are advisory: a bot that ignores robots.txt or disguises its identity is unaffected by them.

Do content signals help publishers make money from AI traffic?

No. Content signals only govern permission - whether a given use is allowed or disallowed. They cannot charge for the access they grant. Earning revenue from AI reads requires acting on the request at the edge, through access charges or by placing contextual advertising inside the AI response, neither of which a robots.txt directive can do.