The Invisible Audience: How AI Bots Redefine Web Traffic in 2025

Alex Taylor, Co-founder of blankspace

Sep 26, 2025

Executive Summary

Digital publishers are facing a growing blind spot in their audience analytics. A new wave of AI-driven bots, from major AI companies like OpenAI, Anthropic, Google, and Perplexity, is consuming web content in ways that traditional measurement tools simply cannot see. These Large Language Model (LLM) agents crawl and fetch content directly from HTML/DOM without executing JavaScript, bypassing popular analytics platforms (e.g. Google Analytics) that rely on in browser scripts. The result is a severely undercounted segment of traffic that can constitute a significant (and in most cases a majority) portion of a website’s audience (Brightspot, 2025). Key point this study addresses:

Hidden Majority of Web Traffic

Automated bots now account for over half of all web traffic. Recent studies show total non-human traffic surpassed human traffic for the first time, reaching ~50-51% of Internet visits (The Register, 2025). Security firms report that malicious bots alone represent ~37% of all traffic (Hov AI, 2025), and combined “good” and “bad” bots exceed human users. Much of this surge is driven by AI content scrapers harvesting data for AI models.

Analytics Blind Spot

Traditional web analytics (which rely on JavaScript tags executing in a user’s browser) fail to detect these AI agents. LLM-based bots typically do not run front-end scripts or render pages; they fetch content directly via HTTP. Consequently, visits by AI crawlers do not trigger Google Analytics or similar tools, leaving publishers oblivious to a sizeable audience interacting with their content. Even Google Analytics 4, which auto filters known bots, cannot catch what isn’t executing its tracking code. As a Cloudflare researcher put it: “You can’t secure what you can’t see”, and without proper tracking, AI driven automation becomes a blind spot for digital teams (The Register, 2025).

Content Consumption Without Clicks

AI services (chatbots, generative search answers, etc.) often present information directly to users without redirecting them to publisher sites. Gartner predicts 25% of search traffic will be lost by 2026 as users turn to AI chatbots instead of traditional search clicks (Search Engine Land, 2025). This means even human attention is increasingly captured by AI intermediaries, with users getting the information they need while the publisher’s pageviews and ad impressions vanish. Publishers may see declining traffic and engagement metrics not because interest in their content fell, but because AI platforms deliver that content in “zero click” fashion.

Distorted Audience & ROI Metrics

The rise of invisible AI traffic skews all kinds of performance metrics. Web analytics dashboards underreport total visits, leading publishers to underestimate their true reach. Traffic sources are misattributed: what appears as a drop in search referral or a rise in “direct” traffic might actually be AI-driven users or bots accessing content. Marketing ROI calculations and ad revenue projections based on these incomplete analytics can be wildly misaligned with reality. In short, decisions are being made on incomplete data.

Infrastructure Costs with No Visible Return

Unlike human visitors, AI bots contribute to server load and bandwidth costs without contributing to ad revenue or subscriptions. Publishers are effectively serving a large invisible audience for free. Case in point: Wikimedia Foundation observed a 50% surge in bandwidth usage (early 2024) from bots scraping images for AI training, straining their infrastructure (WikiMedia, 2024). They found that 65% of their most resource intensive traffic came from bots, much higher than the bots’ share of pageviews (35%), because scrapers “bulk read” vast swaths of content that normal users never would. Many sites now report similar issues of elevated server costs and even outages due to heavy AI crawler activity (Eff, 2025). blankspace has estimated that over $240m will be spent this year by US publishers on unwanted bandwidth costs.

Real World Impacts Already Felt

Fastly observed an AI “fetcher” bot hitting a site with 39,000 requests per minute during tests, an extreme case of automated load that would never appear in Google Analytics, yet could cripple an unprepared site (The Register, 2025). While Google has admitted to display ad purchases dropping from 40% in 2019 to 11% in 2025. Not only are publishers being robbed of their content & being left to fit the bill, but advertisers are now pulling out *(Samuel Gregory, 2025).*

In summary, publishers need to recognise that a large “invisible” audience of AI bots is now accessing content. Failing to account for this hidden segment leads to undercounted audiences, overstated ROI on marketing spend, phantom losses of traffic, and unanticipated infrastructure costs. The following sections delve into how and why this happens, provide data on the scope of the issue, and highlight tangible examples of its impact. A concluding section outlines the business risks and strategic considerations, and a technical appendix offers details on identifying AI bots (user agents, IPs) and why existing analytics fall short.

The Rise of the Invisible AI Audience

It’s no longer humans alone who comprise a site’s readership, increasingly, machines are the biggest readers on the web. Multiple independent analyses confirm that automated bots make up the majority share of Internet traffic:

Data security firm Imperva found that over 50% of all web traffic in 2024 was non-human, with malicious “bad bots” comprising 32-37% of total traffic. This marks the first time automated traffic likely surpassed human traffic on the Internet. In other words, bots are no longer just a fringe nuisance, they are half of your audience on average.

These stats have been backed up by Cloudflare & Fastly, to the point that 1 million websites have adopted Cloudflare’s new tools for blocking AI crawlers, a testament to how widespread & unwelcome this bot surge had become among content providers.

Fastly’s 2025 web application firewall data highlights that a few big AI players dominate this automated traffic. Meta, Google, and OpenAI together generate 95% of all AI crawler hits (The Register, 2025). Meta alone was responsible for ~52% of AI crawling load (likely via its LL.M. training efforts), while Google and OpenAI contributed 23% and 20% respectively. On the on-demand “fetch” side (AI bots fetching specific pages for answers in real time), OpenAI’s share was a whopping 98%, reflecting ChatGPT’s immense usage. These figures show that a handful of AI companies are behind an outsized portion of invisible traffic, effectively creating a shadow audience that can dwarf human visitors on many sites.

This rise of AI driven traffic is not just a volume story; it’s qualitatively different from traditional bot activity. Unlike old school search engine spiders (Googlebot, etc.) which indexed content to send human readers back to websites, LLM bots serve end users directly with the content. An AI like ChatGPT or Google’s Gemini might read a publisher’s article and then present a summary or answer to a user without the user ever clicking through. In essence, the AI becomes the “reader” of the content on behalf of millions of humans, creating an invisible intermediary audience. This has profound implications for how publishers measure reach and engagement.

Publishers are waking up to these implications. The surge in AI scraping has been described as “structural disruption” to the web’s traffic patterns. It prompted swift industry reactions in 2025. For instance, an alliance of major media companies (Condé Nast, The Atlantic, Reddit, AP, and others) backed Cloudflare’s initiative to require permission and partnership for AI access to content. Their CEOs publicly stated that AI firms “can no longer take anything they want for free” and that content access should be limited to partners willing to provide fair value in return. Such statements underscore that invisible AI audience is now a front & centre business issue: the open web’s content economy is being reshaped by bots that don’t pay for content, don’t show up in ads or analytics, and yet consume enormous amounts of information.

Why Traditional Analytics Miss the Mark

The core of the problem lies in how traditional web analytics tools collect data versus how AI bots access websites. Nearly all standard analytics platforms (Google Analytics, Adobe Analytics, etc.) rely on client-side execution, typically a JavaScript snippet embedded in web pages. When a human visitor loads a page in a browser, this script runs and sends a tracking hit (pageview, event, etc.) back to the analytics servers. This process has two key assumptions:

(1) the visitor’s user agent can execute JavaScript, and

(2) the visitor isn’t deliberately filtered out as a known bot.

AI bots break both assumptions.

Most AI crawlers are essentially headless, they request the HTML of pages directly, often using an HTTP library, and do not execute JavaScript or load secondary resources unless absolutely necessary. In fact, OpenAI’s crawlers are known to struggle with or skip JavaScript execution entirely, likely to conserve processing power (SeerInteractive, 2025). Anthropic’s bots and others behave similarly. As a result, when an AI agent fetches a page, it never triggers the analytics tracking code. The page’s content might be parsed and used by the bot, but the visit is completely invisible to Google Analytics (GA) or Adobe because no GA JavaScript ran and no tracking pixel was fired.

Furthermore, many analytics suites intentionally exclude “known bots” by default. GA4, for example, automatically filters out hits that match the IAB’s known spiders/bots list (Google Blog, 2025). Even if an AI scraper did load some resources that ping analytics (which is uncommon), it might identify itself via a user agent that is flagged as a bot, causing GA to drop that data. This means even partial or accidental interactions by bots that reach analytics are likely discarded to avoid inflating metrics with non-human traffic.

The net effect: AI bot visits don’t show up in the charts. They are ghosts in your audience data.

JavaScript based analytics therefore has limitations. The heavy reliance on JavaScript is an Achilles’ heel for visibility into non-human traffic. Many modern bots deliberately fetch just the raw HTML and perhaps API endpoints, ignoring any client-side scripting. For instance, LLM scrapers often behave like basic cURL or Python requests. They grab the content and move on. Some might use headless browsers for complex sites, but even headless browser frameworks (like Puppeteer in headless mode) can be configured to disable scripts for speed. From an analytics standpoint, such visits are completely dark.

It’s worth noting that traditional search engine bots have always been invisible to GA for the same reason (Googlebot never showed up as a “visitor” on your site analytics). However, publishers didn’t mind because search bots ultimately led human users to the site, which would show up in analytics.

In short, JavaScript centric measurement provides a false sense of accuracy by focusing on human behaviour & filtering out others. It misses the new reality that half your “audience” might be non-human and not running any of that JavaScript. Unless a publisher adapts by incorporating server-side analytics or log analysis, the official traffic numbers will remain grievously incomplete.

How AI Bots Consume Content (and Evade Detection)

Understanding how these AI agents interact with your site clarifies why they slip under the radar. The modus operandi of modern AI bots is fundamentally different from a normal user’s web browsing session:

Direct HTML/DOM Retrieval

AI bots typically send an HTTP GET request for a page’s URL and retrieve the raw HTML. This is often done using a generic user agent string (or sometimes a specific identifier like GPTBot or ClaudeBot). The bot then parses the HTML, perhaps using an HTML parser or even a lightweight headless browser to construct the DOM internally. Importantly, they focus on textual content and links, not on rendering the page visually. For instance, OpenAI’s GPTBot is known as a “proprietary web crawler” that collects page content for training (Medium, 2025), while their ChatGPT User agent fetches live data to answer user queries (Cloudflare, 2025). Both operate by pulling HTML directly and scanning for useful text.

Little or No JavaScript Execution

As noted, these bots often do not run page scripts. Cloudflare’s research observed that many newer LLM bots crawl “without fully rendering JavaScript driven content” (SiteBulb, 2025). This is partly a resource choice - executing JavaScript for every page is slow and costly at scale - and partly intentional to avoid interactive elements or blockers. Google’s crawler (Googlebot) spent years evolving to execute some JS for indexing modern sites, but LLM scrapers haven’t reached that sophistication or don’t see the need (SeerInteractive, 2025). The consequence is that any content loaded dynamically via JS (like client-side templates or data fetched after load) might be missed by these bots. Conversely, content present in the raw HTML is easily picked up, which is why many SEO experts advise ensuring critical content is in HTML for both Google and AI bots.

Skipping Assets and Ads

AI bots generally skip fetching images, ads, and other media unless those are specifically part of what they need (e.g. an image bot scraping a media library). For text oriented crawlers, an HTML page’s text is the target, not the images or JS driven ad frameworks around it. This means they don’t trigger ad impressions or viewability, and they also reduce load on themselves by ignoring heavy files. A side effect is that ad analytics and third-party trackers also miss these “visits” entirely, from the ad server’s perspective, a bot never “saw” the page, so no ad was loaded or counted.

Example of a human visitor arriving on site, via GA vs an LLMs AI agent collecting data from a page scraping hundreds of pages within minutes:

Figure 1: human visitor being registered by GA in real-time.

Figure 2: AI agent visitor collecting data from hundreds of pages, and registering 0 visits by GA in real-time.

High frequency, broad crawling

Unlike a human user who might read a couple of articles or browse a few pages, bots often crawl large swaths of a site systematically. They may follow links and sitemaps to fetch hundreds or thousands of pages in a session. This aggressive crawl rate can spike bandwidth usage. Bots also tend to access less popular pages that humans rarely visit (since they are exhaustive in coverage). Wikimedia noted that bots “bulk read” even the long tail of content, causing more requests to hit the origin servers (bypassing caches) and consuming disproportionate resources (WikiMedia, 2024). Such patterns (rapid fire requests, deep page fetches) are telltale signs in server logs of a bot vs a human. Yet, from an analytics viewpoint, none of those hundreds of fetched pages count toward pageviews if no tracking code executes.

Identifying (or Hiding) Themselves

Some AI bots are transparent about who they are – e.g. GPTBot/1.0 +https://openai.com/gptbot is a clear user agent string disclosed by OpenAI (Search Engine Land, 2024). Others might use generic strings or even mimic real browsers to evade blocking. For instance, Perplexity.ai’s crawler was accused of using IP addresses outside its reported range and ignoring robots.txt (Cloudflare, 2025). Many scrapers distribute their requests across cloud servers, making it hard to block by IP alone. Ideally, well behaved bots should provide a unique user agent and respect robots.txt rules. In practice, compliance varies. As of 2025, industry pressure is mounting for AI companies to publish their IP ranges and use identifiable agents, to help site operators manage and measure bot access. When bots hide as regular browsers, they can even trick analytics into counting them (e.g. if a bot pretends to be Chrome and loads the site including JS, it might slip through GA’s filters). However, this is not common for the major AI crawlers today as they have little incentive to execute tracking scripts.

API based Fetchers vs. Crawlers

There’s a distinction between bulk crawlers (which gather content for training or indexing) and real time fetchers (which retrieve specific pages to answer user queries on the fly). Bulk crawlers (like GPTBot, Claude’s crawler, Common Crawl) operate continuously and broadly, whereas fetchers (like the ChatGPT browsing mode or Perplexity’s on-demand fetch) operate reactively but can spike traffic. Fastly’s data shows about 80% of AI bot traffic is crawler activity and ~20% is fetch-on-demand (The Register, 2025). The fetchers can hit a site extremely hard in short bursts, one example was an OpenAI fetcher generating 39k requests per minute to a single site during tests. This likely occurs when an AI service gets a task involving that site (perhaps due to many users querying something from it, or a loop/crawler error). These fetcher bursts are even less likely to be measured by traditional analytics, since they are basically a blitz of headless hits, none of which run client code. They can, however, wreak havoc on infrastructure.

No User Interaction (Zero Engagement)

Because bots don’t interact with the page in a human way, metrics like bounce rate, time on page, scrolling, clicks, etc. are meaningless for them. If by chance a bot triggers an analytics event, it often appears as a bounce (one-page session) with 0 seconds duration. Sophisticated analytics users might spot unusual patterns (e.g. a spike of visits with 0-second duration and 100% bounce could indicate a bot scraping a landing page repeatedly). But many will just see averaged metrics degrade (e.g. site wide bounce rate might rise if a lot of bot “visits” are counted as bounces). Overall though, since most bots aren’t counted at all, the bigger problem is absence, not the skew from their inclusion. The audience that spent zero seconds but consumed 100 pages leaves no trace in GA.

In essence, AI bots act like ultra fast, script averse, voracious readers of your site. They come in, vacuum up content, and disappear, all before the typical tools even know they arrived. For publishers, it’s like having an invisible legion of readers who read everything but never show up to be counted or monetised.

Blind Spots in Audience Metrics and ROI

The rise of this invisible AI audience introduces serious blind spots into the key metrics that publishers & advertisers rely on:

Undercounted Pageviews & Visits

The most direct impact is that your raw traffic numbers are much lower than reality. If bots make up, say, 30-60% of your actual server requests but none of those are in GA, your true audience (human + AI) is much larger than reported. A publisher might think their content is attracting 1 million monthly views (based on analytics), when in fact the content is being fetched 2.1 million times, with the extra 1.1M being AI scrapers. In extreme cases blankspace has seen AI agent visits balloon to be ~98% of a publishers traffic due to extremely high demand of specific articles that are feeding into live AI answers.

Misleading Traffic Sources

AI driven content consumption often shows up indirectly or not at all in referrer analytics. For example, if Bing’s AI chat or ChatGPT answers a question using your site’s info, a user might never click the link (zero referral), or if they do, it might appear as a direct hit (since the AI interface might not pass a referrer). Publishers have observed that some traffic labeled “Direct” or “Typed/Bookmarked” in analytics is actually originating from AI recommendations or summaries. This misattribution wreaks havoc on marketing analytics, you might funnel budget into what you think is a successful direct traffic strategy or referral partnership, when those visitors were really coming because an AI tool surfaced your content. Conversely, if AI answers are siphoning off what used to be organic search clicks, your search traffic drop might be incorrectly blamed on SEO issues or algorithm changes, rather than the reality of AI competition. Gartner’s projection of a 25% drop in search engine traffic by 2026 due to AI answers underlines this shift: publishers may see a decline in Google Analytics “organic search” visits and not immediately connect it to users getting answers from Bing Chat, Bard, or other LLMs without clicking through (Search Engine Land, 2025).

Conversion and Engagement Metrics

If a significant portion of your would be audience is consuming content via AI, you’ll also notice changes in engagement metrics among the humans that do visit. People coming from an AI summary might behave differently (perhaps they only clicked through for a specific detail, then left quickly). But more starkly, AI consumption means fewer humans are coming at all for certain content, which can drop conversions (newsletter signups, e-commerce purchases, etc.) that would have happened on site.

For example, an educational site might normally convert 5% of its visitors into trial sign ups; if students get answers from ChatGPT instead of visiting, those potential sign ups evaporate. Yet the site’s analytics might just show a traffic drop and lower sign ups, without a clear cause. This is exactly what Chegg, the online learning platform, experienced, they saw a sharp decline in student questions and traffic once ChatGPT became readily available.
In 2023-2024, Chegg’s homework help usage dropped so much that the company blamed AI for its user loss and even filed a lawsuit when Google started providing AI answers (so called “Overviews”) that reduced clicks to Chegg.
Chegg reported their new subscriber growth stalled and they had to cut revenue forecasts, largely attributing it to students getting answers from AI without visiting their site. This kind of impact, users served by AI instead of your platform, means your funnel metrics and ROI on content can nosedive unexpectedly *(Ronin Legal Consulting, 2024).*

Advertising and Viewability Gaps

For ad supported publishers, invisible traffic is especially problematic. Advertisers only pay for human impressions (and increasingly demand verification of human viewability). When a sizeable chunk of content consumption is via bots who don’t load ads, it translates to lost ad opportunities and revenue. It’s as if half your readership suddenly installed ad blockers (but even more extreme, because they don’t even appear as pageviews).

Standard ad analytics will just show fewer impressions. If a publisher isn’t aware of the AI factor, they might attribute this to general audience decline or seasonality, rather than a structural shift. Moreover, some ad metrics could be skewed: AI bots might trigger ad calls in rare cases and ads could be served to non-human eyes, inflating apparent ad inventory but with no real view.

Bandwidth and Infrastructure Costs

Invisible traffic still incurs very visible costs on the backend. When bot traffic goes unseen in analytics, a danger is that infrastructure may be under provisioned or caught by surprise. For example, if your analytics show 1M monthly users but in reality scrapers are hitting 2M pages, your bandwidth bills and server load will correspond to 2M. Publishers have been “shocked” to find server bills skyrocketing while real (human) traffic stayed flat (Whole Whale, 2025). blankspace has estimated that up to $240m is being spent on bandwidth costs alone by US publishers per year, and this cost is only going to rise. If these costs aren’t mapped to revenue (since the traffic isn’t in the monetisation pipeline), it can make a project or site seem less profitable than it actually is in human terms. Essentially, ROI per user drops because you’re counting only human users in the denominator, but the numerator (cost) includes servicing a bunch of bots.

Content Strategy Misalignment:

Another subtle impact is on content strategy and editorial decisions. Many publishers obsess over analytics - which articles get the most views, which topics drive engagement, etc. If AI bots are heavily accessing certain content, but those visits aren’t counted, you might undervalue some content that is actually highly “consumed” (albeit by AI). Conversely, a piece that gets moderate human traffic might be extremely influential in AI answers, meaning it reaches audiences in ways not captured by traditional metrics. Publishers not tracking this could miss prioritise what to produce. Remember when Facebook (now Meta) miss represented video completes, and publishers pilled cash into video editorial teams when in actual fact it wasn’t commercial viable for them to do that because the Facebook data was wrong. Competitors who do analyse bot traffic and AI citations can gain an edge. As noted in one analysis, companies tracking AI-driven traffic can see which content AI systems favour and adjust strategy accordingly, while those “in the dark” could miss opportunities or fail to notice a 64% drop in traffic due to being omitted from AI answers. In other words, AI is a new referral channel, one that doesn’t show up as a line item in Google Analytics. Ignoring it is like ignoring SEO or social media referrals; it could mean falling behind in visibility where it increasingly matters.

In aggregate, these blind spots mean that many digital publishers are effectively operating with an outdated picture of their audience. As one industry CEO bluntly put it, companies are “following outdated mental models because this change is invisible to standard measurement”. The danger is making big decisions - budgeting, staffing, acquisitions, content investments - based on metrics that omit what is now the “majority audience” (bots and AI driven consumers). A site might have far more influence (via AI platforms) than its ad impressions indicate, or conversely it might be suffering AI driven losses not evident in top line metrics.

To address these blind spots, some forward looking teams have started treating server logs as the new source of truth for “AI impressions.” By analysing raw access logs, they can count hits from known AI agents and estimate how often content is being seen by AI users. It’s an emerging field, but the overarching message is clear: publishers need new measurement approaches to capture the invisible audience. Ignoring it will only widen the gap between perceived and actual performance.

Conclusion: Making the Invisible Visible

The evidence is overwhelming: bots are no longer a rounding error, they are the majority of internet traffic. AI agents fetch, parse, and reuse your content without firing analytics tags, without loading ads, and without leaving a trace in your dashboards. The result is an audience you’re serving - at real cost - but cannot see or monetise.

Traditional analytics was never designed for this reality. Client-side JavaScript can’t catch headless bots; default bot filters deliberately erase what little signal might exist. The gap between what publishers think is happening and what’s actually happening has become a chasm.

Closing that gap requires a new generation of tools purpose built for AI visibility. Platforms like blankspace’s Monitor take log level signals, identify AI crawlers by user agent/IP/behavioural pattern, and surface them as first class audience metrics. It shows you:

How often each AI agent is consuming your content
Which sections of your site are most affected
How often your content is cited back in AI answers
The bandwidth and cost impact of invisible traffic

Without this visibility, you are undercounting your reach, underestimating your costs, and under valuing your inventory. With it, you can start treating machine consumption as part of your strategy - whether that means renegotiating content access, protecting infrastructure, or creating new monetisation paths.

The invisible AI audience is already here. The only question is whether you measure it. Tools like Monitor give publishers the missing lens, turning your invisible audience into actionable data.

This study was put together by analysing a number of key case studies & from blankspace’s proprietary data. We’d like to thank everyone involved in making this study possible.

This white paper has drawn on data and insights from industry research and reports by Cloudflare, Imperva, Brightspot, Fastly, and others, as well as case specific reporting (e.g. Reuters, Wikimedia Foundation, etc.). The citations included throughout provide direct links to these sources for further reading and verification. By examining both the quantitative scope (percentages of traffic, growth rates of bot activity) and qualitative impact (case studies and quotes from industry leaders), we hope to have illuminated the often unseen, yet critical, phenomenon of the invisible AI audience. Publishers who learn from this & adapt will be better positioned to protect & monetise their content in the AI driven web ecosystem, while those who ignore it risk falling behind.

Appendix A

Case Studies: Impacts of Invisible AI Traffic

Below are a number of case studies that support our wider study of how AI has impacted the publishing industry in 2025, & in general, the internet.

Case Study 1: Chegg vs. The AI Homework Helper - Traffic Loss and Legal Action.

Chegg, a popular education tech company, provides textbook solutions and study help. In early 2023, Chegg noticed a sudden stagnation and decline in student questions and site usage. The culprit became apparent: students had discovered that AI tools (like ChatGPT) could answer homework questions instantly, bypassing Chegg. Chegg’s CEO publicly warned that AI answer engines were hurting their growth. By 2024, the situation escalated when Google introduced AI “overview” answers on search results, often satisfying the query without any click to Chegg. Chegg reported their traffic and new sign-ups dropped dramatically, they cited a 30% direct traffic loss due to these AI answers. The impact was severe enough that in 2025 Chegg filed a lawsuit against Google, alleging that the AI generated snippets in search results were siphoning users and content value from their platform. Financially, Chegg had to cut its revenue outlook and even explored “strategic alternatives” for its business (Yahoo Finance, 2024). This case illustrates the extreme end of invisible AI audience impact: the human audience simply stopped showing up because an AI intermediary fulfilled the need. In analytics terms, Chegg saw big declines in organic traffic and engagement, but those metrics alone didn’t capture how many students were still accessing Chegg’s content via AI. Only through user surveys and external observation (students openly using ChatGPT for answers) did it become clear what was happening (Ronin Legal Consulting, 2024).

Case Study 2: Wikimedia vs. Unprecedented Scraping, Bandwidth Surge and Site Strain.

The Wikimedia Foundation (which runs Wikipedia and related projects) observed a dramatic uptick in scraping activity starting in late 2023. By early 2024, bandwidth used for Wikimedia’s media content had grown 50% beyond normal, and the staff determined this was “not coming from human readers, but largely from automated programs” scraping their images and articles (WikiMedia, 2024). A specific incident highlighted the issue: when a high profile event (the death of Jimmy Carter) occurred, Wikipedia expected a spike in traffic and was prepared for it. Human pageviews did spike moderately (2.8M views on Carter’s page in a day), that was fine. But simultaneously, many users (or bots) accessed a long video of a Carter debate. This caused network traffic to double, saturating some connections and slowing the site. What confused the team was that the human interest, while high, shouldn’t have stressed their systems, they had handled bigger news events before. The post mortem revealed that the baseline load was already elevated due to continuous bot scraping, leaving less headroom for the human surge. Wikimedia engineers then dug deeper and found that 65% of the expensive, uncached traffic hitting core servers was from bots, which was disproportionate since bots accounted for ~35% of total page hits by count. The “cheap” traffic (from human readers hitting cached pages) was being overshadowed in resource usage by bots doing full reads of less-popular pages and media. This led Wikimedia to publicly address the issue: they warned that even a well funded nonprofit with a robust infrastructure was feeling the strain of AI scraper traffic. They started implementing more aggressive bot filters and considered options like preferential APIs or data dumps for AI to use instead of hammering the public site. For the Wikimedia community, an additional concern was that this invisible use of content provided no credit or new contributors (“attribution” was insufficient, so AI companies were “poisoning the well” by not driving users back to the wiki). In terms of metrics, Wikipedia doesn’t run ads, but they do track pageviews for community purposes. The rise in bot hits meant their pageview counts (which attempt to filter bots) might under report the true access count of their content. It also forced them to allocate budget for bandwidth upgrades, money that, in essence, subsidises commercial AI model training. This case is telling because Wikipedia is one of the most scraped sites for AI, and it illustrates the macro scale impact: even the top 10 website Wikipedia struggled when the AI boom unleashed scrapers on it (WikiMedia, 2024).

Case Study 3: Penske Media vs. Google, AI Overviews and the Battle for Clicks.

In September 2025, Penske Media Corporation (PMC), the parent company of Rolling Stone, Variety, Billboard, and The Hollywood Reporter, filed a federal lawsuit against Google over its AI powered “Overviews” in search. These generative summaries appear at the top of Google results, pulling in key content from publishers and often answering user queries outright without requiring a click-through. PMC’s lawsuit argues that this practice is siphoning away their traffic, ad revenue, and affiliate income by substituting Google’s AI summaries for direct visits to their sites (Reuters, 2025).

PMC claims that Overviews have “drastically reduced” the incentive for users to click, leading to steep declines in search referrals. The lawsuit cites internal data showing affiliate revenue had fallen by more than one third since 2024, and argues that Google is unjustly enriching itself by using publisher content to power its AI without fair compensation (The Verge, 2025).

Unlike many earlier lawsuits that framed the issue as copyright infringement, PMC’s legal strategy leans heavily on antitrust law. The complaint accuses Google of monopoly leveraging and coercive tying, effectively forcing publishers to allow their content into AI Overviews if they want to remain visible in search at all. This, PMC argues, leaves them with no practical option to “opt out” of AI usage without sacrificing essential search traffic (MoginLaw, 2025).

Google has defended the feature, saying AI Overviews make search “more useful” and help users discover content more efficiently. The company insists it still drives billions of clicks to publishers and denies coercing participation, but PMC disputes this, noting that Overviews increasingly provide enough information to satisfy users without the need for a click.

The broader implication of this case is clear: zero-click AI answers are no longer just a theoretical risk, they are actively reshaping publisher economics. PMC’s lawsuit may set a precedent for how courts view AI-generated summaries within dominant platforms. If the claims succeed, it could force Google (and by extension, other AI companies) to license publisher content or redesign how AI outputs display in search. If it fails, publishers may find themselves further marginalised, watching high-value content re-routed through AI interfaces while their own analytics show only the hollow shell of what used to be user traffic (TechCrunch, 2025).

Case Study 4: Perplexity & Cloudflare – Stealth Crawling Exposed

In August 2025, Cloudflare published a detailed exposé accusing Perplexity, the AI answer engine, of engaging in “stealth crawling” behaviour that deliberately evades block rules and disguises crawler identity. According to Cloudflare, Perplexity initially uses its declared user agents (e.g. PerplexityBot, Perplexity-User) when possible, but when blocked it switches tactics: rotating IP addresses and even autonomous system networks (ASNs), modifying user-agent strings to mimic mainstream browsers (e.g. Chrome on macOS), and bypassing robots.txt and Web Application Firewall (WAF) rules designed to deny access. Cloudflare’s tests showed that Perplexity successfully retrieved content from domains explicitly configured to disallow all bots, including ones with no public backlinks or prior indexing. The implication is that Perplexity’s crawlers dynamically shift identity to continue harvesting content despite defensive measures. To counter this, Cloudflare removed Perplexity from its “verified bot” list and began deploying new heuristics to block its stealth crawling. (Cloudflare, 2025), (Search Engine Land, 2025).

Cloudflare’s follow-up telemetry also revealed a massive acceleration in Perplexity’s crawl-to-click ratio: over the course of 2025, bot crawling by Perplexity rose by ~256.7% relative to human traffic, indicating that Perplexity was increasingly fetching pages well beyond its click-derived referrals. In other words, for every human hit Perplexity referred, it was sending hundreds of bot requests. (Cloudflare blog, 2025).

Perplexity publicly pushed back, calling Cloudflare’s analysis misguided and accusing it of conflating unrelated traffic (for example from a third-party service called BrowserBase) with its own. Nevertheless, the technical behaviour observed, user agent manipulation, IP/ASN rotation, access despite blocks, underscores a new frontier of AI crawler behaviour that seeks to evade detection and control. (TechRadar, 2025).

From Chegg’s precipitous traffic decline to Wikimedia’s infrastructure stress, to PMC’s legal challenge over click diversion, to Perplexity’s stealth crawling tactics, these case studies form a mosaic of how the invisible AI audience is rewriting the rules of content consumption, traffic attribution, and publisher control.

AI intermediaries are actively replacing user visits. In Chegg’s case, students stopped going to the site because ChatGPT and summaries delivered the answer directly. That meant Chegg’s analytics showed falling traffic, even though content was still being consumed (just invisibly).
Servers, not dashboards, often show the real load. Wikimedia’s experience revealed that bot consumption can dominate resource use even when human traffic appears modest. The invisible layer stresses infrastructure, and the gap between dashboards and origin logs becomes a liability.
Search platforms are weaponising summarisation. PMC’s lawsuit highlights a critical turn: search engines now integrate AI Overviews that cannibalise clicks and divert value upstream. Even if users don’t come to your site, your content continues to power the answer, without monetisation or attribution.
Some AI agents warp to evade control. Perplexity’s evolving tactics illustrate that it’s not enough to block known bots or rely solely on robots.txt. When blocked, some crawlers can morph their identity, adopt browser-like UAs, rotate IPs/ASNs, and continue harvesting.

Taken together, these stories converge on three harsh truths:

You cannot trust front-end analytics alone. AI-driven content consumers may never trigger your tracking tags.
Control is eroding. Where publishers once governed indexing via robots.txt or bot policies, some AI actors are circumventing those norms.
New visibility tools are nonoptional. Without instrumentation that spans logs, fingerprinting, and agent attribution, publishers risk building strategy on an illusion.

If AI traffic is now a material share of content consumption, publishers must treat it as a first class audience. That means adopting tools capable of surfacing invisible fetches, attributing them to agents or families, quantifying their cost, and deciding how to govern access, whether via blockade, metering, or licensing. The race is no longer about who builds the best content, it’s about who measures and controls how that content is used in an AI era.

Appendix B

Technical Summary of LLM Bots and Analytics Limitations

Common AI/LLM Web Crawler User Agents: A number of web crawler user-agent strings have been publicly identified for major AI services. Webmasters can use these identifiers in server logs or robots.txt to recognise AI bot traffic:

Bot / User-Agent	Purpose & Behaviour	Key Traits	Analytics Impact
OpenAI GPTBot GPTBot/1.0 (+https://openai.com/gptbot)	Bulk crawler for training ChatGPT/LLMs.	+305% growth in 2024–25; ~7.7% of crawler traffic. Does not execute JS. No published IP ranges.	Invisible in GA (no JS execution, likely filtered if detected).
OpenAI ChatGPT-User ChatGPT-User/1.0	On-demand fetcher when ChatGPT queries live web data.	+2,825% growth (small base). Headless browser-like, but ignores most JS.	Indicates direct AI citation of content, but not tracked in GA.
OpenAI OAI-SearchBot	Indexing/search agent powering some ChatGPT features.	Similar to GPTBot, HTML-first fetch.	GA blind, often filtered as bot.
Anthropic ClaudeBot / anthropic-ai	Fetches docs for Claude answers & training.	Dropped from 11.7% → 5.4% share of crawler traffic.	Non-JS, invisible in analytics.
PerplexityBot / Perplexity-User	Crawler + live fetches for Perplexity AI search.	Perplexity-User often bypasses robots.txt. ~1% of AI traffic (high growth).	Crawler blocked in GA, but user-driven fetches still hit sites unseen.
Googlebot / Google-Extended / GoogleOther	Search indexing + AI data (SGE).	Executes some JS. Google-Extended flag signals AI training use. ~50% of crawler share.	Always invisible in GA (filtered as known bot).
Bingbot / BingPreview	Indexing + Bing AI (Copilot).	Dual-purpose bot (search + AI). BingPreview fetches page snapshots.	Filtered out; never in analytics.
Meta-ExternalAgent / FacebookBot	Used for FB/IG previews and AI training.	Fastly: Meta = 52% of AI crawler traffic. May overlap with previews.	Risk of over-blocking (breaks FB previews). Invisible in GA.
Amazonbot / Applebot / ByteSpider / CCBot (CommonCrawl)	Product search, Siri suggestions, TikTok AI, or open data crawlers.	ByteSpider large in 2024, fell off in 2025. CommonCrawl feeds many AI labs.	Typically blocked or filtered, not seen in GA.