The 2026 field guide to AI crawler tracking — every user-agent, IP range, robots.txt directive, opt-out matrix, and how to tell crawls from human AI-referred visits.
Part of the AI Search Hub — browse all 35 AI Search guides.
The most common question I get from founders since Q1 2026 is some variant of "should I block GPTBot?" The second most common is "are AI bots eating my bandwidth?" The third is "how do I know if any of this is working?" All three deserve real answers, and the real answers are stranger than the discourse suggests. Blocking GPTBot does not do what most people think it does. AI bot bandwidth is usually not the problem operators imagine it to be. And the question "is it working?" has a structural answer that involves both crawl logs and human-session attribution, not just one or the other.
This piece is the technical companion to the strategic AEO vs SEO in 2026 post and the analytics-side ChatGPT referral analytics guide. Where those two cover what to publish and how to measure human AI traffic, this one covers the crawlers themselves: every user-agent you need to know, how to detect them, when to block them, what each one actually does, and how to wire their crawl frequency into a citation-readiness signal. If you have read the practical track-ChatGPT-traffic playbook and the broader how to get cited by AI engines post, this is the deeper dive on the bot side of the same problem.
Quick Facts
Metric
Value
Source
AI bot share of total bot traffic (2024-2025)
~4-6%
Cloudflare Radar [5]
Documented OpenAI user-agents
3 (GPTBot, ChatGPT-User, OAI-SearchBot)
OpenAI bot docs [1]
Documented Anthropic user-agents
3 (ClaudeBot, anthropic-ai, claude-web)
Anthropic ClaudeBot docs [12]
Documented Perplexity user-agents
2 (PerplexityBot, Perplexity-User)
Perplexity docs [16]
Google AI training crawler
Google-Extended (shares Googlebot infra)
Google crawler docs [13]
Apple AI training opt-out crawler
Applebot-Extended
Apple Applebot docs [17]
ByteDance / TikTok AI crawler
Bytespider
Bytespider docs [18]
Meta AI crawler
Meta-ExternalAgent
Meta crawler docs [19]
Common Crawl bot (third-party AI training source)
CCBot
commoncrawl.org [20]
Cloudflare one-click AI bot block adoption
~1M+ domains as of mid-2025
Cloudflare blog [22]
robots.txt RFC reference (RFC 9309)
Standardized September 2022
IETF RFC 9309 [21]
llms.txt adoption (public SaaS, Q1 2026)
~7%
llmstxt.org [25]
Median spoofed-UA rate on AI bot hits
5-15%
Attrifast aggregate
AI Overviews appearance rate (US English)
13-15% of queries
Search Engine Land [10]
Two numbers do most of the work here. The 4-6% AI bot share of total bot traffic is the demand-side number — AI crawlers are real but still a small fraction of the bot fleet. The 5-15% spoofed-UA rate is the data-quality number — at least one in twenty hits claiming to be GPTBot is not actually OpenAI. If you skip the IP verification step you are reporting on phantom traffic.
Why AI crawler tracking matters in 2026
There are three reasons to track AI crawlers, and they are different enough that the right answer depends on which one you care about.
Reason one: crawl frequency is a leading indicator of citation readiness. When GPTBot, ChatGPT-User, or OAI-SearchBot crawl rates climb on a specific page, it is rarely random. It usually means OpenAI's pipeline considers the page worth indexing, fetching live in response to a user query, or surfacing in search. A page that was getting zero AI-bot hits per week and then jumps to a dozen ChatGPT-User hits in 48 hours is, in my experience, an extremely reliable signal that the page is being cited in answers to a query that is trending in real time. By the time the human traffic shows up in your analytics 24-72 hours later, the crawl-rate spike has already told you something is happening.
Reason two: bandwidth and infra cost accountability. On most sites this is a minor concern. On a few — large catalogs, faceted filter pages, search endpoints, infinite-scroll listing pages, image-heavy galleries — AI crawlers can hit thousands of unique URLs per day and chew through egress budget that you were not planning for. The Vercel team published a detailed breakdown of AI bot traffic patterns in late 2024 [23] showing that some of their largest customers were seeing AI crawler traffic exceed search-engine crawler traffic for the first time. If you are paying per GB or per request, knowing which bots are hitting which paths is the difference between a $40 surprise and a $4,000 surprise.
Reason three: content licensing and IP enforcement decisions. If you publish content that you actively monetize — a paid newsletter, a paywalled archive, a high-investment editorial product — the question of whether AI labs are ingesting it for training is a legitimate commercial concern. The New York Times v. OpenAI suit, the Reddit data licensing deals, the various publisher partnerships with OpenAI and Anthropic in 2024-2025 — all of these orbit the same question. You cannot make a sensible decision about which crawlers to allow, block, or negotiate with until you can see what is hitting your site and how often.
Most operators care about exactly one of those three reasons. The mistake I see most often is conflating them. "I want to block GPTBot because of bandwidth" usually does not survive five minutes of looking at actual numbers; "I want to track GPTBot because of citation readiness" is a measurement project, not a blocking project; "I want to negotiate with OpenAI because of licensing" is a different project again. Pick the reason first, then choose the action.
Goal
Right action
Wrong action
Maximize AI citations for SaaS / ecommerce
Allow GPTBot, ChatGPT-User, ClaudeBot, PerplexityBot, Google-Extended; instrument all of them
Block GPTBot to "protect content"
Bandwidth control on large catalog site
Cache aggressively, rate-limit per ASN, log per-bot bandwidth
Block all AI bots indiscriminately
Paywall enforcement on paid editorial
Block GPTBot, ClaudeBot, Google-Extended, CCBot, Bytespider; allow ChatGPT-User on free preview only
Block via robots.txt only and assume that is enforcement
Citation-readiness leading-indicator dashboard
Log all AI bot hits per URL per day, correlate to human AI-referral spikes 24-72h later
Treat bot hits as traffic in your KPI chart
Compliance with EU AI Act opt-out
Implement TDMRep tdm-reservation directive and ai-content-usage header
Rely on robots.txt alone
The 10 AI crawlers you need to know
Below is the working reference table I keep in a spreadsheet and update roughly monthly. The list is not exhaustive — there are a long tail of academic, research, and one-off crawlers that account for the bottom 5% of AI bot traffic — but these are the ones that account for >95% of what you will see in production logs in 2026.
Crawler
Owner
Purpose
Respects robots.txt
Documented IP range
GPTBot
OpenAI
Training corpus crawler
Yes [1]
openai.com/gptbot-ranges.json
ChatGPT-User
OpenAI
Live browse / user-triggered fetch
Yes for crawls, no for direct user fetches [1]
openai.com/chatgpt-user.json
OAI-SearchBot
OpenAI
ChatGPT search index crawler
Yes [1]
openai.com/searchbot.json
ClaudeBot
Anthropic
Training corpus crawler
Yes [12]
Published in Anthropic docs
anthropic-ai (legacy)
Anthropic
Legacy training crawler
Yes [12]
Same as ClaudeBot
claude-web
Anthropic
Live browse user-triggered
Yes [12]
Same as ClaudeBot
PerplexityBot
Perplexity
Search index crawler
Yes [16]
Published in Perplexity docs
Perplexity-User
Perplexity
User-triggered live fetch
No (per Perplexity policy) [16]
Same as PerplexityBot
Google-Extended
Google
Training crawler for Gemini, Vertex AI
Yes [13]
Shares Googlebot IPs
Applebot-Extended
Apple
Training opt-out for Apple Intelligence
Yes [17]
Shares Applebot IPs
Bytespider
ByteDance
Training crawler (TikTok / Doubao)
Yes (post-2024) [18]
Published in Bytespider docs
Meta-ExternalAgent
Meta
Training crawler (Llama, Meta AI)
Yes [19]
Published in Meta crawler docs
Amazonbot
Amazon
Multi-purpose, partially AI training
Yes [24]
Published in Amazon docs
CCBot
Common Crawl
Open dataset used by most AI labs
Yes [20]
commoncrawl.org
Diffbot
Diffbot
Commercial crawler, AI-data buyer
Yes
Published in Diffbot docs
YouBot
You.com
AI search crawler
Yes
Published in You.com docs
Cohere-AI
Cohere
Training crawler
Yes
Published in Cohere docs
ImagesiftBot
Hive AI
Image training crawler
Yes
Published in Hive docs
The next table is the one I actually use to write log greps. Exact user-agent strings as they appear in the field as of May 2026.
Crawler
User-Agent substring to match
GPTBot
GPTBot/1.1 (or earlier GPTBot/1.0)
ChatGPT-User
ChatGPT-User/1.0 (also seen as ChatGPT-User/2.0)
OAI-SearchBot
OAI-SearchBot/1.0
ClaudeBot
ClaudeBot/1.0 or claudebot
anthropic-ai
anthropic-ai (older legacy string)
claude-web
Claude-Web/1.0
PerplexityBot
PerplexityBot/1.0
Perplexity-User
Perplexity-User/1.0
Google-Extended
(UA same as Googlebot; differentiated by robots.txt directive only)
Applebot-Extended
(UA same as Applebot; differentiated by robots.txt directive only)
Bytespider
Bytespider
Meta-ExternalAgent
meta-externalagent
Amazonbot
Amazonbot/0.1
CCBot
CCBot/2.0 (Common Crawl)
Diffbot
Diffbot
YouBot
YouBot
Cohere-AI
cohere-ai
ImagesiftBot
ImagesiftBot
Two important quirks in that table. First, Google-Extended and Applebot-Extended do not have unique user-agent strings. They share the standard Googlebot and Applebot UAs respectively. The way you "opt them out" is in robots.txt by writing a rule against Google-Extended or Applebot-Extended as a User-agent token — the crawler reads its own token from your robots.txt, even though the HTTP UA header is generic. This trips up log greps constantly. If you only grep by HTTP UA you cannot separate AI-training Googlebot from search-indexing Googlebot.
Second, the version suffixes drift. GPTBot was GPTBot/1.0 through most of 2024 and bumped to GPTBot/1.1 in late 2024. ChatGPT-User has been observed at both 1.0 and 2.0 in production logs. Your regex should match the prefix (GPTBot/), not the exact version.
Crawler ownership and parent company
Crawler
Parent company
Training model fed
Headquartered
GPTBot / ChatGPT-User / OAI-SearchBot
OpenAI
GPT-4, GPT-5, future models
San Francisco, US
ClaudeBot family
Anthropic
Claude 3, Claude 4, Claude 5
San Francisco, US
PerplexityBot / Perplexity-User
Perplexity AI
Perplexity Sonar / hosted Llama, Claude
San Francisco, US
Google-Extended
Google / Alphabet
Gemini, Vertex AI, Bard legacy
Mountain View, US
Applebot-Extended
Apple
Apple Intelligence, on-device foundation models
Cupertino, US
Bytespider
ByteDance
Doubao, internal LLMs
Beijing, CN
Meta-ExternalAgent
Meta
Llama 3, Llama 4, Meta AI
Menlo Park, US
Amazonbot
Amazon
Alexa LLM, internal Bedrock training
Seattle, US
CCBot
Common Crawl Foundation
Public Common Crawl dataset (used by most labs)
San Francisco, US (nonprofit)
Cohere-AI
Cohere
Command R, Command R+
Toronto, CA
YouBot
You.com
YouChat / hosted models
Palo Alto, US
Approximate AI bot traffic share
The Cloudflare Radar AI Insights dashboard [5] publishes share-of-traffic numbers for the AI bot fleet that move quarter to quarter. The numbers below are directional, drawn from publicly accessible Cloudflare Radar snapshots, Vercel's own AI bot writeup [23], and the Attrifast customer aggregate. Treat them as estimates with double-digit error bars, not as canonical facts.
Crawler
Approx share of AI bot traffic (2025-2026)
Trend Q1 2026 vs Q1 2025
GPTBot
~25-30%
Slowly growing
ClaudeBot
~15-18%
Rapidly growing
Bytespider
~10-14%
Volatile, regional
PerplexityBot
~9-12%
Rapidly growing
Google-Extended
~8-11%
Stable
Amazonbot
~7-9%
Stable
Meta-ExternalAgent
~5-7%
Growing
CCBot
~3-5%
Stable
Applebot-Extended
~2-4%
New, growing
All others
~3-6%
Mixed
Two things to call out. ClaudeBot's growth is faster than any other AI crawler in this dataset — it tracks closely with Claude's expanding API customer base, which makes sense if you assume the crawl pipeline is sized to model usage. Perplexity's crawler has been pushed by independent observers to be much more aggressive than its respect-for-robots policy implies; multiple site operators have reported PerplexityBot ignoring Disallow directives, which Perplexity has publicly denied [16]. The honest read is that the field data is mixed and you should verify on your own logs before drawing conclusions.
Crawler vs human AI-referred visit — what's the actual difference
This is the distinction operators get wrong most often. A GPTBot crawl is not a citation. A ChatGPT-User fetch is not a click. The user who later arrives via a ChatGPT citation is a third, completely separate event, and the three live in different parts of your stack.
Event type
What it is
Where it shows up
Counts as
GPTBot training crawl
Scheduled scrape for future training
Server logs with GPTBot/1.1 UA
Bot activity (exclude from traffic)
ChatGPT-User live fetch
On-demand fetch when user asks ChatGPT to read URL
Server logs with ChatGPT-User/1.0 UA
Bot activity (citation signal)
OAI-SearchBot index
Scheduled crawl for ChatGPT search index
Server logs with OAI-SearchBot UA
Bot activity (search-index signal)
Human click from ChatGPT (referer-passed)
Real human, browser hits page
Analytics with chatgpt.com referer
Human traffic (attribute to ChatGPT)
Human click from ChatGPT (referer-stripped)
Real human, browser hits page, no referer
Analytics as Direct/(none)
Human traffic (suspected ChatGPT)
The clean mental model: bot hits and human hits are two different streams that should never be mixed in the same chart. When operators report "we got 4,000 visits from ChatGPT yesterday" and the number is actually 3,200 GPTBot crawls plus 800 humans, the resulting decisions (content plan, ad spend, board update) are all built on a misclassification.
Here is the dual-view pattern I now ship in every Attrifast install for the AI-engine view:
View
What it shows
What it does not show
Bot Activity
All AI crawler hits (GPTBot, ChatGPT-User, ClaudeBot, etc.) per URL per day
Human traffic of any kind
Traffic
Real human sessions, attributed to AI engines where possible
Bot hits of any kind
Citation Readiness
Correlation between bot activity (lead) and human AI-referrals (lag)
Causation — bot hits do not cause clicks; they precede them
The Citation Readiness view is the one that operationalizes the leading-indicator insight. When ChatGPT-User crawls on a page spike, the median lag to a human-referral spike on the same page is 18-72 hours in my data, with wide variance. Pages where the bot spike does not produce a corresponding human spike within 7 days are either (a) being fetched but not cited or (b) being cited in answers that do not generate clicks. Both are useful pieces of information but they mean different things for content strategy.
The decision tree above is the working version I run at the edge in Attrifast. It does three things in one pass: classifies the request, validates the UA against the published IP range, and routes the event into one of three downstream signals. The reverse-DNS step in node C is the one most server-log greps skip, and it is the source of most of the bad data.
How to detect AI crawlers in your logs (with grep examples)
You can get 80% of the value with five minutes and a grep command. The 20% that needs more work is the IP verification and the suspected-AI behavioral inference. Both are covered below.
That gives you a sorted list of (IP, path, status, UA-fragment) tuples for AI bot hits. The columns are $1 (remote IP), $7 (request path), $9 (HTTP status), and $12 onward (User-Agent depending on log format).
A request claiming GPTBot/1.1 from an IP not in OpenAI's published range is spoofed. The verification:
# Pull OpenAI's published GPTBot IP ranges
curl -s https://openai.com/gptbot.json > /tmp/gptbot-ranges.json
# Extract IPs that hit your site claiming to be GPTBot
grep "GPTBot/" /var/log/nginx/access.log | awk '{print $1}' | sort -u > /tmp/observed-gptbot-ips.txt
# Compare (you'll need a CIDR-match tool like grepcidr for the full check)
grepcidr -f <(jq -r '.prefixes[].ipv4Prefix' /tmp/gptbot-ranges.json) /tmp/observed-gptbot-ips.txt
Across the sites I audit, between 5% and 15% of IPs claiming to be GPTBot fall outside OpenAI's published ranges. Those are spoofed. Treat them as you would any other unverified scraper — log them, rate-limit them, or block them, but do not count them as OpenAI activity.
Same for Google-Extended
Google-Extended is harder because it shares Googlebot's user-agent. The differentiating signal is the robots.txt token. You cannot detect Google-Extended in your access logs directly — you can only see all Googlebot hits, and you control whether they get fed into Vertex / Gemini training via the User-agent: Google-Extended directive in robots.txt [13].
# Verify Googlebot hits are real Googlebot via reverse-DNS
for ip in $(grep "Googlebot" /var/log/nginx/access.log | awk '{print $1}' | sort -u); do
host=$(dig +short -x "$ip")
if [[ "$host" == *.googlebot.com* || "$host" == *.google.com* ]]; then
echo "$ip VERIFIED"
else
echo "$ip SPOOFED"
fi
done
Cloudflare and Vercel: log fields that help
If you sit behind Cloudflare, the cf-verified-bot flag (available in Cloudflare's Logpush) tells you whether Cloudflare has verified the bot's identity. This saves you the reverse-DNS dance for the bots Cloudflare knows about. Vercel logs include x-vercel-forwarded-for and standard request headers; for AI bot detection there, the simplest path is to inspect User-Agent in middleware and write the classification to a custom log line. Vercel published a detailed pattern for this in late 2024 [23].
robots.txt vs WAF blocking vs Cloudflare AI Bot Block
Three layers of bot blocking, three different threat models. Most operators conflate them, and the resulting block is either too weak (robots.txt only, against a crawler that ignores it) or too strong (WAF block against good actors who would have respected a polite request).
Layer
What it does
Enforcement
Best for
robots.txt directive
Polite request; well-behaved crawlers respect it
None — it is a request, not a block
Bots that publicly commit to respecting robots.txt (GPTBot, ClaudeBot, Google-Extended)
Server / WAF rule
Returns 403 / 429 / blocks at edge based on UA or IP
Technical block (UA-only is bypassable; IP+UA is stronger)
Bots that ignore robots.txt or scrapers spoofing AI UAs
Cloudflare AI Bot Block (one-click)
Blocks Cloudflare's curated AI bot list at edge
Technical block, kept updated by Cloudflare
Sites that want zero AI training exposure with minimal config
Cloudflare Bot Management
ML-based bot scoring with custom rules
Technical block, plus heuristic detection of unknown bots
Enterprise sites with sophisticated bot threat
TDMRep / ai-content-usage signaling
Standardized opt-out signal
None — depends on crawler honoring the signal
EU compliance and forward-looking opt-out
A copy-pasteable robots.txt for "allow everything, log everything" (recommended default for SaaS)
A copy-pasteable robots.txt for "block all AI training, allow live browse" (for paid publishers)
# Block AI training crawlers. Allow user-triggered live browse so brand remains reachable.
User-agent: GPTBot
Disallow: /
User-agent: ClaudeBot
Disallow: /
User-agent: anthropic-ai
Disallow: /
User-agent: Google-Extended
Disallow: /
User-agent: Applebot-Extended
Disallow: /
User-agent: Bytespider
Disallow: /
User-agent: Meta-ExternalAgent
Disallow: /
User-agent: CCBot
Disallow: /
User-agent: Cohere-AI
Disallow: /
User-agent: Diffbot
Disallow: /
# Allow live-browse user-triggered fetches so users asking ChatGPT/Claude/Perplexity
# to read a specific URL can still reach the page.
User-agent: ChatGPT-User
Allow: /
User-agent: Claude-Web
Allow: /
User-agent: Perplexity-User
Allow: /
User-agent: OAI-SearchBot
Allow: /
User-agent: PerplexityBot
Allow: /
User-agent: *
Allow: /
Sitemap: https://yoursite.com/sitemap.xml
nginx WAF block for crawlers that ignore robots.txt
# /etc/nginx/conf.d/ai-bot-block.conf
# Use only if you have documented evidence a specific crawler is ignoring robots.txt.
map $http_user_agent $is_blocked_ai_bot {
default 0;
"~*GPTBot" 0; # respect robots.txt; do not block at WAF
"~*Bytespider" 1; # block if you have a specific complaint
"~*CCBot" 0; # respect robots.txt
"~*Diffbot" 0;
"~*ImagesiftBot" 1; # image scraping; block if not relevant
}
server {
if ($is_blocked_ai_bot = 1) {
return 403;
}
# ... rest of server config
}
(http.user_agent contains "Bytespider")
or (http.user_agent contains "ImagesiftBot")
or (http.user_agent contains "Diffbot" and not http.user_agent contains "diffbot.com/help")
Action: Block (or Managed Challenge for soft-block).
Cloudflare shipped a "Block AI Bots" toggle in mid-2024 [22] that sits in the Security → Bots panel and blocks Cloudflare's curated list of AI training crawlers without requiring you to maintain WAF rules. Adoption hit over a million domains by late 2024. The tradeoff is the same as the manual block: you lose training-corpus exposure for the brand. The one-click toggle does not block live-browse agents (ChatGPT-User, Claude-Web, Perplexity-User) by default — that requires a separate rule.
Method
Effort
Maintenance
Bypass risk
Coverage
robots.txt
5 min
Low
High (bad-actor bots ignore)
Good crawlers only
nginx UA block
30 min
Medium
Medium (UA-only, IP rotation bypasses)
UA-matched bots only
Cloudflare one-click AI bot block
1 min
Zero
Low
Cloudflare's curated list
Cloudflare Bot Management ML
Enterprise plan
Zero
Very low
Known + unknown bots
AWS WAF + custom rules
1-4 hr
Medium
Low
Whatever you write
IP-range block (verified)
1-2 hr setup
Quarterly refresh
Very low
Bots with published ranges
ai-content-usage header (future)
10 min
Low
Honored by major crawlers (eventually)
Crawlers that adopt standard
Should you block AI crawlers? (honest tradeoffs by site type)
This is the section that has eaten the most of my Discord and email time in 2026. Operators want a clean yes/no. The honest answer is "it depends on your business model and you should not block as a default." The decision tree by site type:
Site type
Block GPTBot?
Block ChatGPT-User?
Block CCBot?
Reasoning
Bootstrapped B2B SaaS
No
No
No
Need brand presence in trained corpus; ChatGPT citations drive measurable revenue
DTC ecommerce
No
No
No
Product discovery via AI engines is rising; blocking is a future-loss bet
Developer tools / OSS
No
No
No
Devs use ChatGPT for technical Q&A; being in the trained corpus is high-leverage
Content publisher (ad-supported)
Soft no (allow most)
No
Maybe block
Some publishers block CCBot but allow live-browse; tradeoff is volume vs control
Content publisher (paywalled)
Yes (block)
Allow free preview only
Yes (block)
Direct paywall conflict; OpenAI partnerships are licensing-deal terrain, not free crawl
News organization
Yes (block)
Maybe (allow free preview)
Yes (block)
NYT v. OpenAI precedent; license, do not give away
Local services / SMB
No
No
No
Negligible content value to AI labs; allow for brand discovery
Regulated healthcare / financial
No (with disclaimers)
No
No
YMYL content gets filtered by AI anyway; blocking does not help, instrumentation does
API documentation
No
No
No
Being in the trained corpus is the entire point of public API docs
Internal / authenticated content
N/A
N/A
N/A
Should already require auth; robots.txt is not the right defense
The argument for not blocking, summarized in one paragraph: most "AI bot blocking" guides assume your content has high standalone value to an AI lab and that you can capture that value by withholding it. For 95% of SaaS and ecommerce sites, that is not true. The content has value to your business because it drives leads, brand awareness, and product evaluations. The AI crawler that ingests your pricing page and the AI engine that later cites your pricing page are part of the same funnel. Blocking the first does not stop the second; it just makes the second less informed about your brand.
The argument for blocking, summarized in one paragraph: if you are a paid publisher whose business model is the sale of access to your content, AI training-corpus inclusion is direct competitive substitution. A reader who gets your reporting summarized inside ChatGPT is a reader who did not subscribe. Blocking GPTBot and ClaudeBot is a defensive move that preserves the licensing-deal option. The cost is loss of trained-corpus brand presence; the benefit is preserving the underlying commercial model.
The argument that nobody makes but should: blocking AI crawlers is a one-way door that is easy to reverse. Allowing them and then deciding later to block costs you nothing; the future crawler can re-ingest the new robots.txt within days. The asymmetry argues for "allow by default, block when you have a specific reason."
Two case patterns
Case one: a developer tools company I work with that blocked GPTBot in early 2024 "just to be safe." Twelve months later they noticed their ChatGPT citation rate was declining relative to peers in the same category. They unblocked. Citation rate recovered within ~30 days as the new crawls fed the next training cycle. The 12-month "protection" did not protect anything; it just put them behind on brand presence inside model knowledge.
Case two: a paid publisher who left GPTBot allowed through 2024. When OpenAI's publisher-partnership program offered to license content for direct citation rights, the publisher had no leverage — their content was already in the training corpus. They blocked retroactively, but a year of crawls had already been ingested. The retrospective lesson: if you ever plan to negotiate a content license with an AI lab, the leverage starts on day one with the block, not on day 365 with the negotiation.
Measuring crawl frequency as a citation-readiness signal
This is the part of the playbook I have not seen anyone else describe in detail, and it is the part that pays off the highest leverage from AI crawl logging.
The basic insight: AI crawler traffic on a specific URL has predictive signal for whether the URL is about to start receiving human AI-referred traffic. Not perfect signal — many bot hits do not produce citations, and many citations come from URLs that have not been freshly crawled — but a real, exploitable pattern with consistent shape.
The four-signal model I run on every Attrifast install
Signal
Source
Lead time vs human traffic
Interpretation
GPTBot crawl frequency (rolling 7-day)
Server logs
7-30 days
Training-corpus inclusion; slow-moving baseline
ChatGPT-User fetch frequency (rolling 24h)
Server logs
18-72 hours
Live-citation activity; fast-moving real-time signal
OAI-SearchBot crawl frequency (rolling 7-day)
Server logs
3-14 days
Search-index inclusion; medium-moving
PerplexityBot + Perplexity-User combined
Server logs
12-48 hours
Perplexity-specific citation activity
For each URL on the site I track all four. The Citation Readiness score for a URL is a weighted blend, with ChatGPT-User and Perplexity-User carrying the highest weight (because they are user-triggered and most strongly correlated with active citations).
A real pattern from one site
I cannot share absolute numbers but the shape is stable enough to describe. A bootstrapped SaaS site published a new commercial page on a Tuesday. The first GPTBot hit landed Wednesday. The first ChatGPT-User hit landed Friday afternoon — a single fetch from one IP. The next Monday morning, the ChatGPT-User hits jumped from 1 to 11 in a single day, and on Wednesday the human AI-referred traffic on that URL went from 0 to 47 sessions. The bot signal led the human signal by 4 days.
Day
GPTBot hits
ChatGPT-User hits
Human AI-referred sessions
Tue (publish)
0
0
0
Wed
1
0
0
Thu
0
0
0
Fri
0
1
0
Sat
1
0
0
Sun
0
0
0
Mon
0
11
2
Tue
1
18
14
Wed
0
22
47
Thu
2
19
38
The bot-signal column does not always lead the human-signal column this cleanly. Variance is high. But over a population of pages and over a long enough window, the directional pattern holds: ChatGPT-User hits cluster a few days before human AI-referrals climb.
Why this matters for content strategy
Two operational uses for the signal:
Use one: prioritize content investment by Citation Readiness. Pages with high ChatGPT-User activity and rising human AI-referrals are the pages that deserve next-round content investment (deeper FAQ, more schema, expanded comparison sections). Pages with high crawl activity and zero human conversion are pages that are being fetched but not cited — different problem, different fix (usually content quality or page structure).
Use two: catch broken citation pipelines fast. When a previously high-citation page suddenly drops in ChatGPT-User activity, it is usually one of three things: the page has been deprioritized in the model's index, the page is broken (404, 500, redirect loop the bot cannot follow), or a competitor has displaced you in the citation slot. All three are recoverable if caught in days, very expensive if caught in months.
AI crawler bandwidth and infra cost benchmarks
The "AI bots are eating my bandwidth" panic is mostly overstated, but it is real on a specific set of site shapes. Here are the benchmarks I have collected.
Typical AI bot bandwidth as % of total traffic
Site type
AI bot bandwidth as % of total
Top contributor
Small SaaS (under 1M pageviews/mo)
0.1-0.8%
GPTBot
Mid SaaS (1M-10M pageviews/mo)
0.3-1.5%
GPTBot
Large SaaS / content site (10M+)
0.5-3%
GPTBot + ClaudeBot + Bytespider
DTC ecommerce (small-mid)
0.4-2%
GPTBot + Amazonbot
DTC ecommerce (large catalog)
1-5%
Bytespider + GPTBot
Content publisher
1-4%
CCBot + GPTBot
Developer documentation
2-8%
ClaudeBot + GPTBot
Job board / classifieds
3-12%
Bytespider + GPTBot
Real estate / listings
4-15%
Bytespider + Diffbot
Wikipedia-shaped reference
5-20%
CCBot + GPTBot + Bytespider
The bottom rows are where the bandwidth bills start to bite. Large catalog sites, listing sites with faceted filters, and reference-style content that has high per-page uniqueness all attract more aggressive AI crawler behavior because the per-page training value is higher.
Cost in concrete dollars (rough estimates)
For a site on Vercel Pro at $20/mo + $0.40/GB egress over the included allowance:
Site profile
Monthly pageviews
AI bot share
AI bot extra GB/mo
AI bot extra $/mo
Small SaaS
200,000
0.5%
<1 GB
<$1
Mid SaaS
2,000,000
1%
~5-10 GB
$2-4
Large content site
20,000,000
2%
~100-200 GB
$40-80
Job board
5,000,000
8%
~150-300 GB
$60-120
Listings site
10,000,000
12%
~500-1000 GB
$200-400
For the bottom two rows the bandwidth cost starts mattering. For the top three rows it is rounding error.
When AI crawlers actually create a problem
Problem
Root cause
Fix
Crawler hits uncached dynamic endpoint repeatedly
Catalog / search / faceted-filter URLs with no cache headers
Set Cache-Control: public, s-maxage=86400 on bot-safe pages
Crawler triggers expensive DB queries
Faceted filter URLs hitting search infrastructure
Rate-limit AI crawlers on filter endpoints; serve static fallback
Crawler downloads large media (PDF, images, video)
No Disallow on heavy media paths
User-agent: GPTBot \n Disallow: /media/large/
Crawler floods 404 on stale sitemap URLs
Old URLs in sitemap that have been removed
Refresh sitemap; ensure 410 not 404 on permanent removals
Bytespider hitting 100+ requests / minute
Bytespider's historic aggressive crawl pattern
Rate-limit per ASN at edge; or block entirely
The single highest-leverage fix for AI crawler cost on most sites is aggressive caching on the URLs they hit most. The crawl pattern is highly cacheable — AI bots tend to refetch the same URL list on a schedule, and a 24-hour CDN cache absorbs nearly all of that load.
The 2026 IETF "ai-content-usage" preferences proposal
The mess of per-crawler robots.txt rules and ad hoc WAF blocks is what the IETF AI Preferences working group [27] is trying to fix. The proposal that has emerged through 2025-2026 is a single HTTP header and robots.txt directive — Content-Usage for the header and ai-content-usage in robots.txt — that lets a site signal consent or refusal for several distinct downstream uses of its content.
The use categories under discussion (subject to change):
Use category
Description
Example crawler that consumes it
train-ai
Use page content as training data for AI models
GPTBot, ClaudeBot, Google-Extended
train-genai
Use specifically for generative AI training
Subset of above
search
Include in search indexes
Googlebot, Bingbot, OAI-SearchBot
inference
Use for retrieval-augmented generation at inference time
ChatGPT-User, Perplexity-User, Claude-Web
summarize
Use for AI-generated summaries
Various
tdm-mining
EU TDM Reservation directive coverage
All AI training
The signal is intended to be respected by all participating crawlers, with the goal that operators can write a single ai-content-usage line in robots.txt instead of maintaining a dozen per-crawler User-agent blocks. The TDMRep standard [26] solves a closely related EU-jurisdiction problem under the EU AI Act.
Example robots.txt directives under the proposal
# Refuse AI training, allow search and inference (live citation)
User-agent: *
ai-content-usage: search, inference
ai-content-usage-deny: train-ai, train-genai, tdm-mining
# Allow everything (recommended default for most SaaS / ecommerce)
User-agent: *
ai-content-usage: train-ai, train-genai, search, inference, summarize
# Selective: allow search and inference, but train-genai only on /blog/, not on /paid/
User-agent: *
ai-content-usage: search, inference
User-agent: *
Disallow: /paid/
User-agent: *
Allow: /blog/
ai-content-usage: train-genai, search, inference
Adoption status
Crawler
Public position on ai-content-usage proposal
GPTBot
Public signal of intent to honor [1]
ClaudeBot
Public signal of intent to honor [12]
Google-Extended
Public signal of intent to honor [13]
PerplexityBot
Tentative; awaiting standardization [16]
Applebot-Extended
Signaled support [17]
Bytespider
No public position
Meta-ExternalAgent
Tentative
CCBot
Common Crawl is a downstream pipeline; the standard applies to consumers, not CC itself
Until the proposal stabilizes (likely 2026-2027 timeline given the IETF process), the practical advice is to maintain per-crawler robots.txt directives as the primary control surface, with TDMRep tdm-reservation set for EU compliance, and ai-content-usage added as an additional polite-signal layer for forward compatibility.
Common AI crawler tracking mistakes
Eight mistakes I see often enough to call them patterns.
Mistake 1: Treating bot hits as traffic. A 10x spike in GPTBot is not 10x more users. It is OpenAI's crawler doing scheduled work. Fix: separate Bot Activity from Traffic in every chart. Never mix.
Mistake 2: Trusting the User-Agent string without IP verification. Between 5% and 15% of UA-claimed GPTBot hits are spoofed. Fix: reverse-DNS verify against OpenAI's published ranges, or use Cloudflare's cf-verified-bot flag.
Mistake 3: Blocking GPTBot believing it stops ChatGPT citations. It blocks training-corpus inclusion only. ChatGPT-User and OAI-SearchBot keep working. Fix: understand the three-pipeline model before blocking anything.
Mistake 4: Forgetting Google-Extended is configured by robots.txt token, not UA. Google-Extended hits show as Googlebot in your logs. You cannot detect them in access logs alone. Fix: control via User-agent: Google-Extended directive in robots.txt; do not expect a unique UA string.
Mistake 5: Confusing crawl frequency with citation frequency. A page that is crawled often is not necessarily cited often. Many crawls produce zero citations. Fix: instrument both signals separately and watch the correlation, not the count.
Mistake 6: Allowing Bytespider unbounded on a large-catalog site. Bytespider has historically been the most aggressive AI crawler on listing-shaped sites, sometimes hitting hundreds of requests per minute. Fix: rate-limit per ASN or block at WAF if the bandwidth bill is real.
Mistake 7: Setting a robots.txt block on /blog/ for AI bots. This is the inverse of the right move. Your blog is the most citation-friendly content you publish; blocking AI from your blog while allowing access to your pricing page makes the model less informed about your category content but still able to quote your prices. Fix: invert. If you must block, block /paid/ or /pricing/details/, not /blog/.
Mistake 8: Ignoring the human-AI-referral side entirely. Bot tracking without human tracking is a dashboard that never shows revenue. The crawler hits are the leading indicator; the human clicks and the Stripe webhook are the lagging signals that pay for the work. Fix: instrument both sides, in one view if possible.
What this looks like in practice on Attrifast
A short note on the product because the article cannot pretend the author has no interest. Attrifast tracks AI crawler hits per URL per day in the same dashboard where it shows human AI-referred sessions and the Stripe revenue join. The crawler view is grouped by engine (OpenAI, Anthropic, Perplexity, Google, Apple, Meta, ByteDance) with per-bot breakdowns inside each engine. The human view is grouped by AI source domain (chatgpt.com, perplexity.ai, claude.ai, gemini.google.com) with referer-passed and referer-stripped separated. The Citation Readiness signal correlates the two over time.
The four pieces the product owns:
Edge detection of AI bot hits (UA matching + IP verification)
First-party detection of human AI-referrals (referer fingerprinting + behavioral inference)
Stripe webhook join from session to payment
A single dashboard showing all of the above with per-URL drill-downs
The first-person reason I built it is that I was the operator in 2024 staring at GPTBot in my access logs, wondering whether the crawl spike on my pricing page meant anything, and finding that nothing in my analytics stack would tell me whether the human traffic 48 hours later was related. Now it does.
Limitations
Five things this article does not cover, and you should not extrapolate past.
Voice-mode AI citations. When ChatGPT voice or Gemini voice answers a query and never renders a clickable URL, no fetch is made and no citation is logged in the way this article describes. The brand mention happens; the crawler signal does not. No reliable measurement story exists for voice-mode citation behavior in 2026.
Enterprise tenants. ChatGPT Enterprise, Claude for Work, and Microsoft Copilot for organizations may use different fetch behaviors from consumer surfaces. The crawler IPs and UAs are not always the same as the public ranges. Treat the consumer-surface measurements in this article as a lower bound for enterprise behavior.
MCP and tool-using agents. Model Context Protocol-based agents and tool-using agents from frameworks like LangChain, AutoGPT, and CrewAI sometimes fetch URLs directly via headless browsers with generic UAs. These are functionally AI-driven fetches but invisible to UA-based bot detection. The right detection is behavioral, not UA-based, and remains an open problem.
Region and language variance. Most of the percentages in this article are aggregated from US English sites. APAC and EU patterns differ — Bytespider is more dominant in APAC, ClaudeBot less so. Take the share percentages as directional, not canonical for your geography.
The "did blocking GPTBot help me get cited less?" counterfactual. I have anecdotes from individual sites but no controlled study. The cleanest available evidence is the developer-tools company case I described, where unblocking restored citation rate within ~30 days. Treat that as one data point, not as a quantified causal claim.
FAQ
What is the difference between GPTBot, ChatGPT-User, and OAI-SearchBot?
GPTBot is OpenAI's training crawler — it scrapes pages to add to future model training corpora and respects robots.txt. ChatGPT-User is the live browse agent that fetches a URL on demand when a user (or the model) asks ChatGPT to read a specific page; it also respects robots.txt but ignores it for direct user fetches per OpenAI's docs. OAI-SearchBot powers the ChatGPT search index that launched October 2024 and behaves more like a traditional search crawler. The three serve different jobs and a single robots.txt block on one does not block the others. Most sites should distinguish them in logs because a GPTBot spike is a training-corpus signal, a ChatGPT-User spike is a live-citation signal, and an OAI-SearchBot spike is a search-index signal.
Should I block GPTBot to protect my content from being used to train AI?
Probably not, but the honest answer depends on what you sell. Blocking GPTBot via robots.txt removes you from future training corpora. It does not block ChatGPT-User, so ChatGPT can still fetch and cite your pages live on user request. The net effect of blocking GPTBot is that the model's baseline knowledge of your brand slowly degrades while live-browse citations continue. For most SaaS and ecommerce sites this is the wrong trade because baseline brand recall in the model is what gets you cited for queries the user asks without browse mode active. Block GPTBot only if you have a specific legal, contractual, or content-monetization reason — for example, a paid publisher whose business model is sub-paywall content, or a brand under explicit DMCA pressure.
Does robots.txt actually stop AI crawlers from training on my site?
It stops the crawlers that respect it. GPTBot, ClaudeBot, Google-Extended, PerplexityBot, and Applebot-Extended all publicly commit to respecting robots.txt and the historical evidence is that they do. It does not stop bad-actor scrapers, third-party crawlers selling data to AI labs, archived copies on the Wayback Machine, third-party citations that quote you verbatim, or content already in published training sets. Robots.txt is a polite request honored by the largest crawlers; it is not a technical enforcement mechanism. If you need enforcement, you need WAF rules, Cloudflare's AI bot blocking, or IP-level blocks at the edge — and even those have known bypass paths.
How can I tell if an AI crawler hit is real or a spoofed user-agent?
Reverse-DNS the IP and compare to the published forward-DNS. OpenAI publishes its IP ranges at openai.com/gptbot-ranges.json, Google publishes its crawler ranges at developers.google.com/search/apis/ipranges/googlebot.json plus a separate special-crawlers list, and Anthropic publishes ClaudeBot IPs in its documentation. A request claiming to be GPTBot from an IP outside OpenAI's published ranges is either spoofed or from a third-party crawler imitating the user-agent. In practice, 5-15% of traffic claiming to be a known AI bot on the sites I audit is spoofed, mostly from scrapers trying to evade rate limits. Always validate by IP, not by user-agent string alone.
What is the cost in bandwidth of allowing AI crawlers?
Smaller than most operators fear and growing. Across the sites I monitor, AI crawlers together account for 0.5-4% of total bandwidth on content-heavy sites, with GPTBot the largest single share at typically 30-50% of AI-crawler bandwidth. The Cloudflare Radar AI bot dashboard pegs AI bot share of total bot traffic at roughly 4-6% in 2024-2025 with steady growth quarter over quarter. On a typical $20-50/mo VPS the marginal cost is negligible. On large static sites behind a CDN the cost is also negligible because the CDN absorbs most crawler load. The cost becomes real when AI crawlers hit uncached dynamic endpoints — large catalog sites, listing pages with faceted filters, search endpoints. For those, cache the AI-crawler responses aggressively or rate-limit per ASN.
Is there a single way to opt out of all AI training at once?
Not as of mid-2026, but the IETF AI Preferences working group's ai-content-usage proposal is the closest path. It defines a single HTTP header and robots.txt directive — Content-Usage and ai-content-usage — that signals consent or refusal for AI training, search-index inclusion, and other downstream uses. Major crawlers including Google-Extended, GPTBot, and ClaudeBot have signaled intent to honor the standard once it stabilizes. Until then you need per-crawler robots.txt blocks for each AI bot you want to exclude, plus WAF rules for crawlers that ignore robots.txt. The TDMRep (Text and Data Mining Reservation Protocol) standard offers a similar opt-out signal for EU jurisdictions, anchored in the EU AI Act.
Does blocking GPTBot stop ChatGPT from citing me?
No, and this is the single most common misunderstanding I see in the bot-blocking discourse. ChatGPT cites pages via three pipelines: the trained corpus (informed by GPTBot crawls), the live browse fetch (ChatGPT-User), and the search index (OAI-SearchBot). Blocking GPTBot only stops the first. A user who asks ChatGPT to read your page, or who searches via ChatGPT's search interface, can still trigger a fetch and a citation. The user is the source of the request, not OpenAI. Blocking GPTBot makes you slightly less likely to be cited in answers the model produces from pure trained knowledge but does not remove you from active answers.
How do I tell an AI crawler hit from a human AI-referred visit in my logs?
Two different log signals. AI crawler hits show a known bot user-agent (GPTBot, ChatGPT-User, ClaudeBot, PerplexityBot, Google-Extended, OAI-SearchBot) and usually no Referer header. Human AI-referred visits show a normal browser user-agent (Chrome, Safari, Firefox on a real OS) and either a Referer of chatgpt.com / perplexity.ai / claude.ai / gemini.google.com (15-20% of the time) or an empty Referer (the majority). The right shape for a log analyzer is two completely separate views: a Bot Activity view that includes all AI crawlers, and a Traffic view that excludes them. Mixing the two is how operators end up reporting 10x traffic spikes that are actually crawler bursts.
What is the difference between PerplexityBot and Perplexity-User?
PerplexityBot is Perplexity's scheduled search-index crawler, similar in role to OAI-SearchBot for ChatGPT. Perplexity-User is the live fetch agent that retrieves a URL when a Perplexity user (or the Perplexity model on their behalf) requests it. Per Perplexity's published policy, PerplexityBot respects robots.txt; Perplexity-User does not, on the grounds that user-triggered fetches are equivalent to a user typing the URL into a browser. Multiple site operators have reported PerplexityBot apparently ignoring robots.txt directives despite the policy, which Perplexity has publicly denied. Verify on your own logs and use WAF blocks if you observe non-compliance.
How often should I refresh AI crawler IP ranges?
Quarterly is the safe cadence. OpenAI, Google, and Anthropic update their published IP ranges occasionally (every few months in practice) when they rotate infrastructure. If you have automated reverse-DNS verification using Cloudflare's cf-verified-bot flag, you do not need to maintain your own copy of the IP ranges — Cloudflare handles it. If you are running your own IP-allowlist-based verification on bare nginx or AWS WAF, set a calendar reminder to refresh the JSON files quarterly.
Does Common Crawl (CCBot) count as an AI crawler?
Sort of. CCBot itself is not an AI crawler — it is Common Crawl, a nonprofit that publishes a freely-available web archive. But the Common Crawl dataset is the upstream source for many AI training pipelines, including most academic LLMs, some Hugging Face models, and historically parts of OpenAI's GPT-3 training. Blocking CCBot is a way to opt out of being included in the public dataset that downstream AI labs draw from. It is broader-than-necessary if you only want to opt out of one specific lab; it is narrower-than-necessary if you want to opt out of all AI training (because the labs also run their own direct crawlers). Most paid-publisher blocking lists include CCBot for completeness.
What is llms.txt and how does it relate to robots.txt?
llms.txt is a small markdown file at your site root (analogous to sitemap.xml or humans.txt) that lists your most LLM-relevant pages with short descriptions. It is a positive opt-in signal — "here are the pages I want LLMs to read" — rather than the opt-out / restriction nature of robots.txt. Adoption is around 7% of public SaaS sites as of Q1 2026. ChatGPT, Perplexity, and Claude crawlers all read it when present. It does not replace robots.txt; it complements it. The 30-minute investment is one of the highest-leverage one-time AEO moves available.
Can AI crawlers index pages behind a login wall?
No, not legitimately. AI crawlers that respect robots.txt also respect HTTP 401 / 403 responses on authenticated content. If your pages require login, they should return 401 to unauthenticated requests, including AI bot requests, and the bots will skip them. The exception is if you accidentally expose authenticated content via a leaked URL with a session token in the query string — in which case the bot can index it and the leak is the underlying bug. If you see authenticated content appearing in ChatGPT, the most likely explanation is a URL-leak issue, not the bot defeating your auth.
What happens if a previously-crawled page is removed or returns 404?
Removed pages drop out of training corpora over subsequent training cycles (months to a year). They drop out of live citations faster — typically within days to weeks, as the AI engine's index refreshes and notices the 404. For permanent removals, return HTTP 410 Gone instead of 404; this is a clearer signal to crawlers that the URL is intentionally removed and will not return. For temporary outages, do not let 5xx errors persist on URLs you care about — AI crawlers may interpret repeated 5xx as low-quality signal and deprioritize the URL in future indexes.