Blog / Technical SEO

AI Crawler & Agent Tracking 2026: GPTBot, ClaudeBot, PerplexityBot Explained

Q: What is the difference between GPTBot, ChatGPT-User, and OAI-SearchBot?

GPTBot is OpenAI's training crawler — it scrapes pages to add to future model training corpora and respects robots.txt. ChatGPT-User is the live browse agent that fetches a URL on demand when a user (or the model) asks ChatGPT to read a specific page; it also respects robots.txt but ignores it for direct user fetches per OpenAI's docs. OAI-SearchBot powers the ChatGPT search index that launched October 2024 and behaves more like a traditional search crawler. The three serve different jobs and a single robots.txt block on one does not block the others. Most sites should distinguish them in logs because a GPTBot spike is a training-corpus signal, a ChatGPT-User spike is a live-citation signal, and an OAI-SearchBot spike is a search-index signal.

Q: Should I block GPTBot to protect my content from being used to train AI?

Probably not, but the honest answer depends on what you sell. Blocking GPTBot via robots.txt removes you from future training corpora. It does not block ChatGPT-User, so ChatGPT can still fetch and cite your pages live on user request. The net effect of blocking GPTBot is that the model's baseline knowledge of your brand slowly degrades while live-browse citations continue. For most SaaS and ecommerce sites this is the wrong trade because baseline brand recall in the model is what gets you cited for queries the user asks without browse mode active. Block GPTBot only if you have a specific legal, contractual, or content-monetization reason — for example, a paid publisher whose business model is sub-paywall content, or a brand under explicit DMCA pressure.

Q: Does robots.txt actually stop AI crawlers from training on my site?

It stops the crawlers that respect it. GPTBot, ClaudeBot, Google-Extended, PerplexityBot, and Applebot-Extended all publicly commit to respecting robots.txt and the historical evidence is that they do. It does not stop bad-actor scrapers, third-party crawlers selling data to AI labs, archived copies on the Wayback Machine, third-party citations that quote you verbatim, or content already in published training sets. Robots.txt is a polite request honored by the largest crawlers; it is not a technical enforcement mechanism. If you need enforcement, you need WAF rules, Cloudflare's AI bot blocking, or IP-level blocks at the edge — and even those have known bypass paths.

Q: How can I tell if an AI crawler hit is real or a spoofed user-agent?

Reverse-DNS the IP and compare to the published forward-DNS. OpenAI publishes its IP ranges at openai.com/gptbot-ranges.json, Google publishes its crawler ranges at developers.google.com/search/apis/ipranges/googlebot.json plus a separate special-crawlers list, and Anthropic publishes ClaudeBot IPs in its documentation. A request claiming to be GPTBot from an IP outside OpenAI's published ranges is either spoofed or from a third-party crawler imitating the user-agent. In practice, 5-15% of traffic claiming to be a known AI bot on the sites I audit is spoofed, mostly from scrapers trying to evade rate limits. Always validate by IP, not by user-agent string alone.

Q: What is the cost in bandwidth of allowing AI crawlers?

Smaller than most operators fear and growing. Across the sites I monitor, AI crawlers together account for 0.5-4% of total bandwidth on content-heavy sites, with GPTBot the largest single share at typically 30-50% of AI-crawler bandwidth. The Cloudflare Radar AI bot dashboard pegs AI bot share of total bot traffic at roughly 4-6% in 2024-2025 with steady growth quarter over quarter. On a typical $20-50/mo VPS the marginal cost is negligible. On large static sites behind a CDN the cost is also negligible because the CDN absorbs most crawler load. The cost becomes real when AI crawlers hit uncached dynamic endpoints — large catalog sites, listing pages with faceted filters, search endpoints. For those, cache the AI-crawler responses aggressively or rate-limit per ASN.

Q: Is there a single way to opt out of all AI training at once?

Not as of mid-2026, but the IETF AI Preferences working group's ai-content-usage proposal is the closest path. It defines a single HTTP header and robots.txt directive — Content-Usage and ai-content-usage — that signals consent or refusal for AI training, search-index inclusion, and other downstream uses. Major crawlers including Google-Extended, GPTBot, and ClaudeBot have signaled intent to honor the standard once it stabilizes. Until then you need per-crawler robots.txt blocks for each AI bot you want to exclude, plus WAF rules for crawlers that ignore robots.txt. The TDMRep (Text and Data Mining Reservation Protocol) standard offers a similar opt-out signal for EU jurisdictions, anchored in the EU AI Act.

Q: Does blocking GPTBot stop ChatGPT from citing me?

No, and this is the single most common misunderstanding I see in the bot-blocking discourse. ChatGPT cites pages via three pipelines: the trained corpus (informed by GPTBot crawls), the live browse fetch (ChatGPT-User), and the search index (OAI-SearchBot). Blocking GPTBot only stops the first. A user who asks ChatGPT to read your page, or who searches via ChatGPT's search interface, can still trigger a fetch and a citation. The user is the source of the request, not OpenAI. Blocking GPTBot makes you slightly less likely to be cited in answers the model produces from pure trained knowledge but does not remove you from active answers.

Q: How do I tell an AI crawler hit from a human AI-referred visit in my logs?

Two different log signals. AI crawler hits show a known bot user-agent (GPTBot, ChatGPT-User, ClaudeBot, PerplexityBot, Google-Extended, OAI-SearchBot) and usually no Referer header. Human AI-referred visits show a normal browser user-agent (Chrome, Safari, Firefox on a real OS) and either a Referer of chatgpt.com / perplexity.ai / claude.ai / gemini.google.com (15-20% of the time) or an empty Referer (the majority). The right shape for a log analyzer is two completely separate views: a Bot Activity view that includes all AI crawlers, and a Traffic view that excludes them. Mixing the two is how operators end up reporting 10x traffic spikes that are actually crawler bursts.

22 min readUpdated May 2026

Vincent RuanFounder, Attrifast · May 26, 2026 · 22 min read

The 2026 field guide to AI crawler tracking — every user-agent, IP range, robots.txt directive, opt-out matrix, and how to tell crawls from human AI-referred visits.

Part of the AI Search Hub — browse all 35 AI Search guides.

TL;DR

AI crawler traffic is real, growing, and structurally separate from human AI-referred visits. The 2026 short list of crawlers you need to know is GPTBot, ChatGPT-User, OAI-SearchBot, ClaudeBot, anthropic-ai, claude-web, PerplexityBot, Perplexity-User, Google-Extended, Googlebot (for AI Overviews), Applebot-Extended, Bytespider, Meta-ExternalAgent, Amazonbot, CCBot, and a handful of smaller research crawlers.
Blocking GPTBot does not stop ChatGPT from citing you. It only removes you from future training corpora. ChatGPT-User and OAI-SearchBot keep working. Most "AI bot blocking" advice in 2026 is theatrical — selectively blocking is a legitimate strategy for paid publishers, but it is the wrong default for SaaS and ecommerce sites that need brand presence inside model knowledge.
AI bot share of total bot traffic sits at roughly 4-6% per Cloudflare Radar in 2024-2025, with GPTBot the single largest contributor. On most $20-50/mo VPS hosts the marginal bandwidth cost is negligible. The real cost is uncached dynamic endpoints on catalog and listing pages.
The IETF AI Preferences working group's ai-content-usage proposal is the path toward a single cross-crawler opt-out signal, similar in spirit to TDMRep. Major crawlers including Google-Extended, GPTBot, and ClaudeBot have signaled intent to honor it once it stabilizes.
5-15% of requests claiming to be a known AI bot are spoofed third-party scrapers. Always reverse-DNS verify against published IP ranges before treating a hit as a real GPTBot or ClaudeBot.
Crawl frequency is a leading indicator of citation readiness — but the click and the revenue are the lagging indicators that matter. See AI crawler hits AND human AI-referred sessions in one view inside Attrifast → Start free trial

The most common question I get from founders since Q1 2026 is some variant of "should I block GPTBot?" The second most common is "are AI bots eating my bandwidth?" The third is "how do I know if any of this is working?" All three deserve real answers, and the real answers are stranger than the discourse suggests. Blocking GPTBot does not do what most people think it does. AI bot bandwidth is usually not the problem operators imagine it to be. And the question "is it working?" has a structural answer that involves both crawl logs and human-session attribution, not just one or the other.

This piece is the technical companion to the strategic AEO vs SEO in 2026 post and the analytics-side ChatGPT referral analytics guide. Where those two cover what to publish and how to measure human AI traffic, this one covers the crawlers themselves: every user-agent you need to know, how to detect them, when to block them, what each one actually does, and how to wire their crawl frequency into a citation-readiness signal. If you have read the practical track-ChatGPT-traffic playbook and the broader how to get cited by AI engines post, this is the deeper dive on the bot side of the same problem.

Cloudflare Radar AI bot traffic share by crawler in 2025: GPTBot ~28%, ClaudeBot ~17%, Bytespider ~13%, PerplexityBot ~11%, Google-Extended ~10%, Amazonbot ~8%, Meta-ExternalAgent ~6%, others ~7%

Quick Facts

Metric	Value	Source
AI bot share of total bot traffic (2024-2025)	~4-6%	Cloudflare Radar [5]
Documented OpenAI user-agents	3 (GPTBot, ChatGPT-User, OAI-SearchBot)	OpenAI bot docs [1]
Documented Anthropic user-agents	3 (ClaudeBot, anthropic-ai, claude-web)	Anthropic ClaudeBot docs [12]
Documented Perplexity user-agents	2 (PerplexityBot, Perplexity-User)	Perplexity docs [16]
Google AI training crawler	Google-Extended (shares Googlebot infra)	Google crawler docs [13]
Apple AI training opt-out crawler	Applebot-Extended	Apple Applebot docs [17]
ByteDance / TikTok AI crawler	Bytespider	Bytespider docs [18]
Meta AI crawler	Meta-ExternalAgent	Meta crawler docs [19]
Common Crawl bot (third-party AI training source)	CCBot	commoncrawl.org [20]
Cloudflare one-click AI bot block adoption	~1M+ domains as of mid-2025	Cloudflare blog [22]
robots.txt RFC reference (RFC 9309)	Standardized September 2022	IETF RFC 9309 [21]
llms.txt adoption (public SaaS, Q1 2026)	~7%	llmstxt.org [25]
Median spoofed-UA rate on AI bot hits	5-15%	Attrifast aggregate
AI Overviews appearance rate (US English)	13-15% of queries	Search Engine Land [10]

Two numbers do most of the work here. The 4-6% AI bot share of total bot traffic is the demand-side number — AI crawlers are real but still a small fraction of the bot fleet. The 5-15% spoofed-UA rate is the data-quality number — at least one in twenty hits claiming to be GPTBot is not actually OpenAI. If you skip the IP verification step you are reporting on phantom traffic.

Why AI crawler tracking matters in 2026

There are three reasons to track AI crawlers, and they are different enough that the right answer depends on which one you care about.

Reason one: crawl frequency is a leading indicator of citation readiness. When GPTBot, ChatGPT-User, or OAI-SearchBot crawl rates climb on a specific page, it is rarely random. It usually means OpenAI's pipeline considers the page worth indexing, fetching live in response to a user query, or surfacing in search. A page that was getting zero AI-bot hits per week and then jumps to a dozen ChatGPT-User hits in 48 hours is, in my experience, an extremely reliable signal that the page is being cited in answers to a query that is trending in real time. By the time the human traffic shows up in your analytics 24-72 hours later, the crawl-rate spike has already told you something is happening.

Reason two: bandwidth and infra cost accountability. On most sites this is a minor concern. On a few — large catalogs, faceted filter pages, search endpoints, infinite-scroll listing pages, image-heavy galleries — AI crawlers can hit thousands of unique URLs per day and chew through egress budget that you were not planning for. The Vercel team published a detailed breakdown of AI bot traffic patterns in late 2024 [23] showing that some of their largest customers were seeing AI crawler traffic exceed search-engine crawler traffic for the first time. If you are paying per GB or per request, knowing which bots are hitting which paths is the difference between a $40 surprise and a $4,000 surprise.

Reason three: content licensing and IP enforcement decisions. If you publish content that you actively monetize — a paid newsletter, a paywalled archive, a high-investment editorial product — the question of whether AI labs are ingesting it for training is a legitimate commercial concern. The New York Times v. OpenAI suit, the Reddit data licensing deals, the various publisher partnerships with OpenAI and Anthropic in 2024-2025 — all of these orbit the same question. You cannot make a sensible decision about which crawlers to allow, block, or negotiate with until you can see what is hitting your site and how often.

Most operators care about exactly one of those three reasons. The mistake I see most often is conflating them. "I want to block GPTBot because of bandwidth" usually does not survive five minutes of looking at actual numbers; "I want to track GPTBot because of citation readiness" is a measurement project, not a blocking project; "I want to negotiate with OpenAI because of licensing" is a different project again. Pick the reason first, then choose the action.

Goal	Right action	Wrong action
Maximize AI citations for SaaS / ecommerce	Allow GPTBot, ChatGPT-User, ClaudeBot, PerplexityBot, Google-Extended; instrument all of them	Block GPTBot to "protect content"
Bandwidth control on large catalog site	Cache aggressively, rate-limit per ASN, log per-bot bandwidth	Block all AI bots indiscriminately
Paywall enforcement on paid editorial	Block GPTBot, ClaudeBot, Google-Extended, CCBot, Bytespider; allow ChatGPT-User on free preview only	Block via robots.txt only and assume that is enforcement
Citation-readiness leading-indicator dashboard	Log all AI bot hits per URL per day, correlate to human AI-referral spikes 24-72h later	Treat bot hits as traffic in your KPI chart
Compliance with EU AI Act opt-out	Implement TDMRep tdm-reservation directive and ai-content-usage header	Rely on robots.txt alone

The 10 AI crawlers you need to know

Below is the working reference table I keep in a spreadsheet and update roughly monthly. The list is not exhaustive — there are a long tail of academic, research, and one-off crawlers that account for the bottom 5% of AI bot traffic — but these are the ones that account for >95% of what you will see in production logs in 2026.

Crawler	Owner	Purpose	Respects robots.txt	Documented IP range
GPTBot	OpenAI	Training corpus crawler	Yes [1]	openai.com/gptbot-ranges.json
ChatGPT-User	OpenAI	Live browse / user-triggered fetch	Yes for crawls, no for direct user fetches [1]	openai.com/chatgpt-user.json
OAI-SearchBot	OpenAI	ChatGPT search index crawler	Yes [1]	openai.com/searchbot.json
ClaudeBot	Anthropic	Training corpus crawler	Yes [12]	Published in Anthropic docs
anthropic-ai (legacy)	Anthropic	Legacy training crawler	Yes [12]	Same as ClaudeBot
claude-web	Anthropic	Live browse user-triggered	Yes [12]	Same as ClaudeBot
PerplexityBot	Perplexity	Search index crawler	Yes [16]	Published in Perplexity docs
Perplexity-User	Perplexity	User-triggered live fetch	No (per Perplexity policy) [16]	Same as PerplexityBot
Google-Extended	Google	Training crawler for Gemini, Vertex AI	Yes [13]	Shares Googlebot IPs
Applebot-Extended	Apple	Training opt-out for Apple Intelligence	Yes [17]	Shares Applebot IPs
Bytespider	ByteDance	Training crawler (TikTok / Doubao)	Yes (post-2024) [18]	Published in Bytespider docs
Meta-ExternalAgent	Meta	Training crawler (Llama, Meta AI)	Yes [19]	Published in Meta crawler docs
Amazonbot	Amazon	Multi-purpose, partially AI training	Yes [24]	Published in Amazon docs
CCBot	Common Crawl	Open dataset used by most AI labs	Yes [20]	commoncrawl.org
Diffbot	Diffbot	Commercial crawler, AI-data buyer	Yes	Published in Diffbot docs
YouBot	You.com	AI search crawler	Yes	Published in You.com docs
Cohere-AI	Cohere	Training crawler	Yes	Published in Cohere docs
ImagesiftBot	Hive AI	Image training crawler	Yes	Published in Hive docs

The next table is the one I actually use to write log greps. Exact user-agent strings as they appear in the field as of May 2026.

Crawler	User-Agent substring to match
GPTBot	`GPTBot/1.1` (or earlier `GPTBot/1.0`)
ChatGPT-User	`ChatGPT-User/1.0` (also seen as `ChatGPT-User/2.0`)
OAI-SearchBot	`OAI-SearchBot/1.0`
ClaudeBot	`ClaudeBot/1.0` or `claudebot`
anthropic-ai	`anthropic-ai` (older legacy string)
claude-web	`Claude-Web/1.0`
PerplexityBot	`PerplexityBot/1.0`
Perplexity-User	`Perplexity-User/1.0`
Google-Extended	(UA same as Googlebot; differentiated by robots.txt directive only)
Applebot-Extended	(UA same as Applebot; differentiated by robots.txt directive only)
Bytespider	`Bytespider`
Meta-ExternalAgent	`meta-externalagent`
Amazonbot	`Amazonbot/0.1`
CCBot	`CCBot/2.0` (Common Crawl)
Diffbot	`Diffbot`
YouBot	`YouBot`
Cohere-AI	`cohere-ai`
ImagesiftBot	`ImagesiftBot`

Two important quirks in that table. First, Google-Extended and Applebot-Extended do not have unique user-agent strings. They share the standard Googlebot and Applebot UAs respectively. The way you "opt them out" is in robots.txt by writing a rule against Google-Extended or Applebot-Extended as a User-agent token — the crawler reads its own token from your robots.txt, even though the HTTP UA header is generic. This trips up log greps constantly. If you only grep by HTTP UA you cannot separate AI-training Googlebot from search-indexing Googlebot.

Second, the version suffixes drift. GPTBot was GPTBot/1.0 through most of 2024 and bumped to GPTBot/1.1 in late 2024. ChatGPT-User has been observed at both 1.0 and 2.0 in production logs. Your regex should match the prefix (GPTBot/), not the exact version.

Crawler ownership and parent company

Crawler	Parent company	Training model fed	Headquartered
GPTBot / ChatGPT-User / OAI-SearchBot	OpenAI	GPT-4, GPT-5, future models	San Francisco, US
ClaudeBot family	Anthropic	Claude 3, Claude 4, Claude 5	San Francisco, US
PerplexityBot / Perplexity-User	Perplexity AI	Perplexity Sonar / hosted Llama, Claude	San Francisco, US
Google-Extended	Google / Alphabet	Gemini, Vertex AI, Bard legacy	Mountain View, US
Applebot-Extended	Apple	Apple Intelligence, on-device foundation models	Cupertino, US
Bytespider	ByteDance	Doubao, internal LLMs	Beijing, CN
Meta-ExternalAgent	Meta	Llama 3, Llama 4, Meta AI	Menlo Park, US
Amazonbot	Amazon	Alexa LLM, internal Bedrock training	Seattle, US
CCBot	Common Crawl Foundation	Public Common Crawl dataset (used by most labs)	San Francisco, US (nonprofit)
Cohere-AI	Cohere	Command R, Command R+	Toronto, CA
YouBot	You.com	YouChat / hosted models	Palo Alto, US

Approximate AI bot traffic share

The Cloudflare Radar AI Insights dashboard [5] publishes share-of-traffic numbers for the AI bot fleet that move quarter to quarter. The numbers below are directional, drawn from publicly accessible Cloudflare Radar snapshots, Vercel's own AI bot writeup [23], and the Attrifast customer aggregate. Treat them as estimates with double-digit error bars, not as canonical facts.

Crawler	Approx share of AI bot traffic (2025-2026)	Trend Q1 2026 vs Q1 2025
GPTBot	~25-30%	Slowly growing
ClaudeBot	~15-18%	Rapidly growing
Bytespider	~10-14%	Volatile, regional
PerplexityBot	~9-12%	Rapidly growing
Google-Extended	~8-11%	Stable
Amazonbot	~7-9%	Stable
Meta-ExternalAgent	~5-7%	Growing
CCBot	~3-5%	Stable
Applebot-Extended	~2-4%	New, growing
All others	~3-6%	Mixed

Two things to call out. ClaudeBot's growth is faster than any other AI crawler in this dataset — it tracks closely with Claude's expanding API customer base, which makes sense if you assume the crawl pipeline is sized to model usage. Perplexity's crawler has been pushed by independent observers to be much more aggressive than its respect-for-robots policy implies; multiple site operators have reported PerplexityBot ignoring Disallow directives, which Perplexity has publicly denied [16]. The honest read is that the field data is mixed and you should verify on your own logs before drawing conclusions.

Crawler vs human AI-referred visit — what's the actual difference

This is the distinction operators get wrong most often. A GPTBot crawl is not a citation. A ChatGPT-User fetch is not a click. The user who later arrives via a ChatGPT citation is a third, completely separate event, and the three live in different parts of your stack.

Event type	What it is	Where it shows up	Counts as
GPTBot training crawl	Scheduled scrape for future training	Server logs with `GPTBot/1.1` UA	Bot activity (exclude from traffic)
ChatGPT-User live fetch	On-demand fetch when user asks ChatGPT to read URL	Server logs with `ChatGPT-User/1.0` UA	Bot activity (citation signal)
OAI-SearchBot index	Scheduled crawl for ChatGPT search index	Server logs with `OAI-SearchBot` UA	Bot activity (search-index signal)
Human click from ChatGPT (referer-passed)	Real human, browser hits page	Analytics with `chatgpt.com` referer	Human traffic (attribute to ChatGPT)
Human click from ChatGPT (referer-stripped)	Real human, browser hits page, no referer	Analytics as Direct/(none)	Human traffic (suspected ChatGPT)

The clean mental model: bot hits and human hits are two different streams that should never be mixed in the same chart. When operators report "we got 4,000 visits from ChatGPT yesterday" and the number is actually 3,200 GPTBot crawls plus 800 humans, the resulting decisions (content plan, ad spend, board update) are all built on a misclassification.

Here is the dual-view pattern I now ship in every Attrifast install for the AI-engine view:

View	What it shows	What it does not show
Bot Activity	All AI crawler hits (GPTBot, ChatGPT-User, ClaudeBot, etc.) per URL per day	Human traffic of any kind
Traffic	Real human sessions, attributed to AI engines where possible	Bot hits of any kind
Citation Readiness	Correlation between bot activity (lead) and human AI-referrals (lag)	Causation — bot hits do not cause clicks; they precede them

The Citation Readiness view is the one that operationalizes the leading-indicator insight. When ChatGPT-User crawls on a page spike, the median lag to a human-referral spike on the same page is 18-72 hours in my data, with wide variance. Pages where the bot spike does not produce a corresponding human spike within 7 days are either (a) being fetched but not cited or (b) being cited in answers that do not generate clicks. Both are useful pieces of information but they mean different things for content strategy.

The decision tree above is the working version I run at the edge in Attrifast. It does three things in one pass: classifies the request, validates the UA against the published IP range, and routes the event into one of three downstream signals. The reverse-DNS step in node C is the one most server-log greps skip, and it is the source of most of the bad data.

How to detect AI crawlers in your logs (with grep examples)

You can get 80% of the value with five minutes and a grep command. The 20% that needs more work is the IP verification and the suspected-AI behavioral inference. Both are covered below.

The minimal log grep

For nginx logs in the default format:

grep -E "(GPTBot/|ChatGPT-User/|OAI-SearchBot|ClaudeBot|anthropic-ai|Claude-Web|PerplexityBot|Perplexity-User|Bytespider|Meta-ExternalAgent|Amazonbot|CCBot/|Google-Extended|Applebot-Extended|Diffbot|YouBot|cohere-ai|ImagesiftBot)" \
  /var/log/nginx/access.log \
  | awk '{print $1, $7, $9, $12}' \
  | sort | uniq -c | sort -rn | head -50

That gives you a sorted list of (IP, path, status, UA-fragment) tuples for AI bot hits. The columns are $1 (remote IP), $7 (request path), $9 (HTTP status), and $12 onward (User-Agent depending on log format).

For Apache combined log format:

grep -E "(GPTBot|ChatGPT-User|OAI-SearchBot|ClaudeBot|anthropic-ai|Claude-Web|PerplexityBot|Perplexity-User|Bytespider|Meta-ExternalAgent|Amazonbot|CCBot|Google-Extended|Applebot-Extended)" \
  /var/log/apache2/access.log \
  | awk -F'"' '{print $1, $2, $6}' \
  | sort | uniq -c | sort -rn | head -50

Per-crawler hit counts in the last 24 hours

for bot in "GPTBot/" "ChatGPT-User/" "OAI-SearchBot" "ClaudeBot" "PerplexityBot" "Bytespider" "Meta-ExternalAgent" "Amazonbot" "CCBot/" "Google-Extended" "Applebot-Extended" "Diffbot"; do
  count=$(grep -c "$bot" /var/log/nginx/access.log)
  printf "%-25s %d\n" "$bot" "$count"
done

Top paths hit by GPTBot in the last 7 days

zcat /var/log/nginx/access.log.* | grep "GPTBot/" \
  | awk '{print $7}' \
  | sort | uniq -c | sort -rn | head -30

Reverse-DNS verification (the step most ops skip)

A request claiming GPTBot/1.1 from an IP not in OpenAI's published range is spoofed. The verification:

# Pull OpenAI's published GPTBot IP ranges
curl -s https://openai.com/gptbot.json > /tmp/gptbot-ranges.json

# Extract IPs that hit your site claiming to be GPTBot
grep "GPTBot/" /var/log/nginx/access.log | awk '{print $1}' | sort -u > /tmp/observed-gptbot-ips.txt

# Compare (you'll need a CIDR-match tool like grepcidr for the full check)
grepcidr -f <(jq -r '.prefixes[].ipv4Prefix' /tmp/gptbot-ranges.json) /tmp/observed-gptbot-ips.txt

Across the sites I audit, between 5% and 15% of IPs claiming to be GPTBot fall outside OpenAI's published ranges. Those are spoofed. Treat them as you would any other unverified scraper — log them, rate-limit them, or block them, but do not count them as OpenAI activity.

Same for Google-Extended

Google-Extended is harder because it shares Googlebot's user-agent. The differentiating signal is the robots.txt token. You cannot detect Google-Extended in your access logs directly — you can only see all Googlebot hits, and you control whether they get fed into Vertex / Gemini training via the User-agent: Google-Extended directive in robots.txt [13].

# Verify Googlebot hits are real Googlebot via reverse-DNS
for ip in $(grep "Googlebot" /var/log/nginx/access.log | awk '{print $1}' | sort -u); do
  host=$(dig +short -x "$ip")
  if [[ "$host" == *.googlebot.com* || "$host" == *.google.com* ]]; then
    echo "$ip VERIFIED"
  else
    echo "$ip SPOOFED"
  fi
done

Cloudflare and Vercel: log fields that help

If you sit behind Cloudflare, the cf-verified-bot flag (available in Cloudflare's Logpush) tells you whether Cloudflare has verified the bot's identity. This saves you the reverse-DNS dance for the bots Cloudflare knows about. Vercel logs include x-vercel-forwarded-for and standard request headers; for AI bot detection there, the simplest path is to inspect User-Agent in middleware and write the classification to a custom log line. Vercel published a detailed pattern for this in late 2024 [23].

robots.txt vs WAF blocking vs Cloudflare AI Bot Block

Three layers of bot blocking, three different threat models. Most operators conflate them, and the resulting block is either too weak (robots.txt only, against a crawler that ignores it) or too strong (WAF block against good actors who would have respected a polite request).

Layer	What it does	Enforcement	Best for
robots.txt directive	Polite request; well-behaved crawlers respect it	None — it is a request, not a block	Bots that publicly commit to respecting robots.txt (GPTBot, ClaudeBot, Google-Extended)
Server / WAF rule	Returns 403 / 429 / blocks at edge based on UA or IP	Technical block (UA-only is bypassable; IP+UA is stronger)	Bots that ignore robots.txt or scrapers spoofing AI UAs
Cloudflare AI Bot Block (one-click)	Blocks Cloudflare's curated AI bot list at edge	Technical block, kept updated by Cloudflare	Sites that want zero AI training exposure with minimal config
Cloudflare Bot Management	ML-based bot scoring with custom rules	Technical block, plus heuristic detection of unknown bots	Enterprise sites with sophisticated bot threat
TDMRep / ai-content-usage signaling	Standardized opt-out signal	None — depends on crawler honoring the signal	EU compliance and forward-looking opt-out

A copy-pasteable robots.txt for "allow everything, log everything" (recommended default for SaaS)

# Default: allow all AI crawlers. We measure and instrument; we do not block.
User-agent: GPTBot
Allow: /

User-agent: ChatGPT-User
Allow: /

User-agent: OAI-SearchBot
Allow: /

User-agent: ClaudeBot
Allow: /

User-agent: anthropic-ai
Allow: /

User-agent: Claude-Web
Allow: /

User-agent: PerplexityBot
Allow: /

User-agent: Perplexity-User
Allow: /

User-agent: Google-Extended
Allow: /

User-agent: Applebot-Extended
Allow: /

User-agent: Bytespider
Allow: /

User-agent: Meta-ExternalAgent
Allow: /

User-agent: Amazonbot
Allow: /

User-agent: CCBot
Allow: /

User-agent: *
Allow: /
Sitemap: https://yoursite.com/sitemap.xml

A copy-pasteable robots.txt for "block all AI training, allow live browse" (for paid publishers)

# Block AI training crawlers. Allow user-triggered live browse so brand remains reachable.
User-agent: GPTBot
Disallow: /

User-agent: ClaudeBot
Disallow: /

User-agent: anthropic-ai
Disallow: /

User-agent: Google-Extended
Disallow: /

User-agent: Applebot-Extended
Disallow: /

User-agent: Bytespider
Disallow: /

User-agent: Meta-ExternalAgent
Disallow: /

User-agent: CCBot
Disallow: /

User-agent: Cohere-AI
Disallow: /

User-agent: Diffbot
Disallow: /

# Allow live-browse user-triggered fetches so users asking ChatGPT/Claude/Perplexity
# to read a specific URL can still reach the page.
User-agent: ChatGPT-User
Allow: /

User-agent: Claude-Web
Allow: /

User-agent: Perplexity-User
Allow: /

User-agent: OAI-SearchBot
Allow: /

User-agent: PerplexityBot
Allow: /

User-agent: *
Allow: /
Sitemap: https://yoursite.com/sitemap.xml

nginx WAF block for crawlers that ignore robots.txt

# /etc/nginx/conf.d/ai-bot-block.conf
# Use only if you have documented evidence a specific crawler is ignoring robots.txt.
map $http_user_agent $is_blocked_ai_bot {
    default 0;
    "~*GPTBot"           0;  # respect robots.txt; do not block at WAF
    "~*Bytespider"       1;  # block if you have a specific complaint
    "~*CCBot"            0;  # respect robots.txt
    "~*Diffbot"          0;
    "~*ImagesiftBot"     1;  # image scraping; block if not relevant
}

server {
    if ($is_blocked_ai_bot = 1) {
        return 403;
    }
    # ... rest of server config
}

Cloudflare WAF custom rule (UI-equivalent expression)

(http.user_agent contains "Bytespider")
or (http.user_agent contains "ImagesiftBot")
or (http.user_agent contains "Diffbot" and not http.user_agent contains "diffbot.com/help")

Action: Block (or Managed Challenge for soft-block).

AWS WAF rule (CloudFormation snippet)

- Name: BlockAggressiveAIBots
  Priority: 10
  Statement:
    OrStatement:
      Statements:
        - ByteMatchStatement:
            SearchString: "Bytespider"
            FieldToMatch:
              SingleHeader:
                Name: user-agent
            TextTransformations:
              - Priority: 0
                Type: NONE
            PositionalConstraint: CONTAINS
        - ByteMatchStatement:
            SearchString: "ImagesiftBot"
            FieldToMatch:
              SingleHeader:
                Name: user-agent
            TextTransformations:
              - Priority: 0
                Type: NONE
            PositionalConstraint: CONTAINS
  Action:
    Block: {}
  VisibilityConfig:
    SampledRequestsEnabled: true
    CloudWatchMetricsEnabled: true
    MetricName: BlockAggressiveAIBots

Cloudflare's one-click AI bot block

Cloudflare shipped a "Block AI Bots" toggle in mid-2024 [22] that sits in the Security → Bots panel and blocks Cloudflare's curated list of AI training crawlers without requiring you to maintain WAF rules. Adoption hit over a million domains by late 2024. The tradeoff is the same as the manual block: you lose training-corpus exposure for the brand. The one-click toggle does not block live-browse agents (ChatGPT-User, Claude-Web, Perplexity-User) by default — that requires a separate rule.

Method	Effort	Maintenance	Bypass risk	Coverage
robots.txt	5 min	Low	High (bad-actor bots ignore)	Good crawlers only
nginx UA block	30 min	Medium	Medium (UA-only, IP rotation bypasses)	UA-matched bots only
Cloudflare one-click AI bot block	1 min	Zero	Low	Cloudflare's curated list
Cloudflare Bot Management ML	Enterprise plan	Zero	Very low	Known + unknown bots
AWS WAF + custom rules	1-4 hr	Medium	Low	Whatever you write
IP-range block (verified)	1-2 hr setup	Quarterly refresh	Very low	Bots with published ranges
ai-content-usage header (future)	10 min	Low	Honored by major crawlers (eventually)	Crawlers that adopt standard

Should you block AI crawlers? (honest tradeoffs by site type)

This is the section that has eaten the most of my Discord and email time in 2026. Operators want a clean yes/no. The honest answer is "it depends on your business model and you should not block as a default." The decision tree by site type:

Site type	Block GPTBot?	Block ChatGPT-User?	Block CCBot?	Reasoning
Bootstrapped B2B SaaS	No	No	No	Need brand presence in trained corpus; ChatGPT citations drive measurable revenue
DTC ecommerce	No	No	No	Product discovery via AI engines is rising; blocking is a future-loss bet
Developer tools / OSS	No	No	No	Devs use ChatGPT for technical Q&A; being in the trained corpus is high-leverage
Content publisher (ad-supported)	Soft no (allow most)	No	Maybe block	Some publishers block CCBot but allow live-browse; tradeoff is volume vs control
Content publisher (paywalled)	Yes (block)	Allow free preview only	Yes (block)	Direct paywall conflict; OpenAI partnerships are licensing-deal terrain, not free crawl
News organization	Yes (block)	Maybe (allow free preview)	Yes (block)	NYT v. OpenAI precedent; license, do not give away
Local services / SMB	No	No	No	Negligible content value to AI labs; allow for brand discovery
Regulated healthcare / financial	No (with disclaimers)	No	No	YMYL content gets filtered by AI anyway; blocking does not help, instrumentation does
API documentation	No	No	No	Being in the trained corpus is the entire point of public API docs
Internal / authenticated content	N/A	N/A	N/A	Should already require auth; robots.txt is not the right defense

The argument for not blocking, summarized in one paragraph: most "AI bot blocking" guides assume your content has high standalone value to an AI lab and that you can capture that value by withholding it. For 95% of SaaS and ecommerce sites, that is not true. The content has value to your business because it drives leads, brand awareness, and product evaluations. The AI crawler that ingests your pricing page and the AI engine that later cites your pricing page are part of the same funnel. Blocking the first does not stop the second; it just makes the second less informed about your brand.

The argument for blocking, summarized in one paragraph: if you are a paid publisher whose business model is the sale of access to your content, AI training-corpus inclusion is direct competitive substitution. A reader who gets your reporting summarized inside ChatGPT is a reader who did not subscribe. Blocking GPTBot and ClaudeBot is a defensive move that preserves the licensing-deal option. The cost is loss of trained-corpus brand presence; the benefit is preserving the underlying commercial model.

The argument that nobody makes but should: blocking AI crawlers is a one-way door that is easy to reverse. Allowing them and then deciding later to block costs you nothing; the future crawler can re-ingest the new robots.txt within days. The asymmetry argues for "allow by default, block when you have a specific reason."

Two case patterns

Case one: a developer tools company I work with that blocked GPTBot in early 2024 "just to be safe." Twelve months later they noticed their ChatGPT citation rate was declining relative to peers in the same category. They unblocked. Citation rate recovered within ~30 days as the new crawls fed the next training cycle. The 12-month "protection" did not protect anything; it just put them behind on brand presence inside model knowledge.

Case two: a paid publisher who left GPTBot allowed through 2024. When OpenAI's publisher-partnership program offered to license content for direct citation rights, the publisher had no leverage — their content was already in the training corpus. They blocked retroactively, but a year of crawls had already been ingested. The retrospective lesson: if you ever plan to negotiate a content license with an AI lab, the leverage starts on day one with the block, not on day 365 with the negotiation.

Measuring crawl frequency as a citation-readiness signal

This is the part of the playbook I have not seen anyone else describe in detail, and it is the part that pays off the highest leverage from AI crawl logging.

The basic insight: AI crawler traffic on a specific URL has predictive signal for whether the URL is about to start receiving human AI-referred traffic. Not perfect signal — many bot hits do not produce citations, and many citations come from URLs that have not been freshly crawled — but a real, exploitable pattern with consistent shape.

The four-signal model I run on every Attrifast install

Signal	Source	Lead time vs human traffic	Interpretation
GPTBot crawl frequency (rolling 7-day)	Server logs	7-30 days	Training-corpus inclusion; slow-moving baseline
ChatGPT-User fetch frequency (rolling 24h)	Server logs	18-72 hours	Live-citation activity; fast-moving real-time signal
OAI-SearchBot crawl frequency (rolling 7-day)	Server logs	3-14 days	Search-index inclusion; medium-moving
PerplexityBot + Perplexity-User combined	Server logs	12-48 hours	Perplexity-specific citation activity

For each URL on the site I track all four. The Citation Readiness score for a URL is a weighted blend, with ChatGPT-User and Perplexity-User carrying the highest weight (because they are user-triggered and most strongly correlated with active citations).

A real pattern from one site

I cannot share absolute numbers but the shape is stable enough to describe. A bootstrapped SaaS site published a new commercial page on a Tuesday. The first GPTBot hit landed Wednesday. The first ChatGPT-User hit landed Friday afternoon — a single fetch from one IP. The next Monday morning, the ChatGPT-User hits jumped from 1 to 11 in a single day, and on Wednesday the human AI-referred traffic on that URL went from 0 to 47 sessions. The bot signal led the human signal by 4 days.

Day	GPTBot hits	ChatGPT-User hits	Human AI-referred sessions
Tue (publish)	0	0	0
Wed	1	0	0
Thu	0	0	0
Fri	0	1	0
Sat	1	0	0
Sun	0	0	0
Mon	0	11	2
Tue	1	18	14
Wed	0	22	47
Thu	2	19	38

The bot-signal column does not always lead the human-signal column this cleanly. Variance is high. But over a population of pages and over a long enough window, the directional pattern holds: ChatGPT-User hits cluster a few days before human AI-referrals climb.

Why this matters for content strategy

Two operational uses for the signal:

Use one: prioritize content investment by Citation Readiness. Pages with high ChatGPT-User activity and rising human AI-referrals are the pages that deserve next-round content investment (deeper FAQ, more schema, expanded comparison sections). Pages with high crawl activity and zero human conversion are pages that are being fetched but not cited — different problem, different fix (usually content quality or page structure).

Use two: catch broken citation pipelines fast. When a previously high-citation page suddenly drops in ChatGPT-User activity, it is usually one of three things: the page has been deprioritized in the model's index, the page is broken (404, 500, redirect loop the bot cannot follow), or a competitor has displaced you in the citation slot. All three are recoverable if caught in days, very expensive if caught in months.

AI crawler bandwidth and infra cost benchmarks

The "AI bots are eating my bandwidth" panic is mostly overstated, but it is real on a specific set of site shapes. Here are the benchmarks I have collected.

Typical AI bot bandwidth as % of total traffic

Site type	AI bot bandwidth as % of total	Top contributor
Small SaaS (under 1M pageviews/mo)	0.1-0.8%	GPTBot
Mid SaaS (1M-10M pageviews/mo)	0.3-1.5%	GPTBot
Large SaaS / content site (10M+)	0.5-3%	GPTBot + ClaudeBot + Bytespider
DTC ecommerce (small-mid)	0.4-2%	GPTBot + Amazonbot
DTC ecommerce (large catalog)	1-5%	Bytespider + GPTBot
Content publisher	1-4%	CCBot + GPTBot
Developer documentation	2-8%	ClaudeBot + GPTBot
Job board / classifieds	3-12%	Bytespider + GPTBot
Real estate / listings	4-15%	Bytespider + Diffbot
Wikipedia-shaped reference	5-20%	CCBot + GPTBot + Bytespider

The bottom rows are where the bandwidth bills start to bite. Large catalog sites, listing sites with faceted filters, and reference-style content that has high per-page uniqueness all attract more aggressive AI crawler behavior because the per-page training value is higher.

Cost in concrete dollars (rough estimates)

For a site on Vercel Pro at $20/mo + $0.40/GB egress over the included allowance:

Site profile	Monthly pageviews	AI bot share	AI bot extra GB/mo	AI bot extra $/mo
Small SaaS	200,000	0.5%	<1 GB	<$1
Mid SaaS	2,000,000	1%	~5-10 GB	$2-4
Large content site	20,000,000	2%	~100-200 GB	$40-80
Job board	5,000,000	8%	~150-300 GB	$60-120
Listings site	10,000,000	12%	~500-1000 GB	$200-400

For the bottom two rows the bandwidth cost starts mattering. For the top three rows it is rounding error.

When AI crawlers actually create a problem

Problem	Root cause	Fix
Crawler hits uncached dynamic endpoint repeatedly	Catalog / search / faceted-filter URLs with no cache headers	Set `Cache-Control: public, s-maxage=86400` on bot-safe pages
Crawler triggers expensive DB queries	Faceted filter URLs hitting search infrastructure	Rate-limit AI crawlers on filter endpoints; serve static fallback
Crawler downloads large media (PDF, images, video)	No `Disallow` on heavy media paths	`User-agent: GPTBot \n Disallow: /media/large/`
Crawler floods 404 on stale sitemap URLs	Old URLs in sitemap that have been removed	Refresh sitemap; ensure 410 not 404 on permanent removals
Bytespider hitting 100+ requests / minute	Bytespider's historic aggressive crawl pattern	Rate-limit per ASN at edge; or block entirely

The single highest-leverage fix for AI crawler cost on most sites is aggressive caching on the URLs they hit most. The crawl pattern is highly cacheable — AI bots tend to refetch the same URL list on a schedule, and a 24-hour CDN cache absorbs nearly all of that load.

The 2026 IETF "ai-content-usage" preferences proposal

The mess of per-crawler robots.txt rules and ad hoc WAF blocks is what the IETF AI Preferences working group [27] is trying to fix. The proposal that has emerged through 2025-2026 is a single HTTP header and robots.txt directive — Content-Usage for the header and ai-content-usage in robots.txt — that lets a site signal consent or refusal for several distinct downstream uses of its content.

The use categories under discussion (subject to change):

Use category	Description	Example crawler that consumes it
`train-ai`	Use page content as training data for AI models	GPTBot, ClaudeBot, Google-Extended
`train-genai`	Use specifically for generative AI training	Subset of above
`search`	Include in search indexes	Googlebot, Bingbot, OAI-SearchBot
`inference`	Use for retrieval-augmented generation at inference time	ChatGPT-User, Perplexity-User, Claude-Web
`summarize`	Use for AI-generated summaries	Various
`tdm-mining`	EU TDM Reservation directive coverage	All AI training

The signal is intended to be respected by all participating crawlers, with the goal that operators can write a single ai-content-usage line in robots.txt instead of maintaining a dozen per-crawler User-agent blocks. The TDMRep standard [26] solves a closely related EU-jurisdiction problem under the EU AI Act.

Example robots.txt directives under the proposal

# Refuse AI training, allow search and inference (live citation)
User-agent: *
ai-content-usage: search, inference
ai-content-usage-deny: train-ai, train-genai, tdm-mining

# Allow everything (recommended default for most SaaS / ecommerce)
User-agent: *
ai-content-usage: train-ai, train-genai, search, inference, summarize

# Selective: allow search and inference, but train-genai only on /blog/, not on /paid/
User-agent: *
ai-content-usage: search, inference

User-agent: *
Disallow: /paid/

User-agent: *
Allow: /blog/
ai-content-usage: train-genai, search, inference

Adoption status

Crawler	Public position on ai-content-usage proposal
GPTBot	Public signal of intent to honor [1]
ClaudeBot	Public signal of intent to honor [12]
Google-Extended	Public signal of intent to honor [13]
PerplexityBot	Tentative; awaiting standardization [16]
Applebot-Extended	Signaled support [17]
Bytespider	No public position
Meta-ExternalAgent	Tentative
CCBot	Common Crawl is a downstream pipeline; the standard applies to consumers, not CC itself

Until the proposal stabilizes (likely 2026-2027 timeline given the IETF process), the practical advice is to maintain per-crawler robots.txt directives as the primary control surface, with TDMRep tdm-reservation set for EU compliance, and ai-content-usage added as an additional polite-signal layer for forward compatibility.

Common AI crawler tracking mistakes

Eight mistakes I see often enough to call them patterns.

Mistake 1: Treating bot hits as traffic. A 10x spike in GPTBot is not 10x more users. It is OpenAI's crawler doing scheduled work. Fix: separate Bot Activity from Traffic in every chart. Never mix.

Mistake 2: Trusting the User-Agent string without IP verification. Between 5% and 15% of UA-claimed GPTBot hits are spoofed. Fix: reverse-DNS verify against OpenAI's published ranges, or use Cloudflare's cf-verified-bot flag.

Mistake 3: Blocking GPTBot believing it stops ChatGPT citations. It blocks training-corpus inclusion only. ChatGPT-User and OAI-SearchBot keep working. Fix: understand the three-pipeline model before blocking anything.

Mistake 4: Forgetting Google-Extended is configured by robots.txt token, not UA. Google-Extended hits show as Googlebot in your logs. You cannot detect them in access logs alone. Fix: control via User-agent: Google-Extended directive in robots.txt; do not expect a unique UA string.

Mistake 5: Confusing crawl frequency with citation frequency. A page that is crawled often is not necessarily cited often. Many crawls produce zero citations. Fix: instrument both signals separately and watch the correlation, not the count.

Mistake 6: Allowing Bytespider unbounded on a large-catalog site. Bytespider has historically been the most aggressive AI crawler on listing-shaped sites, sometimes hitting hundreds of requests per minute. Fix: rate-limit per ASN or block at WAF if the bandwidth bill is real.

Mistake 7: Setting a robots.txt block on /blog/ for AI bots. This is the inverse of the right move. Your blog is the most citation-friendly content you publish; blocking AI from your blog while allowing access to your pricing page makes the model less informed about your category content but still able to quote your prices. Fix: invert. If you must block, block /paid/ or /pricing/details/, not /blog/.

Mistake 8: Ignoring the human-AI-referral side entirely. Bot tracking without human tracking is a dashboard that never shows revenue. The crawler hits are the leading indicator; the human clicks and the Stripe webhook are the lagging signals that pay for the work. Fix: instrument both sides, in one view if possible.

What this looks like in practice on Attrifast

A short note on the product because the article cannot pretend the author has no interest. Attrifast tracks AI crawler hits per URL per day in the same dashboard where it shows human AI-referred sessions and the Stripe revenue join. The crawler view is grouped by engine (OpenAI, Anthropic, Perplexity, Google, Apple, Meta, ByteDance) with per-bot breakdowns inside each engine. The human view is grouped by AI source domain (chatgpt.com, perplexity.ai, claude.ai, gemini.google.com) with referer-passed and referer-stripped separated. The Citation Readiness signal correlates the two over time.

The four pieces the product owns:

Edge detection of AI bot hits (UA matching + IP verification)
First-party detection of human AI-referrals (referer fingerprinting + behavioral inference)
Stripe webhook join from session to payment
A single dashboard showing all of the above with per-URL drill-downs

Cost: $15/mo for the Pro tier, which covers up to a stated session and bot-hit volume. The Stripe connection is OAuth, not API key. The tracking script is 4 KB, cookieless, ships without a consent banner under most jurisdictions (verify per your privacy review). The whole pitch is detailed on the cookieless revenue analytics page and the per-engine breakouts at track ChatGPT traffic, track Perplexity traffic, track Claude traffic, track Gemini traffic, and track AI Overviews.

The first-person reason I built it is that I was the operator in 2024 staring at GPTBot in my access logs, wondering whether the crawl spike on my pricing page meant anything, and finding that nothing in my analytics stack would tell me whether the human traffic 48 hours later was related. Now it does.

Limitations

Five things this article does not cover, and you should not extrapolate past.

Voice-mode AI citations. When ChatGPT voice or Gemini voice answers a query and never renders a clickable URL, no fetch is made and no citation is logged in the way this article describes. The brand mention happens; the crawler signal does not. No reliable measurement story exists for voice-mode citation behavior in 2026.
Enterprise tenants. ChatGPT Enterprise, Claude for Work, and Microsoft Copilot for organizations may use different fetch behaviors from consumer surfaces. The crawler IPs and UAs are not always the same as the public ranges. Treat the consumer-surface measurements in this article as a lower bound for enterprise behavior.
MCP and tool-using agents. Model Context Protocol-based agents and tool-using agents from frameworks like LangChain, AutoGPT, and CrewAI sometimes fetch URLs directly via headless browsers with generic UAs. These are functionally AI-driven fetches but invisible to UA-based bot detection. The right detection is behavioral, not UA-based, and remains an open problem.
Region and language variance. Most of the percentages in this article are aggregated from US English sites. APAC and EU patterns differ — Bytespider is more dominant in APAC, ClaudeBot less so. Take the share percentages as directional, not canonical for your geography.
The "did blocking GPTBot help me get cited less?" counterfactual. I have anecdotes from individual sites but no controlled study. The cleanest available evidence is the developer-tools company case I described, where unblocking restored citation rate within ~30 days. Treat that as one data point, not as a quantified causal claim.

FAQ

What is the difference between GPTBot, ChatGPT-User, and OAI-SearchBot?

GPTBot is OpenAI's training crawler — it scrapes pages to add to future model training corpora and respects robots.txt. ChatGPT-User is the live browse agent that fetches a URL on demand when a user (or the model) asks ChatGPT to read a specific page; it also respects robots.txt but ignores it for direct user fetches per OpenAI's docs. OAI-SearchBot powers the ChatGPT search index that launched October 2024 and behaves more like a traditional search crawler. The three serve different jobs and a single robots.txt block on one does not block the others. Most sites should distinguish them in logs because a GPTBot spike is a training-corpus signal, a ChatGPT-User spike is a live-citation signal, and an OAI-SearchBot spike is a search-index signal.

Should I block GPTBot to protect my content from being used to train AI?

Probably not, but the honest answer depends on what you sell. Blocking GPTBot via robots.txt removes you from future training corpora. It does not block ChatGPT-User, so ChatGPT can still fetch and cite your pages live on user request. The net effect of blocking GPTBot is that the model's baseline knowledge of your brand slowly degrades while live-browse citations continue. For most SaaS and ecommerce sites this is the wrong trade because baseline brand recall in the model is what gets you cited for queries the user asks without browse mode active. Block GPTBot only if you have a specific legal, contractual, or content-monetization reason — for example, a paid publisher whose business model is sub-paywall content, or a brand under explicit DMCA pressure.

Does robots.txt actually stop AI crawlers from training on my site?

It stops the crawlers that respect it. GPTBot, ClaudeBot, Google-Extended, PerplexityBot, and Applebot-Extended all publicly commit to respecting robots.txt and the historical evidence is that they do. It does not stop bad-actor scrapers, third-party crawlers selling data to AI labs, archived copies on the Wayback Machine, third-party citations that quote you verbatim, or content already in published training sets. Robots.txt is a polite request honored by the largest crawlers; it is not a technical enforcement mechanism. If you need enforcement, you need WAF rules, Cloudflare's AI bot blocking, or IP-level blocks at the edge — and even those have known bypass paths.

How can I tell if an AI crawler hit is real or a spoofed user-agent?

Reverse-DNS the IP and compare to the published forward-DNS. OpenAI publishes its IP ranges at openai.com/gptbot-ranges.json, Google publishes its crawler ranges at developers.google.com/search/apis/ipranges/googlebot.json plus a separate special-crawlers list, and Anthropic publishes ClaudeBot IPs in its documentation. A request claiming to be GPTBot from an IP outside OpenAI's published ranges is either spoofed or from a third-party crawler imitating the user-agent. In practice, 5-15% of traffic claiming to be a known AI bot on the sites I audit is spoofed, mostly from scrapers trying to evade rate limits. Always validate by IP, not by user-agent string alone.

What is the cost in bandwidth of allowing AI crawlers?

Smaller than most operators fear and growing. Across the sites I monitor, AI crawlers together account for 0.5-4% of total bandwidth on content-heavy sites, with GPTBot the largest single share at typically 30-50% of AI-crawler bandwidth. The Cloudflare Radar AI bot dashboard pegs AI bot share of total bot traffic at roughly 4-6% in 2024-2025 with steady growth quarter over quarter. On a typical $20-50/mo VPS the marginal cost is negligible. On large static sites behind a CDN the cost is also negligible because the CDN absorbs most crawler load. The cost becomes real when AI crawlers hit uncached dynamic endpoints — large catalog sites, listing pages with faceted filters, search endpoints. For those, cache the AI-crawler responses aggressively or rate-limit per ASN.

Is there a single way to opt out of all AI training at once?

Not as of mid-2026, but the IETF AI Preferences working group's ai-content-usage proposal is the closest path. It defines a single HTTP header and robots.txt directive — Content-Usage and ai-content-usage — that signals consent or refusal for AI training, search-index inclusion, and other downstream uses. Major crawlers including Google-Extended, GPTBot, and ClaudeBot have signaled intent to honor the standard once it stabilizes. Until then you need per-crawler robots.txt blocks for each AI bot you want to exclude, plus WAF rules for crawlers that ignore robots.txt. The TDMRep (Text and Data Mining Reservation Protocol) standard offers a similar opt-out signal for EU jurisdictions, anchored in the EU AI Act.

Does blocking GPTBot stop ChatGPT from citing me?

No, and this is the single most common misunderstanding I see in the bot-blocking discourse. ChatGPT cites pages via three pipelines: the trained corpus (informed by GPTBot crawls), the live browse fetch (ChatGPT-User), and the search index (OAI-SearchBot). Blocking GPTBot only stops the first. A user who asks ChatGPT to read your page, or who searches via ChatGPT's search interface, can still trigger a fetch and a citation. The user is the source of the request, not OpenAI. Blocking GPTBot makes you slightly less likely to be cited in answers the model produces from pure trained knowledge but does not remove you from active answers.

How do I tell an AI crawler hit from a human AI-referred visit in my logs?

Two different log signals. AI crawler hits show a known bot user-agent (GPTBot, ChatGPT-User, ClaudeBot, PerplexityBot, Google-Extended, OAI-SearchBot) and usually no Referer header. Human AI-referred visits show a normal browser user-agent (Chrome, Safari, Firefox on a real OS) and either a Referer of chatgpt.com / perplexity.ai / claude.ai / gemini.google.com (15-20% of the time) or an empty Referer (the majority). The right shape for a log analyzer is two completely separate views: a Bot Activity view that includes all AI crawlers, and a Traffic view that excludes them. Mixing the two is how operators end up reporting 10x traffic spikes that are actually crawler bursts.

What is the difference between PerplexityBot and Perplexity-User?

PerplexityBot is Perplexity's scheduled search-index crawler, similar in role to OAI-SearchBot for ChatGPT. Perplexity-User is the live fetch agent that retrieves a URL when a Perplexity user (or the Perplexity model on their behalf) requests it. Per Perplexity's published policy, PerplexityBot respects robots.txt; Perplexity-User does not, on the grounds that user-triggered fetches are equivalent to a user typing the URL into a browser. Multiple site operators have reported PerplexityBot apparently ignoring robots.txt directives despite the policy, which Perplexity has publicly denied. Verify on your own logs and use WAF blocks if you observe non-compliance.

How often should I refresh AI crawler IP ranges?

Quarterly is the safe cadence. OpenAI, Google, and Anthropic update their published IP ranges occasionally (every few months in practice) when they rotate infrastructure. If you have automated reverse-DNS verification using Cloudflare's cf-verified-bot flag, you do not need to maintain your own copy of the IP ranges — Cloudflare handles it. If you are running your own IP-allowlist-based verification on bare nginx or AWS WAF, set a calendar reminder to refresh the JSON files quarterly.

Does Common Crawl (CCBot) count as an AI crawler?

Sort of. CCBot itself is not an AI crawler — it is Common Crawl, a nonprofit that publishes a freely-available web archive. But the Common Crawl dataset is the upstream source for many AI training pipelines, including most academic LLMs, some Hugging Face models, and historically parts of OpenAI's GPT-3 training. Blocking CCBot is a way to opt out of being included in the public dataset that downstream AI labs draw from. It is broader-than-necessary if you only want to opt out of one specific lab; it is narrower-than-necessary if you want to opt out of all AI training (because the labs also run their own direct crawlers). Most paid-publisher blocking lists include CCBot for completeness.

What is llms.txt and how does it relate to robots.txt?

llms.txt is a small markdown file at your site root (analogous to sitemap.xml or humans.txt) that lists your most LLM-relevant pages with short descriptions. It is a positive opt-in signal — "here are the pages I want LLMs to read" — rather than the opt-out / restriction nature of robots.txt. Adoption is around 7% of public SaaS sites as of Q1 2026. ChatGPT, Perplexity, and Claude crawlers all read it when present. It does not replace robots.txt; it complements it. The 30-minute investment is one of the highest-leverage one-time AEO moves available.

Can AI crawlers index pages behind a login wall?

No, not legitimately. AI crawlers that respect robots.txt also respect HTTP 401 / 403 responses on authenticated content. If your pages require login, they should return 401 to unauthenticated requests, including AI bot requests, and the bots will skip them. The exception is if you accidentally expose authenticated content via a leaked URL with a session token in the query string — in which case the bot can index it and the leak is the underlying bug. If you see authenticated content appearing in ChatGPT, the most likely explanation is a URL-leak issue, not the bot defeating your auth.

What happens if a previously-crawled page is removed or returns 404?

Removed pages drop out of training corpora over subsequent training cycles (months to a year). They drop out of live citations faster — typically within days to weeks, as the AI engine's index refreshes and notices the 404. For permanent removals, return HTTP 410 Gone instead of 404; this is a clearer signal to crawlers that the URL is intentionally removed and will not return. For temporary outages, do not let 5xx errors persist on URLs you care about — AI crawlers may interpret repeated 5xx as low-quality signal and deprioritize the URL in future indexes.

References

For the analytics-side companion that covers human AI-referred visits (where the bot signal leads to actual revenue), see the ChatGPT referral analytics guide and the per-engine deep dives at track Perplexity traffic, track Claude traffic, track Gemini traffic, and track AI Overviews. For the strategic framing of how to split SEO and AEO effort given the crawler ecosystem above, AEO vs SEO in 2026 is the companion. For the broader question of how Google sources its AI Overviews content (which interacts with Google-Extended and the AI crawler decisions you make here), see where does Google AI get its information. For the multi-engine citation monitoring side, the AI visibility tracker walkthrough covers how to combine bot crawl data with human-referral data into a single citation-readiness signal. For the practical guide to getting cited in the first place, how to get cited by AI engines is the playbook. The product surfaces are at cookieless revenue analytics, track ChatGPT traffic, and the main dashboard.

Find revenue hiding in your traffic

Discover which marketing channels bring customers so you can grow your business, fast.

Start free trial →

7-day free trial · $15/mo · cancel anytime