Technical SEO

AI Crawler & Agent Tracking 2026: GPTBot, ClaudeBot, PerplexityBot Explained

The 2026 field guide to AI crawler tracking — every user-agent, IP range, robots.txt directive, opt-out matrix, and how to tell crawls from human AI-referred visits.

Part of the AI Search Hub — browse all 35 AI Search guides.

The most common question I get from founders since Q1 2026 is some variant of "should I block GPTBot?" The second most common is "are AI bots eating my bandwidth?" The third is "how do I know if any of this is working?" All three deserve real answers, and the real answers are stranger than the discourse suggests. Blocking GPTBot does not do what most people think it does. AI bot bandwidth is usually not the problem operators imagine it to be. And the question "is it working?" has a structural answer that involves both crawl logs and human-session attribution, not just one or the other.

This piece is the technical companion to the strategic AEO vs SEO in 2026 post and the analytics-side ChatGPT referral analytics guide. Where those two cover what to publish and how to measure human AI traffic, this one covers the crawlers themselves: every user-agent you need to know, how to detect them, when to block them, what each one actually does, and how to wire their crawl frequency into a citation-readiness signal. If you have read the practical track-ChatGPT-traffic playbook and the broader how to get cited by AI engines post, this is the deeper dive on the bot side of the same problem.

Cloudflare Radar AI bot traffic share by crawler in 2025: GPTBot ~28%, ClaudeBot ~17%, Bytespider ~13%, PerplexityBot ~11%, Google-Extended ~10%, Amazonbot ~8%, Meta-ExternalAgent ~6%, others ~7%

Quick Facts

MetricValueSource
AI bot share of total bot traffic (2024-2025)~4-6%Cloudflare Radar [5]
Documented OpenAI user-agents3 (GPTBot, ChatGPT-User, OAI-SearchBot)OpenAI bot docs [1]
Documented Anthropic user-agents3 (ClaudeBot, anthropic-ai, claude-web)Anthropic ClaudeBot docs [12]
Documented Perplexity user-agents2 (PerplexityBot, Perplexity-User)Perplexity docs [16]
Google AI training crawlerGoogle-Extended (shares Googlebot infra)Google crawler docs [13]
Apple AI training opt-out crawlerApplebot-ExtendedApple Applebot docs [17]
ByteDance / TikTok AI crawlerBytespiderBytespider docs [18]
Meta AI crawlerMeta-ExternalAgentMeta crawler docs [19]
Common Crawl bot (third-party AI training source)CCBotcommoncrawl.org [20]
Cloudflare one-click AI bot block adoption~1M+ domains as of mid-2025Cloudflare blog [22]
robots.txt RFC reference (RFC 9309)Standardized September 2022IETF RFC 9309 [21]
llms.txt adoption (public SaaS, Q1 2026)~7%llmstxt.org [25]
Median spoofed-UA rate on AI bot hits5-15%Attrifast aggregate
AI Overviews appearance rate (US English)13-15% of queriesSearch Engine Land [10]

Two numbers do most of the work here. The 4-6% AI bot share of total bot traffic is the demand-side number — AI crawlers are real but still a small fraction of the bot fleet. The 5-15% spoofed-UA rate is the data-quality number — at least one in twenty hits claiming to be GPTBot is not actually OpenAI. If you skip the IP verification step you are reporting on phantom traffic.

Why AI crawler tracking matters in 2026

There are three reasons to track AI crawlers, and they are different enough that the right answer depends on which one you care about.

Reason one: crawl frequency is a leading indicator of citation readiness. When GPTBot, ChatGPT-User, or OAI-SearchBot crawl rates climb on a specific page, it is rarely random. It usually means OpenAI's pipeline considers the page worth indexing, fetching live in response to a user query, or surfacing in search. A page that was getting zero AI-bot hits per week and then jumps to a dozen ChatGPT-User hits in 48 hours is, in my experience, an extremely reliable signal that the page is being cited in answers to a query that is trending in real time. By the time the human traffic shows up in your analytics 24-72 hours later, the crawl-rate spike has already told you something is happening.

Reason two: bandwidth and infra cost accountability. On most sites this is a minor concern. On a few — large catalogs, faceted filter pages, search endpoints, infinite-scroll listing pages, image-heavy galleries — AI crawlers can hit thousands of unique URLs per day and chew through egress budget that you were not planning for. The Vercel team published a detailed breakdown of AI bot traffic patterns in late 2024 [23] showing that some of their largest customers were seeing AI crawler traffic exceed search-engine crawler traffic for the first time. If you are paying per GB or per request, knowing which bots are hitting which paths is the difference between a $40 surprise and a $4,000 surprise.

Reason three: content licensing and IP enforcement decisions. If you publish content that you actively monetize — a paid newsletter, a paywalled archive, a high-investment editorial product — the question of whether AI labs are ingesting it for training is a legitimate commercial concern. The New York Times v. OpenAI suit, the Reddit data licensing deals, the various publisher partnerships with OpenAI and Anthropic in 2024-2025 — all of these orbit the same question. You cannot make a sensible decision about which crawlers to allow, block, or negotiate with until you can see what is hitting your site and how often.

Most operators care about exactly one of those three reasons. The mistake I see most often is conflating them. "I want to block GPTBot because of bandwidth" usually does not survive five minutes of looking at actual numbers; "I want to track GPTBot because of citation readiness" is a measurement project, not a blocking project; "I want to negotiate with OpenAI because of licensing" is a different project again. Pick the reason first, then choose the action.

GoalRight actionWrong action
Maximize AI citations for SaaS / ecommerceAllow GPTBot, ChatGPT-User, ClaudeBot, PerplexityBot, Google-Extended; instrument all of themBlock GPTBot to "protect content"
Bandwidth control on large catalog siteCache aggressively, rate-limit per ASN, log per-bot bandwidthBlock all AI bots indiscriminately
Paywall enforcement on paid editorialBlock GPTBot, ClaudeBot, Google-Extended, CCBot, Bytespider; allow ChatGPT-User on free preview onlyBlock via robots.txt only and assume that is enforcement
Citation-readiness leading-indicator dashboardLog all AI bot hits per URL per day, correlate to human AI-referral spikes 24-72h laterTreat bot hits as traffic in your KPI chart
Compliance with EU AI Act opt-outImplement TDMRep tdm-reservation directive and ai-content-usage headerRely on robots.txt alone

The 10 AI crawlers you need to know

Below is the working reference table I keep in a spreadsheet and update roughly monthly. The list is not exhaustive — there are a long tail of academic, research, and one-off crawlers that account for the bottom 5% of AI bot traffic — but these are the ones that account for >95% of what you will see in production logs in 2026.

CrawlerOwnerPurposeRespects robots.txtDocumented IP range
GPTBotOpenAITraining corpus crawlerYes [1]openai.com/gptbot-ranges.json
ChatGPT-UserOpenAILive browse / user-triggered fetchYes for crawls, no for direct user fetches [1]openai.com/chatgpt-user.json
OAI-SearchBotOpenAIChatGPT search index crawlerYes [1]openai.com/searchbot.json
ClaudeBotAnthropicTraining corpus crawlerYes [12]Published in Anthropic docs
anthropic-ai (legacy)AnthropicLegacy training crawlerYes [12]Same as ClaudeBot
claude-webAnthropicLive browse user-triggeredYes [12]Same as ClaudeBot
PerplexityBotPerplexitySearch index crawlerYes [16]Published in Perplexity docs
Perplexity-UserPerplexityUser-triggered live fetchNo (per Perplexity policy) [16]Same as PerplexityBot
Google-ExtendedGoogleTraining crawler for Gemini, Vertex AIYes [13]Shares Googlebot IPs
Applebot-ExtendedAppleTraining opt-out for Apple IntelligenceYes [17]Shares Applebot IPs
BytespiderByteDanceTraining crawler (TikTok / Doubao)Yes (post-2024) [18]Published in Bytespider docs
Meta-ExternalAgentMetaTraining crawler (Llama, Meta AI)Yes [19]Published in Meta crawler docs
AmazonbotAmazonMulti-purpose, partially AI trainingYes [24]Published in Amazon docs
CCBotCommon CrawlOpen dataset used by most AI labsYes [20]commoncrawl.org
DiffbotDiffbotCommercial crawler, AI-data buyerYesPublished in Diffbot docs
YouBotYou.comAI search crawlerYesPublished in You.com docs
Cohere-AICohereTraining crawlerYesPublished in Cohere docs
ImagesiftBotHive AIImage training crawlerYesPublished in Hive docs

The next table is the one I actually use to write log greps. Exact user-agent strings as they appear in the field as of May 2026.

CrawlerUser-Agent substring to match
GPTBotGPTBot/1.1 (or earlier GPTBot/1.0)
ChatGPT-UserChatGPT-User/1.0 (also seen as ChatGPT-User/2.0)
OAI-SearchBotOAI-SearchBot/1.0
ClaudeBotClaudeBot/1.0 or claudebot
anthropic-aianthropic-ai (older legacy string)
claude-webClaude-Web/1.0
PerplexityBotPerplexityBot/1.0
Perplexity-UserPerplexity-User/1.0
Google-Extended(UA same as Googlebot; differentiated by robots.txt directive only)
Applebot-Extended(UA same as Applebot; differentiated by robots.txt directive only)
BytespiderBytespider
Meta-ExternalAgentmeta-externalagent
AmazonbotAmazonbot/0.1
CCBotCCBot/2.0 (Common Crawl)
DiffbotDiffbot
YouBotYouBot
Cohere-AIcohere-ai
ImagesiftBotImagesiftBot

Two important quirks in that table. First, Google-Extended and Applebot-Extended do not have unique user-agent strings. They share the standard Googlebot and Applebot UAs respectively. The way you "opt them out" is in robots.txt by writing a rule against Google-Extended or Applebot-Extended as a User-agent token — the crawler reads its own token from your robots.txt, even though the HTTP UA header is generic. This trips up log greps constantly. If you only grep by HTTP UA you cannot separate AI-training Googlebot from search-indexing Googlebot.

Second, the version suffixes drift. GPTBot was GPTBot/1.0 through most of 2024 and bumped to GPTBot/1.1 in late 2024. ChatGPT-User has been observed at both 1.0 and 2.0 in production logs. Your regex should match the prefix (GPTBot/), not the exact version.

Crawler ownership and parent company

CrawlerParent companyTraining model fedHeadquartered
GPTBot / ChatGPT-User / OAI-SearchBotOpenAIGPT-4, GPT-5, future modelsSan Francisco, US
ClaudeBot familyAnthropicClaude 3, Claude 4, Claude 5San Francisco, US
PerplexityBot / Perplexity-UserPerplexity AIPerplexity Sonar / hosted Llama, ClaudeSan Francisco, US
Google-ExtendedGoogle / AlphabetGemini, Vertex AI, Bard legacyMountain View, US
Applebot-ExtendedAppleApple Intelligence, on-device foundation modelsCupertino, US
BytespiderByteDanceDoubao, internal LLMsBeijing, CN
Meta-ExternalAgentMetaLlama 3, Llama 4, Meta AIMenlo Park, US
AmazonbotAmazonAlexa LLM, internal Bedrock trainingSeattle, US
CCBotCommon Crawl FoundationPublic Common Crawl dataset (used by most labs)San Francisco, US (nonprofit)
Cohere-AICohereCommand R, Command R+Toronto, CA
YouBotYou.comYouChat / hosted modelsPalo Alto, US

Approximate AI bot traffic share

The Cloudflare Radar AI Insights dashboard [5] publishes share-of-traffic numbers for the AI bot fleet that move quarter to quarter. The numbers below are directional, drawn from publicly accessible Cloudflare Radar snapshots, Vercel's own AI bot writeup [23], and the Attrifast customer aggregate. Treat them as estimates with double-digit error bars, not as canonical facts.

CrawlerApprox share of AI bot traffic (2025-2026)Trend Q1 2026 vs Q1 2025
GPTBot~25-30%Slowly growing
ClaudeBot~15-18%Rapidly growing
Bytespider~10-14%Volatile, regional
PerplexityBot~9-12%Rapidly growing
Google-Extended~8-11%Stable
Amazonbot~7-9%Stable
Meta-ExternalAgent~5-7%Growing
CCBot~3-5%Stable
Applebot-Extended~2-4%New, growing
All others~3-6%Mixed

Two things to call out. ClaudeBot's growth is faster than any other AI crawler in this dataset — it tracks closely with Claude's expanding API customer base, which makes sense if you assume the crawl pipeline is sized to model usage. Perplexity's crawler has been pushed by independent observers to be much more aggressive than its respect-for-robots policy implies; multiple site operators have reported PerplexityBot ignoring Disallow directives, which Perplexity has publicly denied [16]. The honest read is that the field data is mixed and you should verify on your own logs before drawing conclusions.

Crawler vs human AI-referred visit — what's the actual difference

This is the distinction operators get wrong most often. A GPTBot crawl is not a citation. A ChatGPT-User fetch is not a click. The user who later arrives via a ChatGPT citation is a third, completely separate event, and the three live in different parts of your stack.

Event typeWhat it isWhere it shows upCounts as
GPTBot training crawlScheduled scrape for future trainingServer logs with GPTBot/1.1 UABot activity (exclude from traffic)
ChatGPT-User live fetchOn-demand fetch when user asks ChatGPT to read URLServer logs with ChatGPT-User/1.0 UABot activity (citation signal)
OAI-SearchBot indexScheduled crawl for ChatGPT search indexServer logs with OAI-SearchBot UABot activity (search-index signal)
Human click from ChatGPT (referer-passed)Real human, browser hits pageAnalytics with chatgpt.com refererHuman traffic (attribute to ChatGPT)
Human click from ChatGPT (referer-stripped)Real human, browser hits page, no refererAnalytics as Direct/(none)Human traffic (suspected ChatGPT)

The clean mental model: bot hits and human hits are two different streams that should never be mixed in the same chart. When operators report "we got 4,000 visits from ChatGPT yesterday" and the number is actually 3,200 GPTBot crawls plus 800 humans, the resulting decisions (content plan, ad spend, board update) are all built on a misclassification.

Here is the dual-view pattern I now ship in every Attrifast install for the AI-engine view:

ViewWhat it showsWhat it does not show
Bot ActivityAll AI crawler hits (GPTBot, ChatGPT-User, ClaudeBot, etc.) per URL per dayHuman traffic of any kind
TrafficReal human sessions, attributed to AI engines where possibleBot hits of any kind
Citation ReadinessCorrelation between bot activity (lead) and human AI-referrals (lag)Causation — bot hits do not cause clicks; they precede them

The Citation Readiness view is the one that operationalizes the leading-indicator insight. When ChatGPT-User crawls on a page spike, the median lag to a human-referral spike on the same page is 18-72 hours in my data, with wide variance. Pages where the bot spike does not produce a corresponding human spike within 7 days are either (a) being fetched but not cited or (b) being cited in answers that do not generate clicks. Both are useful pieces of information but they mean different things for content strategy.

The decision tree above is the working version I run at the edge in Attrifast. It does three things in one pass: classifies the request, validates the UA against the published IP range, and routes the event into one of three downstream signals. The reverse-DNS step in node C is the one most server-log greps skip, and it is the source of most of the bad data.

How to detect AI crawlers in your logs (with grep examples)

You can get 80% of the value with five minutes and a grep command. The 20% that needs more work is the IP verification and the suspected-AI behavioral inference. Both are covered below.

The minimal log grep

For nginx logs in the default format:

grep -E "(GPTBot/|ChatGPT-User/|OAI-SearchBot|ClaudeBot|anthropic-ai|Claude-Web|PerplexityBot|Perplexity-User|Bytespider|Meta-ExternalAgent|Amazonbot|CCBot/|Google-Extended|Applebot-Extended|Diffbot|YouBot|cohere-ai|ImagesiftBot)" \
  /var/log/nginx/access.log \
  | awk '{print $1, $7, $9, $12}' \
  | sort | uniq -c | sort -rn | head -50

That gives you a sorted list of (IP, path, status, UA-fragment) tuples for AI bot hits. The columns are $1 (remote IP), $7 (request path), $9 (HTTP status), and $12 onward (User-Agent depending on log format).

For Apache combined log format:

grep -E "(GPTBot|ChatGPT-User|OAI-SearchBot|ClaudeBot|anthropic-ai|Claude-Web|PerplexityBot|Perplexity-User|Bytespider|Meta-ExternalAgent|Amazonbot|CCBot|Google-Extended|Applebot-Extended)" \
  /var/log/apache2/access.log \
  | awk -F'"' '{print $1, $2, $6}' \
  | sort | uniq -c | sort -rn | head -50

Per-crawler hit counts in the last 24 hours

for bot in "GPTBot/" "ChatGPT-User/" "OAI-SearchBot" "ClaudeBot" "PerplexityBot" "Bytespider" "Meta-ExternalAgent" "Amazonbot" "CCBot/" "Google-Extended" "Applebot-Extended" "Diffbot"; do
  count=$(grep -c "$bot" /var/log/nginx/access.log)
  printf "%-25s %d\n" "$bot" "$count"
done

Top paths hit by GPTBot in the last 7 days

zcat /var/log/nginx/access.log.* | grep "GPTBot/" \
  | awk '{print $7}' \
  | sort | uniq -c | sort -rn | head -30

Reverse-DNS verification (the step most ops skip)

A request claiming GPTBot/1.1 from an IP not in OpenAI's published range is spoofed. The verification:

# Pull OpenAI's published GPTBot IP ranges
curl -s https://openai.com/gptbot.json > /tmp/gptbot-ranges.json

# Extract IPs that hit your site claiming to be GPTBot
grep "GPTBot/" /var/log/nginx/access.log | awk '{print $1}' | sort -u > /tmp/observed-gptbot-ips.txt

# Compare (you'll need a CIDR-match tool like grepcidr for the full check)
grepcidr -f <(jq -r '.prefixes[].ipv4Prefix' /tmp/gptbot-ranges.json) /tmp/observed-gptbot-ips.txt

Across the sites I audit, between 5% and 15% of IPs claiming to be GPTBot fall outside OpenAI's published ranges. Those are spoofed. Treat them as you would any other unverified scraper — log them, rate-limit them, or block them, but do not count them as OpenAI activity.

Same for Google-Extended

Google-Extended is harder because it shares Googlebot's user-agent. The differentiating signal is the robots.txt token. You cannot detect Google-Extended in your access logs directly — you can only see all Googlebot hits, and you control whether they get fed into Vertex / Gemini training via the User-agent: Google-Extended directive in robots.txt [13].

# Verify Googlebot hits are real Googlebot via reverse-DNS
for ip in $(grep "Googlebot" /var/log/nginx/access.log | awk '{print $1}' | sort -u); do
  host=$(dig +short -x "$ip")
  if [[ "$host" == *.googlebot.com* || "$host" == *.google.com* ]]; then
    echo "$ip VERIFIED"
  else
    echo "$ip SPOOFED"
  fi
done

Cloudflare and Vercel: log fields that help

If you sit behind Cloudflare, the cf-verified-bot flag (available in Cloudflare's Logpush) tells you whether Cloudflare has verified the bot's identity. This saves you the reverse-DNS dance for the bots Cloudflare knows about. Vercel logs include x-vercel-forwarded-for and standard request headers; for AI bot detection there, the simplest path is to inspect User-Agent in middleware and write the classification to a custom log line. Vercel published a detailed pattern for this in late 2024 [23].

robots.txt vs WAF blocking vs Cloudflare AI Bot Block

Three layers of bot blocking, three different threat models. Most operators conflate them, and the resulting block is either too weak (robots.txt only, against a crawler that ignores it) or too strong (WAF block against good actors who would have respected a polite request).

LayerWhat it doesEnforcementBest for
robots.txt directivePolite request; well-behaved crawlers respect itNone — it is a request, not a blockBots that publicly commit to respecting robots.txt (GPTBot, ClaudeBot, Google-Extended)
Server / WAF ruleReturns 403 / 429 / blocks at edge based on UA or IPTechnical block (UA-only is bypassable; IP+UA is stronger)Bots that ignore robots.txt or scrapers spoofing AI UAs
Cloudflare AI Bot Block (one-click)Blocks Cloudflare's curated AI bot list at edgeTechnical block, kept updated by CloudflareSites that want zero AI training exposure with minimal config
Cloudflare Bot ManagementML-based bot scoring with custom rulesTechnical block, plus heuristic detection of unknown botsEnterprise sites with sophisticated bot threat
TDMRep / ai-content-usage signalingStandardized opt-out signalNone — depends on crawler honoring the signalEU compliance and forward-looking opt-out

A copy-pasteable robots.txt for "allow everything, log everything" (recommended default for SaaS)

# Default: allow all AI crawlers. We measure and instrument; we do not block.
User-agent: GPTBot
Allow: /

User-agent: ChatGPT-User
Allow: /

User-agent: OAI-SearchBot
Allow: /

User-agent: ClaudeBot
Allow: /

User-agent: anthropic-ai
Allow: /

User-agent: Claude-Web
Allow: /

User-agent: PerplexityBot
Allow: /

User-agent: Perplexity-User
Allow: /

User-agent: Google-Extended
Allow: /

User-agent: Applebot-Extended
Allow: /

User-agent: Bytespider
Allow: /

User-agent: Meta-ExternalAgent
Allow: /

User-agent: Amazonbot
Allow: /

User-agent: CCBot
Allow: /

User-agent: *
Allow: /
Sitemap: https://yoursite.com/sitemap.xml

A copy-pasteable robots.txt for "block all AI training, allow live browse" (for paid publishers)

# Block AI training crawlers. Allow user-triggered live browse so brand remains reachable.
User-agent: GPTBot
Disallow: /

User-agent: ClaudeBot
Disallow: /

User-agent: anthropic-ai
Disallow: /

User-agent: Google-Extended
Disallow: /

User-agent: Applebot-Extended
Disallow: /

User-agent: Bytespider
Disallow: /

User-agent: Meta-ExternalAgent
Disallow: /

User-agent: CCBot
Disallow: /

User-agent: Cohere-AI
Disallow: /

User-agent: Diffbot
Disallow: /

# Allow live-browse user-triggered fetches so users asking ChatGPT/Claude/Perplexity
# to read a specific URL can still reach the page.
User-agent: ChatGPT-User
Allow: /

User-agent: Claude-Web
Allow: /

User-agent: Perplexity-User
Allow: /

User-agent: OAI-SearchBot
Allow: /

User-agent: PerplexityBot
Allow: /

User-agent: *
Allow: /
Sitemap: https://yoursite.com/sitemap.xml

nginx WAF block for crawlers that ignore robots.txt

# /etc/nginx/conf.d/ai-bot-block.conf
# Use only if you have documented evidence a specific crawler is ignoring robots.txt.
map $http_user_agent $is_blocked_ai_bot {
    default 0;
    "~*GPTBot"           0;  # respect robots.txt; do not block at WAF
    "~*Bytespider"       1;  # block if you have a specific complaint
    "~*CCBot"            0;  # respect robots.txt
    "~*Diffbot"          0;
    "~*ImagesiftBot"     1;  # image scraping; block if not relevant
}

server {
    if ($is_blocked_ai_bot = 1) {
        return 403;
    }
    # ... rest of server config
}

Cloudflare WAF custom rule (UI-equivalent expression)

(http.user_agent contains "Bytespider")
or (http.user_agent contains "ImagesiftBot")
or (http.user_agent contains "Diffbot" and not http.user_agent contains "diffbot.com/help")

Action: Block (or Managed Challenge for soft-block).

AWS WAF rule (CloudFormation snippet)

- Name: BlockAggressiveAIBots
  Priority: 10
  Statement:
    OrStatement:
      Statements:
        - ByteMatchStatement:
            SearchString: "Bytespider"
            FieldToMatch:
              SingleHeader:
                Name: user-agent
            TextTransformations:
              - Priority: 0
                Type: NONE
            PositionalConstraint: CONTAINS
        - ByteMatchStatement:
            SearchString: "ImagesiftBot"
            FieldToMatch:
              SingleHeader:
                Name: user-agent
            TextTransformations:
              - Priority: 0
                Type: NONE
            PositionalConstraint: CONTAINS
  Action:
    Block: {}
  VisibilityConfig:
    SampledRequestsEnabled: true
    CloudWatchMetricsEnabled: true
    MetricName: BlockAggressiveAIBots

Cloudflare's one-click AI bot block

Cloudflare shipped a "Block AI Bots" toggle in mid-2024 [22] that sits in the Security → Bots panel and blocks Cloudflare's curated list of AI training crawlers without requiring you to maintain WAF rules. Adoption hit over a million domains by late 2024. The tradeoff is the same as the manual block: you lose training-corpus exposure for the brand. The one-click toggle does not block live-browse agents (ChatGPT-User, Claude-Web, Perplexity-User) by default — that requires a separate rule.

MethodEffortMaintenanceBypass riskCoverage
robots.txt5 minLowHigh (bad-actor bots ignore)Good crawlers only
nginx UA block30 minMediumMedium (UA-only, IP rotation bypasses)UA-matched bots only
Cloudflare one-click AI bot block1 minZeroLowCloudflare's curated list
Cloudflare Bot Management MLEnterprise planZeroVery lowKnown + unknown bots
AWS WAF + custom rules1-4 hrMediumLowWhatever you write
IP-range block (verified)1-2 hr setupQuarterly refreshVery lowBots with published ranges
ai-content-usage header (future)10 minLowHonored by major crawlers (eventually)Crawlers that adopt standard

Should you block AI crawlers? (honest tradeoffs by site type)

This is the section that has eaten the most of my Discord and email time in 2026. Operators want a clean yes/no. The honest answer is "it depends on your business model and you should not block as a default." The decision tree by site type:

Site typeBlock GPTBot?Block ChatGPT-User?Block CCBot?Reasoning
Bootstrapped B2B SaaSNoNoNoNeed brand presence in trained corpus; ChatGPT citations drive measurable revenue
DTC ecommerceNoNoNoProduct discovery via AI engines is rising; blocking is a future-loss bet
Developer tools / OSSNoNoNoDevs use ChatGPT for technical Q&A; being in the trained corpus is high-leverage
Content publisher (ad-supported)Soft no (allow most)NoMaybe blockSome publishers block CCBot but allow live-browse; tradeoff is volume vs control
Content publisher (paywalled)Yes (block)Allow free preview onlyYes (block)Direct paywall conflict; OpenAI partnerships are licensing-deal terrain, not free crawl
News organizationYes (block)Maybe (allow free preview)Yes (block)NYT v. OpenAI precedent; license, do not give away
Local services / SMBNoNoNoNegligible content value to AI labs; allow for brand discovery
Regulated healthcare / financialNo (with disclaimers)NoNoYMYL content gets filtered by AI anyway; blocking does not help, instrumentation does
API documentationNoNoNoBeing in the trained corpus is the entire point of public API docs
Internal / authenticated contentN/AN/AN/AShould already require auth; robots.txt is not the right defense

The argument for not blocking, summarized in one paragraph: most "AI bot blocking" guides assume your content has high standalone value to an AI lab and that you can capture that value by withholding it. For 95% of SaaS and ecommerce sites, that is not true. The content has value to your business because it drives leads, brand awareness, and product evaluations. The AI crawler that ingests your pricing page and the AI engine that later cites your pricing page are part of the same funnel. Blocking the first does not stop the second; it just makes the second less informed about your brand.

The argument for blocking, summarized in one paragraph: if you are a paid publisher whose business model is the sale of access to your content, AI training-corpus inclusion is direct competitive substitution. A reader who gets your reporting summarized inside ChatGPT is a reader who did not subscribe. Blocking GPTBot and ClaudeBot is a defensive move that preserves the licensing-deal option. The cost is loss of trained-corpus brand presence; the benefit is preserving the underlying commercial model.

The argument that nobody makes but should: blocking AI crawlers is a one-way door that is easy to reverse. Allowing them and then deciding later to block costs you nothing; the future crawler can re-ingest the new robots.txt within days. The asymmetry argues for "allow by default, block when you have a specific reason."

Two case patterns

Case one: a developer tools company I work with that blocked GPTBot in early 2024 "just to be safe." Twelve months later they noticed their ChatGPT citation rate was declining relative to peers in the same category. They unblocked. Citation rate recovered within ~30 days as the new crawls fed the next training cycle. The 12-month "protection" did not protect anything; it just put them behind on brand presence inside model knowledge.

Case two: a paid publisher who left GPTBot allowed through 2024. When OpenAI's publisher-partnership program offered to license content for direct citation rights, the publisher had no leverage — their content was already in the training corpus. They blocked retroactively, but a year of crawls had already been ingested. The retrospective lesson: if you ever plan to negotiate a content license with an AI lab, the leverage starts on day one with the block, not on day 365 with the negotiation.

Measuring crawl frequency as a citation-readiness signal

This is the part of the playbook I have not seen anyone else describe in detail, and it is the part that pays off the highest leverage from AI crawl logging.

The basic insight: AI crawler traffic on a specific URL has predictive signal for whether the URL is about to start receiving human AI-referred traffic. Not perfect signal — many bot hits do not produce citations, and many citations come from URLs that have not been freshly crawled — but a real, exploitable pattern with consistent shape.

The four-signal model I run on every Attrifast install

SignalSourceLead time vs human trafficInterpretation
GPTBot crawl frequency (rolling 7-day)Server logs7-30 daysTraining-corpus inclusion; slow-moving baseline
ChatGPT-User fetch frequency (rolling 24h)Server logs18-72 hoursLive-citation activity; fast-moving real-time signal
OAI-SearchBot crawl frequency (rolling 7-day)Server logs3-14 daysSearch-index inclusion; medium-moving
PerplexityBot + Perplexity-User combinedServer logs12-48 hoursPerplexity-specific citation activity

For each URL on the site I track all four. The Citation Readiness score for a URL is a weighted blend, with ChatGPT-User and Perplexity-User carrying the highest weight (because they are user-triggered and most strongly correlated with active citations).

A real pattern from one site

I cannot share absolute numbers but the shape is stable enough to describe. A bootstrapped SaaS site published a new commercial page on a Tuesday. The first GPTBot hit landed Wednesday. The first ChatGPT-User hit landed Friday afternoon — a single fetch from one IP. The next Monday morning, the ChatGPT-User hits jumped from 1 to 11 in a single day, and on Wednesday the human AI-referred traffic on that URL went from 0 to 47 sessions. The bot signal led the human signal by 4 days.

DayGPTBot hitsChatGPT-User hitsHuman AI-referred sessions
Tue (publish)000
Wed100
Thu000
Fri010
Sat100
Sun000
Mon0112
Tue11814
Wed02247
Thu21938

The bot-signal column does not always lead the human-signal column this cleanly. Variance is high. But over a population of pages and over a long enough window, the directional pattern holds: ChatGPT-User hits cluster a few days before human AI-referrals climb.

Why this matters for content strategy

Two operational uses for the signal:

Use one: prioritize content investment by Citation Readiness. Pages with high ChatGPT-User activity and rising human AI-referrals are the pages that deserve next-round content investment (deeper FAQ, more schema, expanded comparison sections). Pages with high crawl activity and zero human conversion are pages that are being fetched but not cited — different problem, different fix (usually content quality or page structure).

Use two: catch broken citation pipelines fast. When a previously high-citation page suddenly drops in ChatGPT-User activity, it is usually one of three things: the page has been deprioritized in the model's index, the page is broken (404, 500, redirect loop the bot cannot follow), or a competitor has displaced you in the citation slot. All three are recoverable if caught in days, very expensive if caught in months.

AI crawler bandwidth and infra cost benchmarks

The "AI bots are eating my bandwidth" panic is mostly overstated, but it is real on a specific set of site shapes. Here are the benchmarks I have collected.

Typical AI bot bandwidth as % of total traffic

Site typeAI bot bandwidth as % of totalTop contributor
Small SaaS (under 1M pageviews/mo)0.1-0.8%GPTBot
Mid SaaS (1M-10M pageviews/mo)0.3-1.5%GPTBot
Large SaaS / content site (10M+)0.5-3%GPTBot + ClaudeBot + Bytespider
DTC ecommerce (small-mid)0.4-2%GPTBot + Amazonbot
DTC ecommerce (large catalog)1-5%Bytespider + GPTBot
Content publisher1-4%CCBot + GPTBot
Developer documentation2-8%ClaudeBot + GPTBot
Job board / classifieds3-12%Bytespider + GPTBot
Real estate / listings4-15%Bytespider + Diffbot
Wikipedia-shaped reference5-20%CCBot + GPTBot + Bytespider

The bottom rows are where the bandwidth bills start to bite. Large catalog sites, listing sites with faceted filters, and reference-style content that has high per-page uniqueness all attract more aggressive AI crawler behavior because the per-page training value is higher.

Cost in concrete dollars (rough estimates)

For a site on Vercel Pro at $20/mo + $0.40/GB egress over the included allowance:

Site profileMonthly pageviewsAI bot shareAI bot extra GB/moAI bot extra $/mo
Small SaaS200,0000.5%<1 GB<$1
Mid SaaS2,000,0001%~5-10 GB$2-4
Large content site20,000,0002%~100-200 GB$40-80
Job board5,000,0008%~150-300 GB$60-120
Listings site10,000,00012%~500-1000 GB$200-400

For the bottom two rows the bandwidth cost starts mattering. For the top three rows it is rounding error.

When AI crawlers actually create a problem

ProblemRoot causeFix
Crawler hits uncached dynamic endpoint repeatedlyCatalog / search / faceted-filter URLs with no cache headersSet Cache-Control: public, s-maxage=86400 on bot-safe pages
Crawler triggers expensive DB queriesFaceted filter URLs hitting search infrastructureRate-limit AI crawlers on filter endpoints; serve static fallback
Crawler downloads large media (PDF, images, video)No Disallow on heavy media pathsUser-agent: GPTBot \n Disallow: /media/large/
Crawler floods 404 on stale sitemap URLsOld URLs in sitemap that have been removedRefresh sitemap; ensure 410 not 404 on permanent removals
Bytespider hitting 100+ requests / minuteBytespider's historic aggressive crawl patternRate-limit per ASN at edge; or block entirely

The single highest-leverage fix for AI crawler cost on most sites is aggressive caching on the URLs they hit most. The crawl pattern is highly cacheable — AI bots tend to refetch the same URL list on a schedule, and a 24-hour CDN cache absorbs nearly all of that load.

The 2026 IETF "ai-content-usage" preferences proposal

The mess of per-crawler robots.txt rules and ad hoc WAF blocks is what the IETF AI Preferences working group [27] is trying to fix. The proposal that has emerged through 2025-2026 is a single HTTP header and robots.txt directive — Content-Usage for the header and ai-content-usage in robots.txt — that lets a site signal consent or refusal for several distinct downstream uses of its content.

The use categories under discussion (subject to change):

Use categoryDescriptionExample crawler that consumes it
train-aiUse page content as training data for AI modelsGPTBot, ClaudeBot, Google-Extended
train-genaiUse specifically for generative AI trainingSubset of above
searchInclude in search indexesGooglebot, Bingbot, OAI-SearchBot
inferenceUse for retrieval-augmented generation at inference timeChatGPT-User, Perplexity-User, Claude-Web
summarizeUse for AI-generated summariesVarious
tdm-miningEU TDM Reservation directive coverageAll AI training

The signal is intended to be respected by all participating crawlers, with the goal that operators can write a single ai-content-usage line in robots.txt instead of maintaining a dozen per-crawler User-agent blocks. The TDMRep standard [26] solves a closely related EU-jurisdiction problem under the EU AI Act.

Example robots.txt directives under the proposal

# Refuse AI training, allow search and inference (live citation)
User-agent: *
ai-content-usage: search, inference
ai-content-usage-deny: train-ai, train-genai, tdm-mining
# Allow everything (recommended default for most SaaS / ecommerce)
User-agent: *
ai-content-usage: train-ai, train-genai, search, inference, summarize
# Selective: allow search and inference, but train-genai only on /blog/, not on /paid/
User-agent: *
ai-content-usage: search, inference

User-agent: *
Disallow: /paid/

User-agent: *
Allow: /blog/
ai-content-usage: train-genai, search, inference

Adoption status

CrawlerPublic position on ai-content-usage proposal
GPTBotPublic signal of intent to honor [1]
ClaudeBotPublic signal of intent to honor [12]
Google-ExtendedPublic signal of intent to honor [13]
PerplexityBotTentative; awaiting standardization [16]
Applebot-ExtendedSignaled support [17]
BytespiderNo public position
Meta-ExternalAgentTentative
CCBotCommon Crawl is a downstream pipeline; the standard applies to consumers, not CC itself

Until the proposal stabilizes (likely 2026-2027 timeline given the IETF process), the practical advice is to maintain per-crawler robots.txt directives as the primary control surface, with TDMRep tdm-reservation set for EU compliance, and ai-content-usage added as an additional polite-signal layer for forward compatibility.

Common AI crawler tracking mistakes

Eight mistakes I see often enough to call them patterns.

Mistake 1: Treating bot hits as traffic. A 10x spike in GPTBot is not 10x more users. It is OpenAI's crawler doing scheduled work. Fix: separate Bot Activity from Traffic in every chart. Never mix.

Mistake 2: Trusting the User-Agent string without IP verification. Between 5% and 15% of UA-claimed GPTBot hits are spoofed. Fix: reverse-DNS verify against OpenAI's published ranges, or use Cloudflare's cf-verified-bot flag.

Mistake 3: Blocking GPTBot believing it stops ChatGPT citations. It blocks training-corpus inclusion only. ChatGPT-User and OAI-SearchBot keep working. Fix: understand the three-pipeline model before blocking anything.

Mistake 4: Forgetting Google-Extended is configured by robots.txt token, not UA. Google-Extended hits show as Googlebot in your logs. You cannot detect them in access logs alone. Fix: control via User-agent: Google-Extended directive in robots.txt; do not expect a unique UA string.

Mistake 5: Confusing crawl frequency with citation frequency. A page that is crawled often is not necessarily cited often. Many crawls produce zero citations. Fix: instrument both signals separately and watch the correlation, not the count.

Mistake 6: Allowing Bytespider unbounded on a large-catalog site. Bytespider has historically been the most aggressive AI crawler on listing-shaped sites, sometimes hitting hundreds of requests per minute. Fix: rate-limit per ASN or block at WAF if the bandwidth bill is real.

Mistake 7: Setting a robots.txt block on /blog/ for AI bots. This is the inverse of the right move. Your blog is the most citation-friendly content you publish; blocking AI from your blog while allowing access to your pricing page makes the model less informed about your category content but still able to quote your prices. Fix: invert. If you must block, block /paid/ or /pricing/details/, not /blog/.

Mistake 8: Ignoring the human-AI-referral side entirely. Bot tracking without human tracking is a dashboard that never shows revenue. The crawler hits are the leading indicator; the human clicks and the Stripe webhook are the lagging signals that pay for the work. Fix: instrument both sides, in one view if possible.

What this looks like in practice on Attrifast

A short note on the product because the article cannot pretend the author has no interest. Attrifast tracks AI crawler hits per URL per day in the same dashboard where it shows human AI-referred sessions and the Stripe revenue join. The crawler view is grouped by engine (OpenAI, Anthropic, Perplexity, Google, Apple, Meta, ByteDance) with per-bot breakdowns inside each engine. The human view is grouped by AI source domain (chatgpt.com, perplexity.ai, claude.ai, gemini.google.com) with referer-passed and referer-stripped separated. The Citation Readiness signal correlates the two over time.

The four pieces the product owns:

  1. Edge detection of AI bot hits (UA matching + IP verification)
  2. First-party detection of human AI-referrals (referer fingerprinting + behavioral inference)
  3. Stripe webhook join from session to payment
  4. A single dashboard showing all of the above with per-URL drill-downs

Cost: $29/mo for the base tier, which covers up to a stated session and bot-hit volume. The Stripe connection is OAuth, not API key. The tracking script is 4 KB, cookieless, ships without a consent banner under most jurisdictions (verify per your privacy review). The whole pitch is detailed on the cookieless revenue analytics page and the per-engine breakouts at track ChatGPT traffic, track Perplexity traffic, track Claude traffic, track Gemini traffic, and track AI Overviews.

The first-person reason I built it is that I was the operator in 2024 staring at GPTBot in my access logs, wondering whether the crawl spike on my pricing page meant anything, and finding that nothing in my analytics stack would tell me whether the human traffic 48 hours later was related. Now it does.

Limitations

Five things this article does not cover, and you should not extrapolate past.

  • Voice-mode AI citations. When ChatGPT voice or Gemini voice answers a query and never renders a clickable URL, no fetch is made and no citation is logged in the way this article describes. The brand mention happens; the crawler signal does not. No reliable measurement story exists for voice-mode citation behavior in 2026.
  • Enterprise tenants. ChatGPT Enterprise, Claude for Work, and Microsoft Copilot for organizations may use different fetch behaviors from consumer surfaces. The crawler IPs and UAs are not always the same as the public ranges. Treat the consumer-surface measurements in this article as a lower bound for enterprise behavior.
  • MCP and tool-using agents. Model Context Protocol-based agents and tool-using agents from frameworks like LangChain, AutoGPT, and CrewAI sometimes fetch URLs directly via headless browsers with generic UAs. These are functionally AI-driven fetches but invisible to UA-based bot detection. The right detection is behavioral, not UA-based, and remains an open problem.
  • Region and language variance. Most of the percentages in this article are aggregated from US English sites. APAC and EU patterns differ — Bytespider is more dominant in APAC, ClaudeBot less so. Take the share percentages as directional, not canonical for your geography.
  • The "did blocking GPTBot help me get cited less?" counterfactual. I have anecdotes from individual sites but no controlled study. The cleanest available evidence is the developer-tools company case I described, where unblocking restored citation rate within ~30 days. Treat that as one data point, not as a quantified causal claim.

FAQ

What is the difference between GPTBot, ChatGPT-User, and OAI-SearchBot?

GPTBot is OpenAI's training crawler — it scrapes pages to add to future model training corpora and respects robots.txt. ChatGPT-User is the live browse agent that fetches a URL on demand when a user (or the model) asks ChatGPT to read a specific page; it also respects robots.txt but ignores it for direct user fetches per OpenAI's docs. OAI-SearchBot powers the ChatGPT search index that launched October 2024 and behaves more like a traditional search crawler. The three serve different jobs and a single robots.txt block on one does not block the others. Most sites should distinguish them in logs because a GPTBot spike is a training-corpus signal, a ChatGPT-User spike is a live-citation signal, and an OAI-SearchBot spike is a search-index signal.

Should I block GPTBot to protect my content from being used to train AI?

Probably not, but the honest answer depends on what you sell. Blocking GPTBot via robots.txt removes you from future training corpora. It does not block ChatGPT-User, so ChatGPT can still fetch and cite your pages live on user request. The net effect of blocking GPTBot is that the model's baseline knowledge of your brand slowly degrades while live-browse citations continue. For most SaaS and ecommerce sites this is the wrong trade because baseline brand recall in the model is what gets you cited for queries the user asks without browse mode active. Block GPTBot only if you have a specific legal, contractual, or content-monetization reason — for example, a paid publisher whose business model is sub-paywall content, or a brand under explicit DMCA pressure.

Does robots.txt actually stop AI crawlers from training on my site?

It stops the crawlers that respect it. GPTBot, ClaudeBot, Google-Extended, PerplexityBot, and Applebot-Extended all publicly commit to respecting robots.txt and the historical evidence is that they do. It does not stop bad-actor scrapers, third-party crawlers selling data to AI labs, archived copies on the Wayback Machine, third-party citations that quote you verbatim, or content already in published training sets. Robots.txt is a polite request honored by the largest crawlers; it is not a technical enforcement mechanism. If you need enforcement, you need WAF rules, Cloudflare's AI bot blocking, or IP-level blocks at the edge — and even those have known bypass paths.

How can I tell if an AI crawler hit is real or a spoofed user-agent?

Reverse-DNS the IP and compare to the published forward-DNS. OpenAI publishes its IP ranges at openai.com/gptbot-ranges.json, Google publishes its crawler ranges at developers.google.com/search/apis/ipranges/googlebot.json plus a separate special-crawlers list, and Anthropic publishes ClaudeBot IPs in its documentation. A request claiming to be GPTBot from an IP outside OpenAI's published ranges is either spoofed or from a third-party crawler imitating the user-agent. In practice, 5-15% of traffic claiming to be a known AI bot on the sites I audit is spoofed, mostly from scrapers trying to evade rate limits. Always validate by IP, not by user-agent string alone.

What is the cost in bandwidth of allowing AI crawlers?

Smaller than most operators fear and growing. Across the sites I monitor, AI crawlers together account for 0.5-4% of total bandwidth on content-heavy sites, with GPTBot the largest single share at typically 30-50% of AI-crawler bandwidth. The Cloudflare Radar AI bot dashboard pegs AI bot share of total bot traffic at roughly 4-6% in 2024-2025 with steady growth quarter over quarter. On a typical $20-50/mo VPS the marginal cost is negligible. On large static sites behind a CDN the cost is also negligible because the CDN absorbs most crawler load. The cost becomes real when AI crawlers hit uncached dynamic endpoints — large catalog sites, listing pages with faceted filters, search endpoints. For those, cache the AI-crawler responses aggressively or rate-limit per ASN.

Is there a single way to opt out of all AI training at once?

Not as of mid-2026, but the IETF AI Preferences working group's ai-content-usage proposal is the closest path. It defines a single HTTP header and robots.txt directive — Content-Usage and ai-content-usage — that signals consent or refusal for AI training, search-index inclusion, and other downstream uses. Major crawlers including Google-Extended, GPTBot, and ClaudeBot have signaled intent to honor the standard once it stabilizes. Until then you need per-crawler robots.txt blocks for each AI bot you want to exclude, plus WAF rules for crawlers that ignore robots.txt. The TDMRep (Text and Data Mining Reservation Protocol) standard offers a similar opt-out signal for EU jurisdictions, anchored in the EU AI Act.

Does blocking GPTBot stop ChatGPT from citing me?

No, and this is the single most common misunderstanding I see in the bot-blocking discourse. ChatGPT cites pages via three pipelines: the trained corpus (informed by GPTBot crawls), the live browse fetch (ChatGPT-User), and the search index (OAI-SearchBot). Blocking GPTBot only stops the first. A user who asks ChatGPT to read your page, or who searches via ChatGPT's search interface, can still trigger a fetch and a citation. The user is the source of the request, not OpenAI. Blocking GPTBot makes you slightly less likely to be cited in answers the model produces from pure trained knowledge but does not remove you from active answers.

How do I tell an AI crawler hit from a human AI-referred visit in my logs?

Two different log signals. AI crawler hits show a known bot user-agent (GPTBot, ChatGPT-User, ClaudeBot, PerplexityBot, Google-Extended, OAI-SearchBot) and usually no Referer header. Human AI-referred visits show a normal browser user-agent (Chrome, Safari, Firefox on a real OS) and either a Referer of chatgpt.com / perplexity.ai / claude.ai / gemini.google.com (15-20% of the time) or an empty Referer (the majority). The right shape for a log analyzer is two completely separate views: a Bot Activity view that includes all AI crawlers, and a Traffic view that excludes them. Mixing the two is how operators end up reporting 10x traffic spikes that are actually crawler bursts.

What is the difference between PerplexityBot and Perplexity-User?

PerplexityBot is Perplexity's scheduled search-index crawler, similar in role to OAI-SearchBot for ChatGPT. Perplexity-User is the live fetch agent that retrieves a URL when a Perplexity user (or the Perplexity model on their behalf) requests it. Per Perplexity's published policy, PerplexityBot respects robots.txt; Perplexity-User does not, on the grounds that user-triggered fetches are equivalent to a user typing the URL into a browser. Multiple site operators have reported PerplexityBot apparently ignoring robots.txt directives despite the policy, which Perplexity has publicly denied. Verify on your own logs and use WAF blocks if you observe non-compliance.

How often should I refresh AI crawler IP ranges?

Quarterly is the safe cadence. OpenAI, Google, and Anthropic update their published IP ranges occasionally (every few months in practice) when they rotate infrastructure. If you have automated reverse-DNS verification using Cloudflare's cf-verified-bot flag, you do not need to maintain your own copy of the IP ranges — Cloudflare handles it. If you are running your own IP-allowlist-based verification on bare nginx or AWS WAF, set a calendar reminder to refresh the JSON files quarterly.

Does Common Crawl (CCBot) count as an AI crawler?

Sort of. CCBot itself is not an AI crawler — it is Common Crawl, a nonprofit that publishes a freely-available web archive. But the Common Crawl dataset is the upstream source for many AI training pipelines, including most academic LLMs, some Hugging Face models, and historically parts of OpenAI's GPT-3 training. Blocking CCBot is a way to opt out of being included in the public dataset that downstream AI labs draw from. It is broader-than-necessary if you only want to opt out of one specific lab; it is narrower-than-necessary if you want to opt out of all AI training (because the labs also run their own direct crawlers). Most paid-publisher blocking lists include CCBot for completeness.

What is llms.txt and how does it relate to robots.txt?

llms.txt is a small markdown file at your site root (analogous to sitemap.xml or humans.txt) that lists your most LLM-relevant pages with short descriptions. It is a positive opt-in signal — "here are the pages I want LLMs to read" — rather than the opt-out / restriction nature of robots.txt. Adoption is around 7% of public SaaS sites as of Q1 2026. ChatGPT, Perplexity, and Claude crawlers all read it when present. It does not replace robots.txt; it complements it. The 30-minute investment is one of the highest-leverage one-time AEO moves available.

Can AI crawlers index pages behind a login wall?

No, not legitimately. AI crawlers that respect robots.txt also respect HTTP 401 / 403 responses on authenticated content. If your pages require login, they should return 401 to unauthenticated requests, including AI bot requests, and the bots will skip them. The exception is if you accidentally expose authenticated content via a leaked URL with a session token in the query string — in which case the bot can index it and the leak is the underlying bug. If you see authenticated content appearing in ChatGPT, the most likely explanation is a URL-leak issue, not the bot defeating your auth.

What happens if a previously-crawled page is removed or returns 404?

Removed pages drop out of training corpora over subsequent training cycles (months to a year). They drop out of live citations faster — typically within days to weeks, as the AI engine's index refreshes and notices the 404. For permanent removals, return HTTP 410 Gone instead of 404; this is a clearer signal to crawlers that the URL is intentionally removed and will not return. For temporary outages, do not let 5xx errors persist on URLs you care about — AI crawlers may interpret repeated 5xx as low-quality signal and deprioritize the URL in future indexes.

Related reading from the Attrifast research stack

For a deeper dive on related ground, see How to Verify AI Crawlers.

References

  1. OpenAI. "Overview of OpenAI's bots and how to control them." https://platform.openai.com/docs/bots
  2. OpenAI. "GPTBot - Frequently asked questions." https://platform.openai.com/docs/gptbot
  3. Plausible Analytics. "How to track ChatGPT and AI search traffic." April 2024. https://plausible.io/blog/chatgpt-traffic
  4. OpenAI. "Introducing ChatGPT search." October 31, 2024. https://openai.com/index/introducing-chatgpt-search/
  5. Cloudflare Radar. "AI Insights and bot traffic dashboard." https://radar.cloudflare.com/ai-insights
  6. Search Engine Land. "AI Overviews coverage tracking, 2024-2026." https://searchengineland.com/library/google/google-ai-overviews
  7. Stripe Docs. "Checkout Session metadata field." https://docs.stripe.com/api/checkout/sessions/object#checkout_session_object-metadata
  8. MDN Web Docs. "Referer header reference." https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Referer
  9. The Verge. "ChatGPT hits 1 billion daily messages." December 2024. https://www.theverge.com/2024/12/04/24313097/chatgpt-1-billion-messages-daily-openai
  10. Search Engine Land. "Google AI Overviews coverage tracking, 2024-2026." https://searchengineland.com/library/google/google-ai-overviews
  11. The Verge. "Cloudflare lets sites block AI bots with one click." July 2024. https://www.theverge.com/2024/7/3/24191671/cloudflare-block-ai-bots-scraper-crawler-one-click
  12. Anthropic. "Does Anthropic crawl data from the web and how can site owners block the crawler?" https://support.anthropic.com/en/articles/8896518-does-anthropic-crawl-data-from-the-web-and-how-can-site-owners-block-the-crawler
  13. Google Developers. "Overview of Google crawlers and fetchers (Google-Extended documentation)." https://developers.google.com/search/docs/crawling-indexing/overview-google-crawlers
  14. Google Search Central. "An updated commitment to web publisher control and Google-Extended." https://blog.google/technology/ai/an-update-on-web-publisher-controls/
  15. Profound. "Profound research: how AI engines cite content." https://www.tryprofound.com/blog
  16. Perplexity. "PerplexityBot documentation." https://docs.perplexity.ai/guides/bots
  17. Apple. "About Applebot and Applebot-Extended." https://support.apple.com/en-us/119829
  18. ByteDance. "Bytespider crawler documentation." https://bytedance.com/bytespider
  19. Meta. "Meta web crawlers documentation." https://developers.facebook.com/docs/sharing/webmasters/crawler/
  20. Common Crawl Foundation. "CCBot documentation and robots.txt guidance." https://commoncrawl.org/ccbot
  21. IETF. "RFC 9309 - Robots Exclusion Protocol." September 2022. https://www.rfc-editor.org/rfc/rfc9309.html
  22. Cloudflare. "Declare your AIndependence: block AI bots, scrapers and crawlers with a single click." July 2024. https://blog.cloudflare.com/declaring-your-aindependence-block-ai-bots-scrapers-and-crawlers-with-a-single-click/
  23. Vercel. "The rise of the AI crawler." Late 2024. https://vercel.com/blog/the-rise-of-the-ai-crawler
  24. Amazon. "Amazonbot - Help Amazon Alexa." https://developer.amazon.com/amazonbot
  25. llms.txt specification. https://llmstxt.org/
  26. W3C / TDMRep. "Text and Data Mining Reservation Protocol." https://www.w3.org/community/tdmrep/
  27. IETF. "AI Preferences (aipref) Working Group." https://datatracker.ietf.org/wg/aipref/about/
  28. AWS. "AWS WAF Bot Control managed rule group." https://docs.aws.amazon.com/waf/latest/developerguide/aws-managed-rule-groups-bot.html
  29. nginx. "ngx_http_access_module documentation." https://nginx.org/en/docs/http/ngx_http_access_module.html
  30. Apache. "mod_rewrite documentation for UA-based blocking." https://httpd.apache.org/docs/current/mod/mod_rewrite.html

For the analytics-side companion that covers human AI-referred visits (where the bot signal leads to actual revenue), see the ChatGPT referral analytics guide and the per-engine deep dives at track Perplexity traffic, track Claude traffic, track Gemini traffic, and track AI Overviews. For the strategic framing of how to split SEO and AEO effort given the crawler ecosystem above, AEO vs SEO in 2026 is the companion. For the broader question of how Google sources its AI Overviews content (which interacts with Google-Extended and the AI crawler decisions you make here), see where does Google AI get its information. For the multi-engine citation monitoring side, the AI visibility tracker walkthrough covers how to combine bot crawl data with human-referral data into a single citation-readiness signal. For the practical guide to getting cited in the first place, how to get cited by AI engines is the playbook. The product surfaces are at cookieless revenue analytics, track ChatGPT traffic, and the main dashboard.

Related reading

Find revenue hiding in your traffic

Discover which marketing channels bring customers so you can grow your business, fast.

Start free trial →

5-day free trial · $29/mo · cancel anytime