A 2026 engineering guide to verifying AI crawlers — reverse-DNS the IP, cross-check published ranges, and catch spoofed GPTBot, ClaudeBot, and PerplexityBot user-agents before they pollute your AI analytics.
Part of the AI Search Hub — browse all 35 AI Search guides.
The question that started this article arrived in my inbox in February 2026, from a founder running a mid-size content SaaS: "Our AI-bot dashboard says GPTBot hit us 9,000 times last month. Is that real, or am I being scraped by something pretending to be OpenAI?" The honest answer was: you cannot know from the user-agent, and you are probably being scraped by something pretending to be OpenAI. When we ran the IP check on his logs, 12% of the self-identified GPTBot hits came from IP ranges OpenAI does not own. They were scrapers. His dashboard had been counting them as AI activity for months, which meant every conclusion he had drawn about which pages OpenAI was interested in was contaminated by noise from someone else's crawler.
This is the technical companion to the AI crawler tracking field guide. That piece covers which crawlers exist, what each one does, and how to detect them. This one covers the harder problem that comes immediately after detection: proving the detected crawler is who it says it is. Detection asks "does the user-agent match GPTBot?" Verification asks "is this actually OpenAI, or someone wearing OpenAI's name?" The gap between those two questions is where most AI-analytics data quality goes to die. If you have read the dark AI traffic in GA4 breakdown and the track-ChatGPT-traffic playbook, this is the layer underneath both: the verification step that decides whether the numbers in those pieces are measuring real AI traffic or phantom scraper traffic wearing a costume.
Quick Facts
Metric
Value
Source
Self-identified AI-bot hits that are spoofed (audited sites)
5-15% typical
Attrifast aggregate
Highest single-site spoofed-GPTBot rate I have measured
Two numbers anchor everything below. The 5-15% spoof rate is the data-quality problem: a meaningful fraction of what your analytics calls "AI bot traffic" is not. The fact that source IPs cannot be forged over TCP is the solution: the one part of the request a spoofer cannot freely fake is the very thing the published-range cross-check verifies. Everything in this article is built on those two facts.
Why user-agent strings prove nothing
The quotable version: the HTTP User-Agent header is a free-text field the client sets on every request, with no signature, no secret, and no challenge-response. A scraper can copy OpenAI's exact GPTBot string in one line of code. So a user-agent that says GPTBot/1.1 is a claim, not a credential. The only request attribute a bot cannot freely forge is its source IP — which is why every major crawler operator tells you to verify by IP, not by UA.
Here is the part operators underestimate. Setting the user-agent to impersonate GPTBot is not a sophisticated attack. It is the default behavior of anyone who reads OpenAI's public docs and decides that wearing OpenAI's name makes their scraper look benign. The string is published. Copying it is trivial:
# A scraper impersonating GPTBot in one line. Nothing stops this.
curl -A "Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko); compatible; GPTBot/1.1; +https://openai.com/gptbot" \
https://yoursite.com/pricing
That request will appear in your logs identical to a real GPTBot crawl in every field except one: the source IP. The real GPTBot connects from OpenAI's published infrastructure. The impersonator connects from wherever the scraper is running — a datacenter VPS, a residential proxy, a compromised host. The user-agent is the same. The IP is not, and the IP is the one thing the attacker cannot change without losing the ability to actually receive your page content.
Why can't they forge the IP too? Because HTTP runs over TCP, and TCP requires a completed three-way handshake before any data flows []. To complete the handshake the client must receive the SYN-ACK your server sends to the source IP. If the spoofer forges a source IP they do not control, the SYN-ACK goes to that other host, not to them, and the connection never completes — they never get your page. For a crawler that actually wants your content, the source IP must be real and routable. That asymmetry is the entire foundation of IP-based verification: verify the thing the attacker cannot freely forge.
Request attribute
Can the client set it freely?
Useful for verification?
User-Agent header
Yes — arbitrary free text
No — trivially spoofed
Referer header
Yes — arbitrary free text
No — trivially spoofed
Custom headers (X-*)
Yes — arbitrary free text
No
TLS fingerprint (JA3/JA4)
Partially — harder to forge perfectly
Weak signal, evadable
Source IP (completed TCP)
No — constrained by handshake
Yes — the canonical anchor
Reverse-DNS (PTR) of source IP
Only if attacker controls the IP's reverse zone
Yes, when forward-confirmed
Published-range CIDR match
No — attacker cannot get into operator's range
Yes — the strongest single check
The bottom three rows are the verifiable ones. Everything above them is set by the client and means nothing on its own. This is not an AI-bot-specific insight — it is the same logic Google has documented for verifying Googlebot for over a decade []. What is new in 2026 is that the AI-crawler boom has multiplied both the number of legitimate crawlers worth verifying and the number of scrapers impersonating them.
The canonical method: forward-confirmed reverse DNS
The quotable version: forward-confirmed reverse DNS (FCrDNS) is a two-step check. First, reverse-resolve the source IP to a hostname via its PTR record. Second, forward-resolve that hostname back to an IP and confirm it matches the original. A spoofer might point a PTR record at googlebot.com, but they cannot make googlebot.com forward-resolve to their own IP, because they do not control Google's DNS. The forward confirmation closes the loophole.
Let me walk the mechanism precisely, because the order of operations is where people get it wrong. DNS terminology is standardized in RFC 8499 [], and the PTR (reverse) record lives in the in-addr.arpa zone defined back in RFC 1035.
Step one — reverse lookup (PTR). You have an IP, say 203.0.113.7. You query its PTR record. A real Googlebot IP returns something like crawl-203-0-113-7.googlebot.com. A spoofer's datacenter IP typically returns the hosting provider's generic PTR (ec2-203-0-113-7.compute-1.amazonaws.com) or nothing at all.
Step two — forward lookup (A/AAAA), the confirmation. You take the hostname from step one and resolve it forward. If the PTR claimed crawl-203-0-113-7.googlebot.com, you resolve that hostname and confirm it returns 203.0.113.7. If it does, the IP is forward-confirmed: only the holder of the googlebot.com zone could make both records line up. If the forward lookup returns a different IP, or fails, the PTR was a lie.
The reason both steps are mandatory: PTR records are controlled by whoever owns the reverse zone for an IP block. A bad actor who controls their own IP block can set its PTR to claim any hostname they like, including crawl-malicious.googlebot.com. What they cannot do is make googlebot.com's forward zone — which Google controls — resolve that hostname back to their IP. The forward confirmation is the step that requires control of the legitimate operator's DNS, which the attacker does not have.
The contrast between a real crawler and an impostor is clearest as a sequence. The same two DNS questions get asked of both; only the real crawler's answers line up.
Here is the FCrDNS check as a copy-pasteable shell function. It is deliberately simple so you can read every step; the production version caches verdicts per IP.
# Forward-confirmed reverse DNS for a single IP and an expected domain suffix.
# Usage: fcrdns 66.249.66.1 .googlebot.com
fcrdns() {
local ip="$1" suffix="$2"
# Step 1: reverse lookup (PTR)
local host
host=$(dig +short -x "$ip" | sed 's/\.$//')
if [[ -z "$host" || "$host" != *"$suffix" ]]; then
echo "$ip NOT_VERIFIED (ptr=$host)"
return 1
fi
# Step 2: forward lookup, confirm it resolves back to the same IP
local fwd
fwd=$(dig +short "$host" A "$host" AAAA | sed 's/\.$//')
if grep -qx "$ip" <<<"$fwd"; then
echo "$ip VERIFIED ($host)"
return 0
fi
echo "$ip NOT_VERIFIED (forward=$fwd did not match)"
return 1
}
And the same logic in Python, which is closer to what you would run inside an attribution pipeline or an edge function. Note the caching comment — in production you verify each unique IP once and reuse the verdict.
import socket
import ipaddress
def fcrdns(ip: str, allowed_suffixes: tuple[str, ...]) -> bool:
"""Forward-confirmed reverse DNS.
Returns True only if the IP reverse-resolves to a hostname under one of
allowed_suffixes AND that hostname forward-resolves back to the same IP.
In production: cache the (ip -> bool) verdict for several hours; crawler
IPs repeat constantly, so a tiny LRU eliminates almost all DNS calls.
"""
try:
host, _, _ = socket.gethostbyaddr(ip) # step 1: PTR
except (socket.herror, socket.gaierror):
return False
if not any(host.endswith(s) for s in allowed_suffixes):
return False
try:
# step 2: forward-confirm. getaddrinfo returns A and AAAA records.
forward_ips = {info[4][0] for info in socket.getaddrinfo(host, None)}
except socket.gaierror:
return False
target = ipaddress.ip_address(ip)
return any(ipaddress.ip_address(f) == target for f in forward_ips)
# Real Googlebot resolves to *.googlebot.com and forward-confirms.
# A scraper copying the Googlebot UA from an AWS box does not.
print(fcrdns("66.249.66.1", (".googlebot.com", ".google.com")))
FCrDNS is the universal fallback — it works for any operator that runs honest reverse DNS on its crawler IPs, even ones that do not publish machine-readable range files. But where the operator publishes ranges, the CIDR match is faster and even harder to spoof, so I run that first.
Published IP ranges: the strongest single check
The quotable version: the strongest verification is a CIDR match against the operator's published IP range file. OpenAI, Google, and Microsoft publish their crawler ranges as JSON specifically so you can do this. It is an in-memory prefix check with no DNS call, it cannot be spoofed because the attacker cannot get their IP into OpenAI's published prefixes, and it is the check I run first because it is the cheapest and the most decisive.
OpenAI publishes three separate range files, one per crawler, because the three crawlers run on separate infrastructure [][]:
OpenAI crawler
Published range file
Purpose
GPTBot
https://openai.com/gptbot.json
Training-corpus crawler
ChatGPT-User
https://openai.com/chatgpt-user.json
Live user-triggered fetch agent
OAI-SearchBot
https://openai.com/searchbot.json
ChatGPT search index crawler
The critical detail most people miss: you must match a hit against the right file. A real ChatGPT-User fetch comes from ChatGPT-User's prefixes, not GPTBot's. If you only download gptbot.json and check every OpenAI-claiming hit against it, you will mislabel real ChatGPT-User and OAI-SearchBot fetches as spoofed. Match each user-agent against its own range file.
Here is the published-range check in shell, using jq to extract prefixes and grepcidr to match. This is the first-line filter; FCrDNS is the fallback for operators without range files.
# Verify GPTBot hits against OpenAI's published prefixes.
# Requires: jq, grepcidr (apt-get install grepcidr).
# 1. Pull the published range file (cache this; refetch daily via cron).
curl -s https://openai.com/gptbot.json -o /tmp/gptbot.json
# 2. Extract the IPv4 and IPv6 CIDR prefixes.
jq -r '.prefixes[] | (.ipv4Prefix // .ipv6Prefix)' /tmp/gptbot.json \
> /tmp/gptbot-cidrs.txt
# 3. Pull the IPs that claimed GPTBot in your logs.
grep "GPTBot/" /var/log/nginx/access.log | awk '{print $1}' | sort -u \
> /tmp/claimed-gptbot-ips.txt
# 4. Anything NOT inside a published prefix is spoofed.
echo "=== VERIFIED (inside OpenAI prefixes) ==="
grepcidr -f /tmp/gptbot-cidrs.txt /tmp/claimed-gptbot-ips.txt
echo "=== SPOOFED (outside every OpenAI prefix) ==="
grepcidr -v -f /tmp/gptbot-cidrs.txt /tmp/claimed-gptbot-ips.txt
The Python equivalent, which is what I actually wire into an attribution pipeline, builds an in-memory network list once and checks each IP with a fast prefix match:
import ipaddress
import json
import urllib.request
def load_prefixes(url: str) -> list:
"""Fetch a published crawler range file and return parsed networks.
Cache the result; refetch on a daily schedule, not per request."""
with urllib.request.urlopen(url, timeout=5) as resp:
data = json.load(resp)
nets = []
for entry in data.get("prefixes", []):
cidr = entry.get("ipv4Prefix") or entry.get("ipv6Prefix")
if cidr:
nets.append(ipaddress.ip_network(cidr))
return nets
GPTBOT_NETS = load_prefixes("https://openai.com/gptbot.json")
CHATGPT_USER_NETS = load_prefixes("https://openai.com/chatgpt-user.json")
SEARCHBOT_NETS = load_prefixes("https://openai.com/searchbot.json")
def in_ranges(ip: str, nets: list) -> bool:
addr = ipaddress.ip_address(ip)
return any(addr in net for net in nets)
# Match each UA against ITS OWN range file. Mixing them up mislabels
# real ChatGPT-User fetches as spoofed GPTBot.
def verify_openai(ip: str, ua: str) -> str:
if "GPTBot/" in ua and in_ranges(ip, GPTBOT_NETS):
return "verified-gptbot"
if "ChatGPT-User/" in ua and in_ranges(ip, CHATGPT_USER_NETS):
return "verified-chatgpt-user"
if "OAI-SearchBot" in ua and in_ranges(ip, SEARCHBOT_NETS):
return "verified-oai-searchbot"
return "unverified" # UA claims OpenAI but IP is not in any prefix
Google publishes its ranges in a parallel structure, split across files because Google runs several crawler classes [][]:
Microsoft publishes Bingbot ranges and documents the .search.msn.com reverse-DNS suffix for FCrDNS verification []. The pattern is identical across all three vendors: pull the JSON, build a network list, prefix-match. The only vendor-specific part is the URL and the JSON shape.
Crawler-by-verification-method reference table
The quotable version: not every AI crawler is verifiable the same way. GPTBot, ChatGPT-User, OAI-SearchBot, Googlebot, and Bingbot publish ranges and support FCrDNS — verify them rigorously. Google-Extended and Applebot-Extended share their parent's IPs and user-agent, so they are robots.txt tokens with no log-detectable signature. Where a crawler publishes neither ranges nor honest reverse DNS, treat unmatched hits as unverified rather than assuming they are real.
This is the table I keep open while writing verification rules. "Spoofability" is my assessment of how easy it is for an impostor to pass as this crawler given the verification methods available — lower is better for you.
Google-Extended is not a detectable bot. It has no unique user-agent and no separate IP range — it rides on Googlebot's identity. You cannot find it in access logs. The only control surface is the User-agent: Google-Extended token in robots.txt []. Any log-analysis tool that claims to show you "Google-Extended hits" is either showing you all Googlebot hits or making something up. Same logic applies to Applebot-Extended relative to Applebot.
ChatGPT-User and PerplexityBot run on separate prefixes from the training crawlers. Verify them against their own range files. This is the single most common verification bug I see: a pipeline that checks every OpenAI-claiming hit against gptbot.json and flags real ChatGPT-User fetches as spoofed because they came from ChatGPT-User's prefixes, which are correctly absent from the GPTBot file.
Where ranges are less mature, degrade gracefully. Not every operator updates published ranges with OpenAI's cadence. For a crawler where the published range is stale or absent, fall back to FCrDNS; where FCrDNS is inconclusive, mark the hit unverified and route it to a suspicious bucket. Do not promote an unverified hit to "real" just because the user-agent matched — that is exactly the mistake that lets spoofers in.
A worked example: the e-commerce log where 18.4% of GPTBot was fake
The quotable version: on one e-commerce log I audited in Q1 2026, 18.4% of the hits claiming to be GPTBot resolved to IPs outside every published OpenAI prefix — datacenter ranges and a residential-proxy ASN. They were scrapers wearing OpenAI's name to look benign. The site's AI-bot dashboard had been counting all of them as real OpenAI activity, which meant its read on "which pages OpenAI cares about" was nearly a fifth noise.
Here is the actual investigation, reconstructed with the IPs and ASNs anonymized but the method and proportions intact. The site was a mid-size DTC catalog, the kind of content that attracts price-scrapers, which is exactly the population that benefits from impersonating a benign-looking crawler.
Step 1 — extract every IP that claimed GPTBot. A month of logs, deduplicated by IP:
Step 3 — weight by hit volume, not just unique IP count. This is the step that changes the story. 192 of 1,043 unique IPs is 18.4% of IPs, but the spoofers hit harder per IP, so the request-weighted share was different:
# Sum hits for spoofed IPs vs total
total_hits=$(awk '{s+=$1} END{print s}' /tmp/gptbot-ip-counts.txt)
spoofed_hits=$(grep -F -f /tmp/gptbot-spoofed-ips.txt /tmp/gptbot-ip-counts.txt \
| awk '{s+=$1} END{print s}')
echo "spoofed=$spoofed_hits total=$total_hits"
# spoofed=11,847 total=64,200 -> 18.5% of requests
Step 4 — reverse-DNS the spoofed IPs to characterize them. FCrDNS on the 192 outside-range IPs told the story of who they actually were:
while read ip; do
host=$(dig +short -x "$ip" | sed 's/\.$//')
echo "$ip ${host:-NO_PTR}"
done < /tmp/gptbot-spoofed-ips.txt | sort -k2
The breakdown of where the fake GPTBot traffic actually came from:
Source of spoofed "GPTBot"
Unique IPs
Share of spoofed hits
Reverse-DNS pattern
Generic datacenter VPS (multiple providers)
88
41%
*.compute.amazonaws.com, *.hetzner.com
Residential-proxy ASN
61
38%
NO_PTR or ISP-generic
Eastern-European hosting block
29
14%
*.hosting-provider.example
Empty/no PTR record
14
7%
NO_PTR
None of those is OpenAI. The 41% on datacenter VPS were almost certainly competitors and aggregators scraping the catalog. The 38% on a residential-proxy ASN were the more sophisticated scrapers, the ones paying for residential IPs specifically to evade per-ASN rate limits — and dressing in GPTBot's user-agent to look like a crawler the site owner would not want to block. The impersonation was the tell: nobody pays for residential proxies and then honestly identifies as a scraper. They wear the crawler's name precisely because operators have been trained to allow it.
The remediation was two-layer. For analytics, I excluded every unverified GPTBot hit from the AI-bot dashboard, which immediately corrected the "OpenAI is very interested in our catalog pages" reading down to its real, much smaller value. For security and crawl budget, I routed the unverified bucket to a Cloudflare managed challenge rather than a hard block — a managed challenge stops headless scrapers cold while leaving a path open in case a real crawler ever appears from an IP rotation I had not yet picked up in the published file. Six weeks later the catalog-scraping load was down by roughly the spoofed-hit volume, and the AI-bot signal the owner was using to prioritize content was finally measuring OpenAI rather than OpenAI plus a crowd of impostors.
Why this corrupts your revenue attribution
The quotable version: if your AI-traffic numbers count spoofed bots as real AI activity, your attribution is wrong before you ever reach the human-click side. Spoofed crawlers cluster on your highest-value pages — pricing, product, comparison — which are exactly the pages whose AI-citation signal you are trying to read. Inflated crawl signal leads to wrong content prioritization and a per-engine revenue number that is too high. Verification is the data-quality floor under honest attribution.
The connection between bot verification and revenue is not obvious until you have watched it break a dashboard, so let me make it concrete. In the AI crawler tracking guide I described the citation-readiness signal: a spike in ChatGPT-User fetches on a page tends to lead a spike in human AI-referred clicks by 18-72 hours. That signal is only useful if the crawl data feeding it is real. If 15% of your "ChatGPT-User" hits are a scraper impersonating ChatGPT-User, then your citation-readiness signal has a 15% noise floor, and the pages that look hottest might be the pages a competitor's scraper is hammering, not the pages OpenAI is fetching to answer live queries.
It gets worse because spoofers are not random. They target your money pages:
Page type
Why real AI crawlers visit
Why spoofers also target it
Net effect on signal
Pricing page
High citation value for buying-intent queries
Competitors scrape your prices constantly
Inflated, hard to separate
Product/feature pages
AI engines fetch for comparison answers
Catalog scrapers and aggregators
Inflated
Comparison/vs pages
Highest citation value of all
Competitors monitor positioning
Heavily inflated
Blog/long-tail content
Training and live-fetch interest
Content scrapers and AI-data resellers
Moderately inflated
Login/account pages
Real crawlers skip (401/403)
Credential-stuffing bots wear any UA
False AI signal entirely
The pricing and comparison rows are the dangerous ones. Those are the pages where a genuine AI-crawl spike is most worth acting on — they are the pages that drive measurable revenue when AI engines cite them. They are also the pages competitors most want to scrape. Without verification, you cannot tell whether the spike on your /vs/competitor page means OpenAI is about to cite you in comparison answers (act on it: invest in that page) or whether a competitor's monitoring scraper found a new way to dress up its requests (ignore it). One reading tells you to double down on content that is winning AI citations; the other is noise. Verification is what lets you tell them apart.
This is why, in Attrifast, verification runs before classification. Every hit that claims a known AI-bot user-agent is checked against the published range and, where needed, FCrDNS. Only verified hits feed the citation-readiness signal and the bot-activity view. Unverified UA-claimed hits go to a separate suspicious bucket that never touches the revenue-adjacent signals. The per-engine revenue numbers on the revenue attribution feature are downstream of that filter — they describe traffic that has been proven to come from the engine it claims, not traffic that merely said so. An attribution number built on unverified bot data is a number built on a foundation a one-line curl command can move.
Using Cloudflare and CDN verified-bot signals
The quotable version: if you sit behind Cloudflare, its Verified Bots program maintains a curated list of known-good crawlers and exposes a verified-bot signal you can read in firewall rules, Workers, and Logpush. For the bots Cloudflare knows and has validated, this replaces your own reverse-DNS work. The limits: it only covers bots in Cloudflare's program, and you are trusting Cloudflare's validation cadence — so for the long tail and for defense in depth I still keep my own published-range cross-check.
Cloudflare's Verified Bots program does the FCrDNS-and-range work for you for the crawlers it knows about []. The mechanics: Cloudflare validates a bot's identity against the operator's published infrastructure, and then exposes the result as a signal your rules can read. The most useful field is the verified-bot category, available in firewall expressions and Logpush. A firewall rule that lets verified bots through while challenging unverified UA-claimants looks like this:
# Cloudflare WAF expression (UI rule builder syntax):
# Challenge anything claiming to be an AI crawler that Cloudflare has NOT verified.
(
http.user_agent contains "GPTBot"
or http.user_agent contains "ChatGPT-User"
or http.user_agent contains "OAI-SearchBot"
or http.user_agent contains "ClaudeBot"
or http.user_agent contains "PerplexityBot"
)
and not cf.bot_management.verified_bot
Action: Managed Challenge. The logic reads cleanly: if the request claims to be a major AI crawler but Cloudflare's verification did not confirm it, make it solve a challenge. Real crawlers are verified and pass straight through; impersonators claiming the UA without the verified-bot flag hit the challenge and stop. Cloudflare also publishes AI-specific controls — its AI Audit and bot-management features let you see and govern verified AI crawler activity directly [].
In a Cloudflare Worker, you can read the verification verdict per request and route accordingly:
export default {
async fetch(request, env, ctx) {
const ua = request.headers.get('user-agent') || '';
const claimsAiBot = /GPTBot|ChatGPT-User|OAI-SearchBot|ClaudeBot|PerplexityBot/.test(ua);
// cf.botManagement is present on Bot Management plans.
const verified = request.cf?.botManagement?.verifiedBot === true;
if (claimsAiBot && !verified) {
// Spoofed UA: exclude from AI analytics, do not serve normally.
ctx.waitUntil(logSuspicious(request, ua)); // your logging
return new Response('Forbidden', { status: 403 });
}
if (claimsAiBot && verified) {
ctx.waitUntil(logVerifiedBot(request, ua)); // feeds bot-activity view
}
return fetch(request);
},
};
The honest limits of leaning on Cloudflare alone:
Strength
Limit
Validates known crawlers against operator infra automatically
Only covers bots in Cloudflare's Verified Bots program
Verified-bot signal is fast (no DNS call in your path)
A new or niche crawler may not yet be in the program
Kept fresh by Cloudflare as ranges rotate
You are trusting Cloudflare's validation cadence and scope
Available in WAF, Workers, and Logpush
Requires being on Cloudflare; not portable to other CDNs
So Cloudflare's verified-bot signal is my first layer when a site is behind it, but I keep the published-range CIDR check in the application as a second, portable, independent layer. Two independent checks that agree give me high confidence; two that disagree are a flag worth investigating, usually a freshly rotated IP that one layer has picked up and the other has not yet. Other CDNs and WAFs offer analogous bot-verification features — AWS WAF ships a Bot Control managed rule group with similar category signals [] — but the principle is identical: verified-bot signal first, your own range check as defense in depth.
A complete edge-verification reference implementation
The quotable version: the production pattern verifies each unique IP once, caches the verdict for hours, checks the published-range CIDR match first because it is free, and falls back to forward-confirmed reverse DNS only when the range match is inconclusive. Verified hits feed the bot-activity and citation-readiness views; unverified UA-claimed hits go to a suspicious bucket and never touch revenue signals.
Here is the consolidated logic as a single Next.js-style middleware function, since the Attrifast stack runs Next.js at the edge. It ties together everything above: range files, FCrDNS fallback, per-IP caching, and routing. Range loading and DNS are abstracted behind helpers so the control flow is readable.
// ai-bot-verify.ts — verify an AI-crawler claim before trusting it.
// Verdict cache: crawler IPs repeat constantly, so verify each IP once
// and reuse for several hours. This makes per-request cost ~zero.
type Verdict = 'verified' | 'unverified';
interface BotMatch {
bot: 'gptbot' | 'chatgpt-user' | 'oai-searchbot' | 'claudebot' | 'perplexitybot';
rangeUrl: string; // published CIDR file for THIS bot
dnsSuffixes: string[]; // FCrDNS fallback suffixes
}
const MATCHERS: Array<[RegExp, BotMatch]> = [
[/GPTBot\//, { bot: 'gptbot', rangeUrl: 'https://openai.com/gptbot.json', dnsSuffixes: ['.openai.com'] }],
[/ChatGPT-User\//, { bot: 'chatgpt-user', rangeUrl: 'https://openai.com/chatgpt-user.json', dnsSuffixes: ['.openai.com'] }],
[/OAI-SearchBot/, { bot: 'oai-searchbot', rangeUrl: 'https://openai.com/searchbot.json', dnsSuffixes: ['.openai.com'] }],
[/ClaudeBot\//, { bot: 'claudebot', rangeUrl: 'https://www.anthropic.com/claudebot.json', dnsSuffixes: ['.anthropic.com'] }],
[/PerplexityBot\//, { bot: 'perplexitybot', rangeUrl: 'https://www.perplexity.ai/perplexitybot.json', dnsSuffixes: ['.perplexity.ai'] }],
];
const verdictCache = new Map<string, { verdict: Verdict; bot: string; at: number }>();
const CACHE_TTL_MS = 6 * 60 * 60 * 1000; // 6 hours
function matchAiBot(ua: string): BotMatch | null {
for (const [re, m] of MATCHERS) if (re.test(ua)) return m;
return null;
}
// inRanges(): in-memory CIDR check against the cached published file.
// fcrdns(): forward-confirmed reverse DNS, the FCrDNS fallback.
// Both implemented elsewhere; signatures shown for clarity.
declare function inRanges(ip: string, rangeUrl: string): Promise<boolean>;
declare function fcrdns(ip: string, suffixes: string[]): Promise<boolean>;
export async function verifyAiBot(ua: string, ip: string): Promise<{ isBot: boolean; verdict: Verdict; bot: string | null }> {
const match = matchAiBot(ua);
if (!match) return { isBot: false, verdict: 'unverified', bot: null };
const cached = verdictCache.get(ip);
if (cached && Date.now() - cached.at < CACHE_TTL_MS) {
return { isBot: true, verdict: cached.verdict, bot: cached.bot };
}
// Cheapest check first: published-range CIDR match (no network call once cached).
let verdict: Verdict = (await inRanges(ip, match.rangeUrl)) ? 'verified' : 'unverified';
// Fallback only if the range match failed: maybe a freshly rotated IP.
if (verdict === 'unverified' && (await fcrdns(ip, match.dnsSuffixes))) {
verdict = 'verified';
}
verdictCache.set(ip, { verdict, bot: match.bot, at: Date.now() });
return { isBot: true, verdict, bot: match.bot };
}
And the routing that consumes it. The rule is strict: verified hits feed the signals, unverified UA-claimants are quarantined.
import { verifyAiBot } from './ai-bot-verify';
export async function classifyRequest(ua: string, ip: string) {
const { isBot, verdict, bot } = await verifyAiBot(ua, ip);
if (isBot && verdict === 'verified') {
// Real crawler. Route to bot-activity view; user-triggered agents
// additionally feed the citation-readiness signal.
const userTriggered = bot === 'chatgpt-user' || bot === 'perplexitybot';
return { stream: 'bot-activity', bot, feedsCitationSignal: userTriggered };
}
if (isBot && verdict === 'unverified') {
// UA claims an AI bot but IP is not verifiable: SPOOFED.
// Never counts toward AI analytics. Challenge/rate-limit at edge.
return { stream: 'suspicious', bot, feedsCitationSignal: false };
}
// Not a bot claim at all: human/other attribution path (referer + behavioral).
return { stream: 'traffic', bot: null, feedsCitationSignal: false };
}
The two pieces I want to highlight. First, the cheapest-check-first ordering: the CIDR match against a cached range file is an in-memory prefix test with no network call, so it runs first; FCrDNS, which does involve DNS, runs only when the range match fails. This keeps the common case (a known crawler from a known prefix) free. Second, the fail-closed routing: a UA that claims an AI bot but cannot be verified goes to suspicious, never to bot-activity, and never feeds the citation signal. The default for unverifiable claims is exclusion, not inclusion. That single design choice is the difference between an AI dashboard that measures real crawlers and one that measures real crawlers plus everyone wearing their clothes.
Keeping IP-range allowlists fresh
The quotable version: OpenAI, Google, and Microsoft rotate infrastructure every few months and update their published JSON range files when they do. If you maintain your own allowlist, refetch those files on a schedule — I run a daily cron that pulls each file, diffs against the cached copy, and alerts on change. Quarterly is the minimum; daily is cheap and eliminates the risk of a stale allowlist silently rejecting a real crawler after a rotation.
A stale allowlist fails in the most confusing way possible: a real crawler from a freshly rotated IP fails your range check, gets routed to suspicious, and now you are excluding real OpenAI traffic and wondering why your AI signal dropped. The fix is to treat the published files as live data, not a one-time copy. Here is the refresh job I run:
#!/usr/bin/env bash
# refresh-ai-ranges.sh — pull published AI-crawler range files daily,
# diff against the cached copy, alert on change. Run from cron.
set -euo pipefail
CACHE_DIR=/var/lib/ai-ranges
mkdir -p "$CACHE_DIR"
declare -A SOURCES=(
[gptbot]="https://openai.com/gptbot.json"
[chatgpt-user]="https://openai.com/chatgpt-user.json"
[oai-searchbot]="https://openai.com/searchbot.json"
[googlebot]="https://developers.google.com/static/search/apis/ipranges/googlebot.json"
)
for name in "${!SOURCES[@]}"; do
url="${SOURCES[$name]}"
new="$CACHE_DIR/$name.json.new"
cur="$CACHE_DIR/$name.json"
if ! curl -fsS "$url" -o "$new"; then
echo "WARN: failed to fetch $name from $url" >&2
continue # keep the existing cached copy; do not blow away a good allowlist
fi
if [[ -f "$cur" ]] && ! diff -q "$cur" "$new" >/dev/null; then
echo "CHANGED: $name ranges updated $(date -u +%FT%TZ)"
# hook your alerting here (Slack webhook, email, etc.)
fi
mv "$new" "$cur"
done
Two design choices worth copying. First, on fetch failure, keep the existing cached copy rather than overwriting it with nothing — a transient network error must never empty your allowlist, because an empty allowlist rejects every real crawler. Second, alert on diff so a range rotation is a thing you notice and can correlate against any AI-signal change, rather than a silent event you discover three weeks later when someone asks why GPTBot "stopped crawling." If you are on Cloudflare Verified Bots you can skip this entirely — Cloudflare owns the freshness — but if you run your own check this job is the difference between a verification layer that stays correct and one that quietly rots.
Refresh cadence
Risk of stale allowlist
Effort
One-time copy, never refreshed
High — silently rejects real crawlers after any rotation
Zero, but wrong
Quarterly manual refresh
Moderate — up to a quarter of staleness
Low
Daily automated cron + diff alert
Low — at most one day stale, and you are notified
Low (set once)
Cloudflare Verified Bots (no self-managed allowlist)
Low — Cloudflare handles freshness
Zero (if already on Cloudflare)
Common verification mistakes I see operators make
Eight mistakes I have seen often enough to call them patterns, each with the fix.
Mistake 1: Trusting the user-agent string. The whole article in one line. A UA that says GPTBot is a claim, not proof. Fix: verify the source IP against the published range or via FCrDNS before treating any AI-bot hit as real.
Mistake 2: Checking every OpenAI hit against gptbot.json only. ChatGPT-User and OAI-SearchBot run on separate prefixes. Checking them against the GPTBot file flags real fetches as spoofed. Fix: match each user-agent against its own published range file.
Mistake 3: Doing reverse DNS without the forward confirmation. A PTR record alone can be set by whoever controls the IP's reverse zone. A half-check (PTR only) is spoofable. Fix: always forward-confirm — resolve the PTR hostname back to the IP and require a match.
Mistake 4: Treating Google-Extended as a detectable bot. It shares Googlebot's UA and IPs. There is no log signature for it. Fix: control Google-Extended via the robots.txt token, and stop expecting to see it separately in access logs.
Mistake 5: Hard-blocking unverified hits and forgetting IP rotations. A real crawler from a freshly rotated IP can fail a stale allowlist. A hard 403 then blocks real OpenAI traffic. Fix: route unverified hits to a managed challenge or rate-limit, not a hard block, and keep the allowlist fresh with a daily diff job.
Mistake 6: Verifying every request inline with no caching. A PTR lookup per request adds latency and DNS load for no benefit, since crawler IPs repeat constantly. Fix: verify each unique IP once, cache the verdict for hours, short-circuit repeats.
Mistake 7: Letting unverified bot hits feed the citation-readiness signal. This is the one that corrupts revenue attribution. If spoofers feed the signal, your hottest-looking pages might be the ones a competitor is scraping. Fix: fail closed — only verified hits feed any revenue-adjacent signal.
Mistake 8: Counting spoofed bots as AI traffic in board reporting. "GPTBot hit us 9,000 times" is a meaningless number if 15% of it is fake and you have not said so. Fix: report verified AI-bot activity only, and state the spoof rate you filtered out so the number is honest.
How verification fits the rest of the AI-attribution stack
The quotable version: verification is the bottom layer of the AI-attribution stack. On top of it sits bot-vs-human separation, then referer and behavioral attribution for human AI-referred clicks, then the Stripe revenue join. Each layer trusts the one below it. If the verification layer is wrong — if spoofed bots leak into the bot-activity stream — every layer above inherits the error. Get verification right first, then the rest of the stack measures something real.
The dependency runs strictly upward. Layer 2's bot-activity view is only as clean as Layer 1's verification — feed it spoofed hits and the crawler counts are wrong. Layer 3's citation-readiness signal (the leading indicator that a page is about to get human AI-referred traffic) depends on Layer 1 having verified the ChatGPT-User and PerplexityBot fetches that drive it. And Layer 4, the Stripe revenue join that produces the per-engine RPV numbers, inherits any noise from below. This is why I keep saying verification is a data-quality prerequisite, not a security feature: it is the foundation the revenue numbers stand on.
If you want the layers above this one, the companions cover them in depth. How AI engines choose sources explains why a verified ChatGPT-User fetch on a page is a meaningful citation signal in the first place. The 2026 AI traffic analytics overview walks the full measurement picture from crawl to click to revenue. And the track-ChatGPT-traffic implementation shows the human-side detection code that turns verified crawl signals into attributed revenue.
Limitations
Five things this article does not solve, and you should not extrapolate past.
Crawlers that publish neither ranges nor honest reverse DNS. For the long tail of research, academic, and niche AI crawlers, there may be no published range and no FCrDNS to confirm against. For those you are left with weaker signals (TLS fingerprint, behavioral) or a default of unverified. I treat them as unverified and rate-limit rather than trust them — but I cannot promise they are real or fake, only that they are unproven.
TLS-fingerprint evasion. JA3/JA4 fingerprints can help flag inconsistencies (a request claiming to be GPTBot with a fingerprint that does not match OpenAI's client), but fingerprints can be mimicked by a determined adversary and rotate across client versions. Treat them as a supporting signal, never as the primary verification. The IP remains the anchor.
Headless-browser and agent fetches with generic UAs. Tool-using AI agents (LangChain, MCP servers, custom browse loops) often fetch via headless Chrome with a generic browser UA and no bot identifier at all. These are AI-driven fetches that this article's UA-based verification will not catch, because they do not claim to be a crawler. Detecting them is a behavioral problem and remains open.
IPv6 and CGNAT edge cases. Some published range files lag on IPv6 coverage, and carrier-grade NAT can put many clients behind one IP, complicating both verification and rate-limiting. The methods here work, but verify your specific operator's IPv6 coverage rather than assuming parity with IPv4.
The spoof-rate numbers are my measurements, not universal constants. The 5-15% range and the 18.4% single-site figure come from the sites I have audited, skewed toward SMB SaaS and DTC e-commerce in US-English markets. Your rate depends on how attractive your content is to scrapers. Run the check on your own logs before quoting a number; the method transfers, the exact percentage does not.
FAQ
How do I verify a GPTBot hit is really from OpenAI?
Take the IP address from the request, run a reverse-DNS (PTR) lookup, and confirm the IP also appears in OpenAI's published GPTBot range file at openai.com/gptbot.json. OpenAI publishes its crawler IP ranges as CIDR blocks specifically so you can do this check. A request whose User-Agent says GPTBot but whose IP is not inside any published OpenAI prefix is spoofed — usually a scraper trying to inherit the trust people extend to OpenAI's crawler. The User-Agent string itself is a plain text header anyone can set, so it proves nothing on its own. Always verify by IP, never by the UA string alone.
Why are user-agent strings not enough to identify an AI crawler?
The User-Agent is an arbitrary HTTP request header. Any client — curl, a Python script, a headless browser, a botnet node — can set it to any value, including an exact copy of OpenAI's GPTBot string. There is no cryptographic signature, no shared secret, and no challenge-response in the user-agent mechanism. It was designed for content negotiation in the 1990s, not for authentication. This is why every major crawler operator (Google, OpenAI, Microsoft, Anthropic) tells you to verify by reverse-DNS or published IP ranges instead. Trusting the UA string is the single most common AI-bot data-quality mistake I see.
What percentage of self-identified GPTBot traffic is actually spoofed?
Across the sites I audit it runs 5% to 15% of requests claiming a known AI-bot user-agent, with outliers far higher on content-heavy sites that attract scrapers. On one e-commerce log I worked through in early 2026, 18.4% of hits claiming to be GPTBot resolved to IPs outside every published OpenAI prefix — datacenter ranges in Eastern Europe and a residential-proxy ASN. Those were scrapers wearing OpenAI's name to look benign. The exact rate depends on how attractive your content is to scrapers, but assume a double-digit fraction of unverified AI-bot hits are fake until you have run the check yourself.
How do I verify Googlebot and Google-Extended with reverse DNS?
Run a reverse-DNS (PTR) lookup on the IP. Real Googlebot IPs resolve to a hostname ending in googlebot.com or google.com. Then run a forward-DNS lookup on that hostname and confirm it resolves back to the original IP. This two-step forward-confirmed reverse DNS is the method Google documents officially. Google also publishes its crawler IP ranges as JSON, so you can CIDR-match instead. Google-Extended (the Gemini/Vertex training opt-out) shares Googlebot's IPs and user-agent — you cannot distinguish it in access logs by IP at all; it is controlled only by a robots.txt token.
Does ClaudeBot support IP-based verification like GPTBot?
Anthropic documents ClaudeBot's behavior and publishes guidance on identifying its crawler, and as of 2026 Anthropic provides published IP ranges that you can CIDR-match against. ClaudeBot requests should also reverse-resolve to Anthropic-controlled infrastructure. The honest caveat is that not every AI crawler publishes ranges with the same rigor or freshness as OpenAI and Google, so for ClaudeBot the most reliable cross-check is the published range plus reverse-DNS, and where neither is conclusive, treating the hit as unverified rather than assuming it is real.
Why does spoofed AI traffic break my revenue attribution?
If your AI-traffic dashboard counts every hit that claims to be GPTBot or ChatGPT-User as real AI activity, and 10-15% of those are spoofed scrapers, then your AI-channel numbers are inflated by that fraction before you even get to the human-click side. Worse, spoofed bots cluster on your highest-value pages — pricing, product, comparison pages — which are exactly the pages whose AI-citation signal you are trying to read. Inflated crawl signal leads to wrong content prioritization and a revenue-per-AI-engine number that is too high. Verification is a data-quality prerequisite for honest attribution, not an optional security nicety.
Can Cloudflare verify AI bots for me automatically?
Yes, partially. If you sit behind Cloudflare, its Verified Bots program maintains a curated list of known good crawlers and exposes a verified-bot signal you can read in Logpush, Workers, and firewall rules. For the bots Cloudflare knows and has validated, this saves you the reverse-DNS dance. The limits: Cloudflare only verifies bots in its program, the signal does not tell you which specific bot beyond a category in some plans, and you are trusting Cloudflare's validation cadence. It is an excellent first layer, but for the long tail of crawlers and for defense in depth I still run my own published-range cross-check.
What is the difference between a spoofed bot and a real third-party AI crawler?
A spoofed bot copies a known crawler's user-agent (e.g. GPTBot) while operating from infrastructure that crawler does not own — it is impersonation. A legitimate third-party AI crawler (Diffbot, CCBot, a research crawler) uses its own honest user-agent from its own published infrastructure — it is not OpenAI, but it is not pretending to be. Both are non-OpenAI traffic, but only the first is fraudulent. The verification matters because you treat them differently: the honest third-party crawler you can allow, rate-limit, or block on its merits; the impersonator you should always block because the impersonation itself is a bad-faith signal.
How often do AI crawler IP ranges change, and how do I keep my allowlist fresh?
OpenAI, Google, and Microsoft rotate infrastructure every few months and update their published JSON range files when they do. If you maintain your own allowlist you should refetch those files on a schedule — I run a daily cron that pulls each published range file, diffs against the cached copy, and alerts me on change. Quarterly is the minimum safe cadence; daily is cheap and removes the risk of a stale allowlist silently rejecting a real crawler after an IP rotation. If you rely on Cloudflare Verified Bots instead, Cloudflare handles the freshness for you.
Should I block spoofed AI bots or just exclude them from analytics?
Both, depending on the layer. For analytics you must exclude unverified hits from your AI-channel counts — counting them corrupts the numbers. For security and crawl-budget you can additionally block or challenge them at the edge, since a request impersonating OpenAI from a residential-proxy ASN is acting in bad faith and there is no legitimate reason to serve it. My default is: verify on every request, route verified bots to the bot-activity view, route unverified UA-claimed bots to a suspicious bucket, and apply a rate-limit or managed challenge to the suspicious bucket rather than a hard block, so I do not accidentally hard-block a real crawler after an IP rotation I have not yet caught.
Do ChatGPT-User and Perplexity-User fetches come from verifiable IPs?
OpenAI publishes a separate range file for ChatGPT-User (the live user-triggered fetch agent) distinct from GPTBot, so yes, you can CIDR-match ChatGPT-User hits the same way. Perplexity publishes guidance and ranges for PerplexityBot; Perplexity-User is the user-triggered agent and Perplexity's stance is that it behaves like a user visiting on request. The practical implication is that you should verify ChatGPT-User and PerplexityBot against their own published ranges, not against the training-crawler ranges, because they are separate infrastructure with separate prefixes. Mixing them up is a common way to mislabel a real fetch as spoofed.
Is reverse DNS verification expensive to run at scale?
A PTR lookup is a single DNS query, typically 10-50 ms, and the results cache well because crawler IPs repeat constantly. At scale you do not verify every request inline — you verify each unique IP once, cache the verdict for hours, and short-circuit repeat hits from the same IP. On the sites I instrument, the unique-IP count for AI crawlers is a few hundred to a few thousand per day, so the verification cost is negligible. The CIDR-match against a published range file is even cheaper — it is an in-memory prefix check with no network call. Combine cached PTR with CIDR-match and the per-request overhead is effectively zero.
What is forward-confirmed reverse DNS and why does it matter?
Forward-confirmed reverse DNS (FCrDNS) is a two-step check. Step one: reverse-resolve the IP to a hostname via a PTR record. Step two: forward-resolve that hostname back to an IP and confirm it matches the original. It matters because a PTR record alone can be set by whoever controls the IP's reverse zone, so a spoofer could in theory point a PTR at googlebot.com. But they cannot also make googlebot.com forward-resolve to their IP, because they do not control Google's DNS. The forward confirmation closes that gap. It is the method Google, Microsoft, and the broader anti-spoofing literature all recommend.
Can a bot spoof the IP address as well as the user-agent?
Not for a useful crawl. Spoofing a source IP on a TCP connection is effectively impossible at scale because TCP requires a completed three-way handshake — the spoofer would have to receive packets sent to the forged IP, which they do not control. HTTP runs over TCP, so any bot that actually loads your pages has a real, routable source IP you can verify. This is the whole reason IP-based verification works and user-agent-based identification does not: the UA is a free-text header the client sets, but the source IP of a completed TCP connection is constrained by the network. Verify the thing the attacker cannot freely forge.
How do I tell verified AI crawler activity apart from human AI-referred clicks in one dashboard?
They are two separate streams that need two separate views. A verified AI crawler hit has a known bot user-agent, a verified source IP, and usually no Referer — it belongs in a bot-activity view. A human AI-referred click has a real browser user-agent, no bot UA, and either an AI-engine Referer (chatgpt.com, perplexity.ai) or an empty Referer with behavioral signals — it belongs in your traffic and revenue view. The mistake is mixing them, which makes a crawler burst look like a traffic spike. Attrifast keeps verified bot activity and human AI-referred revenue in the same dashboard but in structurally separate views so neither contaminates the other.