AI Search

How AI Engines Choose Which Sources to Cite: The 2026 Mechanics

The actual retrieval and training mechanics behind AI citations — the two pathways (pretraining corpus vs live RAG), how each scores sources, and which one drives your revenue.

Part of the GEO Hub — browse all 30 GEO guides.

How AI engines choose sources via two pathways: training-corpus citation (the model knows you from pretraining) and live-retrieval RAG (rewrite, retrieve, re-rank, cite) — which reward different things

A founder messaged me in March, frustrated. He had spent six weeks adding FAQ schema, tightening his intros, and chasing the standard GEO checklist, and his Perplexity citations had genuinely climbed. But ChatGPT, when his prospects asked it to recommend a tool in his category, still did not mention him. He could not understand why the same "AI optimization" worked on one engine and did nothing on the other.

The answer is that he was optimizing one pathway and measuring a different one. His schema and concision work moved the live-retrieval (RAG) pathway, which is exactly why Perplexity — the most retrieval-heavy engine — responded. But the ChatGPT query he cared about was answering from frozen training knowledge with no live search, so it was running on the training-corpus pathway, where six weeks of on-page work changes nothing. He needed two different strategies and a way to tell which one was paying.

This is the deeper companion to the AI search ranking factors checklist. That piece lists the 12 signals and grades them. This piece is the layer underneath: the actual retrieval and training mechanics that decide why a signal helps at all, and the single distinction — training corpus versus live retrieval — that most articles never make. If you have read the factors piece, this is where the "why" lives. Everything below is labeled Documented, Inferred, or Speculative, because the vendors publish far less than the confident blog posts imply.

Quick Facts

SpecValue
Distinct citation pathways2 (training-corpus recall + live-retrieval RAG)
RAG pipeline stages4+ (query rewrite, candidate retrieval, re-rank, synthesis+cite)
Time-to-citation, RAG pathway24-72 hours on a well-crawled domain (directional)
Time-to-citation, training pathwayOne model generation (~6-12 month cutoff cycle)
Citations per Perplexity answer3-7 [12]
Citations per ChatGPT search answer3-5 [10]
Citations per Google AI Overview block4-7 [5]
Princeton GEO max citation lift from passage editsUp to 40% on Perplexity / BingChat [1]
Documented partner search index for ChatGPT searchBing (per SearchGPT announcement) [10]
Corpora over-represented in pretrainingWikipedia, Reddit, established publishers [13][15]
GA4 default attribution accuracy for either pathwayRoughly 0% (both bucket as Direct/(none)) [11]

The Princeton GEO paper is the empirical spine of this article. It is the closest thing the field has to a controlled experiment on what moves AI citation rates, and it specifically isolates passage-level interventions, which makes it the right anchor for the re-ranking discussion. Everything sourced to agency studies (Ahrefs, Semrush, Backlinko) is observational at scale — directionally useful, weaker on causation. I weight evidence accordingly and say so inline.

The one distinction that reorganizes everything: two pathways, not one algorithm

The single most useful sentence in this article: an AI engine cites you through one of two mechanically different pathways, and the work that wins one does almost nothing for the other. Get this and the rest follows; miss it and you will keep optimizing blind.

Most "how does AI choose sources" content treats citation as a single black box with a list of factors going in. That framing is wrong in a way that costs real money. There are two systems:

PathwayWhat happensWhen it firesWhat it rewards
1. Training-corpus citationThe model recalls your brand/fact from frozen pretraining knowledge. No live fetch.Browse-off chat: Claude plain, Gemini no-browse, ChatGPT when it skips searchCorpus-wide authority accumulated over years (Wikipedia, Reddit, publisher mentions)
2. Live-retrieval citation (RAG)The engine runs a real search, retrieves docs, re-ranks them, cites the passages usedBrowse-on: ChatGPT search, Perplexity, AI Overviews, Claude with web searchClassic searchability + answer-shaped, schema-marked, fresh passages

The reason this distinction is invisible to most operators is that the output looks identical. In both cases the model says "I recommend Attrifast" or surfaces a fact about a brand. But in Pathway 1 there is usually no clickable link — the model is recalling, not fetching — and in Pathway 2 there is a numbered footnote pointing at a page that was retrieved seconds earlier. Same-looking answer, completely different machinery, completely different levers.

Here is the honest version of what each pathway costs to influence:

DimensionTraining-corpus pathwayRAG pathway
Update latency6-12 months (next model)24-72 hours (next crawl)
Primary leverCorpus-wide mentions + entity presenceIndexable, ranking, answer-shaped pages
Hardest to fakeYes — accumulated authorityNo — passage edits move it (Princeton)
Linkable citation?Usually no (recall, no fetch)Yes (footnote to retrieved page)
Evidence strengthInferred (corpora are documented; weighting is not)Mixed: search step Documented, re-rank Inferred
Who wins by defaultOld, authoritative, heavily-mentioned brandsAnyone who is searchable + concise + fresh

I will walk each pathway's mechanics in full. But the framing comes first because every later section is "which pathway does this apply to?" If you only remember one thing, remember that freshness, schema, and concision are RAG-pathway levers, and Wikipedia/Reddit/mention-density are training-corpus levers, and you have to decide which one the query you care about actually uses.

Pathway 1: training-corpus citation — the model already knows you (or doesn't)

Training-corpus citation is the model recalling, from frozen pretraining, that your brand exists — and you earned that recall years ago, through the open web, whether you optimized for it or not. No live fetch happens. The model is not reading your site; it is reproducing a statistical shadow of your site left in its weights.

This is the pathway nobody can manipulate quickly, which is exactly why it is the most durable and the most valuable for brand-level recommendations. When a prospect opens Claude and types "what's a good cookieless attribution tool for a small SaaS," and Claude answers without browsing, the brands it names are the brands whose information was dense and trustworthy enough in the pretraining corpus to survive into the model's weights.

What actually goes into the corpus

The pretraining corpus for every major LLM is built primarily from large web crawls. Documented: Common Crawl is a publicly available crawl of billions of pages that underpins a large share of LLM training data [15]. Documented: OpenAI's GPTBot, Anthropic's ClaudeBot, and Google-Extended are the named training crawlers, and each respects robots.txt [16][17]. Documented: Reddit licensed its content to Google for AI training per Reuters [13]. Beyond those facts, the exact composition and weighting of any frontier model's corpus is not published — that part is Inferred from research, leaks, and behavior.

Corpus sourceDocumented?Why it's weighted heavily (Inferred)
Wikipedia / WikidataDocumented presence in nearly all LLM corporaHigh edit quality, entity structure, broad coverage
Common Crawl (open web)Documented [15]The bulk substrate of pretraining text
RedditLicensing deal Documented [13]Human consensus, recommendation/experience signal
Established news + publishersInferredHigh editorial trust, frequent citation by others
Books / academicInferredDensity, formality, factual reliability
Brand-owned docs/sitesInferred (only if crawled before cutoff)Direct, structured product facts

The practical reading of that table: the training-corpus pathway over-rewards sources that already had web-scale authority and broad mention density. A brand with a Wikipedia page, a steady drip of Reddit and Hacker News mentions, and coverage in established publications accumulates a presence the model "remembers." A brand-new SaaS with a great site and zero external mentions is, to the training-corpus pathway, nearly invisible until the world talks about it and the next model learns from that conversation.

Which corpus source the model leans on also shifts with the type of question, which is why the same brand can be recalled confidently for one query and not another (all rows below Inferred from behavior, not vendor-documented):

Query type (browse-off)Corpus source the model leans onWho tends to win
"What is [X]?" (definition)Wikipedia / encyclopedicEntities with a wiki/Wikidata page
"Best [category] tool?" (recommendation)Reddit / forums / reviewsBrands with mention density
"How do I [task]?" (procedure)Docs / tutorials / Stack OverflowSites with clean how-to content
"Is [brand] legit?" (trust)Reviews + publisher coverageBrands with earned media
"[Brand A] vs [Brand B]" (comparison)Forums + comparison contentWhoever the web compares openly

Why corrections are slow on this pathway

A wrong fact in the training pathway is frozen until the next model ships — typically a 6-12 month cutoff cycle. This is the source of the maddening "AI keeps saying my old price" problem. The model learned your $49 price before you dropped to $29, and no on-site edit reaches the frozen weights.

SymptomCauseLeverLatency
AI recommends a competitor, never you (browse-off)You're absent/thin in the corpusBuild mentions + entity presenceNext model gen
AI cites outdated pricing/featuresOld data frozen in weightsGet fresh facts into the live index (helps RAG, not recall)Next model gen for recall
AI confuses your brand with anotherWeak entity disambiguationWikidata, Organization schema, sameAsWeeks for KG, next gen for recall
AI "knows" you well, recommends confidentlyDense, consistent corpus presenceMaintain it; this is the assetAlready won

I want to be honest about the limits of optimization here. You cannot make a model recall you faster than its training cycle. What you can do is make sure that by the next training run, the open web's picture of your brand is dense, consistent, and entity-disambiguated. That is a slow, compounding game — and it is the single most under-instrumented one, because the payoff arrives as browse-off recommendations that send no clickable referrer and therefore never show up cleanly in anyone's analytics. We will close that loop at the end. For the entity-presence mechanics specifically, the Wikipedia effect on AI visibility goes deeper than I can here, and Reddit's outsized role in AI citations covers the mention-density side.

Pathway 2: live-retrieval (RAG) — the engine searches, ranks, then cites

RAG citation is mechanically a search engine bolted to a language model: the engine runs a real query against an index, gets a candidate set, re-ranks it, and the model cites the passages it was handed. This is the pathway you can move this week, and it is where almost all "GEO tactics" actually operate, whether or not they say so.

The acronym is retrieval-augmented generation. The key word is retrieval — there is a genuine search step before any generation, and that step behaves a lot like classic search. Documented: OpenAI's SearchGPT announcement names Bing as the partner search index for ChatGPT search [10]. Documented: Perplexity uses its own crawler (PerplexityBot) plus partner search APIs and exposes a Sources tab [12]. Documented: Google AI Overviews synthesize at query time from the live search index, not from Gemini's training data [5]. The exact retrieval and re-ranking algorithms are not published — those steps are Inferred from research and behavior.

Here is the full pipeline, stage by stage, with what's documented at each step:

StageWhat it doesEvidenceWhat moves it
1. Query rewriteTurns the user's chat into one or more retrieval queriesInferred (standard RAG practice)Clear topical targeting; matching real query phrasing
2. Candidate retrievalA search step pulls N documents from the indexDocumented that a search step exists [10][12]Classic SEO: indexed, ranking, crawlable, relevant
3. Re-rankingScores passages on relevance + source + shapeInferred (Princeton shows passage edits move it) [1]Concision, schema, freshness, citations/stats
4. Synthesis + citeModel writes the answer, footnotes used passagesDocumented behavior (visible citations) [10][12]Being the cleanest extractable answer in the set

The reason this article keeps separating stage 2 from stage 3 is that they are different problems with different levers, and conflating them is the modal GEO mistake. Getting into the candidate set is a searchability problem — if your page is not indexed and ranking for the rewritten query, it never enters the pool, and no amount of schema saves it. Surviving the re-ranker is a passage-quality problem — once you are in the pool, authority matters less and answer-shape, concision, and parseability matter more.

Stage 1 deep dive: query rewrite is where your keyword targeting still matters

Before anything gets retrieved, the engine rewrites a messy human chat into one or more clean retrieval queries — and the page that matches the rewritten query, not the user's literal words, is the one that enters the pool. This stage is invisible and frequently ignored, but it quietly decides which candidate set even gets assembled.

A user types "whats a cheap tool to see if chatgpt is sending me sales." The engine does not search that verbatim. It rewrites it into something like "cookieless ChatGPT referral revenue attribution tool" and may fan it out into several sub-queries. Inferred (this is standard RAG practice, not vendor-documented per engine): the rewrite step normalizes intent, expands synonyms, and splits multi-part questions.

Rewrite behaviorPathwayEvidenceImplication for you
Verbatim chat → normalized queryRAGInferred (standard practice)Optimize for intent, not the literal phrasing
Synonym + entity expansionRAGInferredEntity-clear pages match more rewrites
Multi-part question fan-outRAGInferredOne page can match several sub-queries
Conversation-context carry-overRAGInferredEarlier turns shape the rewrite
No rewrite (recall path)Training-corpusN/ANo query is issued at all

The practical consequence: you cannot keyword-match your way in by copying the user's exact words, because you never see them — you see the rewrite. What survives the rewrite is clear topical and entity targeting, which is the opposite of keyword stuffing. This is also why the same page can be pulled into the candidate set for queries phrased a dozen different ways: they all rewrite to the same normalized intent.

Stage 2 deep dive: what gets a source into the candidate set

The candidate set is assembled by a search step, so the entry criteria are mostly classic search criteria — which is why "AI search killed SEO" is exactly backwards on this pathway. If you cannot get into the candidate pool, the cleverest passage in the world never gets scored.

Candidate-set gatePathwayEvidenceNotes
Page is indexedRAGDocumented (search step exists)Non-indexed = invisible to retrieval
Ranks for rewritten queryRAGInferred-strongTop-10 organic correlates with AIO inclusion [5]
Crawlable (not blocked)RAGDocumented (bot docs) [16]Blocking search bots removes you
Topically relevant to intentRAGInferredSemantic match to the rewritten query
Freshness within query windowRAGInferredTime-sensitive queries favor recent pages
Domain in partner indexRAGDocumented (Bing for ChatGPT) [10]If Bing can't find you, ChatGPT search can't either

This is the stage where I tell people to stop treating AI search as exotic. The candidate-set gate is roughly "would classic search return this page for the query?" If yes, you are eligible for the re-ranker. If no, you are not in the conversation. How to rank in ChatGPT is largely a stage-2 problem, and the classic-SEO foundations that get you there are the same ones that have always mattered: indexability, relevance, and rank.

Stage 3 deep dive: the re-ranker, the least-documented and most-tactical step

The re-ranker is where AI search genuinely diverges from classic search, and it is where passage-level features beat raw domain authority — but it is also the stage with the least vendor documentation, so almost everything here is Inferred. I will not pretend otherwise.

What we have is the Princeton GEO paper [1], which is the strongest evidence in the field that passage edits move citation rates. Aggarwal et al ran controlled interventions and found that adding citations, statistics, and quotations to a passage lifted its visibility in generative engines by up to 40% on tested queries, while keyword-stuffing did roughly nothing. That is a direct measurement of the re-ranker responding to passage shape, not page authority.

Re-rank signalPathwayEvidenceDirection of effect
Semantic relevance to queryRAGInferred (core of any re-ranker)Strong positive
Answer-shape / concision (first 80-120 words)RAGInferred-strong [1]Positive; clean direct answers win
Citations + statistics in passageRAGDocumented-ish (Princeton) [1]Up to +40% on tested queries
Quotations from authoritiesRAGDocumented-ish (Princeton) [1]Positive
Schema.org parseability (FAQ/Article)RAGInferred (Backlinko/Semrush correlation) [4][5]Positive; easier extraction
Source authority / trustRAGInferredPositive but weaker than people think at this stage
FreshnessRAGInferredPositive on time-sensitive queries
Keyword density / stuffingRAGDocumented null result (Princeton) [1]~No effect

The headline for operators: at the re-rank stage, being the cleanest, most-extractable, best-evidenced answer in the candidate set beats being the most authoritative domain in it. I have watched a DR-20 niche page get cited over a DR-80 homepage because the niche page answered the exact question in two crisp sentences with a stat and a source, and the homepage buried the answer under a hero section. Authority helps you qualify (stage 2); answer-shape helps you win (stage 3). The full factor-by-factor grading lives in the ranking factors checklist; here I only want you to know which stage each factor acts on.

Stage 4: synthesis and the mechanical nature of the citation

The citation itself is not a reward the model bestows — it is a mechanical instruction: cite the passages you were handed. This is widely misunderstood. People imagine the model "decides" to cite a brand it likes. In a RAG pipeline, the model is prompted to synthesize an answer from the retrieved passages and attach footnotes to the ones it used. The decision of which sources are even available to cite was made upstream, in stages 2 and 3. By the time the model writes, the candidate set is fixed.

MisconceptionReality (Inferred from RAG architecture)
"The model chose to recommend me"The re-ranker put your passage in front of the model; the model used it
"I need the model to like my brand"You need to survive retrieval + re-rank; generation is downstream
"More authority = more citations"Authority helps stage 2-3; it is not a generation-time lever
"Citations are editorial judgment"Citations are mechanical attribution of used passages

This reframing is liberating: you are not trying to charm a language model. You are trying to be the page that wins a search-and-rank contest, then be the passage clean enough that the model can lift it verbatim and footnote it. That is engineering, not persuasion.

Per-engine source preferences: same skeleton, different weighting

The four major engines share the same two-pathway skeleton, but they weight the pathways and the re-rank signals differently — and the differences are smaller than vendor marketing implies. Roughly 70% of citation variance is the shared mechanics; engine-specific deltas are the other 30%.

EngineDefault pathwayCitations/answerRetrieval intensityNotable lean
PerplexityRAG (almost always searches) [12]3-7HighestRecency-weighted; exposes Sources tab
ChatGPT searchMixed (decides per query) [10]3-5High when triggeredBing index; concise synthesis
Google AI OverviewsRAG, gated by SERP [5]4-7High but rank-gatedTop-10 organic strongly favored
Claude (web search on)RAG when enabled [17]VariesModerateConservative; fewer, higher-trust cites
Claude / Gemini (browse off)Training-corpus only0 (no links)NonePure recall; entity presence decides

The operational takeaway from that table:

If you want citations on...Optimize primarily...Because
PerplexityRAG: concision, schema, freshnessIt searches nearly every query
ChatGPT searchRAG + Bing indexabilityBing is the documented index [10]
Google AI OverviewsClassic top-10 rank first, then answer-shapeRank gates the candidate set [5]
Browse-off Claude/GeminiTraining-corpus: mentions, Wikidata, entityNo search happens; only recall

It also helps to know each engine's default posture, because that decides which pathway fires before the user does anything special:

EngineDefault modePathway you hit by defaultHow to flip it
PerplexityAlways searchesRAGEffectively can't avoid RAG
ChatGPT (consumer)Decides per queryMixedExplicit "search the web" forces RAG
Google AI OverviewsTriggers on ~13-15% of SERPsRAG (rank-gated)Only on queries that show an AIO [9]
Claude (chat)Browse off unless enabledTraining-corpusTurn on web search for RAG
Gemini (chat)Browse off unless promptedTraining-corpusAsk it to search for RAG

Notice that the same brand can need opposite strategies on two engines. Winning Perplexity is a stage-2/stage-3 RAG problem solvable this month. Winning browse-off Gemini is a training-corpus problem solvable over quarters. If you only measure one, you will mis-allocate effort toward the engine that happens to be easy to see.

What's documented vs inferred vs speculative: the honesty table

Most "how AI chooses sources" content states inference as fact; here is exactly where the evidence runs out. I would rather you trust less and test more than believe a confident claim I cannot back.

ClaimLabelBasis
Two pathways exist (recall vs RAG)DocumentedVendors document both browse-off and search modes [10][16][17]
RAG has a distinct search stepDocumentedSearchGPT names Bing; Perplexity names its stack [10][12]
ChatGPT search cites 3-5 sourcesDocumentedOpenAI docs [10]
Common Crawl underpins pretrainingDocumentedCommon Crawl + model docs [15]
Reddit is licensed for trainingDocumentedReuters [13]
Passage edits lift citation rate up to 40%Documented-ishPrinceton controlled study [1]
A re-ranker scores passages on relevance + shapeInferred-strongStandard RAG; Princeton consistency [1]
Schema improves re-rank oddsInferredBacklinko/Semrush correlation, plausible mechanism [4][5]
Wikipedia/Reddit over-weighted in corpusInferred-strongCorpus presence documented; weighting inferred
Authority matters more at stage 2 than stage 3InferredOperator measurement + Princeton null on stuffing [1]
Engines use a PageRank-like authority signalInferredLink-graph correlates; mechanism not published
Exact re-rank scoring functionSpeculativeNo vendor publishes it
"Brand affinity" influences generationSpeculativeNo evidence; contradicts mechanical-citation model
Specific token-level citation triggersSpeculativePattern-matching without rigor

The line I draw: Documented means a vendor or peer-style paper said it. Inferred means strong, multi-source observational evidence with a plausible mechanism but no confirmation. Speculative means SEO-community pattern-matching — worth testing on your own site, dangerous to bet the budget on. The whole point of grading is that the field is full of stage-3 (re-ranker) claims dressed up as documented fact, when the re-ranker is precisely the stage vendors keep closed.

How the two pathways interact (and feed each other)

The pathways are not isolated — today's RAG citations seed tomorrow's training corpus, which is why the fast pathway is also a long-term investment. This is the most important second-order mechanic and the one most strategies miss.

Walk the loop:

  1. You publish a clean, answer-shaped page (RAG-pathway optimization).
  2. It gets retrieved and cited in Perplexity/ChatGPT search within days (RAG win).
  3. Those citations and the traffic they drive generate mentions, links, and discussion.
  4. The next training crawl ingests your page, the citations of it, and the discussion.
  5. The next model generation "learns" your brand → training-corpus presence grows.
  6. Now browse-off Claude/Gemini start recommending you with no link (training-pathway win).
Pathway interactionDirectionLatencyWhy it matters
RAG citation → training presenceForward-feeding6-12 monthsFast wins compound into slow durable recall
Training presence → RAG candidate oddsReinforcing (Inferred)ContinuousKnown brands may get retrieved more readily
Blocking training crawlerBreaks the loopImmediateYou keep RAG, lose the compounding to recall
Strong recall, weak RAG pagesDivergenceModel knows you but can't cite a clean page

So the right mental model is not "pick a pathway." It is "win the RAG pathway now for measurable traffic, and let those wins compound into training-corpus recall over the next few model generations." The mistake is doing RAG work and never checking whether it has started moving the browse-off recommendations — because that is the higher-value, harder-to-measure outcome.

Common myths the mechanics dismantle

Once you hold the two-pathway model, most popular AI-SEO advice sorts cleanly into "true for one pathway" or "false for both." Here is the cleanup:

MythVerdictWhy
"Schema gets you cited by AI"True for RAG, irrelevant to recallRe-ranker parses schema; weights are prose
"Just be authoritative and AI will cite you"Half-trueAuthority gates RAG stage 2 and feeds recall; not a stage-3 winner
"AI doesn't use SEO anymore"False for RAGThe candidate set is a search step [10][12]
"Block the bots, protect your content, still get cited"Mostly falseBlocking removes you from retrieval and/or corpus
"Fresh content gets cited fastest"True for RAG, false for recallRAG = 24-72h; recall = next model
"More backlinks = more AI citations"Weak/indirectLinks help rank (stage 2); re-ranker weights passages [1]
"The model decides to recommend brands it likes"FalseCitation is mechanical attribution of used passages
"One GEO strategy works across all engines"FalsePerplexity (RAG) vs browse-off Gemini (recall) are opposite

I have audited enough GEO programs to say the costliest myth is the last one. Teams build a single playbook, see it work on Perplexity, declare victory, and never notice that the browse-off recommendations that actually drive their highest-intent traffic never moved. They optimized the visible pathway and ignored the valuable one.

A practical decision framework: which pathway should you optimize?

Pick your pathway from the query that drives your revenue, not from the engine that is easiest to screenshot. Here is the framework I give founders.

If your buyers ask AI...Dominant pathwayOptimizeRealistic timeline
"Best [category] tool for [use case]?" (browse off)Training-corpusMentions, Wikidata, entity, Reddit presenceQuarters
"Compare [you] vs [competitor]" (with search)RAGComparison pages, schema, concisionWeeks
"How do I do [task]?" (with search)RAGAnswer-shaped how-to passagesDays-weeks
"What is [your brand]?" (either)BothEntity presence + clean About/docsMixed
Time-sensitive / "latest [thing]"RAGFreshness + fast indexingDays

The sequencing I recommend:

StepActionPathway servedWhy first
1Instrument revenue by AI engine + pathwayBothYou cannot allocate blind
2Win RAG: indexable, ranking, answer-shaped, schemaRAGFast, measurable, compounds
3Build entity + mention densityTraining-corpusSlow, durable, hard to fake
4Re-measure: did browse-off recommendations move?BothThe valuable, invisible outcome

Step 1 is non-negotiable and it is where almost everyone fails. The tactical get-cited playbook covers steps 2 and 3 in depth. But notice that both steps depend on step 1 being real — and step 1 is exactly what your default analytics cannot do.

The part nobody tells you: knowing the mechanics is useless unless you measure which pathway pays

You can understand the retrieval pipeline perfectly and still set your budget on fire, because the two pathways send revenue through completely different, equally invisible doors — and GA4 closes both. This is where I have to be blunt about the gap that motivated me to build Attrifast in the first place.

Walk what actually reaches you from each pathway:

PathwayWhat the visitor doesWhat your analytics sees
RAG citationClicks the footnoteReferrer stripped by AI client → GA4 Direct/(none) [11]
Training-corpus recallReads "I'd use Attrifast," then searches your brand laterBranded search or Direct, days later, no link to AI

In both cases, GA4 buckets the visit as Direct/(none) and tells you nothing about which pathway, which engine, or which query drove it [11]. So the team that did six weeks of schema work (a RAG-pathway lever) and the team that spent two quarters building Reddit mention density (a training-corpus lever) get the same useless signal: a swelling Direct bucket. Neither can tell whether their pathway is the one paying.

Here is the measurement stack that separates them, and it is cookieless end to end:

LayerWhat it catchesPathway it reveals
Server-side referrer fingerprinting (AI-engine domain list)The RAG footnote click that passed any referrerRAG, by engine
First-party deep-page landing detectionUnreferred deep entries from stripped RAG clicksRAG (un-refererred)
Branded-search + Direct lag analysisThe later visit after a browse-off recommendationTraining-corpus (inferred)
Stripe webhook join at paymentWhich session → which paying customerBoth, by revenue

The decisive layer is the last one. Until you join the AI-referred session to the Stripe payment, you have citation counts and traffic guesses — you do not have revenue, and revenue is the only number that tells you which pathway to fund. A brand can be cited 500 times a week on Perplexity (RAG) and earn less than a single browse-off Claude recommendation that quietly sends three high-intent buyers a month. You will fund the 500 citations because you can see them, and starve the recommendation because you can't — unless you measure at the money.

That is the architecture Attrifast ships: cookieless, consent-light, server-side referrer detection against the AI-engine domain list, first-party session identity scoped to your own domain, and a Stripe-metadata join that lands the AI-referred visit on the actual paying customer. It is the only way I know to answer "which pathway drives my revenue," and it is the question every section above quietly leads to. If you want the analytics shape of the hidden traffic before you instrument it, the ChatGPT referral analytics guide and track-chatgpt-traffic walk the detection code; revenue attribution is the join.

The mechanics in this article are real and worth knowing. But they are a map, not a destination. The destination is a dashboard that tells you, this quarter, that your RAG schema work added $1,400 of Perplexity-attributed revenue and your slow Reddit-mention grind added $3,100 of browse-off-recommendation revenue — so next quarter you shift the weight. Without that, you are optimizing pathways you have never actually seen.

Freshness and crawl mechanics: why the same edit lands in days on one pathway and never on the other

Freshness is not one thing — it behaves completely differently per pathway, and the same content edit can be live in a citation within 72 hours on RAG and invisible to recall for a year. Operators who do not separate these two clocks chase their tails wondering why "the update didn't take."

On the RAG pathway, freshness is bounded by your crawl-and-index cadence, because the citation is decided against the current index at query time. On the training-corpus pathway, freshness is bounded by the model's training cutoff, which is frozen until the next generation ships. Two clocks, wildly different speeds.

Content changeRAG pathway latencyTraining-corpus latencyWhy the gap
New page published24-72h on well-crawled domainsNext model gen (~6-12mo)RAG reads the live index; recall waits for retraining
Price/feature correctionHours-days once re-indexedNext model genFrozen weights can't re-read your page
New competitor comparisonDaysNext model genSame retrieval/recall split
Brand rename / rebrandDays for RAG, but recall stays wrongNext model genThe hardest correction to propagate
Removing outdated contentDrops from RAG on re-crawlStays in weights until retrainYou can't unlearn the model

The crawl side has its own per-bot mechanics worth knowing, because the bot that visits determines which pathway your content can even reach:

CrawlerOwnerPathway it feedsDocumented?Block effect
GPTBotOpenAITraining corpusDocumented [16]Removes you from future ChatGPT recall
OAI-SearchBotOpenAIRAG search indexDocumented [16]Removes you from ChatGPT search retrieval
ChatGPT-UserOpenAIRAG live fetchDocumented [16]Removes you from on-demand fetches
ClaudeBotAnthropicTraining corpusDocumented [17]Removes you from future Claude recall
Google-ExtendedGoogleGemini trainingDocumented [19]Opts you out of Gemini training only
GooglebotGoogleLive index → AI OverviewsDocumentedRemoves you from Search + AIO
PerplexityBotPerplexityRAG indexDocumented [12]Removes you from Perplexity retrieval

The honest reading: selective blocking lets you keep the fast pathway while opting out of the slow one, but almost no SMB benefits from blocking anything. The freshness asymmetry is the real lesson — when you ship an important correction, the RAG surfaces will tell the truth within days and the browse-off recall will keep repeating the old version until the next model. Plan your messaging around both clocks. Google is the clearest worked example of this multi-clock behavior, and where Google AI gets its information breaks its four stitched-together sources down by exactly these update cadences.

Putting it together: a per-pathway optimization scorecard

The clean way to act on all of this is a single scorecard that maps each lever to its pathway, its evidence grade, and its latency — so you stop spending RAG effort hoping for recall results. This is the table I actually use when auditing a site.

LeverPathwayEvidenceLatencyEffort
Indexability + classic rankRAG (stage 2)Documented search stepDays-weeksMedium
Answer-shaped first 80-120 wordsRAG (stage 3)Inferred-strong [1]DaysLow
Citations + stats in passageRAG (stage 3)Documented-ish [1]DaysLow
FAQ / Article schemaRAG (stage 3)Inferred [4][5]DaysLow
Freshness on time-sensitive pagesRAG (stages 2-3)InferredDaysOngoing
Wikidata + Organization schema + sameAsBoth (entity)Inferred-strongWeeks (KG)Medium
Reddit / forum mention densityTraining-corpusInferred-strong [13]QuartersHigh
Wikipedia presenceTraining-corpusInferred-strongQuartersVery high
Publisher / earned-media mentionsTraining-corpusInferredQuartersHigh
Allowing the training crawlersTraining-corpusDocumented [16][17]Next genTrivial

And the diagnostic side — given a symptom, which pathway is failing:

SymptomFailing pathwayFirst thing to check
Cited on Perplexity, ignored browse-offTraining-corpusEntity presence + mention density
Recommended browse-off, never a footnoteRAGIndexability + answer-shape
Cited but wrong/outdated factsTraining-corpus (recall)Wait for next gen; fix live index for RAG
Zero presence anywhereBothAre crawlers blocked? Are you indexed at all?
Citations rising, revenue flatMeasurementJoin sessions to Stripe by pathway

That last row is the one I keep returning to, because it is the failure mode that survives even a perfect understanding of the mechanics. You can grade every lever correctly and still not know whether the work paid. The fix is not more mechanics — it is measurement at the money.

FAQ

How do AI engines choose which sources to cite?

Through one of two mechanically different pathways. In the training-corpus pathway, the model recalls your brand from frozen pretraining knowledge with no live fetch — citation here reflects how densely and trustworthily your information appeared in the crawl, which is why Wikipedia, Reddit, and established publishers dominate. In the live-retrieval (RAG) pathway, the engine runs a real search at query time, retrieves a candidate set, re-ranks it on relevance plus source plus answer-shape, then cites the passages it actually used. The first pathway rewards years of accumulated authority; the second rewards classic searchability plus concise, schema-marked, fresh passages. The fatal mistake is optimizing without naming which pathway the query you care about uses.

What is the difference between training-corpus citation and RAG citation?

Training-corpus citation fires when the engine answers from memory with browsing off (Claude plain chat, Gemini no-browse, ChatGPT when it skips search). It usually produces an unlinked recommendation because nothing was fetched, and it changes only on the model's 6-12 month training cycle. RAG citation fires when the engine searches the live web, and it produces a clickable footnote to a page retrieved seconds earlier; it can change within 24-72 hours of you publishing. They demand opposite work: slow corpus-wide authority building for the first, fast on-page searchability and answer-shape for the second.

Does ChatGPT search the live web or answer from training memory?

Both, decided per query. Per OpenAI's documentation, ChatGPT search runs a retrieval-augmented pipeline against a search index (Bing is the documented partner) and cites 3-5 sources when it browses. But ChatGPT also answers many queries from frozen training knowledge with no search and no citations. The model decides each time whether to invoke the search tool. That is why the same brand can be a footnote on one query (RAG pathway) and an unlinked mention on another (training-corpus pathway), and why a single optimization strategy underperforms.

What gets a source into the candidate set for retrieval?

Mostly classic search criteria, because the candidate set is built by an actual search step before any AI re-ranking. The page has to be indexed, crawlable, topically relevant to the rewritten retrieval query, and ideally already ranking for it. The Princeton GEO research and broad operator measurement both point to top-organic pages being far likelier to enter the pool. Schema and concision do not get you into the candidate set — they help you survive the next stage, the re-ranker. Getting in is a searchability problem; staying in is a passage-quality problem.

How does the re-ranking step decide which passages survive?

This is the least-documented stage, so most of the answer is Inferred. After retrieval returns a candidate set, a re-ranker scores passages on semantic relevance, source signals, and answer-shape. The strongest evidence we have is the Princeton GEO paper, which showed that adding citations, statistics, and quotations to a passage lifted its generative-engine visibility by up to 40% on tested queries, while keyword-stuffing did essentially nothing. That is direct evidence the re-ranker weights passage-level features, not just domain authority. Concision, schema parseability, and freshness all plausibly help, but no vendor publishes the exact scoring function.

Why does AI cite Wikipedia and Reddit so often?

Because both are massively over-represented in pretraining corpora and both are treated as high-trust for their query types. Wikipedia seeds the entity graph that every major model uses for disambiguation, so it surfaces in both pathways constantly. Reddit was licensed to Google for AI training and is heavily weighted for opinion, recommendation, and lived-experience queries where the model wants human consensus. Neither passes you classic backlink equity, but consistent presence on either accumulates training-corpus authority that pure on-site optimization cannot replicate.

Can I get cited if I block the AI crawlers?

Only on the RAG pathway, and only if you block selectively. Blocking the training crawler (GPTBot) while allowing the live-fetch and search agents (ChatGPT-User, OAI-SearchBot) keeps you retrievable at query time but removes you from future pretraining, slowly eroding your training-corpus recall. Blocking everything forfeits both pathways. For most SMB SaaS and ecommerce sites in 2026, allowing the crawlers is the right call, because training-corpus presence — the browse-off recommendation — is the durable asset you cannot buy back on a short timeline.

How long does it take to start getting cited by AI?

Pathway-dependent. On RAG, fresh content can be cited within 24-72 hours of indexing on a well-crawled domain, because the decision is made at query time against the current index. On the training-corpus pathway, you are waiting for the next model generation — a 6-12 month cutoff cycle plus the lead time for your content to be crawled and learned before the cutoff. If you need citations this quarter, optimize RAG; if you want the model to recommend you with browsing off, you are playing a multi-quarter game.

Which AI engine is easiest to get cited by, and why?

Perplexity, because it is the most retrieval-heavy major engine — it runs a live search on nearly every query and surfaces 3-7 citations per answer, with a Sources tab exposing the wider retrieved set. That makes the RAG pathway dominant, so classic searchability plus answer-shaped passages get you in quickly. ChatGPT search is next, then Google AI Overviews (gated by top-10 organic rank). The hardest surface is base-model knowledge in a no-browse chat, which is pure training-corpus recall you cannot influence without being in the next training run.

Does structured data (schema) actually affect AI citations?

On the RAG pathway, the evidence says yes and I label it Inferred-strong. Schema.org JSON-LD makes a passage trivially parseable, so the re-ranker does not have to guess what is a question, answer, price, or author. Multiple observational studies find AI-cited pages carry FAQ and Article schema at higher rates, and the mechanism is plausible: a clean, machine-readable answer is easier to extract and trust. On the training-corpus pathway, schema matters far less, because the model learned prose, not markup. So schema is mostly a RAG-pathway lever — another reason naming the pathway first changes what you build.

Do AI engines use Google's PageRank to choose sources?

Not directly, but they almost certainly use a similar citation-graph authority signal derived from web crawl data, and I label this Inferred. Pages from high-PageRank-style domains (Wikipedia, government, established publishers) get cited disproportionately, which is consistent with a link-graph authority signal but does not prove the mechanism. Critically, that authority appears to act more at the candidate-set stage (does this page rank well enough to be retrieved) than at the re-rank stage, where the Princeton evidence shows passage shape can beat raw authority.

Is the citation a judgment the model makes, or a mechanical step?

Mechanical, on the RAG pathway. The model is prompted to synthesize an answer from the retrieved passages and attach footnotes to the ones it used. The decision of which sources are even available to cite was made upstream in retrieval and re-ranking; by the time the model writes, the candidate set is fixed. So "the model chose to recommend me" really means "the re-ranker put my passage in front of the model and it was clean enough to use." You are not charming a model — you are winning a search-and-rank contest.

How do the two pathways feed each other?

Forward. A clean page wins RAG citations within days; the resulting traffic, mentions, and discussion get ingested by the next training crawl; the next model generation learns your brand; and then browse-off engines start recommending you with no link. So fast RAG wins compound into slow, durable training-corpus recall over a few model generations. The strategic implication: do RAG work now for measurable traffic, but keep checking whether it has started moving the higher-value, harder-to-see browse-off recommendations.

Why can't GA4 tell me which pathway drives my AI revenue?

Because both pathways arrive looking like Direct/(none). A RAG footnote click has its referrer stripped by the AI client, so GA4 sees no referrer and no UTM and files it as Direct. A training-corpus recommendation produces a later branded search or Direct visit with no link back to the AI engine at all. GA4 has no built-in rule to separate AI traffic, let alone separate the two pathways, and it cannot join either to revenue. You need server-side referrer fingerprinting against an AI-engine domain list plus a first-party identity and a Stripe-payment join to see which pathway actually pays.

How do I measure which pathway is driving my revenue?

Four layers, all cookieless. First, server-side referrer fingerprinting against a known AI-engine domain list catches the RAG footnote clicks that pass any referrer, broken out by engine. Second, first-party deep-page landing detection catches the unreferred RAG clicks where the referrer was stripped. Third, branded-search and Direct lag analysis approximates the training-corpus recommendations that send delayed visits. Fourth — the decisive one — a Stripe webhook join lands the AI-referred session on the actual paying customer, so you see revenue by pathway, not just citation counts. That join is the entire reason citation-tracking tools alone leave you guessing, and it is what Attrifast was built to close.

Related reading from the Attrifast research stack

This article covers the mechanics — the retrieval pipeline and training-corpus weighting that decide which pages get cited. For the factor list the mechanics produce — what observable signals correlate with citation — see AI Search Ranking Factors 2026. For more on connected topics, see ChatGPT Query Fan-Out, Explained for Attribution Operators (2026), How to Submit Content to AI Search Engines for Faster Discovery in 2026, Why Bing SEO Now Matters for ChatGPT and Copilot Visibility in 2026, and Schema Markup for AI Search.

Related reading

Find revenue hiding in your traffic

Discover which marketing channels bring customers so you can grow your business, fast.

Start free trial →

5-day free trial · $29/mo · cancel anytime