The actual retrieval and training mechanics behind AI citations — the two pathways (pretraining corpus vs live RAG), how each scores sources, and which one drives your revenue.
A founder messaged me in March, frustrated. He had spent six weeks adding FAQ schema, tightening his intros, and chasing the standard GEO checklist, and his Perplexity citations had genuinely climbed. But ChatGPT, when his prospects asked it to recommend a tool in his category, still did not mention him. He could not understand why the same "AI optimization" worked on one engine and did nothing on the other.
The answer is that he was optimizing one pathway and measuring a different one. His schema and concision work moved the live-retrieval (RAG) pathway, which is exactly why Perplexity — the most retrieval-heavy engine — responded. But the ChatGPT query he cared about was answering from frozen training knowledge with no live search, so it was running on the training-corpus pathway, where six weeks of on-page work changes nothing. He needed two different strategies and a way to tell which one was paying.
This is the deeper companion to the AI search ranking factors checklist. That piece lists the 12 signals and grades them. This piece is the layer underneath: the actual retrieval and training mechanics that decide why a signal helps at all, and the single distinction — training corpus versus live retrieval — that most articles never make. If you have read the factors piece, this is where the "why" lives. Everything below is labeled Documented, Inferred, or Speculative, because the vendors publish far less than the confident blog posts imply.
24-72 hours on a well-crawled domain (directional)
Time-to-citation, training pathway
One model generation (~6-12 month cutoff cycle)
Citations per Perplexity answer
3-7 [12]
Citations per ChatGPT search answer
3-5 [10]
Citations per Google AI Overview block
4-7 [5]
Princeton GEO max citation lift from passage edits
Up to 40% on Perplexity / BingChat [1]
Documented partner search index for ChatGPT search
Bing (per SearchGPT announcement) [10]
Corpora over-represented in pretraining
Wikipedia, Reddit, established publishers [13][15]
GA4 default attribution accuracy for either pathway
Roughly 0% (both bucket as Direct/(none)) [11]
The Princeton GEO paper is the empirical spine of this article. It is the closest thing the field has to a controlled experiment on what moves AI citation rates, and it specifically isolates passage-level interventions, which makes it the right anchor for the re-ranking discussion. Everything sourced to agency studies (Ahrefs, Semrush, Backlinko) is observational at scale — directionally useful, weaker on causation. I weight evidence accordingly and say so inline.
The one distinction that reorganizes everything: two pathways, not one algorithm
The single most useful sentence in this article: an AI engine cites you through one of two mechanically different pathways, and the work that wins one does almost nothing for the other. Get this and the rest follows; miss it and you will keep optimizing blind.
Most "how does AI choose sources" content treats citation as a single black box with a list of factors going in. That framing is wrong in a way that costs real money. There are two systems:
Pathway
What happens
When it fires
What it rewards
1. Training-corpus citation
The model recalls your brand/fact from frozen pretraining knowledge. No live fetch.
Browse-off chat: Claude plain, Gemini no-browse, ChatGPT when it skips search
Corpus-wide authority accumulated over years (Wikipedia, Reddit, publisher mentions)
2. Live-retrieval citation (RAG)
The engine runs a real search, retrieves docs, re-ranks them, cites the passages used
Browse-on: ChatGPT search, Perplexity, AI Overviews, Claude with web search
The reason this distinction is invisible to most operators is that the output looks identical. In both cases the model says "I recommend Attrifast" or surfaces a fact about a brand. But in Pathway 1 there is usually no clickable link — the model is recalling, not fetching — and in Pathway 2 there is a numbered footnote pointing at a page that was retrieved seconds earlier. Same-looking answer, completely different machinery, completely different levers.
Here is the honest version of what each pathway costs to influence:
Dimension
Training-corpus pathway
RAG pathway
Update latency
6-12 months (next model)
24-72 hours (next crawl)
Primary lever
Corpus-wide mentions + entity presence
Indexable, ranking, answer-shaped pages
Hardest to fake
Yes — accumulated authority
No — passage edits move it (Princeton)
Linkable citation?
Usually no (recall, no fetch)
Yes (footnote to retrieved page)
Evidence strength
Inferred (corpora are documented; weighting is not)
Mixed: search step Documented, re-rank Inferred
Who wins by default
Old, authoritative, heavily-mentioned brands
Anyone who is searchable + concise + fresh
I will walk each pathway's mechanics in full. But the framing comes first because every later section is "which pathway does this apply to?" If you only remember one thing, remember that freshness, schema, and concision are RAG-pathway levers, and Wikipedia/Reddit/mention-density are training-corpus levers, and you have to decide which one the query you care about actually uses.
Pathway 1: training-corpus citation — the model already knows you (or doesn't)
Training-corpus citation is the model recalling, from frozen pretraining, that your brand exists — and you earned that recall years ago, through the open web, whether you optimized for it or not. No live fetch happens. The model is not reading your site; it is reproducing a statistical shadow of your site left in its weights.
This is the pathway nobody can manipulate quickly, which is exactly why it is the most durable and the most valuable for brand-level recommendations. When a prospect opens Claude and types "what's a good cookieless attribution tool for a small SaaS," and Claude answers without browsing, the brands it names are the brands whose information was dense and trustworthy enough in the pretraining corpus to survive into the model's weights.
What actually goes into the corpus
The pretraining corpus for every major LLM is built primarily from large web crawls. Documented: Common Crawl is a publicly available crawl of billions of pages that underpins a large share of LLM training data [15]. Documented: OpenAI's GPTBot, Anthropic's ClaudeBot, and Google-Extended are the named training crawlers, and each respects robots.txt [16][17]. Documented: Reddit licensed its content to Google for AI training per Reuters [13]. Beyond those facts, the exact composition and weighting of any frontier model's corpus is not published — that part is Inferred from research, leaks, and behavior.
Corpus source
Documented?
Why it's weighted heavily (Inferred)
Wikipedia / Wikidata
Documented presence in nearly all LLM corpora
High edit quality, entity structure, broad coverage
Common Crawl (open web)
Documented [15]
The bulk substrate of pretraining text
Reddit
Licensing deal Documented [13]
Human consensus, recommendation/experience signal
Established news + publishers
Inferred
High editorial trust, frequent citation by others
Books / academic
Inferred
Density, formality, factual reliability
Brand-owned docs/sites
Inferred (only if crawled before cutoff)
Direct, structured product facts
The practical reading of that table: the training-corpus pathway over-rewards sources that already had web-scale authority and broad mention density. A brand with a Wikipedia page, a steady drip of Reddit and Hacker News mentions, and coverage in established publications accumulates a presence the model "remembers." A brand-new SaaS with a great site and zero external mentions is, to the training-corpus pathway, nearly invisible until the world talks about it and the next model learns from that conversation.
Which corpus source the model leans on also shifts with the type of question, which is why the same brand can be recalled confidently for one query and not another (all rows below Inferred from behavior, not vendor-documented):
Query type (browse-off)
Corpus source the model leans on
Who tends to win
"What is [X]?" (definition)
Wikipedia / encyclopedic
Entities with a wiki/Wikidata page
"Best [category] tool?" (recommendation)
Reddit / forums / reviews
Brands with mention density
"How do I [task]?" (procedure)
Docs / tutorials / Stack Overflow
Sites with clean how-to content
"Is [brand] legit?" (trust)
Reviews + publisher coverage
Brands with earned media
"[Brand A] vs [Brand B]" (comparison)
Forums + comparison content
Whoever the web compares openly
Why corrections are slow on this pathway
A wrong fact in the training pathway is frozen until the next model ships — typically a 6-12 month cutoff cycle. This is the source of the maddening "AI keeps saying my old price" problem. The model learned your $49 price before you dropped to $29, and no on-site edit reaches the frozen weights.
Symptom
Cause
Lever
Latency
AI recommends a competitor, never you (browse-off)
You're absent/thin in the corpus
Build mentions + entity presence
Next model gen
AI cites outdated pricing/features
Old data frozen in weights
Get fresh facts into the live index (helps RAG, not recall)
Next model gen for recall
AI confuses your brand with another
Weak entity disambiguation
Wikidata, Organization schema, sameAs
Weeks for KG, next gen for recall
AI "knows" you well, recommends confidently
Dense, consistent corpus presence
Maintain it; this is the asset
Already won
I want to be honest about the limits of optimization here. You cannot make a model recall you faster than its training cycle. What you can do is make sure that by the next training run, the open web's picture of your brand is dense, consistent, and entity-disambiguated. That is a slow, compounding game — and it is the single most under-instrumented one, because the payoff arrives as browse-off recommendations that send no clickable referrer and therefore never show up cleanly in anyone's analytics. We will close that loop at the end. For the entity-presence mechanics specifically, the Wikipedia effect on AI visibility goes deeper than I can here, and Reddit's outsized role in AI citations covers the mention-density side.
Pathway 2: live-retrieval (RAG) — the engine searches, ranks, then cites
RAG citation is mechanically a search engine bolted to a language model: the engine runs a real query against an index, gets a candidate set, re-ranks it, and the model cites the passages it was handed. This is the pathway you can move this week, and it is where almost all "GEO tactics" actually operate, whether or not they say so.
The acronym is retrieval-augmented generation. The key word is retrieval — there is a genuine search step before any generation, and that step behaves a lot like classic search. Documented: OpenAI's SearchGPT announcement names Bing as the partner search index for ChatGPT search [10]. Documented: Perplexity uses its own crawler (PerplexityBot) plus partner search APIs and exposes a Sources tab [12]. Documented: Google AI Overviews synthesize at query time from the live search index, not from Gemini's training data [5]. The exact retrieval and re-ranking algorithms are not published — those steps are Inferred from research and behavior.
Here is the full pipeline, stage by stage, with what's documented at each step:
Stage
What it does
Evidence
What moves it
1. Query rewrite
Turns the user's chat into one or more retrieval queries
Inferred (standard RAG practice)
Clear topical targeting; matching real query phrasing
The reason this article keeps separating stage 2 from stage 3 is that they are different problems with different levers, and conflating them is the modal GEO mistake. Getting into the candidate set is a searchability problem — if your page is not indexed and ranking for the rewritten query, it never enters the pool, and no amount of schema saves it. Surviving the re-ranker is a passage-quality problem — once you are in the pool, authority matters less and answer-shape, concision, and parseability matter more.
Stage 1 deep dive: query rewrite is where your keyword targeting still matters
Before anything gets retrieved, the engine rewrites a messy human chat into one or more clean retrieval queries — and the page that matches the rewritten query, not the user's literal words, is the one that enters the pool. This stage is invisible and frequently ignored, but it quietly decides which candidate set even gets assembled.
A user types "whats a cheap tool to see if chatgpt is sending me sales." The engine does not search that verbatim. It rewrites it into something like "cookieless ChatGPT referral revenue attribution tool" and may fan it out into several sub-queries. Inferred (this is standard RAG practice, not vendor-documented per engine): the rewrite step normalizes intent, expands synonyms, and splits multi-part questions.
Rewrite behavior
Pathway
Evidence
Implication for you
Verbatim chat → normalized query
RAG
Inferred (standard practice)
Optimize for intent, not the literal phrasing
Synonym + entity expansion
RAG
Inferred
Entity-clear pages match more rewrites
Multi-part question fan-out
RAG
Inferred
One page can match several sub-queries
Conversation-context carry-over
RAG
Inferred
Earlier turns shape the rewrite
No rewrite (recall path)
Training-corpus
N/A
No query is issued at all
The practical consequence: you cannot keyword-match your way in by copying the user's exact words, because you never see them — you see the rewrite. What survives the rewrite is clear topical and entity targeting, which is the opposite of keyword stuffing. This is also why the same page can be pulled into the candidate set for queries phrased a dozen different ways: they all rewrite to the same normalized intent.
Stage 2 deep dive: what gets a source into the candidate set
The candidate set is assembled by a search step, so the entry criteria are mostly classic search criteria — which is why "AI search killed SEO" is exactly backwards on this pathway. If you cannot get into the candidate pool, the cleverest passage in the world never gets scored.
Candidate-set gate
Pathway
Evidence
Notes
Page is indexed
RAG
Documented (search step exists)
Non-indexed = invisible to retrieval
Ranks for rewritten query
RAG
Inferred-strong
Top-10 organic correlates with AIO inclusion [5]
Crawlable (not blocked)
RAG
Documented (bot docs) [16]
Blocking search bots removes you
Topically relevant to intent
RAG
Inferred
Semantic match to the rewritten query
Freshness within query window
RAG
Inferred
Time-sensitive queries favor recent pages
Domain in partner index
RAG
Documented (Bing for ChatGPT) [10]
If Bing can't find you, ChatGPT search can't either
This is the stage where I tell people to stop treating AI search as exotic. The candidate-set gate is roughly "would classic search return this page for the query?" If yes, you are eligible for the re-ranker. If no, you are not in the conversation. How to rank in ChatGPT is largely a stage-2 problem, and the classic-SEO foundations that get you there are the same ones that have always mattered: indexability, relevance, and rank.
Stage 3 deep dive: the re-ranker, the least-documented and most-tactical step
The re-ranker is where AI search genuinely diverges from classic search, and it is where passage-level features beat raw domain authority — but it is also the stage with the least vendor documentation, so almost everything here is Inferred. I will not pretend otherwise.
What we have is the Princeton GEO paper [1], which is the strongest evidence in the field that passage edits move citation rates. Aggarwal et al ran controlled interventions and found that adding citations, statistics, and quotations to a passage lifted its visibility in generative engines by up to 40% on tested queries, while keyword-stuffing did roughly nothing. That is a direct measurement of the re-ranker responding to passage shape, not page authority.
Re-rank signal
Pathway
Evidence
Direction of effect
Semantic relevance to query
RAG
Inferred (core of any re-ranker)
Strong positive
Answer-shape / concision (first 80-120 words)
RAG
Inferred-strong [1]
Positive; clean direct answers win
Citations + statistics in passage
RAG
Documented-ish (Princeton) [1]
Up to +40% on tested queries
Quotations from authorities
RAG
Documented-ish (Princeton) [1]
Positive
Schema.org parseability (FAQ/Article)
RAG
Inferred (Backlinko/Semrush correlation) [4][5]
Positive; easier extraction
Source authority / trust
RAG
Inferred
Positive but weaker than people think at this stage
Freshness
RAG
Inferred
Positive on time-sensitive queries
Keyword density / stuffing
RAG
Documented null result (Princeton) [1]
~No effect
The headline for operators: at the re-rank stage, being the cleanest, most-extractable, best-evidenced answer in the candidate set beats being the most authoritative domain in it. I have watched a DR-20 niche page get cited over a DR-80 homepage because the niche page answered the exact question in two crisp sentences with a stat and a source, and the homepage buried the answer under a hero section. Authority helps you qualify (stage 2); answer-shape helps you win (stage 3). The full factor-by-factor grading lives in the ranking factors checklist; here I only want you to know which stage each factor acts on.
Stage 4: synthesis and the mechanical nature of the citation
The citation itself is not a reward the model bestows — it is a mechanical instruction: cite the passages you were handed. This is widely misunderstood. People imagine the model "decides" to cite a brand it likes. In a RAG pipeline, the model is prompted to synthesize an answer from the retrieved passages and attach footnotes to the ones it used. The decision of which sources are even available to cite was made upstream, in stages 2 and 3. By the time the model writes, the candidate set is fixed.
Misconception
Reality (Inferred from RAG architecture)
"The model chose to recommend me"
The re-ranker put your passage in front of the model; the model used it
"I need the model to like my brand"
You need to survive retrieval + re-rank; generation is downstream
"More authority = more citations"
Authority helps stage 2-3; it is not a generation-time lever
"Citations are editorial judgment"
Citations are mechanical attribution of used passages
This reframing is liberating: you are not trying to charm a language model. You are trying to be the page that wins a search-and-rank contest, then be the passage clean enough that the model can lift it verbatim and footnote it. That is engineering, not persuasion.
Per-engine source preferences: same skeleton, different weighting
The four major engines share the same two-pathway skeleton, but they weight the pathways and the re-rank signals differently — and the differences are smaller than vendor marketing implies. Roughly 70% of citation variance is the shared mechanics; engine-specific deltas are the other 30%.
Engine
Default pathway
Citations/answer
Retrieval intensity
Notable lean
Perplexity
RAG (almost always searches) [12]
3-7
Highest
Recency-weighted; exposes Sources tab
ChatGPT search
Mixed (decides per query) [10]
3-5
High when triggered
Bing index; concise synthesis
Google AI Overviews
RAG, gated by SERP [5]
4-7
High but rank-gated
Top-10 organic strongly favored
Claude (web search on)
RAG when enabled [17]
Varies
Moderate
Conservative; fewer, higher-trust cites
Claude / Gemini (browse off)
Training-corpus only
0 (no links)
None
Pure recall; entity presence decides
The operational takeaway from that table:
If you want citations on...
Optimize primarily...
Because
Perplexity
RAG: concision, schema, freshness
It searches nearly every query
ChatGPT search
RAG + Bing indexability
Bing is the documented index [10]
Google AI Overviews
Classic top-10 rank first, then answer-shape
Rank gates the candidate set [5]
Browse-off Claude/Gemini
Training-corpus: mentions, Wikidata, entity
No search happens; only recall
It also helps to know each engine's default posture, because that decides which pathway fires before the user does anything special:
Engine
Default mode
Pathway you hit by default
How to flip it
Perplexity
Always searches
RAG
Effectively can't avoid RAG
ChatGPT (consumer)
Decides per query
Mixed
Explicit "search the web" forces RAG
Google AI Overviews
Triggers on ~13-15% of SERPs
RAG (rank-gated)
Only on queries that show an AIO [9]
Claude (chat)
Browse off unless enabled
Training-corpus
Turn on web search for RAG
Gemini (chat)
Browse off unless prompted
Training-corpus
Ask it to search for RAG
Notice that the same brand can need opposite strategies on two engines. Winning Perplexity is a stage-2/stage-3 RAG problem solvable this month. Winning browse-off Gemini is a training-corpus problem solvable over quarters. If you only measure one, you will mis-allocate effort toward the engine that happens to be easy to see.
What's documented vs inferred vs speculative: the honesty table
Most "how AI chooses sources" content states inference as fact; here is exactly where the evidence runs out. I would rather you trust less and test more than believe a confident claim I cannot back.
Claim
Label
Basis
Two pathways exist (recall vs RAG)
Documented
Vendors document both browse-off and search modes [10][16][17]
RAG has a distinct search step
Documented
SearchGPT names Bing; Perplexity names its stack [10][12]
Operator measurement + Princeton null on stuffing [1]
Engines use a PageRank-like authority signal
Inferred
Link-graph correlates; mechanism not published
Exact re-rank scoring function
Speculative
No vendor publishes it
"Brand affinity" influences generation
Speculative
No evidence; contradicts mechanical-citation model
Specific token-level citation triggers
Speculative
Pattern-matching without rigor
The line I draw: Documented means a vendor or peer-style paper said it. Inferred means strong, multi-source observational evidence with a plausible mechanism but no confirmation. Speculative means SEO-community pattern-matching — worth testing on your own site, dangerous to bet the budget on. The whole point of grading is that the field is full of stage-3 (re-ranker) claims dressed up as documented fact, when the re-ranker is precisely the stage vendors keep closed.
How the two pathways interact (and feed each other)
The pathways are not isolated — today's RAG citations seed tomorrow's training corpus, which is why the fast pathway is also a long-term investment. This is the most important second-order mechanic and the one most strategies miss.
Walk the loop:
You publish a clean, answer-shaped page (RAG-pathway optimization).
It gets retrieved and cited in Perplexity/ChatGPT search within days (RAG win).
Those citations and the traffic they drive generate mentions, links, and discussion.
The next training crawl ingests your page, the citations of it, and the discussion.
The next model generation "learns" your brand → training-corpus presence grows.
Now browse-off Claude/Gemini start recommending you with no link (training-pathway win).
Pathway interaction
Direction
Latency
Why it matters
RAG citation → training presence
Forward-feeding
6-12 months
Fast wins compound into slow durable recall
Training presence → RAG candidate odds
Reinforcing (Inferred)
Continuous
Known brands may get retrieved more readily
Blocking training crawler
Breaks the loop
Immediate
You keep RAG, lose the compounding to recall
Strong recall, weak RAG pages
Divergence
—
Model knows you but can't cite a clean page
So the right mental model is not "pick a pathway." It is "win the RAG pathway now for measurable traffic, and let those wins compound into training-corpus recall over the next few model generations." The mistake is doing RAG work and never checking whether it has started moving the browse-off recommendations — because that is the higher-value, harder-to-measure outcome.
Common myths the mechanics dismantle
Once you hold the two-pathway model, most popular AI-SEO advice sorts cleanly into "true for one pathway" or "false for both." Here is the cleanup:
Myth
Verdict
Why
"Schema gets you cited by AI"
True for RAG, irrelevant to recall
Re-ranker parses schema; weights are prose
"Just be authoritative and AI will cite you"
Half-true
Authority gates RAG stage 2 and feeds recall; not a stage-3 winner
"AI doesn't use SEO anymore"
False for RAG
The candidate set is a search step [10][12]
"Block the bots, protect your content, still get cited"
Mostly false
Blocking removes you from retrieval and/or corpus
"Fresh content gets cited fastest"
True for RAG, false for recall
RAG = 24-72h; recall = next model
"More backlinks = more AI citations"
Weak/indirect
Links help rank (stage 2); re-ranker weights passages [1]
"The model decides to recommend brands it likes"
False
Citation is mechanical attribution of used passages
"One GEO strategy works across all engines"
False
Perplexity (RAG) vs browse-off Gemini (recall) are opposite
I have audited enough GEO programs to say the costliest myth is the last one. Teams build a single playbook, see it work on Perplexity, declare victory, and never notice that the browse-off recommendations that actually drive their highest-intent traffic never moved. They optimized the visible pathway and ignored the valuable one.
A practical decision framework: which pathway should you optimize?
Pick your pathway from the query that drives your revenue, not from the engine that is easiest to screenshot. Here is the framework I give founders.
If your buyers ask AI...
Dominant pathway
Optimize
Realistic timeline
"Best [category] tool for [use case]?" (browse off)
Step 1 is non-negotiable and it is where almost everyone fails. The tactical get-cited playbook covers steps 2 and 3 in depth. But notice that both steps depend on step 1 being real — and step 1 is exactly what your default analytics cannot do.
The part nobody tells you: knowing the mechanics is useless unless you measure which pathway pays
You can understand the retrieval pipeline perfectly and still set your budget on fire, because the two pathways send revenue through completely different, equally invisible doors — and GA4 closes both. This is where I have to be blunt about the gap that motivated me to build Attrifast in the first place.
Walk what actually reaches you from each pathway:
Pathway
What the visitor does
What your analytics sees
RAG citation
Clicks the footnote
Referrer stripped by AI client → GA4 Direct/(none) [11]
Training-corpus recall
Reads "I'd use Attrifast," then searches your brand later
Branded search or Direct, days later, no link to AI
In both cases, GA4 buckets the visit as Direct/(none) and tells you nothing about which pathway, which engine, or which query drove it [11]. So the team that did six weeks of schema work (a RAG-pathway lever) and the team that spent two quarters building Reddit mention density (a training-corpus lever) get the same useless signal: a swelling Direct bucket. Neither can tell whether their pathway is the one paying.
Here is the measurement stack that separates them, and it is cookieless end to end:
The decisive layer is the last one. Until you join the AI-referred session to the Stripe payment, you have citation counts and traffic guesses — you do not have revenue, and revenue is the only number that tells you which pathway to fund. A brand can be cited 500 times a week on Perplexity (RAG) and earn less than a single browse-off Claude recommendation that quietly sends three high-intent buyers a month. You will fund the 500 citations because you can see them, and starve the recommendation because you can't — unless you measure at the money.
That is the architecture Attrifast ships: cookieless, consent-light, server-side referrer detection against the AI-engine domain list, first-party session identity scoped to your own domain, and a Stripe-metadata join that lands the AI-referred visit on the actual paying customer. It is the only way I know to answer "which pathway drives my revenue," and it is the question every section above quietly leads to. If you want the analytics shape of the hidden traffic before you instrument it, the ChatGPT referral analytics guide and track-chatgpt-traffic walk the detection code; revenue attribution is the join.
The mechanics in this article are real and worth knowing. But they are a map, not a destination. The destination is a dashboard that tells you, this quarter, that your RAG schema work added $1,400 of Perplexity-attributed revenue and your slow Reddit-mention grind added $3,100 of browse-off-recommendation revenue — so next quarter you shift the weight. Without that, you are optimizing pathways you have never actually seen.
Freshness and crawl mechanics: why the same edit lands in days on one pathway and never on the other
Freshness is not one thing — it behaves completely differently per pathway, and the same content edit can be live in a citation within 72 hours on RAG and invisible to recall for a year. Operators who do not separate these two clocks chase their tails wondering why "the update didn't take."
On the RAG pathway, freshness is bounded by your crawl-and-index cadence, because the citation is decided against the current index at query time. On the training-corpus pathway, freshness is bounded by the model's training cutoff, which is frozen until the next generation ships. Two clocks, wildly different speeds.
Content change
RAG pathway latency
Training-corpus latency
Why the gap
New page published
24-72h on well-crawled domains
Next model gen (~6-12mo)
RAG reads the live index; recall waits for retraining
Price/feature correction
Hours-days once re-indexed
Next model gen
Frozen weights can't re-read your page
New competitor comparison
Days
Next model gen
Same retrieval/recall split
Brand rename / rebrand
Days for RAG, but recall stays wrong
Next model gen
The hardest correction to propagate
Removing outdated content
Drops from RAG on re-crawl
Stays in weights until retrain
You can't unlearn the model
The crawl side has its own per-bot mechanics worth knowing, because the bot that visits determines which pathway your content can even reach:
Crawler
Owner
Pathway it feeds
Documented?
Block effect
GPTBot
OpenAI
Training corpus
Documented [16]
Removes you from future ChatGPT recall
OAI-SearchBot
OpenAI
RAG search index
Documented [16]
Removes you from ChatGPT search retrieval
ChatGPT-User
OpenAI
RAG live fetch
Documented [16]
Removes you from on-demand fetches
ClaudeBot
Anthropic
Training corpus
Documented [17]
Removes you from future Claude recall
Google-Extended
Google
Gemini training
Documented [19]
Opts you out of Gemini training only
Googlebot
Google
Live index → AI Overviews
Documented
Removes you from Search + AIO
PerplexityBot
Perplexity
RAG index
Documented [12]
Removes you from Perplexity retrieval
The honest reading: selective blocking lets you keep the fast pathway while opting out of the slow one, but almost no SMB benefits from blocking anything. The freshness asymmetry is the real lesson — when you ship an important correction, the RAG surfaces will tell the truth within days and the browse-off recall will keep repeating the old version until the next model. Plan your messaging around both clocks. Google is the clearest worked example of this multi-clock behavior, and where Google AI gets its information breaks its four stitched-together sources down by exactly these update cadences.
Putting it together: a per-pathway optimization scorecard
The clean way to act on all of this is a single scorecard that maps each lever to its pathway, its evidence grade, and its latency — so you stop spending RAG effort hoping for recall results. This is the table I actually use when auditing a site.
Lever
Pathway
Evidence
Latency
Effort
Indexability + classic rank
RAG (stage 2)
Documented search step
Days-weeks
Medium
Answer-shaped first 80-120 words
RAG (stage 3)
Inferred-strong [1]
Days
Low
Citations + stats in passage
RAG (stage 3)
Documented-ish [1]
Days
Low
FAQ / Article schema
RAG (stage 3)
Inferred [4][5]
Days
Low
Freshness on time-sensitive pages
RAG (stages 2-3)
Inferred
Days
Ongoing
Wikidata + Organization schema + sameAs
Both (entity)
Inferred-strong
Weeks (KG)
Medium
Reddit / forum mention density
Training-corpus
Inferred-strong [13]
Quarters
High
Wikipedia presence
Training-corpus
Inferred-strong
Quarters
Very high
Publisher / earned-media mentions
Training-corpus
Inferred
Quarters
High
Allowing the training crawlers
Training-corpus
Documented [16][17]
Next gen
Trivial
And the diagnostic side — given a symptom, which pathway is failing:
Symptom
Failing pathway
First thing to check
Cited on Perplexity, ignored browse-off
Training-corpus
Entity presence + mention density
Recommended browse-off, never a footnote
RAG
Indexability + answer-shape
Cited but wrong/outdated facts
Training-corpus (recall)
Wait for next gen; fix live index for RAG
Zero presence anywhere
Both
Are crawlers blocked? Are you indexed at all?
Citations rising, revenue flat
Measurement
Join sessions to Stripe by pathway
That last row is the one I keep returning to, because it is the failure mode that survives even a perfect understanding of the mechanics. You can grade every lever correctly and still not know whether the work paid. The fix is not more mechanics — it is measurement at the money.
FAQ
How do AI engines choose which sources to cite?
Through one of two mechanically different pathways. In the training-corpus pathway, the model recalls your brand from frozen pretraining knowledge with no live fetch — citation here reflects how densely and trustworthily your information appeared in the crawl, which is why Wikipedia, Reddit, and established publishers dominate. In the live-retrieval (RAG) pathway, the engine runs a real search at query time, retrieves a candidate set, re-ranks it on relevance plus source plus answer-shape, then cites the passages it actually used. The first pathway rewards years of accumulated authority; the second rewards classic searchability plus concise, schema-marked, fresh passages. The fatal mistake is optimizing without naming which pathway the query you care about uses.
What is the difference between training-corpus citation and RAG citation?
Training-corpus citation fires when the engine answers from memory with browsing off (Claude plain chat, Gemini no-browse, ChatGPT when it skips search). It usually produces an unlinked recommendation because nothing was fetched, and it changes only on the model's 6-12 month training cycle. RAG citation fires when the engine searches the live web, and it produces a clickable footnote to a page retrieved seconds earlier; it can change within 24-72 hours of you publishing. They demand opposite work: slow corpus-wide authority building for the first, fast on-page searchability and answer-shape for the second.
Does ChatGPT search the live web or answer from training memory?
Both, decided per query. Per OpenAI's documentation, ChatGPT search runs a retrieval-augmented pipeline against a search index (Bing is the documented partner) and cites 3-5 sources when it browses. But ChatGPT also answers many queries from frozen training knowledge with no search and no citations. The model decides each time whether to invoke the search tool. That is why the same brand can be a footnote on one query (RAG pathway) and an unlinked mention on another (training-corpus pathway), and why a single optimization strategy underperforms.
What gets a source into the candidate set for retrieval?
Mostly classic search criteria, because the candidate set is built by an actual search step before any AI re-ranking. The page has to be indexed, crawlable, topically relevant to the rewritten retrieval query, and ideally already ranking for it. The Princeton GEO research and broad operator measurement both point to top-organic pages being far likelier to enter the pool. Schema and concision do not get you into the candidate set — they help you survive the next stage, the re-ranker. Getting in is a searchability problem; staying in is a passage-quality problem.
How does the re-ranking step decide which passages survive?
This is the least-documented stage, so most of the answer is Inferred. After retrieval returns a candidate set, a re-ranker scores passages on semantic relevance, source signals, and answer-shape. The strongest evidence we have is the Princeton GEO paper, which showed that adding citations, statistics, and quotations to a passage lifted its generative-engine visibility by up to 40% on tested queries, while keyword-stuffing did essentially nothing. That is direct evidence the re-ranker weights passage-level features, not just domain authority. Concision, schema parseability, and freshness all plausibly help, but no vendor publishes the exact scoring function.
Why does AI cite Wikipedia and Reddit so often?
Because both are massively over-represented in pretraining corpora and both are treated as high-trust for their query types. Wikipedia seeds the entity graph that every major model uses for disambiguation, so it surfaces in both pathways constantly. Reddit was licensed to Google for AI training and is heavily weighted for opinion, recommendation, and lived-experience queries where the model wants human consensus. Neither passes you classic backlink equity, but consistent presence on either accumulates training-corpus authority that pure on-site optimization cannot replicate.
Can I get cited if I block the AI crawlers?
Only on the RAG pathway, and only if you block selectively. Blocking the training crawler (GPTBot) while allowing the live-fetch and search agents (ChatGPT-User, OAI-SearchBot) keeps you retrievable at query time but removes you from future pretraining, slowly eroding your training-corpus recall. Blocking everything forfeits both pathways. For most SMB SaaS and ecommerce sites in 2026, allowing the crawlers is the right call, because training-corpus presence — the browse-off recommendation — is the durable asset you cannot buy back on a short timeline.
How long does it take to start getting cited by AI?
Pathway-dependent. On RAG, fresh content can be cited within 24-72 hours of indexing on a well-crawled domain, because the decision is made at query time against the current index. On the training-corpus pathway, you are waiting for the next model generation — a 6-12 month cutoff cycle plus the lead time for your content to be crawled and learned before the cutoff. If you need citations this quarter, optimize RAG; if you want the model to recommend you with browsing off, you are playing a multi-quarter game.
Which AI engine is easiest to get cited by, and why?
Perplexity, because it is the most retrieval-heavy major engine — it runs a live search on nearly every query and surfaces 3-7 citations per answer, with a Sources tab exposing the wider retrieved set. That makes the RAG pathway dominant, so classic searchability plus answer-shaped passages get you in quickly. ChatGPT search is next, then Google AI Overviews (gated by top-10 organic rank). The hardest surface is base-model knowledge in a no-browse chat, which is pure training-corpus recall you cannot influence without being in the next training run.
Does structured data (schema) actually affect AI citations?
On the RAG pathway, the evidence says yes and I label it Inferred-strong. Schema.org JSON-LD makes a passage trivially parseable, so the re-ranker does not have to guess what is a question, answer, price, or author. Multiple observational studies find AI-cited pages carry FAQ and Article schema at higher rates, and the mechanism is plausible: a clean, machine-readable answer is easier to extract and trust. On the training-corpus pathway, schema matters far less, because the model learned prose, not markup. So schema is mostly a RAG-pathway lever — another reason naming the pathway first changes what you build.
Do AI engines use Google's PageRank to choose sources?
Not directly, but they almost certainly use a similar citation-graph authority signal derived from web crawl data, and I label this Inferred. Pages from high-PageRank-style domains (Wikipedia, government, established publishers) get cited disproportionately, which is consistent with a link-graph authority signal but does not prove the mechanism. Critically, that authority appears to act more at the candidate-set stage (does this page rank well enough to be retrieved) than at the re-rank stage, where the Princeton evidence shows passage shape can beat raw authority.
Is the citation a judgment the model makes, or a mechanical step?
Mechanical, on the RAG pathway. The model is prompted to synthesize an answer from the retrieved passages and attach footnotes to the ones it used. The decision of which sources are even available to cite was made upstream in retrieval and re-ranking; by the time the model writes, the candidate set is fixed. So "the model chose to recommend me" really means "the re-ranker put my passage in front of the model and it was clean enough to use." You are not charming a model — you are winning a search-and-rank contest.
How do the two pathways feed each other?
Forward. A clean page wins RAG citations within days; the resulting traffic, mentions, and discussion get ingested by the next training crawl; the next model generation learns your brand; and then browse-off engines start recommending you with no link. So fast RAG wins compound into slow, durable training-corpus recall over a few model generations. The strategic implication: do RAG work now for measurable traffic, but keep checking whether it has started moving the higher-value, harder-to-see browse-off recommendations.
Why can't GA4 tell me which pathway drives my AI revenue?
Because both pathways arrive looking like Direct/(none). A RAG footnote click has its referrer stripped by the AI client, so GA4 sees no referrer and no UTM and files it as Direct. A training-corpus recommendation produces a later branded search or Direct visit with no link back to the AI engine at all. GA4 has no built-in rule to separate AI traffic, let alone separate the two pathways, and it cannot join either to revenue. You need server-side referrer fingerprinting against an AI-engine domain list plus a first-party identity and a Stripe-payment join to see which pathway actually pays.
How do I measure which pathway is driving my revenue?
Four layers, all cookieless. First, server-side referrer fingerprinting against a known AI-engine domain list catches the RAG footnote clicks that pass any referrer, broken out by engine. Second, first-party deep-page landing detection catches the unreferred RAG clicks where the referrer was stripped. Third, branded-search and Direct lag analysis approximates the training-corpus recommendations that send delayed visits. Fourth — the decisive one — a Stripe webhook join lands the AI-referred session on the actual paying customer, so you see revenue by pathway, not just citation counts. That join is the entire reason citation-tracking tools alone leave you guessing, and it is what Attrifast was built to close.