Does ChatGPT search the live web or answer from memory?

Both, depending on the query and the mode. Per OpenAI's documentation, ChatGPT search runs a retrieval-augmented pipeline that queries a search index (Bing is the documented partner per the SearchGPT announcement) and cites 3-5 sources per answer. But ChatGPT also answers many queries from its frozen training knowledge without browsing at all, in which case there are no inline citations because nothing was fetched. The model decides per-query whether to invoke the search tool. This is why the same brand can appear as a clickable footnote on one query (RAG pathway) and as an unlinked mention on another (training-corpus pathway), and why you have to optimize for both.

How do I know which pathway is driving my AI traffic and revenue?

You measure it, because you cannot infer it from citation-tracking tools alone. A clickable footnote in a ChatGPT or Perplexity answer is the RAG pathway and it sends a referrer-stripped click you can catch server-side. A browse-off mention where the model recommends you with no link is the training-corpus pathway and it shows up as a later branded search or a Direct visit. The only way to separate them by revenue is first-party, server-side attribution that joins the AI-referred session to the Stripe payment. GA4 buckets all of it as Direct, so most teams optimize a pathway they have never actually measured. That measurement gap is the entire reason this article ends where it does.

Blog / AI Search

How AI Engines Choose Which Sources to Cite: The 2026 Mechanics

Q: How do AI engines choose which sources to cite?

There is not one answer because there are two completely different pathways. In the training-corpus pathway, the model already learned about your brand during pretraining and reproduces it from memory with no live fetch — citation here is a function of how heavily your information appeared in the crawl (Wikipedia, Reddit, established publishers dominate). In the live-retrieval pathway (RAG), the engine runs an actual search at query time, pulls a candidate set of documents, re-ranks them on relevance plus source signals, then cites the handful of passages it passed to the model. The first pathway rewards corpus-wide authority accumulated over years; the second rewards classic searchability plus answer-shaped, schema-marked, fresh passages. Most 'how to get cited' advice optimizes blindly without naming which pathway it targets, which is why so much of it underperforms.

Q: What is the difference between training-corpus citation and RAG citation?

Training-corpus citation happens when the model answers from frozen pretraining knowledge with browsing off — Claude in a plain chat, ChatGPT when it decides not to search, Gemini in no-browse mode. The 'citation' is really the model recalling that a brand or fact exists; it cannot link to a page it did not fetch. RAG citation happens when the engine performs a live search step at query time, retrieves documents, re-ranks them, and attaches inline footnotes to the passages it actually used. The practical consequence: training-corpus presence is slow to earn and slow to change (it moves on the model's 6-12 month training cycle), while RAG citation can flip within 24-72 hours of publishing on a well-crawled domain. They demand different work.

Q: What gets a source into the candidate set for retrieval?

In the RAG pathway, the candidate set is assembled by a search step before any AI re-ranking happens, so the entry criteria are mostly classic search criteria: the page is indexed, it ranks for the rewritten retrieval query, it is crawlable, and it is topically relevant to the intent. The Princeton GEO research (Aggarwal et al, 2024) and operator measurement both suggest that pages already ranking in the top organic results for a query are far more likely to enter the candidate pool. Once in the pool, a second, different scoring step (the re-ranker) decides which 4-8 passages survive to the generation step. Getting into the candidate set is a searchability problem; surviving the re-ranker is a passage-quality problem.

Q: How does the re-ranking step decide which passages survive?

This is the least-documented stage and I label most of it Inferred. After the search step returns a candidate set, a re-ranker scores passages on semantic relevance to the query, source signals (authority, freshness, trust), and answer-shape (does this passage directly answer the question in a few sentences). The Princeton GEO paper showed that adding citations, statistics, and quotations to a passage lifted its generative-engine visibility by up to 40% on tested queries, which is direct evidence that passage-level features move the re-ranker, not just page-level authority. Concision, schema parseability, and freshness all appear to help a passage survive the cut, but no vendor publishes the exact scoring function.

Q: Why does AI cite Wikipedia and Reddit so often?

Because both are massively over-represented in the pretraining corpus and both are treated as high-trust for their respective query types. Wikipedia seeds the entity graph every major model uses for disambiguation, so it gets pulled into both pathways constantly. Reddit was licensed to Google for AI training (per Reuters reporting) and is heavily weighted for opinion, recommendation, and experience queries where the model wants human consensus. Neither passes you backlink equity in the classic sense, but a brand mentioned repeatedly on Reddit or present on Wikipedia accumulates training-corpus authority that no amount of on-site optimization replicates. That is the training-corpus pathway doing its work.

Q: Can I get cited if I block the AI crawlers?

Only on the RAG pathway, and only partially. If you block GPTBot (OpenAI's training crawler) but allow ChatGPT-User and OAI-SearchBot (the live-fetch and search agents), you can still be retrieved and cited at query time, but you remove yourself from future pretraining corpora, which slowly degrades your training-corpus presence. If you block everything, you forfeit both pathways. The honest tradeoff for most SMB SaaS and ecommerce sites in 2026 is to allow the crawlers, because the training-corpus pathway is where durable, browse-off brand recognition is built, and that is the pathway you cannot buy back quickly.

Q: How long does it take to start getting cited by AI?

It depends entirely on the pathway. On the live-retrieval (RAG) pathway, fresh content can be cited within 24-72 hours of indexing on a well-crawled domain, because the citation decision is made at query time against the current index. On the training-corpus pathway, you are waiting for the next model generation, which means a 6-12 month cutoff cycle plus the lead time for your content to be crawled, weighted, and learned before the cutoff. This asymmetry is the single most important thing to understand: if you need citations this quarter, you optimize the RAG pathway; if you want the model to 'just know' your brand with browsing off, you are playing a multi-quarter game.

Q: Which AI engine is easiest to get cited by?

Perplexity, by a wide margin, because it is the most retrieval-heavy of the major engines — it runs a live search on virtually every query and surfaces 3-7 citations per answer, including a Sources tab that exposes the wider retrieved set. That means the RAG pathway dominates and classic searchability plus answer-shaped passages get you in fast. ChatGPT search is next, then Google AI Overviews (gated by top-10 organic rank). The hardest surface is base-model knowledge in a no-browse chat (Claude or Gemini with browsing off), because that is pure training-corpus recall and you cannot influence it without being in the next training run.

Q: Does structured data (schema) actually affect AI citations?

On the RAG pathway, the evidence says yes, and I label it Inferred-strong. Schema.org JSON-LD makes a passage trivially parseable — the engine does not have to guess what is a question, an answer, a price, or an author. Multiple observational studies (Backlinko, Semrush) find AI-cited pages carry FAQ and Article schema at higher rates than uncited pages, and the mechanism is plausible: a clean, machine-readable answer is easier for the re-ranker to extract and trust. On the training-corpus pathway, schema matters less directly, because the model learned prose, not markup. So schema is mostly a RAG-pathway lever, which is another reason naming the pathway first changes what you build.

26 min readUpdated May 2026

Vincent RuanFounder, Attrifast · May 26, 2026 · 26 min read

The actual retrieval and training mechanics behind AI citations — the two pathways (pretraining corpus vs live RAG), how each scores sources, and which one drives your revenue.

Part of the generative engine optimization guide.

TL;DR

There are two completely different citation pathways and almost no advice names which one it is optimizing. Pathway 1 — training-corpus citation: the model already learned your brand during pretraining and recalls it with no live fetch (browse-off Claude, Gemini, ChatGPT-without-search). Pathway 2 — live-retrieval citation (RAG): the engine runs an actual search at query time, retrieves a candidate set, re-ranks it, then cites the passages it used.
The two pathways reward different work. Training-corpus presence is earned over years through corpus-wide authority (Wikipedia, Reddit, established publishers dominate) and changes on the model's 6-12 month training cycle. RAG citation rewards classic searchability plus answer-shaped, schema-marked, fresh passages and can flip within 24-72 hours.
RAG itself is at least four stages: query rewrite → candidate retrieval (a search step) → re-ranking (a different scoring step) → synthesis with citations. The candidate-set step is a searchability problem (so classic SEO matters); the re-rank step is a passage-quality problem (so concision, schema, freshness matter). Most "ranking factor" lists collapse these.
The Princeton GEO paper (Aggarwal et al, 2024) is the empirical anchor: adding citations, statistics, and quotations to a passage lifted generative-engine visibility by up to 40% on tested queries — direct evidence that the re-ranker weights passage-level features, not only page authority.
I label every mechanic claim Documented, Inferred, or Speculative. Vendors publish almost nothing about the re-ranker, so honesty about evidence strength is the whole point.
Knowing the mechanics is useless unless you measure which pathway drives your revenue. GA4 buckets all of it as Direct. See the AI-engine revenue split inside Attrifast's cookieless attribution → Start free trial

How AI engines choose sources via two pathways: training-corpus citation (the model knows you from pretraining) and live-retrieval RAG (rewrite, retrieve, re-rank, cite) — which reward different things

A founder messaged me in March, frustrated. He had spent six weeks adding FAQ schema, tightening his intros, and chasing the standard GEO checklist, and his Perplexity citations had genuinely climbed. But ChatGPT, when his prospects asked it to recommend a tool in his category, still did not mention him. He could not understand why the same "AI optimization" worked on one engine and did nothing on the other.

The answer is that he was optimizing one pathway and measuring a different one. His schema and concision work moved the live-retrieval (RAG) pathway, which is exactly why Perplexity — the most retrieval-heavy engine — responded. But the ChatGPT query he cared about was answering from frozen training knowledge with no live search, so it was running on the training-corpus pathway, where six weeks of on-page work changes nothing. He needed two different strategies and a way to tell which one was paying.

This is the deeper companion to the AI search ranking factors checklist. That piece lists the 12 signals and grades them. This piece is the layer underneath: the actual retrieval and training mechanics that decide why a signal helps at all, and the single distinction — training corpus versus live retrieval — that most articles never make. If you have read the factors piece, this is where the "why" lives. Everything below is labeled Documented, Inferred, or Speculative, because the vendors publish far less than the confident blog posts imply.

Quick Facts

Spec	Value
Distinct citation pathways	2 (training-corpus recall + live-retrieval RAG)
RAG pipeline stages	4+ (query rewrite, candidate retrieval, re-rank, synthesis+cite)
Time-to-citation, RAG pathway	24-72 hours on a well-crawled domain (directional)
Time-to-citation, training pathway	One model generation (~6-12 month cutoff cycle)
Citations per Perplexity answer	3-7 [12]
Citations per ChatGPT search answer	3-5 [10]
Citations per Google AI Overview block	4-7 [5]
Princeton GEO max citation lift from passage edits	Up to 40% on Perplexity / BingChat [1]
Documented partner search index for ChatGPT search	Bing (per SearchGPT announcement) [10]
Corpora over-represented in pretraining	Wikipedia, Reddit, established publishers [13][15]
GA4 default attribution accuracy for either pathway	Roughly 0% (both bucket as Direct/(none)) [11]

The Princeton GEO paper is the empirical spine of this article. It is the closest thing the field has to a controlled experiment on what moves AI citation rates, and it specifically isolates passage-level interventions, which makes it the right anchor for the re-ranking discussion. Everything sourced to agency studies (Ahrefs, Semrush, Backlinko) is observational at scale — directionally useful, weaker on causation. I weight evidence accordingly and say so inline.

The one distinction that reorganizes everything: two pathways, not one algorithm

The single most useful sentence in this article: an AI engine cites you through one of two mechanically different pathways, and the work that wins one does almost nothing for the other. Get this and the rest follows; miss it and you will keep optimizing blind.

Most "how does AI choose sources" content treats citation as a single black box with a list of factors going in. That framing is wrong in a way that costs real money. There are two systems:

Pathway	What happens	When it fires	What it rewards
1. Training-corpus citation	The model recalls your brand/fact from frozen pretraining knowledge. No live fetch.	Browse-off chat: Claude plain, Gemini no-browse, ChatGPT when it skips search	Corpus-wide authority accumulated over years (Wikipedia, Reddit, publisher mentions)
2. Live-retrieval citation (RAG)	The engine runs a real search, retrieves docs, re-ranks them, cites the passages used	Browse-on: ChatGPT search, Perplexity, AI Overviews, Claude with web search	Classic searchability + answer-shaped, schema-marked, fresh passages

The reason this distinction is invisible to most operators is that the output looks identical. In both cases the model says "I recommend Attrifast" or surfaces a fact about a brand. But in Pathway 1 there is usually no clickable link — the model is recalling, not fetching — and in Pathway 2 there is a numbered footnote pointing at a page that was retrieved seconds earlier. Same-looking answer, completely different machinery, completely different levers.

Here is the honest version of what each pathway costs to influence:

Dimension	Training-corpus pathway	RAG pathway
Update latency	6-12 months (next model)	24-72 hours (next crawl)
Primary lever	Corpus-wide mentions + entity presence	Indexable, ranking, answer-shaped pages
Hardest to fake	Yes — accumulated authority	No — passage edits move it (Princeton)
Linkable citation?	Usually no (recall, no fetch)	Yes (footnote to retrieved page)
Evidence strength	Inferred (corpora are documented; weighting is not)	Mixed: search step Documented, re-rank Inferred
Who wins by default	Old, authoritative, heavily-mentioned brands	Anyone who is searchable + concise + fresh

I will walk each pathway's mechanics in full. But the framing comes first because every later section is "which pathway does this apply to?" If you only remember one thing, remember that freshness, schema, and concision are RAG-pathway levers, and Wikipedia/Reddit/mention-density are training-corpus levers, and you have to decide which one the query you care about actually uses.

Pathway 1: training-corpus citation — the model already knows you (or doesn't)

Training-corpus citation is the model recalling, from frozen pretraining, that your brand exists — and you earned that recall years ago, through the open web, whether you optimized for it or not. No live fetch happens. The model is not reading your site; it is reproducing a statistical shadow of your site left in its weights.

This is the pathway nobody can manipulate quickly, which is exactly why it is the most durable and the most valuable for brand-level recommendations. When a prospect opens Claude and types "what's a good cookieless attribution tool for a small SaaS," and Claude answers without browsing, the brands it names are the brands whose information was dense and trustworthy enough in the pretraining corpus to survive into the model's weights.

What actually goes into the corpus

The pretraining corpus for every major LLM is built primarily from large web crawls. Documented: Common Crawl is a publicly available crawl of billions of pages that underpins a large share of LLM training data [15]. Documented: OpenAI's GPTBot, Anthropic's ClaudeBot, and Google-Extended are the named training crawlers, and each respects robots.txt [16][17]. Documented: Reddit licensed its content to Google for AI training per Reuters [13]. Beyond those facts, the exact composition and weighting of any frontier model's corpus is not published — that part is Inferred from research, leaks, and behavior.

Corpus source	Documented?	Why it's weighted heavily (Inferred)
Wikipedia / Wikidata	Documented presence in nearly all LLM corpora	High edit quality, entity structure, broad coverage
Common Crawl (open web)	Documented [15]	The bulk substrate of pretraining text
Reddit	Licensing deal Documented [13]	Human consensus, recommendation/experience signal
Established news + publishers	Inferred	High editorial trust, frequent citation by others
Books / academic	Inferred	Density, formality, factual reliability
Brand-owned docs/sites	Inferred (only if crawled before cutoff)	Direct, structured product facts

The practical reading of that table: the training-corpus pathway over-rewards sources that already had web-scale authority and broad mention density. A brand with a Wikipedia page, a steady drip of Reddit and Hacker News mentions, and coverage in established publications accumulates a presence the model "remembers." A brand-new SaaS with a great site and zero external mentions is, to the training-corpus pathway, nearly invisible until the world talks about it and the next model learns from that conversation.

Which corpus source the model leans on also shifts with the type of question, which is why the same brand can be recalled confidently for one query and not another (all rows below Inferred from behavior, not vendor-documented):

Query type (browse-off)	Corpus source the model leans on	Who tends to win
"What is [X]?" (definition)	Wikipedia / encyclopedic	Entities with a wiki/Wikidata page
"Best [category] tool?" (recommendation)	Reddit / forums / reviews	Brands with mention density
"How do I [task]?" (procedure)	Docs / tutorials / Stack Overflow	Sites with clean how-to content
"Is [brand] legit?" (trust)	Reviews + publisher coverage	Brands with earned media
"[Brand A] vs [Brand B]" (comparison)	Forums + comparison content	Whoever the web compares openly

Why corrections are slow on this pathway

A wrong fact in the training pathway is frozen until the next model ships — typically a 6-12 month cutoff cycle. This is the source of the maddening "AI keeps saying my old price" problem. The model learned your $49 price before you dropped to $29, and no on-site edit reaches the frozen weights.

Symptom	Cause	Lever	Latency
AI recommends a competitor, never you (browse-off)	You're absent/thin in the corpus	Build mentions + entity presence	Next model gen
AI cites outdated pricing/features	Old data frozen in weights	Get fresh facts into the live index (helps RAG, not recall)	Next model gen for recall
AI confuses your brand with another	Weak entity disambiguation	Wikidata, Organization schema, sameAs	Weeks for KG, next gen for recall
AI "knows" you well, recommends confidently	Dense, consistent corpus presence	Maintain it; this is the asset	Already won

I want to be honest about the limits of optimization here. You cannot make a model recall you faster than its training cycle. What you can do is make sure that by the next training run, the open web's picture of your brand is dense, consistent, and entity-disambiguated. That is a slow, compounding game — and it is the single most under-instrumented one, because the payoff arrives as browse-off recommendations that send no clickable referrer and therefore never show up cleanly in anyone's analytics. We will close that loop at the end. For the entity-presence mechanics specifically, the Wikipedia effect on AI visibility goes deeper than I can here, and Reddit's outsized role in AI citations covers the mention-density side.

Pathway 2: live-retrieval (RAG) — the engine searches, ranks, then cites

RAG citation is mechanically a search engine bolted to a language model: the engine runs a real query against an index, gets a candidate set, re-ranks it, and the model cites the passages it was handed. This is the pathway you can move this week, and it is where almost all "GEO tactics" actually operate, whether or not they say so.

The acronym is retrieval-augmented generation. The key word is retrieval — there is a genuine search step before any generation, and that step behaves a lot like classic search. Documented: OpenAI's SearchGPT announcement names Bing as the partner search index for ChatGPT search [10]. Documented: Perplexity uses its own crawler (PerplexityBot) plus partner search APIs and exposes a Sources tab [12]. Documented: Google AI Overviews synthesize at query time from the live search index, not from Gemini's training data [5]. The exact retrieval and re-ranking algorithms are not published — those steps are Inferred from research and behavior.

Here is the full pipeline, stage by stage, with what's documented at each step:

Stage	What it does	Evidence	What moves it
1. Query rewrite	Turns the user's chat into one or more retrieval queries	Inferred (standard RAG practice)	Clear topical targeting; matching real query phrasing
2. Candidate retrieval	A search step pulls N documents from the index	Documented that a search step exists [10][12]	Classic SEO: indexed, ranking, crawlable, relevant
3. Re-ranking	Scores passages on relevance + source + shape	Inferred (Princeton shows passage edits move it) [1]	Concision, schema, freshness, citations/stats
4. Synthesis + cite	Model writes the answer, footnotes used passages	Documented behavior (visible citations) [10][12]	Being the cleanest extractable answer in the set

The reason this article keeps separating stage 2 from stage 3 is that they are different problems with different levers, and conflating them is the modal GEO mistake. Getting into the candidate set is a searchability problem — if your page is not indexed and ranking for the rewritten query, it never enters the pool, and no amount of schema saves it. Surviving the re-ranker is a passage-quality problem — once you are in the pool, authority matters less and answer-shape, concision, and parseability matter more.

Stage 1 deep dive: query rewrite is where your keyword targeting still matters

Before anything gets retrieved, the engine rewrites a messy human chat into one or more clean retrieval queries — and the page that matches the rewritten query, not the user's literal words, is the one that enters the pool. This stage is invisible and frequently ignored, but it quietly decides which candidate set even gets assembled.

A user types "whats a cheap tool to see if chatgpt is sending me sales." The engine does not search that verbatim. It rewrites it into something like "cookieless ChatGPT referral revenue attribution tool" and may fan it out into several sub-queries. Inferred (this is standard RAG practice, not vendor-documented per engine): the rewrite step normalizes intent, expands synonyms, and splits multi-part questions.

Rewrite behavior	Pathway	Evidence	Implication for you
Verbatim chat → normalized query	RAG	Inferred (standard practice)	Optimize for intent, not the literal phrasing
Synonym + entity expansion	RAG	Inferred	Entity-clear pages match more rewrites
Multi-part question fan-out	RAG	Inferred	One page can match several sub-queries
Conversation-context carry-over	RAG	Inferred	Earlier turns shape the rewrite
No rewrite (recall path)	Training-corpus	N/A	No query is issued at all

The practical consequence: you cannot keyword-match your way in by copying the user's exact words, because you never see them — you see the rewrite. What survives the rewrite is clear topical and entity targeting, which is the opposite of keyword stuffing. This is also why the same page can be pulled into the candidate set for queries phrased a dozen different ways: they all rewrite to the same normalized intent.

Stage 2 deep dive: what gets a source into the candidate set

The candidate set is assembled by a search step, so the entry criteria are mostly classic search criteria — which is why "AI search killed SEO" is exactly backwards on this pathway. If you cannot get into the candidate pool, the cleverest passage in the world never gets scored.

Candidate-set gate	Pathway	Evidence	Notes
Page is indexed	RAG	Documented (search step exists)	Non-indexed = invisible to retrieval
Ranks for rewritten query	RAG	Inferred-strong	Top-10 organic correlates with AIO inclusion [5]
Crawlable (not blocked)	RAG	Documented (bot docs) [16]	Blocking search bots removes you
Topically relevant to intent	RAG	Inferred	Semantic match to the rewritten query
Freshness within query window	RAG	Inferred	Time-sensitive queries favor recent pages
Domain in partner index	RAG	Documented (Bing for ChatGPT) [10]	If Bing can't find you, ChatGPT search can't either

This is the stage where I tell people to stop treating AI search as exotic. The candidate-set gate is roughly "would classic search return this page for the query?" If yes, you are eligible for the re-ranker. If no, you are not in the conversation. How to rank in ChatGPT is largely a stage-2 problem, and the classic-SEO foundations that get you there are the same ones that have always mattered: indexability, relevance, and rank.

Stage 3 deep dive: the re-ranker, the least-documented and most-tactical step

The re-ranker is where AI search genuinely diverges from classic search, and it is where passage-level features beat raw domain authority — but it is also the stage with the least vendor documentation, so almost everything here is Inferred. I will not pretend otherwise.

What we have is the Princeton GEO paper [1], which is the strongest evidence in the field that passage edits move citation rates. Aggarwal et al ran controlled interventions and found that adding citations, statistics, and quotations to a passage lifted its visibility in generative engines by up to 40% on tested queries, while keyword-stuffing did roughly nothing. That is a direct measurement of the re-ranker responding to passage shape, not page authority.

Re-rank signal	Pathway	Evidence	Direction of effect
Semantic relevance to query	RAG	Inferred (core of any re-ranker)	Strong positive
Answer-shape / concision (first 80-120 words)	RAG	Inferred-strong [1]	Positive; clean direct answers win
Citations + statistics in passage	RAG	Documented-ish (Princeton) [1]	Up to +40% on tested queries
Quotations from authorities	RAG	Documented-ish (Princeton) [1]	Positive
Schema.org parseability (FAQ/Article)	RAG	Inferred (Backlinko/Semrush correlation) [4][5]	Positive; easier extraction
Source authority / trust	RAG	Inferred	Positive but weaker than people think at this stage
Freshness	RAG	Inferred	Positive on time-sensitive queries
Keyword density / stuffing	RAG	Documented null result (Princeton) [1]	~No effect

The headline for operators: at the re-rank stage, being the cleanest, most-extractable, best-evidenced answer in the candidate set beats being the most authoritative domain in it. I have watched a DR-20 niche page get cited over a DR-80 homepage because the niche page answered the exact question in two crisp sentences with a stat and a source, and the homepage buried the answer under a hero section. Authority helps you qualify (stage 2); answer-shape helps you win (stage 3). The full factor-by-factor grading lives in the ranking factors checklist; here I only want you to know which stage each factor acts on.

Stage 4: synthesis and the mechanical nature of the citation

The citation itself is not a reward the model bestows — it is a mechanical instruction: cite the passages you were handed. This is widely misunderstood. People imagine the model "decides" to cite a brand it likes. In a RAG pipeline, the model is prompted to synthesize an answer from the retrieved passages and attach footnotes to the ones it used. The decision of which sources are even available to cite was made upstream, in stages 2 and 3. By the time the model writes, the candidate set is fixed.

Misconception	Reality (Inferred from RAG architecture)
"The model chose to recommend me"	The re-ranker put your passage in front of the model; the model used it
"I need the model to like my brand"	You need to survive retrieval + re-rank; generation is downstream
"More authority = more citations"	Authority helps stage 2-3; it is not a generation-time lever
"Citations are editorial judgment"	Citations are mechanical attribution of used passages

This reframing is liberating: you are not trying to charm a language model. You are trying to be the page that wins a search-and-rank contest, then be the passage clean enough that the model can lift it verbatim and footnote it. That is engineering, not persuasion.

Per-engine source preferences: same skeleton, different weighting

The four major engines share the same two-pathway skeleton, but they weight the pathways and the re-rank signals differently — and the differences are smaller than vendor marketing implies. Roughly 70% of citation variance is the shared mechanics; engine-specific deltas are the other 30%.

Engine	Default pathway	Citations/answer	Retrieval intensity	Notable lean
Perplexity	RAG (almost always searches) [12]	3-7	Highest	Recency-weighted; exposes Sources tab
ChatGPT search	Mixed (decides per query) [10]	3-5	High when triggered	Bing index; concise synthesis
Google AI Overviews	RAG, gated by SERP [5]	4-7	High but rank-gated	Top-10 organic strongly favored
Claude (web search on)	RAG when enabled [17]	Varies	Moderate	Conservative; fewer, higher-trust cites
Claude / Gemini (browse off)	Training-corpus only	0 (no links)	None	Pure recall; entity presence decides

The operational takeaway from that table:

If you want citations on...	Optimize primarily...	Because
Perplexity	RAG: concision, schema, freshness	It searches nearly every query
ChatGPT search	RAG + Bing indexability	Bing is the documented index [10]
Google AI Overviews	Classic top-10 rank first, then answer-shape	Rank gates the candidate set [5]
Browse-off Claude/Gemini	Training-corpus: mentions, Wikidata, entity	No search happens; only recall

It also helps to know each engine's default posture, because that decides which pathway fires before the user does anything special:

Engine	Default mode	Pathway you hit by default	How to flip it
Perplexity	Always searches	RAG	Effectively can't avoid RAG
ChatGPT (consumer)	Decides per query	Mixed	Explicit "search the web" forces RAG
Google AI Overviews	Triggers on ~13-15% of SERPs	RAG (rank-gated)	Only on queries that show an AIO [9]
Claude (chat)	Browse off unless enabled	Training-corpus	Turn on web search for RAG
Gemini (chat)	Browse off unless prompted	Training-corpus	Ask it to search for RAG

Notice that the same brand can need opposite strategies on two engines. Winning Perplexity is a stage-2/stage-3 RAG problem solvable this month. Winning browse-off Gemini is a training-corpus problem solvable over quarters. If you only measure one, you will mis-allocate effort toward the engine that happens to be easy to see.

What's documented vs inferred vs speculative: the honesty table

Most "how AI chooses sources" content states inference as fact; here is exactly where the evidence runs out. I would rather you trust less and test more than believe a confident claim I cannot back.

Claim	Label	Basis
Two pathways exist (recall vs RAG)	Documented	Vendors document both browse-off and search modes [10][16][17]
RAG has a distinct search step	Documented	SearchGPT names Bing; Perplexity names its stack [10][12]
ChatGPT search cites 3-5 sources	Documented	OpenAI docs [10]
Common Crawl underpins pretraining	Documented	Common Crawl + model docs [15]
Reddit is licensed for training	Documented	Reuters [13]
Passage edits lift citation rate up to 40%	Documented-ish	Princeton controlled study [1]
A re-ranker scores passages on relevance + shape	Inferred-strong	Standard RAG; Princeton consistency [1]
Schema improves re-rank odds	Inferred	Backlinko/Semrush correlation, plausible mechanism [4][5]
Wikipedia/Reddit over-weighted in corpus	Inferred-strong	Corpus presence documented; weighting inferred
Authority matters more at stage 2 than stage 3	Inferred	Operator measurement + Princeton null on stuffing [1]
Engines use a PageRank-like authority signal	Inferred	Link-graph correlates; mechanism not published
Exact re-rank scoring function	Speculative	No vendor publishes it
"Brand affinity" influences generation	Speculative	No evidence; contradicts mechanical-citation model
Specific token-level citation triggers	Speculative	Pattern-matching without rigor

The line I draw: Documented means a vendor or peer-style paper said it. Inferred means strong, multi-source observational evidence with a plausible mechanism but no confirmation. Speculative means SEO-community pattern-matching — worth testing on your own site, dangerous to bet the budget on. The whole point of grading is that the field is full of stage-3 (re-ranker) claims dressed up as documented fact, when the re-ranker is precisely the stage vendors keep closed.

How the two pathways interact (and feed each other)

The pathways are not isolated — today's RAG citations seed tomorrow's training corpus, which is why the fast pathway is also a long-term investment. This is the most important second-order mechanic and the one most strategies miss.

Walk the loop:

You publish a clean, answer-shaped page (RAG-pathway optimization).
It gets retrieved and cited in Perplexity/ChatGPT search within days (RAG win).
Those citations and the traffic they drive generate mentions, links, and discussion.
The next training crawl ingests your page, the citations of it, and the discussion.
The next model generation "learns" your brand → training-corpus presence grows.
Now browse-off Claude/Gemini start recommending you with no link (training-pathway win).

Pathway interaction	Direction	Latency	Why it matters
RAG citation → training presence	Forward-feeding	6-12 months	Fast wins compound into slow durable recall
Training presence → RAG candidate odds	Reinforcing (Inferred)	Continuous	Known brands may get retrieved more readily
Blocking training crawler	Breaks the loop	Immediate	You keep RAG, lose the compounding to recall
Strong recall, weak RAG pages	Divergence	—	Model knows you but can't cite a clean page

So the right mental model is not "pick a pathway." It is "win the RAG pathway now for measurable traffic, and let those wins compound into training-corpus recall over the next few model generations." The mistake is doing RAG work and never checking whether it has started moving the browse-off recommendations — because that is the higher-value, harder-to-measure outcome.

Common myths the mechanics dismantle

Once you hold the two-pathway model, most popular AI-SEO advice sorts cleanly into "true for one pathway" or "false for both." Here is the cleanup:

Myth	Verdict	Why
"Schema gets you cited by AI"	True for RAG, irrelevant to recall	Re-ranker parses schema; weights are prose
"Just be authoritative and AI will cite you"	Half-true	Authority gates RAG stage 2 and feeds recall; not a stage-3 winner
"AI doesn't use SEO anymore"	False for RAG	The candidate set is a search step [10][12]
"Block the bots, protect your content, still get cited"	Mostly false	Blocking removes you from retrieval and/or corpus
"Fresh content gets cited fastest"	True for RAG, false for recall	RAG = 24-72h; recall = next model
"More backlinks = more AI citations"	Weak/indirect	Links help rank (stage 2); re-ranker weights passages [1]
"The model decides to recommend brands it likes"	False	Citation is mechanical attribution of used passages
"One GEO strategy works across all engines"	False	Perplexity (RAG) vs browse-off Gemini (recall) are opposite

I have audited enough GEO programs to say the costliest myth is the last one. Teams build a single playbook, see it work on Perplexity, declare victory, and never notice that the browse-off recommendations that actually drive their highest-intent traffic never moved. They optimized the visible pathway and ignored the valuable one.

A practical decision framework: which pathway should you optimize?

Pick your pathway from the query that drives your revenue, not from the engine that is easiest to screenshot. Here is the framework I give founders.

If your buyers ask AI...	Dominant pathway	Optimize	Realistic timeline
"Best [category] tool for [use case]?" (browse off)	Training-corpus	Mentions, Wikidata, entity, Reddit presence	Quarters
"Compare [you] vs [competitor]" (with search)	RAG	Comparison pages, schema, concision	Weeks
"How do I do [task]?" (with search)	RAG	Answer-shaped how-to passages	Days-weeks
"What is [your brand]?" (either)	Both	Entity presence + clean About/docs	Mixed
Time-sensitive / "latest [thing]"	RAG	Freshness + fast indexing	Days

The sequencing I recommend:

Step	Action	Pathway served	Why first
1	Instrument revenue by AI engine + pathway	Both	You cannot allocate blind
2	Win RAG: indexable, ranking, answer-shaped, schema	RAG	Fast, measurable, compounds
3	Build entity + mention density	Training-corpus	Slow, durable, hard to fake
4	Re-measure: did browse-off recommendations move?	Both	The valuable, invisible outcome

Step 1 is non-negotiable and it is where almost everyone fails. The tactical get-cited playbook covers steps 2 and 3 in depth. But notice that both steps depend on step 1 being real — and step 1 is exactly what your default analytics cannot do.

The part nobody tells you: knowing the mechanics is useless unless you measure which pathway pays

You can understand the retrieval pipeline perfectly and still set your budget on fire, because the two pathways send revenue through completely different, equally invisible doors — and GA4 closes both. This is where I have to be blunt about the gap that motivated me to build Attrifast in the first place.

Walk what actually reaches you from each pathway:

Pathway	What the visitor does	What your analytics sees
RAG citation	Clicks the footnote	Referrer stripped by AI client → GA4 Direct/(none) [11]
Training-corpus recall	Reads "I'd use Attrifast," then searches your brand later	Branded search or Direct, days later, no link to AI

In both cases, GA4 buckets the visit as Direct/(none) and tells you nothing about which pathway, which engine, or which query drove it [11]. So the team that did six weeks of schema work (a RAG-pathway lever) and the team that spent two quarters building Reddit mention density (a training-corpus lever) get the same useless signal: a swelling Direct bucket. Neither can tell whether their pathway is the one paying.

Here is the measurement stack that separates them, and it is cookieless end to end:

Layer	What it catches	Pathway it reveals
Server-side referrer fingerprinting (AI-engine domain list)	The RAG footnote click that passed any referrer	RAG, by engine
First-party deep-page landing detection	Unreferred deep entries from stripped RAG clicks	RAG (un-refererred)
Branded-search + Direct lag analysis	The later visit after a browse-off recommendation	Training-corpus (inferred)
Stripe webhook join at payment	Which session → which paying customer	Both, by revenue

The decisive layer is the last one. Until you join the AI-referred session to the Stripe payment, you have citation counts and traffic guesses — you do not have revenue, and revenue is the only number that tells you which pathway to fund. A brand can be cited 500 times a week on Perplexity (RAG) and earn less than a single browse-off Claude recommendation that quietly sends three high-intent buyers a month. You will fund the 500 citations because you can see them, and starve the recommendation because you can't — unless you measure at the money.

That is the architecture Attrifast ships: cookieless, consent-light, server-side referrer detection against the AI-engine domain list, first-party session identity scoped to your own domain, and a Stripe-metadata join that lands the AI-referred visit on the actual paying customer. It is the only way I know to answer "which pathway drives my revenue," and it is the question every section above quietly leads to. If you want the analytics shape of the hidden traffic before you instrument it, the ChatGPT referral analytics guide and track-chatgpt-traffic walk the detection code; revenue attribution is the join.

The mechanics in this article are real and worth knowing. But they are a map, not a destination. The destination is a dashboard that tells you, this quarter, that your RAG schema work added $1,400 of Perplexity-attributed revenue and your slow Reddit-mention grind added $3,100 of browse-off-recommendation revenue — so next quarter you shift the weight. Without that, you are optimizing pathways you have never actually seen.

Freshness and crawl mechanics: why the same edit lands in days on one pathway and never on the other

Freshness is not one thing — it behaves completely differently per pathway, and the same content edit can be live in a citation within 72 hours on RAG and invisible to recall for a year. Operators who do not separate these two clocks chase their tails wondering why "the update didn't take."

On the RAG pathway, freshness is bounded by your crawl-and-index cadence, because the citation is decided against the current index at query time. On the training-corpus pathway, freshness is bounded by the model's training cutoff, which is frozen until the next generation ships. Two clocks, wildly different speeds.

Content change	RAG pathway latency	Training-corpus latency	Why the gap
New page published	24-72h on well-crawled domains	Next model gen (~6-12mo)	RAG reads the live index; recall waits for retraining
Price/feature correction	Hours-days once re-indexed	Next model gen	Frozen weights can't re-read your page
New competitor comparison	Days	Next model gen	Same retrieval/recall split
Brand rename / rebrand	Days for RAG, but recall stays wrong	Next model gen	The hardest correction to propagate
Removing outdated content	Drops from RAG on re-crawl	Stays in weights until retrain	You can't unlearn the model

The crawl side has its own per-bot mechanics worth knowing, because the bot that visits determines which pathway your content can even reach:

Crawler	Owner	Pathway it feeds	Documented?	Block effect
GPTBot	OpenAI	Training corpus	Documented [16]	Removes you from future ChatGPT recall
OAI-SearchBot	OpenAI	RAG search index	Documented [16]	Removes you from ChatGPT search retrieval
ChatGPT-User	OpenAI	RAG live fetch	Documented [16]	Removes you from on-demand fetches
ClaudeBot	Anthropic	Training corpus	Documented [17]	Removes you from future Claude recall
Google-Extended	Google	Gemini training	Documented [19]	Opts you out of Gemini training only
Googlebot	Google	Live index → AI Overviews	Documented	Removes you from Search + AIO
PerplexityBot	Perplexity	RAG index	Documented [12]	Removes you from Perplexity retrieval

The honest reading: selective blocking lets you keep the fast pathway while opting out of the slow one, but almost no SMB benefits from blocking anything. The freshness asymmetry is the real lesson — when you ship an important correction, the RAG surfaces will tell the truth within days and the browse-off recall will keep repeating the old version until the next model. Plan your messaging around both clocks. Google is the clearest worked example of this multi-clock behavior, and where Google AI gets its information breaks its four stitched-together sources down by exactly these update cadences.

Putting it together: a per-pathway optimization scorecard

The clean way to act on all of this is a single scorecard that maps each lever to its pathway, its evidence grade, and its latency — so you stop spending RAG effort hoping for recall results. This is the table I actually use when auditing a site.

Lever	Pathway	Evidence	Latency	Effort
Indexability + classic rank	RAG (stage 2)	Documented search step	Days-weeks	Medium
Answer-shaped first 80-120 words	RAG (stage 3)	Inferred-strong [1]	Days	Low
Citations + stats in passage	RAG (stage 3)	Documented-ish [1]	Days	Low
FAQ / Article schema	RAG (stage 3)	Inferred [4][5]	Days	Low
Freshness on time-sensitive pages	RAG (stages 2-3)	Inferred	Days	Ongoing
Wikidata + Organization schema + sameAs	Both (entity)	Inferred-strong	Weeks (KG)	Medium
Reddit / forum mention density	Training-corpus	Inferred-strong [13]	Quarters	High
Wikipedia presence	Training-corpus	Inferred-strong	Quarters	Very high
Publisher / earned-media mentions	Training-corpus	Inferred	Quarters	High
Allowing the training crawlers	Training-corpus	Documented [16][17]	Next gen	Trivial

And the diagnostic side — given a symptom, which pathway is failing:

Symptom	Failing pathway	First thing to check
Cited on Perplexity, ignored browse-off	Training-corpus	Entity presence + mention density
Recommended browse-off, never a footnote	RAG	Indexability + answer-shape
Cited but wrong/outdated facts	Training-corpus (recall)	Wait for next gen; fix live index for RAG
Zero presence anywhere	Both	Are crawlers blocked? Are you indexed at all?
Citations rising, revenue flat	Measurement	Join sessions to Stripe by pathway

That last row is the one I keep returning to, because it is the failure mode that survives even a perfect understanding of the mechanics. You can grade every lever correctly and still not know whether the work paid. The fix is not more mechanics — it is measurement at the money.

FAQ

How do AI engines choose which sources to cite?

Through one of two mechanically different pathways. In the training-corpus pathway, the model recalls your brand from frozen pretraining knowledge with no live fetch — citation here reflects how densely and trustworthily your information appeared in the crawl, which is why Wikipedia, Reddit, and established publishers dominate. In the live-retrieval (RAG) pathway, the engine runs a real search at query time, retrieves a candidate set, re-ranks it on relevance plus source plus answer-shape, then cites the passages it actually used. The first pathway rewards years of accumulated authority; the second rewards classic searchability plus concise, schema-marked, fresh passages. The fatal mistake is optimizing without naming which pathway the query you care about uses.

What is the difference between training-corpus citation and RAG citation?

Training-corpus citation fires when the engine answers from memory with browsing off (Claude plain chat, Gemini no-browse, ChatGPT when it skips search). It usually produces an unlinked recommendation because nothing was fetched, and it changes only on the model's 6-12 month training cycle. RAG citation fires when the engine searches the live web, and it produces a clickable footnote to a page retrieved seconds earlier; it can change within 24-72 hours of you publishing. They demand opposite work: slow corpus-wide authority building for the first, fast on-page searchability and answer-shape for the second.

Does ChatGPT search the live web or answer from training memory?

Both, decided per query. Per OpenAI's documentation, ChatGPT search runs a retrieval-augmented pipeline against a search index (Bing is the documented partner) and cites 3-5 sources when it browses. But ChatGPT also answers many queries from frozen training knowledge with no search and no citations. The model decides each time whether to invoke the search tool. That is why the same brand can be a footnote on one query (RAG pathway) and an unlinked mention on another (training-corpus pathway), and why a single optimization strategy underperforms.

What gets a source into the candidate set for retrieval?

Mostly classic search criteria, because the candidate set is built by an actual search step before any AI re-ranking. The page has to be indexed, crawlable, topically relevant to the rewritten retrieval query, and ideally already ranking for it. The Princeton GEO research and broad operator measurement both point to top-organic pages being far likelier to enter the pool. Schema and concision do not get you into the candidate set — they help you survive the next stage, the re-ranker. Getting in is a searchability problem; staying in is a passage-quality problem.

How does the re-ranking step decide which passages survive?

This is the least-documented stage, so most of the answer is Inferred. After retrieval returns a candidate set, a re-ranker scores passages on semantic relevance, source signals, and answer-shape. The strongest evidence we have is the Princeton GEO paper, which showed that adding citations, statistics, and quotations to a passage lifted its generative-engine visibility by up to 40% on tested queries, while keyword-stuffing did essentially nothing. That is direct evidence the re-ranker weights passage-level features, not just domain authority. Concision, schema parseability, and freshness all plausibly help, but no vendor publishes the exact scoring function.

Why does AI cite Wikipedia and Reddit so often?

Because both are massively over-represented in pretraining corpora and both are treated as high-trust for their query types. Wikipedia seeds the entity graph that every major model uses for disambiguation, so it surfaces in both pathways constantly. Reddit was licensed to Google for AI training and is heavily weighted for opinion, recommendation, and lived-experience queries where the model wants human consensus. Neither passes you classic backlink equity, but consistent presence on either accumulates training-corpus authority that pure on-site optimization cannot replicate.

Can I get cited if I block the AI crawlers?

Only on the RAG pathway, and only if you block selectively. Blocking the training crawler (GPTBot) while allowing the live-fetch and search agents (ChatGPT-User, OAI-SearchBot) keeps you retrievable at query time but removes you from future pretraining, slowly eroding your training-corpus recall. Blocking everything forfeits both pathways. For most SMB SaaS and ecommerce sites in 2026, allowing the crawlers is the right call, because training-corpus presence — the browse-off recommendation — is the durable asset you cannot buy back on a short timeline.

How long does it take to start getting cited by AI?

Pathway-dependent. On RAG, fresh content can be cited within 24-72 hours of indexing on a well-crawled domain, because the decision is made at query time against the current index. On the training-corpus pathway, you are waiting for the next model generation — a 6-12 month cutoff cycle plus the lead time for your content to be crawled and learned before the cutoff. If you need citations this quarter, optimize RAG; if you want the model to recommend you with browsing off, you are playing a multi-quarter game.

Which AI engine is easiest to get cited by, and why?

Perplexity, because it is the most retrieval-heavy major engine — it runs a live search on nearly every query and surfaces 3-7 citations per answer, with a Sources tab exposing the wider retrieved set. That makes the RAG pathway dominant, so classic searchability plus answer-shaped passages get you in quickly. ChatGPT search is next, then Google AI Overviews (gated by top-10 organic rank). The hardest surface is base-model knowledge in a no-browse chat, which is pure training-corpus recall you cannot influence without being in the next training run.

Does structured data (schema) actually affect AI citations?

On the RAG pathway, the evidence says yes and I label it Inferred-strong. Schema.org JSON-LD makes a passage trivially parseable, so the re-ranker does not have to guess what is a question, answer, price, or author. Multiple observational studies find AI-cited pages carry FAQ and Article schema at higher rates, and the mechanism is plausible: a clean, machine-readable answer is easier to extract and trust. On the training-corpus pathway, schema matters far less, because the model learned prose, not markup. So schema is mostly a RAG-pathway lever — another reason naming the pathway first changes what you build.

Do AI engines use Google's PageRank to choose sources?

Not directly, but they almost certainly use a similar citation-graph authority signal derived from web crawl data, and I label this Inferred. Pages from high-PageRank-style domains (Wikipedia, government, established publishers) get cited disproportionately, which is consistent with a link-graph authority signal but does not prove the mechanism. Critically, that authority appears to act more at the candidate-set stage (does this page rank well enough to be retrieved) than at the re-rank stage, where the Princeton evidence shows passage shape can beat raw authority.

Is the citation a judgment the model makes, or a mechanical step?

Mechanical, on the RAG pathway. The model is prompted to synthesize an answer from the retrieved passages and attach footnotes to the ones it used. The decision of which sources are even available to cite was made upstream in retrieval and re-ranking; by the time the model writes, the candidate set is fixed. So "the model chose to recommend me" really means "the re-ranker put my passage in front of the model and it was clean enough to use." You are not charming a model — you are winning a search-and-rank contest.

How do the two pathways feed each other?

Forward. A clean page wins RAG citations within days; the resulting traffic, mentions, and discussion get ingested by the next training crawl; the next model generation learns your brand; and then browse-off engines start recommending you with no link. So fast RAG wins compound into slow, durable training-corpus recall over a few model generations. The strategic implication: do RAG work now for measurable traffic, but keep checking whether it has started moving the higher-value, harder-to-see browse-off recommendations.

Why can't GA4 tell me which pathway drives my AI revenue?

Because both pathways arrive looking like Direct/(none). A RAG footnote click has its referrer stripped by the AI client, so GA4 sees no referrer and no UTM and files it as Direct. A training-corpus recommendation produces a later branded search or Direct visit with no link back to the AI engine at all. GA4 has no built-in rule to separate AI traffic, let alone separate the two pathways, and it cannot join either to revenue. You need server-side referrer fingerprinting against an AI-engine domain list plus a first-party identity and a Stripe-payment join to see which pathway actually pays.

How do I measure which pathway is driving my revenue?

Four layers, all cookieless. First, server-side referrer fingerprinting against a known AI-engine domain list catches the RAG footnote clicks that pass any referrer, broken out by engine. Second, first-party deep-page landing detection catches the unreferred RAG clicks where the referrer was stripped. Third, branded-search and Direct lag analysis approximates the training-corpus recommendations that send delayed visits. Fourth — the decisive one — a Stripe webhook join lands the AI-referred session on the actual paying customer, so you see revenue by pathway, not just citation counts. That join is the entire reason citation-tracking tools alone leave you guessing, and it is what Attrifast was built to close.

Find revenue hiding in your traffic

Discover which marketing channels bring customers so you can grow your business, fast.

Start free trial →

7-day free trial · $15/mo · cancel anytime