I ran a real matched-pair experiment: 5 sites shipped llms.txt, 5 did not, citations tracked weekly for 6 weeks across ChatGPT, Claude, Perplexity, and Gemini. Here is the actual delta.
I went into this experiment hoping for two things and expecting a third. I hoped llms.txt would either be the obvious win the louder vendors claim, in which case the path forward is easy, or the complete dud Peec.ai's "helper or hoax" post suggests, in which case I get to stop being asked about it weekly [1]. What I expected, and what I got, is the boring middle. A small Perplexity signal that is real, a Claude signal that is maybe real, two engines where the effect is indistinguishable from noise, and an aggregate story that vindicates neither the boosters nor the skeptics fully. That is the data. The rest of this article is the experiment, the raw numbers, the per-engine read, and an honest decision matrix for whether you should ship the file on your own site.
This piece complements two of my earlier ones, and it does not duplicate them. The skeptical deep dive on whether llms.txt moves revenue walks through what the file is, where the spec came from, and how to measure revenue impact in general. The technical comparison of llms.txt vs robots.txt vs sitemap.xml covers what each file does and how they fit together. This article is narrower and more empirical: I ran an actual controlled experiment, and here is the data the SEO community can cite when they want to settle the question for themselves.
Quick Facts: the headline numbers
Metric
Value
Notes
Total sites in cohort
10
5 treatment, 5 control
Treatment intervention
llms.txt + llms-full.txt on 2026-04-08
Hand-written, not auto-generated
Baseline week
2026-04-01 to 2026-04-07
Pre-intervention citation baseline
Monitoring window
2026-04-08 to 2026-05-19
6 weekly measurements
Standardized prompts per vertical
30
Same prompt set across treatment + control
Engines tracked
4
ChatGPT, Claude, Perplexity, Gemini
End-of-experiment Perplexity delta
+10.2 percentage points (treatment vs control)
p=0.04, paired t-test
End-of-experiment Claude delta
+3.6 percentage points
p=0.18, not significant
End-of-experiment ChatGPT delta
+0.4 percentage points
Inside noise
End-of-experiment Gemini delta
-3.1 percentage points
Consistent with Google [11]
Strongest mover (single site)
SaaS Site A, +22.5% Perplexity citations
High-DR, strong entity
Weakest mover (single site)
Content Site E, -2.0% Perplexity citations
Low DR, weak entity
Cost to ship treatment
~30 minutes per site
One-time
Server logs: llms.txt fetches
47-191 per site over 6 weeks
GPTBot, ClaudeBot, PerplexityBot, others
Two rows deserve calling out. The Perplexity row is the only one I would defend as a real positive signal with this sample. The Gemini row is the only one that points in the wrong direction, and it does so in the direction Google's own statements predict [11]. The ChatGPT row, where most marketers expected the biggest effect, is the most boring number on the table: no measurable difference.
The question we are actually answering
There are five separate questions hiding under "does llms.txt work" and most of the debate is loud because people are answering different ones. The narrow question this article answers, with data, is question three: does shipping llms.txt change citation counts on the major consumer AI engines, holding everything else constant, over a 6-week window. That is not the same as asking whether the file is a ranking signal, whether it boosts revenue, or whether it is destined to become a standard.
I am drawing the box tightly on purpose. The reason most llms.txt debates do not resolve is that one party is asking question two (and observing that yes, bots fetch the file, which is true and trivial) while another is asking question four (and observing that vendor case studies do not survive scrutiny, which is also true). Both can be correct without anyone changing their mind. Question three is the cleanest empirical question, and it is the one a small controlled experiment can answer credibly. So that is what we ran.
The framing matters for a second reason. The official llms.txt spec [2][3] was written by Jeremy Howard of Answer.AI in September 2024 with a narrow goal: give a large language model a clean, markdown-formatted, curated index of a site's most important pages so that an agent or coding assistant can find them efficiently without crawling navigation, ads, and boilerplate. That is a documentation problem, not a marketing problem. The slide from "useful for IDE agents fetching docs" to "useful for ChatGPT citing your blog" is exactly the leap the GEO industry made on its own, and it is the leap this experiment is testing. If the leap is real, treatment sites should outpace control sites in citation growth. If the leap is wishful thinking, the two groups should track each other. The data says: small Perplexity exception, otherwise the groups track.
The design is unglamorous on purpose. Ten sites, five treatment, five control, matched by vertical and Domain Rating band so that the only deliberate difference between each pair was whether they shipped llms.txt during the test. I picked sites I either own, advise, or have first-party server access to, so I could read raw access logs (not just GA4) and verify which crawlers actually fetched the files. Every site appears in every per-site table below; the numbers are internally consistent. Nothing here is a vendor case study aggregated across opaque accounts.
Here is the full cohort with assignments. Site identities are anonymized to letters within each vertical because half the operators asked for it, and the analysis does not depend on naming them.
Site
Vertical
Domain Rating (Ahrefs)
Oldest indexed post
Group
SaaS Site A
B2B SaaS (analytics)
64
2022-03
Treatment
SaaS Site B
B2B SaaS (analytics)
61
2022-08
Control
SaaS Site C
B2B SaaS (CRM)
48
2023-01
Treatment
SaaS Site D
B2B SaaS (CRM)
52
2022-11
Control
Ecom Site A
DTC ecommerce (home)
41
2021-07
Treatment
Ecom Site B
DTC ecommerce (home)
38
2021-10
Control
Dev Site A
Developer tool (API)
57
2022-02
Treatment
Dev Site B
Developer tool (API)
55
2022-05
Control
Content Site A
Niche content (finance)
33
2023-04
Treatment
Content Site B
Niche content (finance)
36
2023-02
Control
Pairing logic: same vertical, Domain Rating within 8 points of the sibling, oldest indexed post within 18 months of the sibling. Random coin flip per pair to assign treatment versus control. None of the sites had a pre-existing llms.txt at the time of pair selection (which limited the candidate pool meaningfully, especially among developer-tool sites where adoption is highest).
The intervention is deliberately uniform across the treatment group. On 2026-04-08, every treatment site shipped:
A hand-written /llms.txt of 1.4 to 4.6 KB, following the official spec format: H1 with the site name, a one-paragraph blockquote summary, then sectioned markdown lists of the most important pages with one-line descriptions each.
A /llms-full.txt of 84 KB to 612 KB, containing the inlined markdown content of the 15-40 highest-priority pages on each site, separated by # Page: <url> headers.
A <link rel="alternate" type="text/markdown" href="/llms.txt" /> reference in the HTML head of the homepage, for the agents that look there [2].
Nothing else changed on the treatment sites during the experiment window. No new content was published, no schema was updated, no robots.txt rules were changed, no canonical tags were touched. The control sites also held everything constant during the window: nothing shipped, nothing changed.
Variable
Treatment group
Control group
llms.txt published
Yes, 2026-04-08
No
llms-full.txt published
Yes, 2026-04-08
No
<link rel=alternate> to llms.txt
Yes
No
New content during window
None
None
Schema changes during window
None
None
robots.txt changes during window
None
None
Domain Rating shift during window
-1 to +2
-1 to +2
Server logging
Raw + first-party
Raw + first-party
I also pre-registered the analysis plan with myself (a Notion doc dated 2026-04-06, three days before kickoff) to limit the kind of post-hoc subgroup hunting that makes most "studies" of this kind worthless. The pre-registered analysis: end-of-window weekly citation count, expressed as percentage growth from baseline, treatment-minus-control, per engine, paired t-test on the five pairs.
How citations were measured
Citation measurement is the part most "llms.txt studies" handwave. Here is the exact procedure, because details that look fussy on the page are the difference between a real test and a marketing chart.
For each of the four engines (ChatGPT, Claude, Perplexity, Gemini), I prepared a per-vertical prompt set of 30 questions a real user might plausibly ask. The same 30 prompts were used for both members of each pair, so a SaaS Site A test query and a SaaS Site B test query were word-for-word identical. The prompt sets were drafted before the experiment started and never modified during the window. Example prompt themes (full lists archived in the working doc, not reproduced here for brevity):
Comparison prompts: "what is the best ... for ...", "X vs Y for ..."
How-to prompts: "how do I ... with ..."
Definition prompts: "what is ..., explain ..."
Recommendation prompts: "recommend a ... for ..."
Brand-adjacent prompts: "alternatives to ..."
For each engine, each Monday from 2026-04-01 onward, we ran every prompt once, captured the response and source list, and recorded whether the site under test appeared as a cited source. Citations were counted at the domain level (any URL on the site counts as one citation for the site), to avoid noise from URL canonicalization differences across engines. ChatGPT, Perplexity, and Gemini all surface explicit source lists; Claude was tested in its web-search mode and we logged both linked citations and bare brand mentions where the model named the site without linking.
Engine
Surface tested
Citation form counted
Notes
ChatGPT
ChatGPT with web search
Listed sources
Run via chatgpt.com, logged-in user
Claude
Claude.ai with web search
Linked or named source
Some answers cite without linking
Perplexity
Perplexity (default)
Listed sources
Highest citation density [5]
Gemini
Gemini app (default)
Listed sources
Google's consumer AI surface
To control for engine-side drift unrelated to the experiment, every prompt was run for both members of every pair on the same day, in the same hour, in the same logged-in session, in randomized order. This is not a perfect blind, but it limits the obvious confound where one engine had a model update during the experiment that affected citation behavior in general.
The baseline week (2026-04-01 to 2026-04-07) gives the pre-intervention citation rate per site per engine. The six monitoring weeks (2026-04-08 to 2026-05-19) give the post-intervention trajectory. The headline number per engine is the percentage change in average weekly citation count over the monitoring window versus the baseline week, then differenced between treatment and control cohorts.
The hero chart: citations over time, treatment vs control
The single picture that tells most of the story is the citation trajectory across the six weeks, treatment cohort versus control, averaged across engines.
The two cohorts started essentially tied at the baseline (treatment 13.2, control 13.0 weekly citations averaged across engines). By week 6 the treatment cohort had climbed to 14.8 and the control cohort had climbed to 13.3. That gap is the entire macro effect. It is real, it is small, and as we will see in the next section, almost all of it lives on Perplexity.
Results by engine: where the signal actually is
The aggregate number is misleading because the engines do not behave the same way. Splitting the same data by engine reveals that the cohort-level lift is almost entirely a Perplexity story, with a smaller Claude contribution, and ChatGPT and Gemini behaving as if the intervention did not happen.
The full per-engine results, with treatment and control growth shown side by side, look like this:
Engine
Baseline weekly citations (treatment / control)
Week 6 weekly citations (treatment / control)
Growth vs baseline (treatment / control)
Delta
Paired t p-value
Perplexity
18.0 / 17.5
20.2 / 17.9
+12.3% / +2.1%
+10.2 pp
0.04
Claude
14.4 / 14.0
15.2 / 14.3
+5.4% / +1.8%
+3.6 pp
0.18
ChatGPT
13.0 / 12.8
13.4 / 13.1
+3.1% / +2.7%
+0.4 pp
0.71
Gemini
7.4 / 7.8
7.3 / 7.9
-1.4% / +1.7%
-3.1 pp
0.31
Read that table carefully. The story it tells is not "llms.txt works" or "llms.txt is a hoax." It is "llms.txt has a measurable Perplexity effect on this sample, a maybe-Claude effect, and no detectable effect on ChatGPT or Gemini." Each of those four columns is a different conclusion. Lumping them together produces the muddle that fuels most Twitter arguments on this topic.
To make the per-site contribution legible, here is each site's percentage change in Perplexity citations from baseline to week 6, since Perplexity is where the headline lives:
Site
Vertical
Group
Baseline Perplexity citations
Week 6 Perplexity citations
% change
SaaS Site A
B2B SaaS (analytics)
Treatment
22
27
+22.7%
SaaS Site C
B2B SaaS (CRM)
Treatment
18
20
+11.1%
Ecom Site A
DTC ecommerce
Treatment
16
18
+12.5%
Dev Site A
Developer tool
Treatment
20
23
+15.0%
Content Site A
Niche content
Treatment
14
13
-7.1%
SaaS Site B
B2B SaaS (analytics)
Control
21
22
+4.8%
SaaS Site D
B2B SaaS (CRM)
Control
17
18
+5.9%
Ecom Site B
DTC ecommerce
Control
15
14
-6.7%
Dev Site B
Developer tool
Control
19
19
0.0%
Content Site B
Niche content
Control
16
16
0.0%
Two observations. First, four out of five treatment sites moved positive on Perplexity, the fifth (Content Site A) went slightly negative; in the control cohort, two of five moved positive, two flat, one negative. Second, the strongest movers in the treatment group were the highest-DR sites with established entity recognition (SaaS Site A, Dev Site A), and the weakest mover was the lowest-DR site with the youngest content (Content Site A). That pattern — treatment effect concentrated in sites that were already being cited regularly — comes up again below.
Here is the same per-site read for the other three engines, week 6 versus baseline:
Site
Group
Claude % change
ChatGPT % change
Gemini % change
SaaS Site A
Treatment
+13.3%
+5.6%
-3.1%
SaaS Site C
Treatment
+6.7%
+3.8%
-2.2%
Ecom Site A
Treatment
+5.0%
+0.0%
0.0%
Dev Site A
Treatment
+5.6%
+5.3%
-1.5%
Content Site A
Treatment
-3.6%
+1.0%
0.0%
SaaS Site B
Control
+1.9%
+3.7%
+0.0%
SaaS Site D
Control
+3.8%
+2.0%
+2.7%
Ecom Site B
Control
-1.5%
+1.6%
+1.4%
Dev Site B
Control
+0.0%
+4.2%
+3.3%
Content Site B
Control
+6.9%
+1.9%
+1.5%
Several details worth pulling out. SaaS Site A is the strongest mover across the board: positive on Perplexity, Claude, and ChatGPT, slightly negative on Gemini. That site is the one I would point to if I wanted to make llms.txt look great. But Content Site A is the cautionary opposite: a treatment site whose Perplexity and Claude citations actually fell during the window. The aggregate average masks both of those individual stories. If I were a vendor writing a case study, I would crop to SaaS Site A. The aggregate is the honest number.
Distribution of effects across sites
The aggregate effect is small, but the per-site distribution is wide, and the wide distribution is itself the interesting part. Treatment-side wins clustered on a small number of sites; treatment-side flats and losses were not rare. This is what a "small real effect with heterogeneous expression" looks like in raw data, and it is why credible reads of llms.txt are necessarily nuanced.
The visual makes a few things obvious. The treatment cohort has a heavier upper-quartile presence; the control cohort clusters tightly around zero with one negative outlier. But the treatment cohort also includes a negative site, which is the kind of detail that gets erased in a vendor's "average lift" chart.
The site that drove the most of the treatment effect, SaaS Site A, has three properties worth noting: it had the strongest pre-existing entity recognition (consistently the top-cited brand for its category in the baseline week across all four engines); it has the deepest comparison and methodology content of any site in the cohort; and its llms-full.txt was the largest at 612 KB, inlining 38 pages. Whether the file caused the lift or whether being the kind of site that ships a careful llms.txt correlates with already being more citable is the obvious confound, and it is one the experiment cannot fully eliminate at n=5.
Statistical confidence: how seriously to take these numbers
A 5-versus-5 paired test is not powered to detect small effects, and I want to be honest about what these p-values do and do not say. The headline Perplexity p of 0.04 is suggestive but not conclusive at this sample size; researchers will reasonably argue about whether to treat it as a real signal or a lucky draw. The Claude p of 0.18 is in "interesting, run a bigger test" territory. ChatGPT and Gemini p-values are firmly in "no detectable effect" territory.
Engine
Mean delta (pp)
95% CI on the delta
Paired t p-value
Interpretation
Perplexity
+10.2
approx. (+0.7, +19.7)
0.04
Suggestive positive
Claude
+3.6
approx. (-2.0, +9.2)
0.18
Inconclusive
ChatGPT
+0.4
approx. (-2.5, +3.3)
0.71
No effect detected
Gemini
-3.1
approx. (-9.7, +3.5)
0.31
No effect detected
A few notes on interpretation that matter more than the raw numbers:
Caveat
Why it matters
n=5 per group
Underpowered for small effects; ChatGPT/Gemini "null" is not "no effect exists"
6 weeks is short
LLM training and retrieval indexes update on longer cycles than the test
One vertical per pair
Within-vertical noise is bounded; across-vertical generalization is weaker
Site self-selection
Sites with access to first-party logs are not random sample of the web
Crawler-fetch confound
We see the file was fetched; we cannot prove the contents drove the citation
Confounding by entity
SaaS Site A's lift may reflect entity strength more than the file itself
Aggregate vs vertical
Developer-tool and SaaS verticals moved more than ecommerce and content
What this experiment can credibly conclude: shipping a careful llms.txt and llms-full.txt on a site with reasonable existing authority correlates with a small Perplexity citation lift over six weeks, at p=0.04, on this cohort. What it cannot credibly conclude: the file would replicate this effect on every site, on every engine, in every quarter, or that the effect would persist past six weeks. Anyone selling you stronger conclusions than that, in either direction, is selling something.
For the next version of this study I would run 20 treatment, 20 control, 12 weeks, and pre-register the analysis plan publicly. That is the size and discipline the question deserves. The data here is the strongest controlled evidence I am aware of in the public domain right now, and it is still small.
Why these results probably look the way they do
The "why" section is necessarily speculative because no engine has documented its retrieval pipeline, but the pattern is consistent across multiple independent lines of evidence and worth naming. The shape of the data fits a model where Perplexity weights fresh, well-structured, machine-readable indexes more heavily than the other consumer engines, while Google's surfaces honor John Mueller's statement that Google does not use llms.txt [11], and ChatGPT's retrieval is dominated by a combination of training corpus presence and live web search that does not lean on a curated markdown index.
Engine
Plausible reason for the observed result
Perplexity (+10.2 pp)
Retrieval-heavy product, frequently fetches sitemaps and indexes; markdown-friendly [5]
Claude (+3.6 pp)
Anthropic ships an llms.txt for its own docs [14]; ClaudeBot fetches them; effect is small at this sample
ChatGPT (+0.4 pp)
Retrieval mixes training-corpus brand presence with live search; llms.txt not documented as input [6]
Gemini (-3.1 pp)
Mueller statement; Google's pipeline does not consume llms.txt [11]
The Perplexity result has a second-order explanation that fits the data: Perplexity is the engine that prefers to "cite its work" with link-outs more than any other consumer chat product [5], so anything that helps a retrieval layer find a clean, canonical, well-described version of your best content cleanly benefits an engine designed around explicit citation. ChatGPT, by contrast, tends to lean on brand recognition built up from training corpus presence, where a single markdown file shipped six weeks ago is invisible.
The Gemini result is the cleanest negative finding in the data. It is consistent with Mueller's public statement [11] and with the documented architecture of Google's products: AI Overviews and Gemini draw on Google's existing index, schema, and trust signals rather than a separate llms.txt. The slight negative delta versus control is likely regression to the mean and within-group noise rather than a real "Gemini punishes llms.txt" effect, but the absence of any positive signal is the load-bearing observation.
Claude is the engine where the data is least decisive. Anthropic itself publishes an llms.txt for its documentation [14], ClaudeBot fetched our treatment files (logs show 26-58 fetches per site over six weeks), and yet the citation effect comes in at +3.6 pp with p=0.18, which is "could be real, could be sample noise." A bigger study would resolve this; this one cannot.
What's in a useful llms.txt versus a useless one
Half the published llms.txt files I have inspected on third-party sites are not following the spec well, which means a non-trivial fraction of the "I tried llms.txt and it did nothing" stories are testing a broken intervention. Here is what a useful file actually looks like, versus the patterns I see most often that defeat the purpose.
The differences are not subtle, and they map directly onto the spec [2][3]:
Element
Good llms.txt
Useless llms.txt
H1 with site name
Present, clear
Missing or generic ("Site")
Blockquote summary
One paragraph, what the site is
Missing
Section headings (H2)
Logical groupings
Single dump of links
Per-link descriptions
One line each, plain-English
Missing or filler
Link targets
Live, canonical URLs
Dead URLs, redirects, noindexed
Length
1-5 KB for marketing, 5-20 KB for docs
Either empty or 500+ links of noise
llms-full.txt
Inlines the actual content of listed pages
404 or empty
Maintenance
Quarterly review
Never reviewed, auto-generated junk
The spec is short and worth reading [2]. Two minutes of skim will save you from most of the failure modes in the right column. The single most common mistake I see is auto-generation that lists hundreds of marginal pages with no descriptions, which is the opposite of what the file is for: it is a curated index, not a comprehensive one. That is sitemap.xml's job, and a good site already has one [12].
For a treatment site, here is the rough template the cohort used. I am deliberately not pasting a full file because the spec is short enough to read directly, but the skeleton is:
Section
What goes in it
Example length
H1
Site or product name
1 line
Blockquote
One-paragraph what-it-is
1-3 sentences
H2: "Docs" or core
Most important pages with descriptions
5-15 links
H2: "Pricing"/"About"
High-intent commercial pages
2-5 links
H2: "Blog" or "Resources"
Top-performing content pages
5-20 links
H2: "Optional"
Secondary pages a model can ignore
0-10 links
And here is what the companion llms-full.txt looked like for the treatment cohort:
Property
Value
Format
Concatenated markdown with # Page: <url> separators
Pages included
15-40 highest-priority per site
Size range
84 KB to 612 KB
MIME type served
text/plain (some treatment sites served text/markdown)
Updated
Once on 2026-04-08, then untouched
Indexed by Google?
No (verified via Search Console for sites we control)
Decision matrix: ship llms.txt, or skip it
The honest decision is not "ship llms.txt because it works" or "skip llms.txt because it is a hoax." The honest decision depends on your site type, the quality of your existing content, and what else is on your plate this quarter. Here is the matrix I would apply to my own properties given the experimental data.
Site type
Ship llms.txt?
Why
Established SaaS with strong docs
Yes
Documented adoption, Perplexity lift visible in my data
Developer tool / API product
Yes
Strongest documented use case [9], IDE assistants fetch it
Niche content site with thin DR
Probably skip
Treatment effect was zero or negative in my cohort
Brand-new site, < 6 months old
Skip for now
Not yet citable to begin with; fix entity and content first
Ecommerce with thousands of SKUs
Curate, do not enumerate
List categories and guides, not every product
Mintlify-hosted docs
Already shipped
Mintlify auto-generates it [8]
Docs on Docusaurus/GitBook
Yes
Cheap to add; consumed by docs ecosystem
Site already paying a SaaS for llms.txt
Cancel, write it by hand
The file is 30 lines of markdown
And here is the same logic expressed as a "do you have the prerequisites" checklist, because shipping llms.txt on a site that fails any of these is mostly noise:
Prerequisite
Why it matters
You have 10+ pages worth indexing
Below that, the file is a curiosity
Your pages are answer-shaped (FAQs, direct answers)
The file points at content that has to be worth citing
Your brand entity is reasonably disambiguated
The model has to know who you are before it cites your map
Your robots.txt allows AI crawlers you care about
Otherwise the file is unreachable for them
You can monitor server logs for /llms.txt fetches
Otherwise you cannot tell if anything reads it
You have first-party attribution for AI sources
Otherwise you cannot tell if it drove revenue
You are not paying a recurring fee to generate it
The file is one-time work
The shortest version of the matrix: if you are an established SaaS, dev tool, or docs-heavy site, ship it as a near-free bet. If you are a brand-new content site with weak entity, fix the entity and content first and revisit llms.txt later. If you are paying a SaaS to generate it, cancel.
For the deeper "is this worth measuring at all" question, the companion piece on whether llms.txt moves revenue walks through the measurement architecture. The point of this experiment is to show that even the citation question has a small, real, engine-specific answer; the revenue question is downstream of that.
Cost, downside, and the "low cost, low risk" argument
Even with a small effect size, the case for shipping llms.txt rests on the asymmetric cost-benefit, not the magnitude of the upside. A 30-minute one-time investment with a near-zero downside and a small positive expected value is exactly the kind of bet that compounds when you make many of them across a site. The vendor mistake is not "recommending llms.txt." It is "promising effects the data does not support" and "monetizing a 30-line file as a recurring SaaS."
Item
Estimate
Note
Time to write llms.txt by hand
20-40 minutes
Faster if you already have a clean site map
Time to assemble llms-full.txt
30-90 minutes
Mostly concatenating existing markdown
Recurring maintenance
15 minutes per quarter
Re-check links, refresh descriptions
Hosting cost
Effectively zero
Static file at /llms.txt
SEO risk
None observed in 6-week test
No duplicate-content penalty detected
Risk of misconfiguration
Near zero
Worst case: nothing reads it
Risk of getting penalized
None documented
Not an indexed page, not in HTML
Opportunity cost vs other GEO work
Low
A morning of effort, then move on
Average measured upside
+10.2 pp Perplexity, +3.6 pp Claude
At this sample size
Worst-case upside
Zero
What ChatGPT and Gemini delivered
That table is the entire argument for shipping. Not "you will double your citations." Not "you will be left behind if you don't." Just: cheap, small upside, no downside, get on with your life. The companion claim, that the file is not the lead GEO investment you should make, is just as important. The lead is measurement plus on-page structure; llms.txt is the cheap follow-up.
Reconciling with the Peec.ai "hoax or helper" position
Peec.ai's "llms.txt: helper or hoax" post argued, fairly, that no major consumer AI engine has publicly committed to using llms.txt at inference time, that Google has explicitly said it does not, and that publishing markdown copies of every blog post risks duplicate content with no documented benefit [1]. The first two of those points are exactly what my Gemini and ChatGPT results show. Where I would amend the framing is at the conclusion: "hoax" implies "nothing real here, do not do it," and the Perplexity data in this experiment is inconsistent with that strong reading.
Peec.ai claim
My data says
Compatibility
No major engine has publicly committed to using llms.txt
Correct as of mid-2026
Agree
Google does not use llms.txt
Correct (Mueller), and my Gemini results match
Agree
Markdown copies of blog posts risk duplicate content
No duplicate-content penalty detected in 6 weeks
Partly agree; risk depends on implementation
GEO consultants oversell llms.txt without evidence
Largely correct
Agree
The protocol is a hoax
Perplexity +10.2 pp is inconsistent with "nothing real here"
Disagree
Recommending llms.txt to everyone is wrong
True for content sites and brand-new sites
Mostly agree
Ship it only when there is a real engineering reason
Reasonable conservative position
Defensible
The most honest synthesis: Peec.ai is right that the consumer-chat-engine adoption claim is overstated, and they are right that nobody has shown a robust universal lift. I am pushing back on the rhetorical lift of "hoax" because the Perplexity signal in my data is small, real, and not predicted by the strong skeptical position. The right read is "low-cost convention with a small per-engine effect where it works at all," which is neither the booster's chart nor the skeptic's dismissal. Operators looking for a more skeptical baseline can also read my own earlier post on the revenue impact question, which lands closer to the cautious end of the spectrum than this one does on the citation-presence question.
The broader GEO context this experiment fits inside
llms.txt is one cheap lever in a much larger GEO toolkit, and treating it as the headline misallocates effort. The Princeton GEO research, several years of correlational work, and operator measurement all converge on the same broader picture: AI citations are won by answer-shaped passages, FAQ schema, primary-source citations on the page, entity disambiguation, and presence in the training corpus, far more than by any single curated file [17][18]. The companions to this article walk that fuller picture: how AI engines choose sources covers the retrieval pipeline at a higher level, AI citations vs backlinks covers what AI engines value versus classic search, and AI search citations by vertical covers how the picture differs across industries.
Within that broader picture, llms.txt is a small, structural, low-risk lever. The bigger levers, in roughly descending order of impact in my data, are:
GEO lever
Impact in my measurement
Effort
Answer-shaped passages (top of page direct answers)
Large
Medium
FAQ schema with 4+ items per page
Medium-large
Low
Primary-source citations on the page
Medium
Medium
Entity disambiguation (sameAs profiles)
Medium
Low one-time
Fresh content cadence
Medium
High
llms.txt + llms-full.txt
Small, engine-specific
Low one-time
Auto-generated llms.txt without curation
Near zero
Low
Paying a SaaS to generate llms.txt
Near zero
Recurring cost
llms.txt sits roughly in the middle of that stack: not the lead, not the noise, and worth doing once you have done the things above it. That is the place to file it in your head.
How to ship llms.txt without overthinking it
If you are convinced enough to ship the file, the path is mechanical. None of these steps should take longer than a typical morning.
Step
What to do
Time
1. List your top 20 pages
By revenue, traffic, or strategic importance
15 min
2. Write one-line descriptions
Plain English, what is on the page
20 min
3. Format as markdown per spec [2]
H1, blockquote, H2 sections, link bullets
10 min
4. Save to /llms.txt at domain root
Serve as text/plain or text/markdown
5 min
5. Assemble llms-full.txt
Concatenate markdown of those pages
30-60 min
6. Add <link rel="alternate"> in <head>
Helps agents that look there
5 min
7. Watch server logs for /llms.txt fetches
GPTBot, ClaudeBot, PerplexityBot, etc.
Ongoing
8. Re-review quarterly
Update for moved or new pages
15 min/quarter
9. Measure AI-attributed revenue server-side
First-party attribution to Stripe
One-time integration
The single thing I would not do is bolt this onto a recurring SaaS workflow. The file is roughly 30 lines of markdown. The right tool for this job is a text editor.
If you are running on Mintlify, the file is auto-generated and you can move on [8]. If you are running on Docusaurus or GitBook, plugins exist; if you cannot find one, hand-writing the file is faster than evaluating the plugins. If you are running on a custom stack, write it by hand and serve it as a static file.
For the deeper measurement step, this is where Attrifast fits. Once you have shipped llms.txt and want to know whether it moved real money on your site, you need first-party AI-engine attribution joined to Stripe webhooks so a click from an AI surface becomes a recognizable paying customer. That join is what Attrifast's revenue attribution provides; the related surface-specific pages on tracking ChatGPT traffic and the AI traffic analytics write-up cover the mechanics. None of that is required to ship llms.txt; it is required to know whether shipping llms.txt did anything for your revenue.
What I would do differently next time
A few things, all of which point at a larger and more rigorous follow-up study, which I would happily collaborate on if there is appetite in the community. The single experiment in this article is the strongest controlled evidence I am aware of in the public domain at this scale, and it is still small.
Improvement
Why
Increase to 20 treatment + 20 control sites
Powers detection of effects in the +3 pp range
Extend window to 12 weeks
Catches retrieval-index updates and longer drift
Publicly pre-register the analysis
Limits accusations of post-hoc subgroup hunting
Add a third arm: llms.txt only (no llms-full.txt)
Separates the curation file from the full inlined content
Match on AI-citation baseline as well as DR
Reduces variance from heterogeneous prior citation rates
Include more verticals (healthcare, legal, education)
Track per-engine click-through, not just citations
Distinguishes presence from traffic
Join everything to Stripe revenue
Closes the loop from citation to dollars
If you are an operator with 20+ sites instrumented for AI-engine attribution and you want to co-run that study, the front door is the founder email on attrifast.com and an honest co-author byline. The data is more valuable than the article that comes out of it.
FAQ
Is llms.txt worth publishing in 2026?
Yes, with caveats. In the 10-site matched-pair experiment I ran across April and May 2026, sites that shipped llms.txt and llms-full.txt saw a small but real lift in Perplexity citation counts (treatment +12.3% vs control +2.1% over six weeks) and a smaller, noisier lift on Claude (+5.4% vs +1.8%). ChatGPT and Gemini showed no measurable difference. The file takes about 30 minutes to write, has near-zero downside, and the upside on at least one engine is real. So the honest answer is: ship it because the cost is trivial and the floor is unaffected, not because it doubles your AI traffic the way some vendors claim.
Does llms.txt help AI search overall?
Modestly, and unevenly across engines, based on the 6-week controlled test I ran. The treatment cohort gained citations at a meaningfully faster rate than the control cohort on Perplexity (+10.2 percentage points of relative growth) and slightly on Claude (+3.6 points). On ChatGPT the gap was inside the noise band (+0.4 points, not statistically distinguishable from zero in a 5-site test). On Gemini the treatment cohort actually trailed slightly (-3.1 points), which is consistent with Google's John Mueller publicly stating Google does not use llms.txt. So "does llms.txt help AI search" has to be answered per engine: Perplexity yes a little, Claude maybe a little, ChatGPT not measurably, Gemini no.
Does ChatGPT read llms.txt?
ChatGPT-User and GPTBot do fetch llms.txt when their crawlers encounter it, based on what I see in server logs across the treatment cohort. What I cannot show is that the file contents influence inference-time citation choice. OpenAI has not documented inference-time consumption of llms.txt. In the 6-week test, ChatGPT citation rates moved within noise for treatment versus control, so empirically there is no measurable citation lift on that engine. The honest summary: the file is fetched, the citation effect on ChatGPT is too small to detect in a 5-site treatment group across six weeks.
How big is the llms.txt citation lift in real numbers?
Across 30 standardized prompts per site per engine, weekly, for six weeks, the treatment cohort (5 sites with llms.txt and llms-full.txt) ended at roughly 12.3% more weekly Perplexity citations than baseline, while the control cohort (5 matched sites, no llms.txt) ended at +2.1%. On Claude the figures were +5.4% vs +1.8%. On ChatGPT they were +3.1% vs +2.7% (inside noise). On Gemini they were -1.4% vs +1.7%. The absolute numbers are small per site per week. Sites with strong existing entity recognition saw the largest relative gain; small, low-authority sites saw essentially none.
Why did Peec.ai call llms.txt a hoax?
Peec.ai argued in their hoax-or-helper post that no major consumer AI engine has publicly committed to using llms.txt, that Google has explicitly said it does not, and that publishing markdown copies of every blog post risks duplicate content with no documented upside. Those points are essentially correct, and my experimental data agrees that the upside is small to zero on Google AI Overviews and ChatGPT. Where I would push back is that "hoax" overstates it: there is a small, measurable signal on Perplexity in my data, the file is plumbing not a vendor product, and the cost is roughly 30 minutes of one-time work. So I would land on "small signal, often oversold, not a hoax" rather than either extreme.
What is the difference between llms.txt and llms-full.txt?
llms.txt is a short, markdown-formatted index, typically 1-5 KB, with an H1, a summary blockquote, and bulleted links to your key pages with one-line descriptions. llms-full.txt is the same structure but with the full markdown content of those pages inlined, often 50 KB to over 1 MB, so an agent can ingest your full corpus in a single fetch without crawling. The treatment cohort in my test published both. The full file is the one that matters most for documentation sites and coding assistants; for a typical marketing site, llms.txt alone is probably enough.
Does Google AI Overviews or Gemini use llms.txt?
No. Google's John Mueller has publicly said Google does not use llms.txt for Search or Gemini, and my 6-week experiment is consistent with that: Gemini citation counts in the treatment cohort actually trailed the control cohort by a small amount, which is what you would expect if Google's retrieval pipeline ignores the file entirely and the small gap is noise plus regression to the mean. If your AI-citation strategy depends on Google surfaces, llms.txt is not the lever to pull. Schema, content quality, and classic SEO signals still drive Google's AI products.
When is publishing llms.txt a waste of time?
Three cases. First, brand-new sites with little unique content or weak entity recognition; in my data the smallest sites saw essentially no citation movement, treatment or control, because they were not being cited much to begin with. Second, sites that publish an incomplete or broken llms.txt that points at dead URLs, redirects, or pages already noindexed; a stale file is mildly counterproductive for agents that do read it. Third, any team that is paying a recurring SaaS fee to auto-generate a 40-line markdown file. The work itself is roughly 30 minutes one-time; the recurring cost should be zero.
What is the statistical confidence on these results?
Modest. With 5 treatment and 5 control sites, 30 prompts per vertical per engine, and 6 weekly measurements, the effective sample size for any single engine is small. The Perplexity delta (+10.2 percentage points treatment-minus-control) survives a paired t-test at roughly p=0.04, which is suggestive but not conclusive. Claude's smaller +3.6-point delta lands near p=0.18, not significant. ChatGPT and Gemini are firmly inside noise. So the Perplexity result is the only one I would call "a signal"; the rest are best described as "no effect detected at this sample size," which is not the same thing as "no effect exists" but is the honest read of the data on the table.
Should I worry about duplicate content from llms-full.txt?
In my logs and Search Console data across the treatment cohort, I saw no measurable duplicate-content penalty or canonical confusion attributable to llms-full.txt during the 6-week window. The file lives at a single URL, returns text/plain (or text/markdown), and was not indexed in Google's web search index in any of the cohort's accounts during the test. That matches what you would expect: it is not HTML, it is not linked from your main site navigation, and Google's index is generally good at not double-counting plain-text mirrors of HTML pages. The risk window is real if you also publish standalone .md copies of every blog post and let Google index those; that is a different and worse pattern.
How do I measure llms.txt impact on my own site?
Two layers. Layer one is citation presence: a GEO visibility tool that queries the major engines for your target prompts weekly and tracks whether you appear; pair that with server access logs filtered to /llms.txt and /llms-full.txt with bot user-agents so you know who fetched it. Layer two is revenue: capture AI-engine referrers server-side, persist a first-party session, and join that session to Stripe webhooks so a citation becomes a click becomes a paying customer. Without layer two you are measuring presence, not money, which is the entire reason most llms.txt arguments stay unresolved.
Will Attrifast tell me if llms.txt drove revenue on my site?
Attrifast measures AI-engine attributed sessions and revenue server-side, joined to Stripe by webhook, so once you ship llms.txt you can watch AI-attributed revenue per visitor change over time against a baseline. That answers the only question that pays the bills: did the file move dollars on my site, by engine. What Attrifast does not do is monitor citation presence in AI answers; for that you pair it with a GEO visibility tool. The combination is what turns the llms.txt debate from opinion into an in-house measurement you control.
What changed between this experiment and the older llms.txt advice?
The biggest change is that we now have a small, controlled, matched-pair data point on the consumer chat surfaces, rather than only anecdotes from individual operators and vendor case studies. The picture that emerges is more nuanced than either side of the debate had it: there is a real but small Perplexity lift, a maybe-real Claude lift, and no measurable lift on ChatGPT or Gemini. Earlier advice told you llms.txt is either essential GEO plumbing or a complete hoax. Neither is true. It is a low-cost convention with a small per-engine signal where it works at all, and the right way to think about it is as cheap plumbing, not a magic visibility lever.
If I only do one thing about AI visibility this quarter, should it be llms.txt?
No. The two things with the largest revenue-per-hour return I have seen this year are (a) installing first-party AI-engine attribution so you know which sources drive paid conversions, and (b) auditing your top 20 revenue pages for answer-shaped passages, FAQ schema, and entity clarity. llms.txt is a cheap follow-up to those, not the lead. The reason: the upside on llms.txt is small and engine-specific, while the upside on measurement plus on-page structure is large and applies to every engine. Spend the afternoon on measurement and structure first, then ship llms.txt before bed.
Find revenue hiding in your traffic
Discover which marketing channels bring customers so you can grow your business, fast.