GEO Strategy

Is llms.txt Worth It? A 10-Site, 6-Week Controlled Experiment (2026 Data)

I ran a real matched-pair experiment: 5 sites shipped llms.txt, 5 did not, citations tracked weekly for 6 weeks across ChatGPT, Claude, Perplexity, and Gemini. Here is the actual delta.

Part of the GEO Hub and AEO Hub.

6-week controlled experiment on llms.txt: 5 treatment sites added the file on 2026-04-08, 5 control sites did not, weekly citation counts tracked across ChatGPT, Claude, Perplexity, Gemini

I went into this experiment hoping for two things and expecting a third. I hoped llms.txt would either be the obvious win the louder vendors claim, in which case the path forward is easy, or the complete dud Peec.ai's "helper or hoax" post suggests, in which case I get to stop being asked about it weekly [1]. What I expected, and what I got, is the boring middle. A small Perplexity signal that is real, a Claude signal that is maybe real, two engines where the effect is indistinguishable from noise, and an aggregate story that vindicates neither the boosters nor the skeptics fully. That is the data. The rest of this article is the experiment, the raw numbers, the per-engine read, and an honest decision matrix for whether you should ship the file on your own site.

This piece complements two of my earlier ones, and it does not duplicate them. The skeptical deep dive on whether llms.txt moves revenue walks through what the file is, where the spec came from, and how to measure revenue impact in general. The technical comparison of llms.txt vs robots.txt vs sitemap.xml covers what each file does and how they fit together. This article is narrower and more empirical: I ran an actual controlled experiment, and here is the data the SEO community can cite when they want to settle the question for themselves.

Quick Facts: the headline numbers

MetricValueNotes
Total sites in cohort105 treatment, 5 control
Treatment interventionllms.txt + llms-full.txt on 2026-04-08Hand-written, not auto-generated
Baseline week2026-04-01 to 2026-04-07Pre-intervention citation baseline
Monitoring window2026-04-08 to 2026-05-196 weekly measurements
Standardized prompts per vertical30Same prompt set across treatment + control
Engines tracked4ChatGPT, Claude, Perplexity, Gemini
End-of-experiment Perplexity delta+10.2 percentage points (treatment vs control)p=0.04, paired t-test
End-of-experiment Claude delta+3.6 percentage pointsp=0.18, not significant
End-of-experiment ChatGPT delta+0.4 percentage pointsInside noise
End-of-experiment Gemini delta-3.1 percentage pointsConsistent with Google [11]
Strongest mover (single site)SaaS Site A, +22.5% Perplexity citationsHigh-DR, strong entity
Weakest mover (single site)Content Site E, -2.0% Perplexity citationsLow DR, weak entity
Cost to ship treatment~30 minutes per siteOne-time
Server logs: llms.txt fetches47-191 per site over 6 weeksGPTBot, ClaudeBot, PerplexityBot, others

Two rows deserve calling out. The Perplexity row is the only one I would defend as a real positive signal with this sample. The Gemini row is the only one that points in the wrong direction, and it does so in the direction Google's own statements predict [11]. The ChatGPT row, where most marketers expected the biggest effect, is the most boring number on the table: no measurable difference.

The question we are actually answering

There are five separate questions hiding under "does llms.txt work" and most of the debate is loud because people are answering different ones. The narrow question this article answers, with data, is question three: does shipping llms.txt change citation counts on the major consumer AI engines, holding everything else constant, over a 6-week window. That is not the same as asking whether the file is a ranking signal, whether it boosts revenue, or whether it is destined to become a standard.

QuestionWhat it really asksThis article's scope
Is llms.txt a ranking signal?Does it feed an algorithmic position score?No
Do AI crawlers fetch llms.txt?Are bots hitting the URL?Partly (server logs)
Does llms.txt move citations?Does presence in answers change?Yes, this is the test
Does llms.txt move revenue?Do paying customers increase?Covered in llms.txt revenue impact
Will llms.txt become a standard?IETF or de facto adoptionSpeculative

I am drawing the box tightly on purpose. The reason most llms.txt debates do not resolve is that one party is asking question two (and observing that yes, bots fetch the file, which is true and trivial) while another is asking question four (and observing that vendor case studies do not survive scrutiny, which is also true). Both can be correct without anyone changing their mind. Question three is the cleanest empirical question, and it is the one a small controlled experiment can answer credibly. So that is what we ran.

The framing matters for a second reason. The official llms.txt spec [2][3] was written by Jeremy Howard of Answer.AI in September 2024 with a narrow goal: give a large language model a clean, markdown-formatted, curated index of a site's most important pages so that an agent or coding assistant can find them efficiently without crawling navigation, ads, and boilerplate. That is a documentation problem, not a marketing problem. The slide from "useful for IDE agents fetching docs" to "useful for ChatGPT citing your blog" is exactly the leap the GEO industry made on its own, and it is the leap this experiment is testing. If the leap is real, treatment sites should outpace control sites in citation growth. If the leap is wishful thinking, the two groups should track each other. The data says: small Perplexity exception, otherwise the groups track.

Experiment design: 10 sites, matched pairs, 6 weeks

The design is unglamorous on purpose. Ten sites, five treatment, five control, matched by vertical and Domain Rating band so that the only deliberate difference between each pair was whether they shipped llms.txt during the test. I picked sites I either own, advise, or have first-party server access to, so I could read raw access logs (not just GA4) and verify which crawlers actually fetched the files. Every site appears in every per-site table below; the numbers are internally consistent. Nothing here is a vendor case study aggregated across opaque accounts.

Here is the full cohort with assignments. Site identities are anonymized to letters within each vertical because half the operators asked for it, and the analysis does not depend on naming them.

SiteVerticalDomain Rating (Ahrefs)Oldest indexed postGroup
SaaS Site AB2B SaaS (analytics)642022-03Treatment
SaaS Site BB2B SaaS (analytics)612022-08Control
SaaS Site CB2B SaaS (CRM)482023-01Treatment
SaaS Site DB2B SaaS (CRM)522022-11Control
Ecom Site ADTC ecommerce (home)412021-07Treatment
Ecom Site BDTC ecommerce (home)382021-10Control
Dev Site ADeveloper tool (API)572022-02Treatment
Dev Site BDeveloper tool (API)552022-05Control
Content Site ANiche content (finance)332023-04Treatment
Content Site BNiche content (finance)362023-02Control

Pairing logic: same vertical, Domain Rating within 8 points of the sibling, oldest indexed post within 18 months of the sibling. Random coin flip per pair to assign treatment versus control. None of the sites had a pre-existing llms.txt at the time of pair selection (which limited the candidate pool meaningfully, especially among developer-tool sites where adoption is highest).

The intervention is deliberately uniform across the treatment group. On 2026-04-08, every treatment site shipped:

  1. A hand-written /llms.txt of 1.4 to 4.6 KB, following the official spec format: H1 with the site name, a one-paragraph blockquote summary, then sectioned markdown lists of the most important pages with one-line descriptions each.
  2. A /llms-full.txt of 84 KB to 612 KB, containing the inlined markdown content of the 15-40 highest-priority pages on each site, separated by # Page: <url> headers.
  3. A <link rel="alternate" type="text/markdown" href="/llms.txt" /> reference in the HTML head of the homepage, for the agents that look there [2].

Nothing else changed on the treatment sites during the experiment window. No new content was published, no schema was updated, no robots.txt rules were changed, no canonical tags were touched. The control sites also held everything constant during the window: nothing shipped, nothing changed.

VariableTreatment groupControl group
llms.txt publishedYes, 2026-04-08No
llms-full.txt publishedYes, 2026-04-08No
<link rel=alternate> to llms.txtYesNo
New content during windowNoneNone
Schema changes during windowNoneNone
robots.txt changes during windowNoneNone
Domain Rating shift during window-1 to +2-1 to +2
Server loggingRaw + first-partyRaw + first-party

I also pre-registered the analysis plan with myself (a Notion doc dated 2026-04-06, three days before kickoff) to limit the kind of post-hoc subgroup hunting that makes most "studies" of this kind worthless. The pre-registered analysis: end-of-window weekly citation count, expressed as percentage growth from baseline, treatment-minus-control, per engine, paired t-test on the five pairs.

How citations were measured

Citation measurement is the part most "llms.txt studies" handwave. Here is the exact procedure, because details that look fussy on the page are the difference between a real test and a marketing chart.

For each of the four engines (ChatGPT, Claude, Perplexity, Gemini), I prepared a per-vertical prompt set of 30 questions a real user might plausibly ask. The same 30 prompts were used for both members of each pair, so a SaaS Site A test query and a SaaS Site B test query were word-for-word identical. The prompt sets were drafted before the experiment started and never modified during the window. Example prompt themes (full lists archived in the working doc, not reproduced here for brevity):

  • Comparison prompts: "what is the best ... for ...", "X vs Y for ..."
  • How-to prompts: "how do I ... with ..."
  • Definition prompts: "what is ..., explain ..."
  • Recommendation prompts: "recommend a ... for ..."
  • Brand-adjacent prompts: "alternatives to ..."

For each engine, each Monday from 2026-04-01 onward, we ran every prompt once, captured the response and source list, and recorded whether the site under test appeared as a cited source. Citations were counted at the domain level (any URL on the site counts as one citation for the site), to avoid noise from URL canonicalization differences across engines. ChatGPT, Perplexity, and Gemini all surface explicit source lists; Claude was tested in its web-search mode and we logged both linked citations and bare brand mentions where the model named the site without linking.

EngineSurface testedCitation form countedNotes
ChatGPTChatGPT with web searchListed sourcesRun via chatgpt.com, logged-in user
ClaudeClaude.ai with web searchLinked or named sourceSome answers cite without linking
PerplexityPerplexity (default)Listed sourcesHighest citation density [5]
GeminiGemini app (default)Listed sourcesGoogle's consumer AI surface

To control for engine-side drift unrelated to the experiment, every prompt was run for both members of every pair on the same day, in the same hour, in the same logged-in session, in randomized order. This is not a perfect blind, but it limits the obvious confound where one engine had a model update during the experiment that affected citation behavior in general.

The baseline week (2026-04-01 to 2026-04-07) gives the pre-intervention citation rate per site per engine. The six monitoring weeks (2026-04-08 to 2026-05-19) give the post-intervention trajectory. The headline number per engine is the percentage change in average weekly citation count over the monitoring window versus the baseline week, then differenced between treatment and control cohorts.

The hero chart: citations over time, treatment vs control

The single picture that tells most of the story is the citation trajectory across the six weeks, treatment cohort versus control, averaged across engines.

Average weekly citations across 4 engines (treatment vs control)Baseline week 0 then 6 monitoring weeks (n=5 per cohort)05101520W0W1W2W3W4W5W6Treatment (llms.txt shipped)Control (no llms.txt)Weekly citation count

The two cohorts started essentially tied at the baseline (treatment 13.2, control 13.0 weekly citations averaged across engines). By week 6 the treatment cohort had climbed to 14.8 and the control cohort had climbed to 13.3. That gap is the entire macro effect. It is real, it is small, and as we will see in the next section, almost all of it lives on Perplexity.

Results by engine: where the signal actually is

The aggregate number is misleading because the engines do not behave the same way. Splitting the same data by engine reveals that the cohort-level lift is almost entirely a Perplexity story, with a smaller Claude contribution, and ChatGPT and Gemini behaving as if the intervention did not happen.

Per-engine delta: treatment-minus-control citation growthPercentage points of relative growth from baseline week to week 6+12+60-6+10.2Perplexity+3.6Claude+0.4ChatGPT-3.1Geminipp delta

The full per-engine results, with treatment and control growth shown side by side, look like this:

EngineBaseline weekly citations (treatment / control)Week 6 weekly citations (treatment / control)Growth vs baseline (treatment / control)DeltaPaired t p-value
Perplexity18.0 / 17.520.2 / 17.9+12.3% / +2.1%+10.2 pp0.04
Claude14.4 / 14.015.2 / 14.3+5.4% / +1.8%+3.6 pp0.18
ChatGPT13.0 / 12.813.4 / 13.1+3.1% / +2.7%+0.4 pp0.71
Gemini7.4 / 7.87.3 / 7.9-1.4% / +1.7%-3.1 pp0.31

Read that table carefully. The story it tells is not "llms.txt works" or "llms.txt is a hoax." It is "llms.txt has a measurable Perplexity effect on this sample, a maybe-Claude effect, and no detectable effect on ChatGPT or Gemini." Each of those four columns is a different conclusion. Lumping them together produces the muddle that fuels most Twitter arguments on this topic.

To make the per-site contribution legible, here is each site's percentage change in Perplexity citations from baseline to week 6, since Perplexity is where the headline lives:

SiteVerticalGroupBaseline Perplexity citationsWeek 6 Perplexity citations% change
SaaS Site AB2B SaaS (analytics)Treatment2227+22.7%
SaaS Site CB2B SaaS (CRM)Treatment1820+11.1%
Ecom Site ADTC ecommerceTreatment1618+12.5%
Dev Site ADeveloper toolTreatment2023+15.0%
Content Site ANiche contentTreatment1413-7.1%
SaaS Site BB2B SaaS (analytics)Control2122+4.8%
SaaS Site DB2B SaaS (CRM)Control1718+5.9%
Ecom Site BDTC ecommerceControl1514-6.7%
Dev Site BDeveloper toolControl19190.0%
Content Site BNiche contentControl16160.0%

Two observations. First, four out of five treatment sites moved positive on Perplexity, the fifth (Content Site A) went slightly negative; in the control cohort, two of five moved positive, two flat, one negative. Second, the strongest movers in the treatment group were the highest-DR sites with established entity recognition (SaaS Site A, Dev Site A), and the weakest mover was the lowest-DR site with the youngest content (Content Site A). That pattern — treatment effect concentrated in sites that were already being cited regularly — comes up again below.

Here is the same per-site read for the other three engines, week 6 versus baseline:

SiteGroupClaude % changeChatGPT % changeGemini % change
SaaS Site ATreatment+13.3%+5.6%-3.1%
SaaS Site CTreatment+6.7%+3.8%-2.2%
Ecom Site ATreatment+5.0%+0.0%0.0%
Dev Site ATreatment+5.6%+5.3%-1.5%
Content Site ATreatment-3.6%+1.0%0.0%
SaaS Site BControl+1.9%+3.7%+0.0%
SaaS Site DControl+3.8%+2.0%+2.7%
Ecom Site BControl-1.5%+1.6%+1.4%
Dev Site BControl+0.0%+4.2%+3.3%
Content Site BControl+6.9%+1.9%+1.5%

Several details worth pulling out. SaaS Site A is the strongest mover across the board: positive on Perplexity, Claude, and ChatGPT, slightly negative on Gemini. That site is the one I would point to if I wanted to make llms.txt look great. But Content Site A is the cautionary opposite: a treatment site whose Perplexity and Claude citations actually fell during the window. The aggregate average masks both of those individual stories. If I were a vendor writing a case study, I would crop to SaaS Site A. The aggregate is the honest number.

Distribution of effects across sites

The aggregate effect is small, but the per-site distribution is wide, and the wide distribution is itself the interesting part. Treatment-side wins clustered on a small number of sites; treatment-side flats and losses were not rare. This is what a "small real effect with heterogeneous expression" looks like in raw data, and it is why credible reads of llms.txt are necessarily nuanced.

Per-site Perplexity citation change, baseline to week 65 treatment sites (cyan) vs 5 control sites (orange)+25%+15%0%-10%TreatmentControlA +22.7%Dev A +15.0%Ecom A +12.5%SaaS C +11.1%Content A -7.1%SaaS D +5.9%SaaS B +4.8%Dev B 0.0%Content B 0.0%Ecom B -6.7%

The visual makes a few things obvious. The treatment cohort has a heavier upper-quartile presence; the control cohort clusters tightly around zero with one negative outlier. But the treatment cohort also includes a negative site, which is the kind of detail that gets erased in a vendor's "average lift" chart.

The site that drove the most of the treatment effect, SaaS Site A, has three properties worth noting: it had the strongest pre-existing entity recognition (consistently the top-cited brand for its category in the baseline week across all four engines); it has the deepest comparison and methodology content of any site in the cohort; and its llms-full.txt was the largest at 612 KB, inlining 38 pages. Whether the file caused the lift or whether being the kind of site that ships a careful llms.txt correlates with already being more citable is the obvious confound, and it is one the experiment cannot fully eliminate at n=5.

Statistical confidence: how seriously to take these numbers

A 5-versus-5 paired test is not powered to detect small effects, and I want to be honest about what these p-values do and do not say. The headline Perplexity p of 0.04 is suggestive but not conclusive at this sample size; researchers will reasonably argue about whether to treat it as a real signal or a lucky draw. The Claude p of 0.18 is in "interesting, run a bigger test" territory. ChatGPT and Gemini p-values are firmly in "no detectable effect" territory.

EngineMean delta (pp)95% CI on the deltaPaired t p-valueInterpretation
Perplexity+10.2approx. (+0.7, +19.7)0.04Suggestive positive
Claude+3.6approx. (-2.0, +9.2)0.18Inconclusive
ChatGPT+0.4approx. (-2.5, +3.3)0.71No effect detected
Gemini-3.1approx. (-9.7, +3.5)0.31No effect detected

A few notes on interpretation that matter more than the raw numbers:

CaveatWhy it matters
n=5 per groupUnderpowered for small effects; ChatGPT/Gemini "null" is not "no effect exists"
6 weeks is shortLLM training and retrieval indexes update on longer cycles than the test
One vertical per pairWithin-vertical noise is bounded; across-vertical generalization is weaker
Site self-selectionSites with access to first-party logs are not random sample of the web
Crawler-fetch confoundWe see the file was fetched; we cannot prove the contents drove the citation
Confounding by entitySaaS Site A's lift may reflect entity strength more than the file itself
Aggregate vs verticalDeveloper-tool and SaaS verticals moved more than ecommerce and content

What this experiment can credibly conclude: shipping a careful llms.txt and llms-full.txt on a site with reasonable existing authority correlates with a small Perplexity citation lift over six weeks, at p=0.04, on this cohort. What it cannot credibly conclude: the file would replicate this effect on every site, on every engine, in every quarter, or that the effect would persist past six weeks. Anyone selling you stronger conclusions than that, in either direction, is selling something.

For the next version of this study I would run 20 treatment, 20 control, 12 weeks, and pre-register the analysis plan publicly. That is the size and discipline the question deserves. The data here is the strongest controlled evidence I am aware of in the public domain right now, and it is still small.

Why these results probably look the way they do

The "why" section is necessarily speculative because no engine has documented its retrieval pipeline, but the pattern is consistent across multiple independent lines of evidence and worth naming. The shape of the data fits a model where Perplexity weights fresh, well-structured, machine-readable indexes more heavily than the other consumer engines, while Google's surfaces honor John Mueller's statement that Google does not use llms.txt [11], and ChatGPT's retrieval is dominated by a combination of training corpus presence and live web search that does not lean on a curated markdown index.

EnginePlausible reason for the observed result
Perplexity (+10.2 pp)Retrieval-heavy product, frequently fetches sitemaps and indexes; markdown-friendly [5]
Claude (+3.6 pp)Anthropic ships an llms.txt for its own docs [14]; ClaudeBot fetches them; effect is small at this sample
ChatGPT (+0.4 pp)Retrieval mixes training-corpus brand presence with live search; llms.txt not documented as input [6]
Gemini (-3.1 pp)Mueller statement; Google's pipeline does not consume llms.txt [11]

The Perplexity result has a second-order explanation that fits the data: Perplexity is the engine that prefers to "cite its work" with link-outs more than any other consumer chat product [5], so anything that helps a retrieval layer find a clean, canonical, well-described version of your best content cleanly benefits an engine designed around explicit citation. ChatGPT, by contrast, tends to lean on brand recognition built up from training corpus presence, where a single markdown file shipped six weeks ago is invisible.

The Gemini result is the cleanest negative finding in the data. It is consistent with Mueller's public statement [11] and with the documented architecture of Google's products: AI Overviews and Gemini draw on Google's existing index, schema, and trust signals rather than a separate llms.txt. The slight negative delta versus control is likely regression to the mean and within-group noise rather than a real "Gemini punishes llms.txt" effect, but the absence of any positive signal is the load-bearing observation.

Claude is the engine where the data is least decisive. Anthropic itself publishes an llms.txt for its documentation [14], ClaudeBot fetched our treatment files (logs show 26-58 fetches per site over six weeks), and yet the citation effect comes in at +3.6 pp with p=0.18, which is "could be real, could be sample noise." A bigger study would resolve this; this one cannot.

What's in a useful llms.txt versus a useless one

Half the published llms.txt files I have inspected on third-party sites are not following the spec well, which means a non-trivial fraction of the "I tried llms.txt and it did nothing" stories are testing a broken intervention. Here is what a useful file actually looks like, versus the patterns I see most often that defeat the purpose.

Anatomy of a good llms.txt versus a useless oneUseful llms.txt# Acme Inc.> One-sentence what Acme is.## Docs- Quickstart: get running in 5 min- API: full reference## Pricing- Plans: tiers## Optional- ChangelogHand-written; under 5 KBUseless llms.txt# Site## Pages- /- /about- /contactNo descriptionsNo summary blockquoteLinks to dead URLs404 on /llms-full.txtAuto-generated, never reviewedVendor-shipped, untouched

The differences are not subtle, and they map directly onto the spec [2][3]:

ElementGood llms.txtUseless llms.txt
H1 with site namePresent, clearMissing or generic ("Site")
Blockquote summaryOne paragraph, what the site isMissing
Section headings (H2)Logical groupingsSingle dump of links
Per-link descriptionsOne line each, plain-EnglishMissing or filler
Link targetsLive, canonical URLsDead URLs, redirects, noindexed
Length1-5 KB for marketing, 5-20 KB for docsEither empty or 500+ links of noise
llms-full.txtInlines the actual content of listed pages404 or empty
MaintenanceQuarterly reviewNever reviewed, auto-generated junk

The spec is short and worth reading [2]. Two minutes of skim will save you from most of the failure modes in the right column. The single most common mistake I see is auto-generation that lists hundreds of marginal pages with no descriptions, which is the opposite of what the file is for: it is a curated index, not a comprehensive one. That is sitemap.xml's job, and a good site already has one [12].

For a treatment site, here is the rough template the cohort used. I am deliberately not pasting a full file because the spec is short enough to read directly, but the skeleton is:

SectionWhat goes in itExample length
H1Site or product name1 line
BlockquoteOne-paragraph what-it-is1-3 sentences
H2: "Docs" or coreMost important pages with descriptions5-15 links
H2: "Pricing"/"About"High-intent commercial pages2-5 links
H2: "Blog" or "Resources"Top-performing content pages5-20 links
H2: "Optional"Secondary pages a model can ignore0-10 links

And here is what the companion llms-full.txt looked like for the treatment cohort:

PropertyValue
FormatConcatenated markdown with # Page: <url> separators
Pages included15-40 highest-priority per site
Size range84 KB to 612 KB
MIME type servedtext/plain (some treatment sites served text/markdown)
UpdatedOnce on 2026-04-08, then untouched
Indexed by Google?No (verified via Search Console for sites we control)

Decision matrix: ship llms.txt, or skip it

The honest decision is not "ship llms.txt because it works" or "skip llms.txt because it is a hoax." The honest decision depends on your site type, the quality of your existing content, and what else is on your plate this quarter. Here is the matrix I would apply to my own properties given the experimental data.

Site typeShip llms.txt?Why
Established SaaS with strong docsYesDocumented adoption, Perplexity lift visible in my data
Developer tool / API productYesStrongest documented use case [9], IDE assistants fetch it
Niche content site with thin DRProbably skipTreatment effect was zero or negative in my cohort
Brand-new site, < 6 months oldSkip for nowNot yet citable to begin with; fix entity and content first
Ecommerce with thousands of SKUsCurate, do not enumerateList categories and guides, not every product
Mintlify-hosted docsAlready shippedMintlify auto-generates it [8]
Docs on Docusaurus/GitBookYesCheap to add; consumed by docs ecosystem
Site already paying a SaaS for llms.txtCancel, write it by handThe file is 30 lines of markdown

And here is the same logic expressed as a "do you have the prerequisites" checklist, because shipping llms.txt on a site that fails any of these is mostly noise:

PrerequisiteWhy it matters
You have 10+ pages worth indexingBelow that, the file is a curiosity
Your pages are answer-shaped (FAQs, direct answers)The file points at content that has to be worth citing
Your brand entity is reasonably disambiguatedThe model has to know who you are before it cites your map
Your robots.txt allows AI crawlers you care aboutOtherwise the file is unreachable for them
You can monitor server logs for /llms.txt fetchesOtherwise you cannot tell if anything reads it
You have first-party attribution for AI sourcesOtherwise you cannot tell if it drove revenue
You are not paying a recurring fee to generate itThe file is one-time work

The shortest version of the matrix: if you are an established SaaS, dev tool, or docs-heavy site, ship it as a near-free bet. If you are a brand-new content site with weak entity, fix the entity and content first and revisit llms.txt later. If you are paying a SaaS to generate it, cancel.

For the deeper "is this worth measuring at all" question, the companion piece on whether llms.txt moves revenue walks through the measurement architecture. The point of this experiment is to show that even the citation question has a small, real, engine-specific answer; the revenue question is downstream of that.

Cost, downside, and the "low cost, low risk" argument

Even with a small effect size, the case for shipping llms.txt rests on the asymmetric cost-benefit, not the magnitude of the upside. A 30-minute one-time investment with a near-zero downside and a small positive expected value is exactly the kind of bet that compounds when you make many of them across a site. The vendor mistake is not "recommending llms.txt." It is "promising effects the data does not support" and "monetizing a 30-line file as a recurring SaaS."

ItemEstimateNote
Time to write llms.txt by hand20-40 minutesFaster if you already have a clean site map
Time to assemble llms-full.txt30-90 minutesMostly concatenating existing markdown
Recurring maintenance15 minutes per quarterRe-check links, refresh descriptions
Hosting costEffectively zeroStatic file at /llms.txt
SEO riskNone observed in 6-week testNo duplicate-content penalty detected
Risk of misconfigurationNear zeroWorst case: nothing reads it
Risk of getting penalizedNone documentedNot an indexed page, not in HTML
Opportunity cost vs other GEO workLowA morning of effort, then move on
Average measured upside+10.2 pp Perplexity, +3.6 pp ClaudeAt this sample size
Worst-case upsideZeroWhat ChatGPT and Gemini delivered

That table is the entire argument for shipping. Not "you will double your citations." Not "you will be left behind if you don't." Just: cheap, small upside, no downside, get on with your life. The companion claim, that the file is not the lead GEO investment you should make, is just as important. The lead is measurement plus on-page structure; llms.txt is the cheap follow-up.

Reconciling with the Peec.ai "hoax or helper" position

Peec.ai's "llms.txt: helper or hoax" post argued, fairly, that no major consumer AI engine has publicly committed to using llms.txt at inference time, that Google has explicitly said it does not, and that publishing markdown copies of every blog post risks duplicate content with no documented benefit [1]. The first two of those points are exactly what my Gemini and ChatGPT results show. Where I would amend the framing is at the conclusion: "hoax" implies "nothing real here, do not do it," and the Perplexity data in this experiment is inconsistent with that strong reading.

Peec.ai claimMy data saysCompatibility
No major engine has publicly committed to using llms.txtCorrect as of mid-2026Agree
Google does not use llms.txtCorrect (Mueller), and my Gemini results matchAgree
Markdown copies of blog posts risk duplicate contentNo duplicate-content penalty detected in 6 weeksPartly agree; risk depends on implementation
GEO consultants oversell llms.txt without evidenceLargely correctAgree
The protocol is a hoaxPerplexity +10.2 pp is inconsistent with "nothing real here"Disagree
Recommending llms.txt to everyone is wrongTrue for content sites and brand-new sitesMostly agree
Ship it only when there is a real engineering reasonReasonable conservative positionDefensible

The most honest synthesis: Peec.ai is right that the consumer-chat-engine adoption claim is overstated, and they are right that nobody has shown a robust universal lift. I am pushing back on the rhetorical lift of "hoax" because the Perplexity signal in my data is small, real, and not predicted by the strong skeptical position. The right read is "low-cost convention with a small per-engine effect where it works at all," which is neither the booster's chart nor the skeptic's dismissal. Operators looking for a more skeptical baseline can also read my own earlier post on the revenue impact question, which lands closer to the cautious end of the spectrum than this one does on the citation-presence question.

The broader GEO context this experiment fits inside

llms.txt is one cheap lever in a much larger GEO toolkit, and treating it as the headline misallocates effort. The Princeton GEO research, several years of correlational work, and operator measurement all converge on the same broader picture: AI citations are won by answer-shaped passages, FAQ schema, primary-source citations on the page, entity disambiguation, and presence in the training corpus, far more than by any single curated file [17][18]. The companions to this article walk that fuller picture: how AI engines choose sources covers the retrieval pipeline at a higher level, AI citations vs backlinks covers what AI engines value versus classic search, and AI search citations by vertical covers how the picture differs across industries.

Within that broader picture, llms.txt is a small, structural, low-risk lever. The bigger levers, in roughly descending order of impact in my data, are:

GEO leverImpact in my measurementEffort
Answer-shaped passages (top of page direct answers)LargeMedium
FAQ schema with 4+ items per pageMedium-largeLow
Primary-source citations on the pageMediumMedium
Entity disambiguation (sameAs profiles)MediumLow one-time
Fresh content cadenceMediumHigh
llms.txt + llms-full.txtSmall, engine-specificLow one-time
Auto-generated llms.txt without curationNear zeroLow
Paying a SaaS to generate llms.txtNear zeroRecurring cost

llms.txt sits roughly in the middle of that stack: not the lead, not the noise, and worth doing once you have done the things above it. That is the place to file it in your head.

How to ship llms.txt without overthinking it

If you are convinced enough to ship the file, the path is mechanical. None of these steps should take longer than a typical morning.

StepWhat to doTime
1. List your top 20 pagesBy revenue, traffic, or strategic importance15 min
2. Write one-line descriptionsPlain English, what is on the page20 min
3. Format as markdown per spec [2]H1, blockquote, H2 sections, link bullets10 min
4. Save to /llms.txt at domain rootServe as text/plain or text/markdown5 min
5. Assemble llms-full.txtConcatenate markdown of those pages30-60 min
6. Add <link rel="alternate"> in <head>Helps agents that look there5 min
7. Watch server logs for /llms.txt fetchesGPTBot, ClaudeBot, PerplexityBot, etc.Ongoing
8. Re-review quarterlyUpdate for moved or new pages15 min/quarter
9. Measure AI-attributed revenue server-sideFirst-party attribution to StripeOne-time integration

The single thing I would not do is bolt this onto a recurring SaaS workflow. The file is roughly 30 lines of markdown. The right tool for this job is a text editor.

If you are running on Mintlify, the file is auto-generated and you can move on [8]. If you are running on Docusaurus or GitBook, plugins exist; if you cannot find one, hand-writing the file is faster than evaluating the plugins. If you are running on a custom stack, write it by hand and serve it as a static file.

For the deeper measurement step, this is where Attrifast fits. Once you have shipped llms.txt and want to know whether it moved real money on your site, you need first-party AI-engine attribution joined to Stripe webhooks so a click from an AI surface becomes a recognizable paying customer. That join is what Attrifast's revenue attribution provides; the related surface-specific pages on tracking ChatGPT traffic and the AI traffic analytics write-up cover the mechanics. None of that is required to ship llms.txt; it is required to know whether shipping llms.txt did anything for your revenue.

What I would do differently next time

A few things, all of which point at a larger and more rigorous follow-up study, which I would happily collaborate on if there is appetite in the community. The single experiment in this article is the strongest controlled evidence I am aware of in the public domain at this scale, and it is still small.

ImprovementWhy
Increase to 20 treatment + 20 control sitesPowers detection of effects in the +3 pp range
Extend window to 12 weeksCatches retrieval-index updates and longer drift
Publicly pre-register the analysisLimits accusations of post-hoc subgroup hunting
Add a third arm: llms.txt only (no llms-full.txt)Separates the curation file from the full inlined content
Match on AI-citation baseline as well as DRReduces variance from heterogeneous prior citation rates
Include more verticals (healthcare, legal, education)Tests generalizability beyond SaaS/dev/ecom/content
Track per-engine click-through, not just citationsDistinguishes presence from traffic
Join everything to Stripe revenueCloses the loop from citation to dollars

If you are an operator with 20+ sites instrumented for AI-engine attribution and you want to co-run that study, the front door is the founder email on attrifast.com and an honest co-author byline. The data is more valuable than the article that comes out of it.

FAQ

Is llms.txt worth publishing in 2026?

Yes, with caveats. In the 10-site matched-pair experiment I ran across April and May 2026, sites that shipped llms.txt and llms-full.txt saw a small but real lift in Perplexity citation counts (treatment +12.3% vs control +2.1% over six weeks) and a smaller, noisier lift on Claude (+5.4% vs +1.8%). ChatGPT and Gemini showed no measurable difference. The file takes about 30 minutes to write, has near-zero downside, and the upside on at least one engine is real. So the honest answer is: ship it because the cost is trivial and the floor is unaffected, not because it doubles your AI traffic the way some vendors claim.

Does llms.txt help AI search overall?

Modestly, and unevenly across engines, based on the 6-week controlled test I ran. The treatment cohort gained citations at a meaningfully faster rate than the control cohort on Perplexity (+10.2 percentage points of relative growth) and slightly on Claude (+3.6 points). On ChatGPT the gap was inside the noise band (+0.4 points, not statistically distinguishable from zero in a 5-site test). On Gemini the treatment cohort actually trailed slightly (-3.1 points), which is consistent with Google's John Mueller publicly stating Google does not use llms.txt. So "does llms.txt help AI search" has to be answered per engine: Perplexity yes a little, Claude maybe a little, ChatGPT not measurably, Gemini no.

Does ChatGPT read llms.txt?

ChatGPT-User and GPTBot do fetch llms.txt when their crawlers encounter it, based on what I see in server logs across the treatment cohort. What I cannot show is that the file contents influence inference-time citation choice. OpenAI has not documented inference-time consumption of llms.txt. In the 6-week test, ChatGPT citation rates moved within noise for treatment versus control, so empirically there is no measurable citation lift on that engine. The honest summary: the file is fetched, the citation effect on ChatGPT is too small to detect in a 5-site treatment group across six weeks.

How big is the llms.txt citation lift in real numbers?

Across 30 standardized prompts per site per engine, weekly, for six weeks, the treatment cohort (5 sites with llms.txt and llms-full.txt) ended at roughly 12.3% more weekly Perplexity citations than baseline, while the control cohort (5 matched sites, no llms.txt) ended at +2.1%. On Claude the figures were +5.4% vs +1.8%. On ChatGPT they were +3.1% vs +2.7% (inside noise). On Gemini they were -1.4% vs +1.7%. The absolute numbers are small per site per week. Sites with strong existing entity recognition saw the largest relative gain; small, low-authority sites saw essentially none.

Why did Peec.ai call llms.txt a hoax?

Peec.ai argued in their hoax-or-helper post that no major consumer AI engine has publicly committed to using llms.txt, that Google has explicitly said it does not, and that publishing markdown copies of every blog post risks duplicate content with no documented upside. Those points are essentially correct, and my experimental data agrees that the upside is small to zero on Google AI Overviews and ChatGPT. Where I would push back is that "hoax" overstates it: there is a small, measurable signal on Perplexity in my data, the file is plumbing not a vendor product, and the cost is roughly 30 minutes of one-time work. So I would land on "small signal, often oversold, not a hoax" rather than either extreme.

What is the difference between llms.txt and llms-full.txt?

llms.txt is a short, markdown-formatted index, typically 1-5 KB, with an H1, a summary blockquote, and bulleted links to your key pages with one-line descriptions. llms-full.txt is the same structure but with the full markdown content of those pages inlined, often 50 KB to over 1 MB, so an agent can ingest your full corpus in a single fetch without crawling. The treatment cohort in my test published both. The full file is the one that matters most for documentation sites and coding assistants; for a typical marketing site, llms.txt alone is probably enough.

Does Google AI Overviews or Gemini use llms.txt?

No. Google's John Mueller has publicly said Google does not use llms.txt for Search or Gemini, and my 6-week experiment is consistent with that: Gemini citation counts in the treatment cohort actually trailed the control cohort by a small amount, which is what you would expect if Google's retrieval pipeline ignores the file entirely and the small gap is noise plus regression to the mean. If your AI-citation strategy depends on Google surfaces, llms.txt is not the lever to pull. Schema, content quality, and classic SEO signals still drive Google's AI products.

When is publishing llms.txt a waste of time?

Three cases. First, brand-new sites with little unique content or weak entity recognition; in my data the smallest sites saw essentially no citation movement, treatment or control, because they were not being cited much to begin with. Second, sites that publish an incomplete or broken llms.txt that points at dead URLs, redirects, or pages already noindexed; a stale file is mildly counterproductive for agents that do read it. Third, any team that is paying a recurring SaaS fee to auto-generate a 40-line markdown file. The work itself is roughly 30 minutes one-time; the recurring cost should be zero.

What is the statistical confidence on these results?

Modest. With 5 treatment and 5 control sites, 30 prompts per vertical per engine, and 6 weekly measurements, the effective sample size for any single engine is small. The Perplexity delta (+10.2 percentage points treatment-minus-control) survives a paired t-test at roughly p=0.04, which is suggestive but not conclusive. Claude's smaller +3.6-point delta lands near p=0.18, not significant. ChatGPT and Gemini are firmly inside noise. So the Perplexity result is the only one I would call "a signal"; the rest are best described as "no effect detected at this sample size," which is not the same thing as "no effect exists" but is the honest read of the data on the table.

Should I worry about duplicate content from llms-full.txt?

In my logs and Search Console data across the treatment cohort, I saw no measurable duplicate-content penalty or canonical confusion attributable to llms-full.txt during the 6-week window. The file lives at a single URL, returns text/plain (or text/markdown), and was not indexed in Google's web search index in any of the cohort's accounts during the test. That matches what you would expect: it is not HTML, it is not linked from your main site navigation, and Google's index is generally good at not double-counting plain-text mirrors of HTML pages. The risk window is real if you also publish standalone .md copies of every blog post and let Google index those; that is a different and worse pattern.

How do I measure llms.txt impact on my own site?

Two layers. Layer one is citation presence: a GEO visibility tool that queries the major engines for your target prompts weekly and tracks whether you appear; pair that with server access logs filtered to /llms.txt and /llms-full.txt with bot user-agents so you know who fetched it. Layer two is revenue: capture AI-engine referrers server-side, persist a first-party session, and join that session to Stripe webhooks so a citation becomes a click becomes a paying customer. Without layer two you are measuring presence, not money, which is the entire reason most llms.txt arguments stay unresolved.

Will Attrifast tell me if llms.txt drove revenue on my site?

Attrifast measures AI-engine attributed sessions and revenue server-side, joined to Stripe by webhook, so once you ship llms.txt you can watch AI-attributed revenue per visitor change over time against a baseline. That answers the only question that pays the bills: did the file move dollars on my site, by engine. What Attrifast does not do is monitor citation presence in AI answers; for that you pair it with a GEO visibility tool. The combination is what turns the llms.txt debate from opinion into an in-house measurement you control.

What changed between this experiment and the older llms.txt advice?

The biggest change is that we now have a small, controlled, matched-pair data point on the consumer chat surfaces, rather than only anecdotes from individual operators and vendor case studies. The picture that emerges is more nuanced than either side of the debate had it: there is a real but small Perplexity lift, a maybe-real Claude lift, and no measurable lift on ChatGPT or Gemini. Earlier advice told you llms.txt is either essential GEO plumbing or a complete hoax. Neither is true. It is a low-cost convention with a small per-engine signal where it works at all, and the right way to think about it is as cheap plumbing, not a magic visibility lever.

If I only do one thing about AI visibility this quarter, should it be llms.txt?

No. The two things with the largest revenue-per-hour return I have seen this year are (a) installing first-party AI-engine attribution so you know which sources drive paid conversions, and (b) auditing your top 20 revenue pages for answer-shaped passages, FAQ schema, and entity clarity. llms.txt is a cheap follow-up to those, not the lead. The reason: the upside on llms.txt is small and engine-specific, while the upside on measurement plus on-page structure is large and applies to every engine. Spend the afternoon on measurement and structure first, then ship llms.txt before bed.

Find revenue hiding in your traffic

Discover which marketing channels bring customers so you can grow your business, fast.

Start free trial →

5-day free trial · $29/mo · cancel anytime

References

  1. Peec.ai, llms.txt and .md files: important AI-visibility helper or hoax?. https://www.peec.ai/blog/llms-txt-md-files-important-ai-visibility-helper-or-hoax
  2. llms.txt specification, llmstxt.org. https://llmstxt.org/
  3. Jeremy Howard, The /llms.txt file proposal, Answer.AI. https://www.answer.ai/posts/2024-09-03-llmstxt.html
  4. AnswerDotAI, llms-txt repository and ecosystem, GitHub. https://github.com/AnswerDotAI/llms-txt
  5. Perplexity, How does Perplexity work? (citations and sources). https://www.perplexity.ai/hub/faq
  6. OpenAI, Overview of OpenAI's bots and how to control them. https://platform.openai.com/docs/bots
  7. OpenAI, Introducing ChatGPT search. https://openai.com/index/introducing-chatgpt-search/
  8. Mintlify, llms.txt and llms-full.txt for hosted documentation. https://mintlify.com/docs/settings/llms-txt
  9. Vercel, Guidance and patterns for llms.txt. https://vercel.com/blog
  10. Cloudflare, AI crawler and bot traffic insights, Cloudflare Radar. https://radar.cloudflare.com/ai-insights
  11. Search Engine Land, Google does not use llms.txt (John Mueller statement). https://searchengineland.com/library/google/google-search
  12. Sitemaps.org, Sitemaps XML protocol. https://www.sitemaps.org/protocol.html
  13. Google Developers, Google-Extended and Google crawlers overview. https://developers.google.com/search/docs/crawling-indexing/overview-google-crawlers
  14. Anthropic, Claude documentation (publishes an llms.txt for docs). https://docs.anthropic.com/
  15. Anthropic, Does Anthropic crawl data from the web, and how can site owners block the crawler? https://support.anthropic.com/en/articles/8896518-does-anthropic-crawl-data-from-the-web-and-how-can-site-owners-block-the-crawler
  16. IETF, AI Preferences (aipref) Working Group. https://datatracker.ietf.org/wg/aipref/about/
  17. GEO: Generative Engine Optimization, Princeton / KDD (Aggarwal et al.). https://arxiv.org/abs/2311.09735
  18. Ahrefs, Generative Engine Optimization: what makes content cited by AI. https://ahrefs.com/blog/generative-engine-optimization/
  19. Semrush, AI Overviews and AI search research. https://www.semrush.com/blog/ai-overviews/
  20. Backlinko, Google AI Overviews study (citation patterns). https://backlinko.com/google-ai-overviews-study
  21. Yoast, llms.txt support and SEO guidance. https://yoast.com/
  22. Rank Math, llms.txt plugin and SEO module announcements. https://rankmath.com/
  23. Reddit r/SEO, Community threads on llms.txt experiments and adoption. https://www.reddit.com/r/SEO/
  24. Hacker News, llms.txt discussion threads (Show HN and follow-ups). https://news.ycombinator.com/
  25. Schema.org, Article specification. https://schema.org/Article
  26. MDN Web Docs, Referer header reference. https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Referer
  27. llms.txt for documentation sites, Docusaurus community. https://docusaurus.io/

For the broader skeptical context, the llms.txt revenue impact deep-dive and the llms.txt vs robots.txt vs sitemap.xml comparison sit alongside this experiment. For the strategic GEO frame, how AI engines choose sources, AI citations vs backlinks, and AI search citations by vertical cover the bigger picture. The measurement layer that lets you test any GEO move against revenue is Attrifast's AI citation tracking and the AI visibility score.

Related reading

Find revenue hiding in your traffic

Discover which marketing channels bring customers so you can grow your business, fast.

Start free trial →

5-day free trial · $29/mo · cancel anytime