How LLMs choose their sources

Models do not cite at random, and they do not cite whoever ranks first. Understanding the two-stage logic behind a citation tells you exactly what to fix.

Get the AEO / GEO checklist

Two stages, two different bars

Stage	What gets you through
Retrieval	Relevance, authority, crawlability, clean structure, schema
Synthesis	Specificity, verifiable claims, original data, consistency with the wider web

Retrieval decides who is in the room. Synthesis decides who gets quoted. Most pages that lose are relevant enough to be retrieved but not trustworthy or specific enough to be cited. That gap is where the work is.

What makes a model reach for you

You lower its uncertainty

A model writing an answer is constantly estimating how confident it can be. A specific number with a clear source, a first-hand test, a named method, all reduce that uncertainty, so they get pulled in. Vague or generic claims do the opposite.

You agree with the world, or prove why you do not

Models are wary of claims that contradict the consensus without evidence. If you take a contrarian position, back it with data. If you simply contradict the record carelessly, you become a source to route around.

You are easy to read

Facts trapped in images, scripts, or sprawling unstructured prose are facts a model might miss or mangle. Clean structure and a markdown version make your claims trivial to extract correctly. This is the foundation under GEO.

Next: the AI SEO and visibility tools that help you act on this, or the AI search optimization overview.

How LLMs cite, answered

How do large language models decide what to cite?

In two stages. First retrieval gathers candidate sources from an index, a live search, or training data, rewarding relevance, authority, and clean structure. Then synthesis writes the answer from the candidates the model trusts most, rewarding sources that are specific, verifiable, and consistent with everything else it has read. You optimize both stages.

Why do models skip some relevant pages?

Because relevance gets you into the candidate pool, but trust decides who gets quoted. A page that contradicts the consensus without evidence, hides its facts in images or scripts, or reads like filler adds risk for the model. It would rather cite a source that lowers its uncertainty than one that raises it.

What content gets cited most?

Content that reduces the model's uncertainty: specific numbers, first-hand testing, named methodology, and clearly attributed facts. Restated consensus rarely gets cited, because it adds nothing the model did not already have from a dozen other pages.

Run the checklist