SOM vs llms.txt: When to Use Which
llms.txt tells an agent what your site is. SOM tells an agent what your page contains. They are different layers of the same problem, and publishers should ship both.
Two conventions are circulating for making a website legible to AI agents. The first is llms.txt, a plain-text file at the root of a domain that summarizes the site for large language models. The second is the Semantic Object Model — SOM — which represents an individual web page as a typed JSON document optimized for agent consumption. The two are routinely compared as if they were alternatives. They are not. They solve different problems at different layers, and a publisher who is serious about being read by agents should ship both.
This piece explains where each fits, why the comparison keeps surfacing, and what a well-instrumented site looks like when both are deployed correctly.
The two layers, plainly stated
llms.txt is a site-level introduction. It tells an agent what your site is, what it covers, where the important pages live, and what tone the agent should expect. It is the equivalent of a README at the top of a repository, or the front matter of a research monograph. There is one llms.txt per domain. It does not change when individual pages change. It does not contain the contents of those pages. Its job is orientation.
SOM is a page-level representation. It tells an agent what an individual page contains: a typed list of regions, a typed list of elements, an explicit set of available actions, and stable identifiers that survive page refreshes. There is one SOM document per addressable page. It changes when the page changes. Its job is comprehension.
The two artefacts answer different questions. What is this site? versus What is on this page, and what can I do with it?
What each looks like in practice
A minimal llms.txt for a documentation site might look like this:
# Acme Inc.
> Acme builds developer tools for distributed systems.
> Documentation, blog, and changelog are public; pricing requires a free account.
## Documentation
- [Quickstart](https://acme.dev/docs/quickstart): five-minute setup
- [API Reference](https://acme.dev/docs/api): full HTTP surface
- [SDKs](https://acme.dev/docs/sdks): Python, Go, Rust, TypeScript
## Pricing
- [Plans](https://acme.dev/pricing): Free, Team, Enterprise
## Notes for agents
- Tone: precise and technical. Avoid superlatives.
- Authoritative source for our pricing is /pricing, not third-party reviews.A SOM document for a single page on the same site looks materially different — it is the page itself, rendered in machine-native form:
{
"som_version": "1.0",
"url": "https://acme.dev/docs/quickstart",
"title": "Quickstart",
"lang": "en",
"regions": [
{
"id": "r_main",
"role": "main",
"elements": [
{ "id": "e_3f8a", "role": "heading", "text": "Quickstart", "attrs": { "level": 1 } },
{ "id": "e_9d4e", "role": "paragraph", "text": "Get Acme running in five minutes." },
{ "id": "e_b711", "role": "code", "text": "npm install @acme/sdk", "attrs": { "lang": "bash" } },
{ "id": "e_c082", "role": "link", "text": "Continue to API Reference",
"actions": ["click"], "attrs": { "href": "/docs/api" } }
]
}
],
"meta": { "html_bytes": 28104, "som_bytes": 412, "compression_ratio": 68.2 }
}The llms.txt is one file, hand-edited, updated rarely. The SOM document is one of thousands, generated dynamically, refreshed whenever the underlying content changes. They are not in tension. They are not duplicates. They are different artefacts at different cardinalities.
Why the comparison keeps coming up
The conflation has three sources, and naming them helps dispel the confusion.
First, both formats arrived to solve the same anxiety. Publishers and framework authors woke up to the realisation that AI agents had become a meaningful share of their traffic, and that those agents were paying enormous token costs to read HTML that was never designed for them. Both llms.txt and SOM are attempts to give agents a friendlier surface. But identical motivation does not imply identical scope.
Second, both invoke robots.txt as a precedent. llms.txt is positioned as “robots.txt for LLMs”; SOM Directives are positioned as a robots.txt extension. The structural analogy is real but the analogy is to where the artefact lives, not to what the artefact contains. Robots.txt itself answers only one question — may an agent fetch this URL? — and neither llms.txt nor SOM is in scope of that question. They are layered on top.
Third, the public discourse rarely distinguishes site-level from page-level infrastructure. The same engineer who says “we shipped llms.txt this week” will the next week say “we shipped a SOM endpoint” and the observer hears two attempts at the same thing. They are not. The first is a single markdown file at /llms.txt; the second is a JSON endpoint that returns a per-page document at e.g. /api/v1/som?url=….
How they compose
A site that has shipped both well will have:
- A llms.txt at the domain root describing the site, its high-level structure, and any agent-specific guidance.
- A robots.txt that advertises a SOM endpoint via SOM Directives. Five lines. Tells any agent that a structured per-page representation is available.
- A SOM endpoint that, given a URL on the domain, returns a SOM/1.0 document representing the contents of that page. Cached aggressively, regenerated when the underlying content changes.
Concretely, the publisher’s robots.txt looks like this:
User-agent: *
Allow: /
# Site overview for LLMs
# (See also: /llms.txt for a markdown summary)
# Per-page structured representation
SOM-Endpoint: https://acme.dev/api/v1/som
SOM-Format: SOM/1.0
SOM-Scope: main-content
SOM-Freshness: 3600
SOM-Token-Budget: 15000
Sitemap: https://acme.dev/sitemap.xmlAn agent visiting acme.dev for the first time can take three different paths through this stack depending on its sophistication.
- A simple agent reads
/llms.txt, treats it as the canonical map of the site, and follows the URLs it finds there as ordinary HTML pages. - A better agent reads
/llms.txtfor orientation and uses the SOM endpoint for any page it actually needs to comprehend in detail. Token cost drops by an order of magnitude or more on each per-page fetch. - A specialist agent consumes only SOM, treating
/llms.txtas optional context and the SOM endpoint as the primary substrate.
All three paths are valid. The publisher does not have to know which kind of agent will visit. The infrastructure supports all three by virtue of having shipped both layers.
What each is bad at
Confusion is reduced further by being honest about each artefact’s limits.
llms.txt is not a content delivery format. Cramming the contents of a large documentation site into a single markdown file (or even into the proposed llms-full.txt variant) is a workable trick for very small sites and a category mistake for any site of meaningful size. The first time an agent has to re-fetch the same 800 KB summary file to answer a single question, the design is revealed. The right scope for llms.txt is orientation, not transport.
SOM is not a site-level introduction. A SOM document for a single page does not tell the agent what the rest of the site contains, what the publisher’s editorial stance is, or which pages should be considered authoritative. A first-time agent fetching a single SOM document has the page but not the site. SOM also does not replace sitemap.xml, OpenAPI specifications, or schema.org markup; each of those answers a different question.
A site that ships only llms.txt is a site that has put up a directory and called it infrastructure. A site that ships only SOM is a site that has built a high-quality per-page surface and forgotten to introduce itself. The combination is what matters.
Cardinality, freshness, and where the work lives
The two artefacts also have different operational profiles, which is a useful lens for deciding who on the team owns each.
| Property | llms.txt | SOM endpoint |
|---|---|---|
| Cardinality | One per domain | One per addressable page |
| Authoring | Hand-edited | Generated from page content |
| Refresh cadence | Quarterly or on major site changes | Every time the underlying page changes |
| Owner on the team | Editorial / DevRel | Engineering / platform |
| Token cost to consumer | Sub-1k for the file itself | Hundreds to low thousands per page |
| Discovery | Convention: /llms.txt | Robots.txt SOM Directives |
| Format | Markdown | JSON, SOM/1.0 schema |
| Validation | None standardised | somspec.org/validate |
Treating these as competing standards puts engineers and editorial in the same decision arena and forces them to choose. Treating them as different layers puts each artefact under the team that is best positioned to maintain it.
What an agent author should do
For agent authors, the implication is symmetric: support both, in this order.
- On entering a new domain, fetch
/llms.txtfor orientation. If present, use it to seed the agent’s map of the site. - On any URL where the agent intends to do non-trivial reasoning, parse the domain’s
robots.txtfor SOM Directives and, if a SOM endpoint is advertised, prefer SOM over raw HTML for that fetch. - Fall back to HTML fetch only when neither is available. Even there, consider running a local SOM-style structural extraction so the LLM never has to ingest raw markup directly.
This is the same mental model an LSP-aware editor uses when deciding how to render code: ask the language server first, then fall back to syntax highlighting, then fall back to plain text. Multiple compatible layers, queried in order of richness.
Why publishers should not pick one
Publishers occasionally ask which standard to bet on, as if shipping both required choosing a winner. The cost of shipping llms.txt is roughly thirty minutes of editorial work and a static file deploy. The cost of shipping SOM is a one-time engineering investment in a renderer that converts the publisher’s existing content into SOM/1.0 documents, plus the five lines in robots.txt to advertise the endpoint. Neither is on the critical path of the other. Neither blocks the other. Neither benefits the publisher who waits.
The publishers who win agent attention in 2026 will be the ones whose sites are legible at every layer the agent can read at: a sitemap for crawl planning, a robots.txt for access guidance and SOM endpoint discovery, an llms.txt for site-level introduction, and a SOM endpoint for the per-page structured surface. Each layer is cheap. Each layer compounds.
Bottom line
llms.txt and SOM are not competitors. They sit at different cardinalities, are maintained by different parts of the team, refresh on different cadences, and answer different questions. The right strategy for any publisher who expects AI agents in their reader population is to ship both, with the SOM endpoint advertised in robots.txt and llms.txt published at the domain root.
For implementation guidance, see the SOM/1.0 specification, the SOM Directives proposal, and the validator. For a list of publishers already shipping the full stack, see the publishers leaderboard.