Industry2026-04-277 min read

SOM vs llms.txt: When to Use Which

llms.txt tells an agent what your site is. SOM tells an agent what your page contains. They are different layers of the same problem, and publishers should ship both.

llms.txtSpecificationPublishersComparison

Two conventions are circulating for making a website legible to AI agents. The first is llms.txt, a plain-text file at the root of a domain that summarizes the site for large language models. The second is the Semantic Object Model — SOM — which represents an individual web page as a typed JSON document optimized for agent consumption. The two are routinely compared as if they were alternatives. They are not. They solve different problems at different layers, and a publisher who is serious about being read by agents should ship both.

This piece explains where each fits, why the comparison keeps surfacing, and what a well-instrumented site looks like when both are deployed correctly.

The two layers, plainly stated

llms.txt is a site-level introduction. It tells an agent what your site is, what it covers, where the important pages live, and what tone the agent should expect. It is the equivalent of a README at the top of a repository, or the front matter of a research monograph. There is one llms.txt per domain. It does not change when individual pages change. It does not contain the contents of those pages. Its job is orientation.

SOM is a page-level representation. It tells an agent what an individual page contains: a typed list of regions, a typed list of elements, an explicit set of available actions, and stable identifiers that survive page refreshes. There is one SOM document per addressable page. It changes when the page changes. Its job is comprehension.

The two artefacts answer different questions. What is this site? versus What is on this page, and what can I do with it?

What each looks like in practice

A minimal llms.txt for a documentation site might look like this:

# Acme Inc.

> Acme builds developer tools for distributed systems.
> Documentation, blog, and changelog are public; pricing requires a free account.

## Documentation
- [Quickstart](https://acme.dev/docs/quickstart): five-minute setup
- [API Reference](https://acme.dev/docs/api): full HTTP surface
- [SDKs](https://acme.dev/docs/sdks): Python, Go, Rust, TypeScript

## Pricing
- [Plans](https://acme.dev/pricing): Free, Team, Enterprise

## Notes for agents
- Tone: precise and technical. Avoid superlatives.
- Authoritative source for our pricing is /pricing, not third-party reviews.

A SOM document for a single page on the same site looks materially different — it is the page itself, rendered in machine-native form:

{
  "som_version": "1.0",
  "url": "https://acme.dev/docs/quickstart",
  "title": "Quickstart",
  "lang": "en",
  "regions": [
    {
      "id": "r_main",
      "role": "main",
      "elements": [
        { "id": "e_3f8a", "role": "heading", "text": "Quickstart", "attrs": { "level": 1 } },
        { "id": "e_9d4e", "role": "paragraph", "text": "Get Acme running in five minutes." },
        { "id": "e_b711", "role": "code", "text": "npm install @acme/sdk", "attrs": { "lang": "bash" } },
        { "id": "e_c082", "role": "link", "text": "Continue to API Reference",
          "actions": ["click"], "attrs": { "href": "/docs/api" } }
      ]
    }
  ],
  "meta": { "html_bytes": 28104, "som_bytes": 412, "compression_ratio": 68.2 }
}

The llms.txt is one file, hand-edited, updated rarely. The SOM document is one of thousands, generated dynamically, refreshed whenever the underlying content changes. They are not in tension. They are not duplicates. They are different artefacts at different cardinalities.

Why the comparison keeps coming up

The conflation has three sources, and naming them helps dispel the confusion.

First, both formats arrived to solve the same anxiety. Publishers and framework authors woke up to the realisation that AI agents had become a meaningful share of their traffic, and that those agents were paying enormous token costs to read HTML that was never designed for them. Both llms.txt and SOM are attempts to give agents a friendlier surface. But identical motivation does not imply identical scope.

Second, both invoke robots.txt as a precedent. llms.txt is positioned as “robots.txt for LLMs”; SOM Directives are positioned as a robots.txt extension. The structural analogy is real but the analogy is to where the artefact lives, not to what the artefact contains. Robots.txt itself answers only one question — may an agent fetch this URL? — and neither llms.txt nor SOM is in scope of that question. They are layered on top.

Third, the public discourse rarely distinguishes site-level from page-level infrastructure. The same engineer who says “we shipped llms.txt this week” will the next week say “we shipped a SOM endpoint” and the observer hears two attempts at the same thing. They are not. The first is a single markdown file at /llms.txt; the second is a JSON endpoint that returns a per-page document at e.g. /api/v1/som?url=….

How they compose

A site that has shipped both well will have:

A llms.txt at the domain root describing the site, its high-level structure, and any agent-specific guidance.
A robots.txt that advertises a SOM endpoint via SOM Directives. Five lines. Tells any agent that a structured per-page representation is available.
A SOM endpoint that, given a URL on the domain, returns a SOM/1.0 document representing the contents of that page. Cached aggressively, regenerated when the underlying content changes.

Concretely, the publisher’s robots.txt looks like this:

User-agent: *
Allow: /

# Site overview for LLMs
# (See also: /llms.txt for a markdown summary)

# Per-page structured representation
SOM-Endpoint:    https://acme.dev/api/v1/som
SOM-Format:      SOM/1.0
SOM-Scope:       main-content
SOM-Freshness:   3600
SOM-Token-Budget: 15000

Sitemap: https://acme.dev/sitemap.xml

An agent visiting acme.dev for the first time can take three different paths through this stack depending on its sophistication.

A simple agent reads /llms.txt, treats it as the canonical map of the site, and follows the URLs it finds there as ordinary HTML pages.
A better agent reads /llms.txt for orientation and uses the SOM endpoint for any page it actually needs to comprehend in detail. Token cost drops by an order of magnitude or more on each per-page fetch.
A specialist agent consumes only SOM, treating /llms.txt as optional context and the SOM endpoint as the primary substrate.

All three paths are valid. The publisher does not have to know which kind of agent will visit. The infrastructure supports all three by virtue of having shipped both layers.

What each is bad at

Confusion is reduced further by being honest about each artefact’s limits.

llms.txt is not a content delivery format. Cramming the contents of a large documentation site into a single markdown file (or even into the proposed llms-full.txt variant) is a workable trick for very small sites and a category mistake for any site of meaningful size. The first time an agent has to re-fetch the same 800 KB summary file to answer a single question, the design is revealed. The right scope for llms.txt is orientation, not transport.

SOM is not a site-level introduction. A SOM document for a single page does not tell the agent what the rest of the site contains, what the publisher’s editorial stance is, or which pages should be considered authoritative. A first-time agent fetching a single SOM document has the page but not the site. SOM also does not replace sitemap.xml, OpenAPI specifications, or schema.org markup; each of those answers a different question.

A site that ships only llms.txt is a site that has put up a directory and called it infrastructure. A site that ships only SOM is a site that has built a high-quality per-page surface and forgotten to introduce itself. The combination is what matters.

Cardinality, freshness, and where the work lives

The two artefacts also have different operational profiles, which is a useful lens for deciding who on the team owns each.

Property	llms.txt	SOM endpoint
Cardinality	One per domain	One per addressable page
Authoring	Hand-edited	Generated from page content
Refresh cadence	Quarterly or on major site changes	Every time the underlying page changes
Owner on the team	Editorial / DevRel	Engineering / platform
Token cost to consumer	Sub-1k for the file itself	Hundreds to low thousands per page
Discovery	Convention: `/llms.txt`	Robots.txt SOM Directives
Format	Markdown	JSON, SOM/1.0 schema
Validation	None standardised	somspec.org/validate

Treating these as competing standards puts engineers and editorial in the same decision arena and forces them to choose. Treating them as different layers puts each artefact under the team that is best positioned to maintain it.

What an agent author should do

For agent authors, the implication is symmetric: support both, in this order.

On entering a new domain, fetch /llms.txt for orientation. If present, use it to seed the agent’s map of the site.
On any URL where the agent intends to do non-trivial reasoning, parse the domain’s robots.txt for SOM Directives and, if a SOM endpoint is advertised, prefer SOM over raw HTML for that fetch.
Fall back to HTML fetch only when neither is available. Even there, consider running a local SOM-style structural extraction so the LLM never has to ingest raw markup directly.

This is the same mental model an LSP-aware editor uses when deciding how to render code: ask the language server first, then fall back to syntax highlighting, then fall back to plain text. Multiple compatible layers, queried in order of richness.

Why publishers should not pick one

Publishers occasionally ask which standard to bet on, as if shipping both required choosing a winner. The cost of shipping llms.txt is roughly thirty minutes of editorial work and a static file deploy. The cost of shipping SOM is a one-time engineering investment in a renderer that converts the publisher’s existing content into SOM/1.0 documents, plus the five lines in robots.txt to advertise the endpoint. Neither is on the critical path of the other. Neither blocks the other. Neither benefits the publisher who waits.

The publishers who win agent attention in 2026 will be the ones whose sites are legible at every layer the agent can read at: a sitemap for crawl planning, a robots.txt for access guidance and SOM endpoint discovery, an llms.txt for site-level introduction, and a SOM endpoint for the per-page structured surface. Each layer is cheap. Each layer compounds.

Bottom line

llms.txt and SOM are not competitors. They sit at different cardinalities, are maintained by different parts of the team, refresh on different cadences, and answer different questions. The right strategy for any publisher who expects AI agents in their reader population is to ship both, with the SOM endpoint advertised in robots.txt and llms.txt published at the domain root.

For implementation guidance, see the SOM/1.0 specification, the SOM Directives proposal, and the validator. For a list of publishers already shipping the full stack, see the publishers leaderboard.