PROPOSAL · DRAFT v0.1
SOM Directives for robots.txt
A proposed extension to the robots.txt standard that lets publishers advertise SOM availability and declare interaction preferences for AI agents. Rooted in the Plasmate Labs proposal and the W3C Web Content Browser for AI Agents Community Group.
Why Extend robots.txt
The robots.txt standard has governed crawler behavior since 1994. Today it is also how website owners signal their preferences to AI agents — yet it only answers one question: may this agent access this URL?
It cannot say: yes, you may read my content, and here is a more efficient representation of it. Publishers are left with a binary choice — block agents entirely, or serve them raw HTML at full bandwidth and token cost.
SOM Directives extend robots.txt to express a third option: cooperative content negotiation. A publisher can advertise a structured SOM endpoint, declare freshness preferences, suggest token budgets, and set interaction policies — all within the file agents already check.
Extending robots.txt is preferable to a new discovery file because it requires no new infrastructure, no additional HTTP round-trips, and builds on a well-understood trust model that every crawler and agent already implements.
Base Directives
The following five directives form the base proposal, as specified by Hurley (2026) and documented at docs.plasmate.app/robots-txt-proposal.
User-agent: * Allow: / # Semantic Object Model available SOM-Endpoint: https://cache.example.com/v1/som SOM-Format: SOM/1.0 SOM-Scope: main-content SOM-Freshness: 3600 SOM-Token-Budget: 15000
| Directive | Type | Description |
|---|---|---|
| SOM-Endpoint | URL | Base URL of the SOM service. Agents append ?url= with the target page URL. |
| SOM-Format | string | Format of the representation. Values: SOM/1.0, markdown, accessibility-tree. |
| SOM-Scope | string | Content coverage. Values: full-page, main-content, article-body. |
| SOM-Freshness | seconds | Maximum age of a cached SOM representation. Default: 86400 (24 hours). |
| SOM-Token-Budget | integer | Suggested maximum token count, helping agents estimate costs before fetching. |
When an agent encounters these directives it should prefer the SOM endpoint over fetching raw HTML, using the endpoint as: {SOM-Endpoint}?url={encoded-page-url}
Extended Directives (Proposed)
The following directives extend the base proposal. They are not yet part of the Plasmate Labs specification but are proposed here for community discussion via the W3C CG. They follow the same robots.txt syntax and are ignored by agents that do not understand them.
# Extended interaction preferences
SOM-Rate-Limit: 60/minute
SOM-Concurrent: 5
SOM-Attribution: required
SOM-Attribution-Format: Source: {publisher} ({url})
SOM-Contact: agents@example.com
SOM-Paywall: /premium/* /members/*| Directive | Type | Description |
|---|---|---|
| SOM-Rate-Limit | N/period | Advisory rate limit for SOM endpoint requests. Format: integer/minute or integer/hour. |
| SOM-Concurrent | integer | Advisory limit on simultaneous sessions per agent identity. |
| SOM-Attribution | required | optional | Whether the publisher requests attribution when content is cited or summarized. |
| SOM-Attribution-Format | template | Attribution template. Variables: {publisher}, {url}, {title}. |
| SOM-Contact | Contact address for agent-related issues. Not for automated use. | |
| SOM-Paywall | glob patterns | Space-separated path patterns indicating gated content. Agents should not attempt SOM fetch for these paths. |
These directives are advisory. Publishers declare preferences; enforcement remains the publisher's responsibility through server-side controls. Agents that honor extended directives are considered Level 2 compliant (see Section 05).
Per-Path Overrides
Standard robots.txt user-agent blocks allow per-path scoping. SOM directives inherit this mechanism naturally:
# Default: SOM available site-wide User-agent: * Allow: / SOM-Endpoint: https://cache.example.com/v1/som SOM-Format: SOM/1.0 SOM-Freshness: 3600 # Documentation: longer freshness, full-page scope User-agent: * SOM-Scope: full-page SOM-Freshness: 86400 Disallow: # API routes: not content, no SOM served User-agent: * Disallow: /api/
Where multiple blocks apply to the same agent, the most specific match wins, consistent with RFC 9309 precedence rules.
Compliance Levels
Compliance is voluntary. SOM Directives express publisher preferences, not enforcement mechanisms. Three levels are defined for interoperability:
Relationship to Other Standards
SOM Directives sit within an emerging stack of standards for the agentic web:
| Standard | Layer | Question answered |
|---|---|---|
| robots.txt (RFC 9309) | Access | May this agent access this URL? |
| SOM Directives | Representation | What format and endpoint should agents use? |
| SOM (v1.0) | Format | What does a structured page representation look like? |
| AWP Protocol | Interaction | How does an agent act on a page once fetched? |
| schema.org / JSON-LD | Semantics | What does this content mean? |
SOM Directives do not replace robots.txt — they extend it. An agent should check robots.txt first. If access is denied, directives are irrelevant. If access is allowed, directives provide guidance on how to fetch efficiently.
The Discovery Gap
Empirical research by Hurley (2026) in Agent Compliance with robots.txt SOM Directives: Empirical Evidence of the Discovery Gap [9] found that even agents capable of honoring SOM directives frequently fail to check for them. The primary causes are:
- robots.txt is fetched once per session and cached, but SOM directives are ignored even when present
- Most agent frameworks treat robots.txt as access-only and do not parse unknown directives
- No standard library implements SOM-directive parsing, creating friction for framework authors
Closing this gap requires adoption at the framework level — LangChain, LlamaIndex, Browser Use, CrewAI, and other orchestration tools adding robots.txt directive parsing to their fetch pipelines.
This page is intended to serve as a reference that framework authors can point to when implementing compliance.
Publisher Quickstart
Add to your robots.txt:
# Minimal — advertise SOM endpoint only User-agent: * Allow: / SOM-Endpoint: https://cache.plasmate.app/v1/som SOM-Format: SOM/1.0 SOM-Freshness: 3600
If you self-host Plasmate, replace the endpoint with your own instance. If you use the Plasmate SOM Cache, the endpoint is https://cache.plasmate.app/v1/som.
Verify your configuration at somspec.org/validate or use somordom.com for a live head-to-head comparison.
Get Involved
This proposal is discussed in the W3C Web Content Browser for AI Agents Community Group. The extended directives in Section 03 are not yet adopted and are open for comment.