← Writing
·6 min read
ResearchCompliancerobots.txt

The Discovery Gap: Why AI Agents Miss Your SOM Directives

The robots.txt SOM Directives proposal is technically elegant. A publisher adds five lines to an existing file. An AI agent reads those lines, discovers a structured SOM endpoint, and fetches compact JSON instead of raw HTML. Token costs drop 10–100×. The page still gets read. The publisher still gets visited. Everyone benefits.

The problem is that in empirical testing, most agents never read those five lines.

This is the “discovery gap” — the distance between a published standard and actual agent behavior. Hurley (2026) measured it in Agent Compliance with robots.txt SOM Directives: Empirical Evidence of the Discovery Gap. The findings are instructive for anyone invested in the future of cooperative web infrastructure.

What the research found

Most agent frameworks treat robots.txt as an access-control file. They parse Disallow and Allow directives. They check Crawl-delay. But unknown directives — including all SOM-* directives — are silently ignored. The framework never errors. The publisher never knows. The agent just keeps fetching raw HTML, consuming tens of thousands of tokens per page for content that could be represented in hundreds.

This is not a bug. It is standard behavior for robust parsers: ignore what you don’t understand. The same principle that makes robots.txt backward-compatible — allowing new directives to be added without breaking existing crawlers — is exactly what creates the discovery gap. Backward compatibility and forward discovery are, in this case, in direct tension.

Three causes

The research identifies three structural causes for the gap, none of which are individual failures of engineering:

1. No standard library. There is no widely-adopted robots.txt parsing library that implements SOM directive discovery. The popular libraries — in Python, JavaScript, Go, and Rust — parse the directives defined in the original 1994 standard and its subsequent extensions. SOM directives are new. Framework authors who want to support them would need to write the parsing themselves, or wait for upstream libraries to add support. Most wait.

2. No incentive signal. Until agents that honor SOM directives are measurably cheaper to run than those that don’t, framework maintainers have no urgent reason to prioritize compliance. The cost savings are real — but they accrue downstream, at the LLM inference layer, not at the framework layer. The person who would save the most money is not the person who needs to write the code. This is a classic misaligned-incentive problem, and it does not resolve itself without measurement.

3. No visible compliance tracking. Before the compliance matrix at somspec.org existed, there was no public record of which frameworks were or weren’t compliant. The gap was invisible. You cannot fix what you cannot see, and you cannot prioritize what nobody is measuring. The compliance matrix changes this — not by shaming anyone, but by making the state of the ecosystem legible.

What Level 1 compliance requires

The full compliance checklist defines three levels. Level 1 — the minimum — is deliberately simple. A framework must: parse SOM-* directives from robots.txt. Check for the presence of SOM-Endpoint and SOM-Format. If both are present, fetch from the SOM endpoint instead of the raw HTML URL.

That is the complete change. No new protocol. No new authentication. No breaking change to existing behavior. Just: if the publisher has told you where to find structured content, go get it.

Why this matters

Token costs compound. Consider a framework serving 1 million page fetches per day at an average of 45,000 HTML tokens per page. At GPT-4o pricing ($0.0000025 per token), that is approximately $112,500 per day — $3,375,000 per month — just in input tokenization. The same traffic routed through SOM endpoints at a conservative 17× compression ratio: $6,600 per day. $198,000 per month.

The gap is $3.17 million per month. Per framework. And that is a conservative estimate using a moderate compression ratio. Sites with heavy JavaScript and complex navigation structures routinely see 40–100× compression, which would widen the gap further.

The discovery gap is not a technical problem. The five lines of robots.txt work. The SOM endpoints return valid, structured content. The specification is stable and public. This is an adoption problem — and adoption problems are solved by visibility, documentation, and a clear, public record of who has crossed the line and who has not.

Closing the gap

SOM Directives are in robots.txt. The endpoints are running. The cost math is not ambiguous. Framework authors who close the discovery gap will have lower-cost systems, more efficient users, and a competitive advantage that compounds with every page fetch.

The compliance matrix at somspec.org tracks progress. The bar is low. The incentive is real. The gap is waiting to be closed.