← Writing
Benchmarks9 min read

What Three Weeks of Public Benchmarks Reveal About the Web’s Token Bill

Three weeks. Two Plasmate releases. Thirty-eight production sites. The WebTaskBench public benchmark gives the first defensible empirical answer to a question publishers and agent authors have been asking each other since 2024: how much of the web’s token cost is actually load-bearing?

BenchmarksWebTaskBenchv0.5.0v0.5.1Token Economics

On April 4, 2026, the WebTaskBench observatory began publishing weekly, machine-readable token-efficiency measurements for AI agent web fetching against a curated battery of production sites. Three weeks later, two Plasmate releases later, and thirty-eight successful site fetches later, the public dataset is large enough to say something defensible about a question publishers and agent authors have been asking each other since 2024: how much of the web’s token cost is actually load-bearing?

The short answer is: not much. The average production page in the current run carries roughly thirty times the tokens it would need if represented in a format designed for AI agents. The peak ratio is over one hundred. The median is closer to ten. And the long tail — the small set of pages where structured representation is actually worse than raw HTML — turns out to be more interesting than the headline number.

This piece walks through the public dataset as it stands on April 28, 2026. All numbers are reproducible from webtaskbench.com/api/v1/benchmark.json or by following the methodology at /protocol.

The setup, briefly

For each URL in the registry, the harness captures two values weekly:

  1. html_tokens — tokens in the raw HTML response from curl -sL, with a thirty-second timeout, tokenised using tiktoken with the cl100k_base encoding.
  2. som_tokens — tokens in the SOM/1.0 document produced by the current Plasmate release for the same URL, tokenised identically.

The compression ratio is html_tokens / som_tokens. A ratio above 1.0 means SOM is more compact than the raw HTML. The HTML baseline is intentionally conservative: curl -sL does not execute JavaScript, so it under-counts the tokens a real headless browser would see on JS-heavy pages. Real-world ratios in production are likely higher than what this benchmark reports.

The headline numbers

As of the most recent run (Plasmate v0.5.1, dated 2026-04-20):

MetricValue
Sites attempted38
Sites succeeded38 (100%)
Average compression ratio29.6×
Median compression ratio9.8×
Peak compression ratio118.5× (cloud.google.com)
Sites where SOM is larger than HTML7 (ratio < 1.0)

The gap between the average (29.6×) and the median (9.8×) is itself the story. The distribution is heavily right-skewed: a small number of large, scaffolding-heavy pages are responsible for most of the aggregate token cost, and they are also the pages where a structured representation pays off most dramatically. The median page does well — call it order-of-magnitude. The 90th-percentile page does spectacularly.

Where the compression actually lives

Sorted by compression ratio, the top of the leaderboard is dominated by enterprise marketing pages and reference documentation:

URLHTML tokensSOM tokensRatioCategory
cloud.google.com762,5166,435118.5×SaaS & Cloud
arstechnica.com139,9061,294108.1×News & Media
kubernetes.io/docs123,4181,210102.0×Dev Tools
techcrunch.com139,4981,39899.8×News & Media
nytimes.com375,8284,29487.5×News & Media
linear.app893,11611,04680.9×SaaS & Cloud
docker.com139,0972,59653.6×SaaS & Cloud

The pattern is consistent. Pages that ship hundreds of kilobytes of JSON state, build-tool runtime hydration, ad-tech beacons, analytics initialisers, design-system tokens, and nested layout containers are pages where the human-relevant content (the headline, the body, the actionable controls) is one or two percent of the byte budget. The other ninety-eight percent is scaffolding for browsers, telemetry, and the publisher’s own authoring pipeline. None of it is information an AI agent needs.

cloud.google.com at 118× is not an outlier in kind. It is an outlier in degree. A marketing page at cloud.google.com/products/databases has to load the Google design system, the ads SDK, the consent management script, the localisation runtime, the visitor analytics, four flavours of A/B testing harness, and the rest. The actual page content — twelve sentences of marketing copy and a feature grid — fits in about three thousand tokens of SOM.

Per-vertical: news, SaaS, dev docs, and the long tail

Aggregating by category gives a cleaner picture of where structured representation delivers the most value:

CategorynAverage ratioNotes
News & Media~8~41×Anti-bot walls were the binding constraint until v0.5.0
SaaS & Cloud~12~47×Highest-bloat category; design-system overhead dominates
Dev Tools / Documentation~18~12×Already structured for human reading; less to compress
General~6~4×Mostly small pages; SOM overhead can dominate

SaaS & Cloud and News & Media are the two verticals where the payoff is the largest. For both, the agent-native economic argument is unambiguous: a publisher serving SOM cuts agent token cost by an order of magnitude or more, which translates directly into more agent visits, deeper agent reasoning, and lower agent-side cost per successful session. Developer documentation sits in the middle of the distribution. Documentation sites have already done much of the work of structuring information for human readers, and the marginal compression of SOM over a well-formed Markdown rendering of a docs page is real but smaller.

The long tail: where SOM is bigger than HTML

Seven of the thirty-eight current sites have a compression ratio below 1.0 — meaning the SOM document is larger than the raw HTML. This is the most diagnostically useful part of the dataset, because the explanation is structural rather than accidental. Examples:

URLHTML tokensSOM tokensRatio
example.com1523310.5×
crates.io713720.2×
news.ycombinator.com11,92414,5730.8×
jsonplaceholder.typicode.com2,4763,2820.8×
postgresql.org6,3229,3210.7×

Two distinct cases are visible. The first — example.com and crates.io at the bottom — are pages whose entire HTML is so small (under 200 tokens) that the structural overhead of SOM (region headers, element role annotations, stable IDs) exceeds the content itself. SOM has a non-zero floor; on sub-kilobyte pages, the floor is the dominant cost.

The second case — news.ycombinator.com, postgresql.org — is more useful. These are pages whose HTML is already aggressively minimal: no design-system runtime, minimal styling, content-first markup. Hacker News is famously a single nested table with inline styles. PostgreSQL.org has thirty years of editorial discipline. SOM’s value is the delta from typical; on these pages, the delta is negative because the typical bloat isn’t there in the first place.

The lesson for publishers: SOM is not a universal good. It is a structural improvement over the kind of HTML that ships in 2026. Pages that already practice a 1996 discipline of minimal markup get little benefit. Most pages do not.

What changed between v0.5.0 and v0.5.1

The v0.5.0 release on April 4 made one decisive change: it broke through anti-bot infrastructure that had been blocking benchmark fetches on major news sites. The cleanest example is TechCrunch, which had been on the “failed sites” list for the entire pre-launch period and which now sits at 77× to 100× depending on the run. Five other news domains followed the same trajectory in the same release.

v0.5.1 on April 20 was a stability release — improved retry behaviour, tighter per-URL timeouts, better handling of large multi-region SaaS pages. The aggregate numbers shifted by less than a point (29.7× → 29.6× average) but the sites at the long tail tightened their ratios, suggesting more reliable extraction on the marginal cases.

Two trajectories are now visible across the three weeks of data: the previously unreachable becomes reachable, and the unstable becomes stable. The first is what unlocks new categories; the second is what makes the benchmark defensible quarter-over-quarter.

What this means for publishers

Three implications follow from the dataset, in increasing order of how often publishers seem to miss them:

The first: if your site is in the SaaS & Cloud or News & Media categories, you are paying somewhere between 30× and 100× more agent tokens than you need to. Every AI agent fetch is an inefficient one, and there are now substantially more AI agent fetches than there are humans on most marketing pages. A SOM endpoint is the single highest-leverage performance optimisation available to a publisher in 2026, and it is one that no human visitor will ever see.

The second: if your compression ratio is below ten, the right move is not to skip SOM. The right move is to ship SOM and use it as a forcing function for editorial discipline. Pages where SOM is barely better than HTML are pages where the underlying HTML is bloated; the ratio is a rough proxy for how much of your page is content versus apparatus. A useful number, even if it’s embarrassing.

The third: the failed-sites list is the most uncomfortable part of the public benchmark and the most important. Seven sites are currently invisible to AI agents because of anti-bot infrastructure or aggressive JavaScript-only rendering. Those publishers are paying full hosting and rendering cost for traffic from agents that cannot read the page, and they are doing so without knowing it. The long-term fix is not to harden the wall but to publish a SOM endpoint that returns the same content in a form designed for the visitor. This is a single deploy, not a project.

What changes the next time we update this

Three things are due in the next one to three benchmark runs and will materially shift the dataset:

  • Plasmate v0.6.0 is expected to add structured-data extraction (JSON-LD, microdata, OpenGraph) directly into the SOM meta block. This will not change ratios meaningfully but will raise the floor on what a SOM document carries.
  • Vertical expansion. The benchmark is currently weighted toward the marketing-page surface of the web. Adding e-commerce category and product pages, long-form journalism with comments, and forum / community pages will broaden the distribution and likely lower the average (these categories tend to compress less dramatically) while exposing new failure modes.
  • Third-party submissions under Protocol v1.0. The first non-Plasmate tool that publishes results under the same protocol — and the protocol is open, documented at /directives and webtaskbench.com/protocol — will be the first chance to compare engines on equal footing. Until that arrives, the dataset is single-engine, which is a bound on how strong the conclusions can be.

Bottom line

Three weeks of public benchmarks across thirty-eight production sites suggest that the web’s typical page is between an order of magnitude and two orders of magnitude more expensive for AI agents to read than it needs to be. The fix is not a new framework, a new browser, or a new tax on the publisher. It is a small structured endpoint, advertised in five lines of robots.txt, that returns the agent-relevant content of a page in agent-native form. The numbers are now public, reproducible, and updated weekly.

For the underlying methodology, see How to Read the WebTaskBench Leaderboard. For the specification this benchmark validates against, see the SOM/1.0 specification. To check whether your own site is currently SOM-ready, use somready.com. To compare a single URL live, use somordom.com.

Related context: the cost numbers above are what motivates the layered approach outlined in SOM vs llms.txt: When to Use Which, the protocol-versus-format distinction in SOM vs MCP: How Publishers and Agents Are Different Problems, and the historical framing in The Web’s Second Reader.

The next quarterly retrospective will publish after the next major Plasmate release and the addition of e-commerce and community-forum verticals to the registry.