← Writing
·5 min read
Benchmarksv0.5.0News & Media

TechCrunch Was Blocked. Now It's 77×. What Changed?

When we ran the first WebTaskBench benchmark in early April, TechCrunch was in the failure list. Not because the site is poorly structured or agent-hostile by policy, but because Cloudflare's anti-bot protection was returning a challenge page instead of content. The raw HTML tokens for a challenge page are essentially zero. The SOM was empty. We marked it as a failed fetch and moved on.

After upgrading to Plasmate v0.5.0 and re-running the benchmark, TechCrunch fetched cleanly: 108,481 HTML tokens compressed to 1,398 SOM tokens. A ratio of 77.5× — the highest compression we have measured on any news site, and among the highest in the entire benchmark.

That number is worth pausing on. An AI agent reading TechCrunch through raw HTML consumes roughly the same token budget as reading a short novel. The same page, served as SOM, fits comfortably in a single fast API call. The content is equivalent. The cost is not.

What v0.5.0 changed

The TechCrunch fix is a byproduct of improved browser fingerprinting in v0.5.0. Cloudflare's challenge pages are triggered by HTTP client signatures that look automated: missing headers, incorrect TLS extension ordering, suspicious ALPN values. Plasmate's v0.5.0 release tightened its browser emulation at the network layer, making its requests indistinguishable from a standard Chrome session to most bot-detection systems. TechCrunch was not a specific target; it was simply one of the sites that had been blocking Plasmate for this reason.

A second change had a more systematic effect on news sites specifically: GDPR consent overlay stripping. Modern news publishers in Europe and the United States are legally required to present cookie consent banners to visitors. These banners are implemented as full-page overlays with substantial DOM footprints. The New York Times consent layer, for example, contains several hundred elements, multiple tracking script declarations, and a complete alternative navigation structure for non-consenting users. None of this is content. All of it inflates HTML token counts.

Plasmate v0.5.0 detects and removes these overlays before extraction. The effect on our benchmark was immediate. The New York Times moved from an estimated 43× compression ratio to a measured 59.9×. The Guardian, which has a particularly complex consent implementation, showed different results — its SOM is now larger relative to its HTML because Plasmate is correctly capturing more article-body content, including elements that were previously hidden behind the consent layer. The compression ratio dropped; the quality of what is captured increased.

The measurement problem

This creates an interesting tension in how we report benchmark numbers. Compression ratio is a useful proxy for efficiency, but it is not the same thing as usefulness to an agent. A SOM that captures 95% of the article body at 13× compression is more useful than one that captures 40% of the article body at 43× compression. The Guardian's apparent regression in our numbers is likely an improvement in practice.

Plasmate's own internal benchmarks for v0.5.0 report a 4× average compression across 100 agent tasks on 50 real websites. Our measurement of 17.7× average on 45 sites uses a different methodology: we compare raw HTML bytes to SOM JSON bytes, without normalization for content relevance. Neither number is wrong. They measure different things. Plasmate's number answers “how much cheaper is an agent task?”. Ours answers “how much smaller is the structured representation?”

The distinction matters because publishers making implementation decisions care primarily about the first question. How much will this reduce the cost of agents reading my site? That depends on what those agents are doing, not just on how compressible the page is.

What the numbers suggest for publishers

Across the news vertical specifically, the v0.5.0 data shows compression ratios ranging from 13× to 78×, with an average in the high thirties. These are homepage measurements; article pages, which have higher signal-to-noise ratios, tend to compress even more favorably.

For a publisher receiving meaningful AI agent traffic — crawlers researching recent coverage, summarization services, research agents building knowledge bases — the economics of serving SOM vs. raw HTML are straightforward. At 40× compression and a standard LLM API pricing of $3 per million input tokens, serving one million agent page-reads through a SOM endpoint costs approximately $75 in model fees. Serving the same traffic as raw HTML costs approximately $3,000. The SOM Directives robots.txt implementation that routes agents to the SOM endpoint takes about five minutes to add.

The barrier is not technical. It is awareness.

Next in the data

The WebTaskBench benchmark now runs automatically each week against the latest Plasmate release. As v0.5.x improvements continue — particularly in table extraction, ARIA state handling, and the compile subcommand for publisher-side pre-rendering — we will track the real-world effect on compression ratios across all three verticals.

The TechCrunch story is, in a small way, a model for what this ecosystem is trying to do. A site that was invisible to structured agent fetching is now accessible at 77× compression. No change was required on TechCrunch's side. The improvement came from the tooling getting better. When publishers add SOM Directives, agents that respect them get the same result automatically, without waiting for the next round of tooling improvements on either side.