Open Specification v1.0

Structured Object Model

An open specification for representing web pages as structured JSON. SOM defines typed element roles, stable identifiers, and semantic regions — producing documents an order of magnitude smaller than raw HTML.

Version 1.0  ·  Apache 2.0 License  ·  W3C Community Group Proposal
json
1{
2 "som_version": "1.0",
3 "url": "https://example.com",
4 "title": "Example Domain",
5 "regions": [{
6 "id": "r_main",
7 "role": "main",
8 "elements": [{
9 "id": "e_3f8a2b1c",
10 "role": "heading",
11 "text": "Example Domain",
12 "attrs": { "level": 1 }
13 }, {
14 "id": "e_9d4e7f2a",
15 "role": "paragraph",
16 "text": "This domain is for use in illustrative examples."
17 }, {
18 "id": "e_1a5c8b3e",
19 "role": "link",
20 "text": "More information...",
21 "attrs": { "href": "https://www.iana.org/domains/example" },
22 "actions": ["click"]
23 }]
24 }],
25 "meta": {
26 "html_bytes": 1256,
27 "som_bytes": 312,
28 "element_count": 3,
29 "compression_ratio": 4.0
30 }
31}

Format Analysis

Related Formats

Prior to SOM, agent pipelines consumed web content as raw HTML, stripped Markdown, or accessibility trees — each a repurposing of a format designed for other consumers. The table below characterises the trade-offs.

PropertyHTMLMarkdownA11y TreeSOM
Token overheadRelative to content densityHighModerateModerateMinimal
Structural typingTyped element roles and semantic regionsNoneNonePartialComplete
Interactivity preservedClickable, typeable, scrollable elementsRaw attributesNot preservedPresentTyped with actions
Stable element IDsReproducible across independent fetchesNoneNoneNoneSHA-256 derived
Publisher-servableCacheable as an alternate representationYesYesNoYes
Approx. tokens per pageMedian across 51 representative sites~80,000~12,000~8,000~4,600
Token estimates derived from the Plasmate benchmark suite (51 sites, April 2026). A11y Tree figures represent Playwright accessibility snapshot output. SOM figures represent plasmate fetch output without selector filtering.

Specification

The Specification

SOM v1.0 defines a compact, typed JSON representation of web pages. Explore the core concepts below.

Every SOM document is a single JSON object with the following top-level fields:

  • som_version (string, required) - Specification version, currently "1.0"
  • url (string, required) - The canonical URL of the source page
  • title (string, required) - The document title extracted from the page
  • lang (string, optional) - BCP 47 language code (e.g., "en", "fr")
  • regions (array, required) - Ordered list of semantic page regions
  • meta (object, required) - Compression and structure metadata
  • structured_data (object, optional) - Extracted semantic data (JSON-LD, OpenGraph, etc.)

The document structure is intentionally flat. There is exactly one level of nesting: document contains regions, regions contain elements. This avoids the deeply nested trees that make HTML expensive for LLMs to process.


Quick Start

Get Started

Install the reference implementation and start converting pages to SOM in seconds.

bash
# Install
npm install -g plasmate
# or: brew install plasmate-labs/tap/plasmate
# Fetch any page as SOM
plasmate fetch https://example.com
# With selector to strip nav/footer
plasmate fetch https://example.com --selector main
# Compile existing HTML to SOM
cat page.html | plasmate compile