The spatial intelligence stack: capture, understand, act

“Spatial intelligence” sounds abstract until you break it into layers. Once you do, it becomes a checklist you can actually build against — and a way to spot which layer a given product is really competing on. I think of it as three stacked jobs: capture, understand, and act.

The three layers of spatial intelligence.

Layer 1 — Capture

Before software can do anything with space, it has to sense it. A decade ago this needed specialist rigs; now any modern phone can scan a room, and lidar, depth cameras, and photogrammetry fill in the rest. The output is raw: a point cloud, a mesh, a stream of frames. On its own it’s just geometry — useful, but dumb. Capture is increasingly a solved, commodity layer. Competing here alone is a race to the bottom.

Layer 2 — Understand

This is where it gets interesting. Understanding turns raw geometry into meaning: that surface is a floor, that shape is a chair, that flow of movement means people avoid this corner. It’s the difference between a scan of a room and a system that knows it’s a room. This is also where behaviour lives — gaze, dwell, hesitation — and where most of the durable value sits, because interpretation is hard and compounding.

Layer 3 — Act

Understanding is wasted if nothing happens with it. The action layer is where intelligence touches a person or a decision: anchoring a digital object to a real surface, surfacing the right information at the right place, or feeding a recommendation back to a planner or designer. Good action feels effortless and almost invisible — the answer is simply there, in the space, when you need it.

Most “AR apps” only do capture and a thin slice of action. The winners go deep on understanding — that’s the moat.

Why the model is useful

When you’re evaluating a product, an idea, or a vendor, ask which layer it actually competes on. A clever overlay with no real understanding is fragile. A system that genuinely understands space and behaviour — like a VR space audit reading emotional response, or AR crowdsourcing turning placements into planning data — has something defensible. And tools like Magic show the same logic in content: the value isn’t the render, it’s the understanding that makes one button enough.

If you want the bigger picture of why this matters now, start with what spatial intelligence is.

The spatial intelligence stack: capture, understand, act

Layer 1 — Capture

Layer 2 — Understand

Layer 3 — Act

Why the model is useful

Not sure which layer your product lives on?