AI has a blind spot

The next frontier for agentic AI is browser use. Browsers are multi-modal and stateful in nature, which pushes the limits of what AI can do. But we can mitigate the pain on our end as developers. Let me propose a few ideas on how to future-proof our web interfaces today.¹

A step change

These past 12 months saw revolutionary advances with multi-modality and Computer-Using Agents or CUA. And while that’s exciting, it appears current foundation models really struggle to intuitively understand software like humans do. Having tested the whole smorgasbord of agentic tools recently, this missing skill is becoming evident to me. And it’s holding AI assistants back.

🙋‍♂️ Like this piece?

Subscribe and get notified about new posts

Lossy interoperation

At work, we experiment with fully autonomous front-end development. Instead of using AI to augment human development workflows, it means letting AI solve tasks end-to-end². The bottom line: not terrible, not great either. To debug implementations in the browser, we require different expert models in a multi-agent network to achieve somewhat interesting results.³

If I had to give an estimate I’d say this gets us to 51%.⁴

Why is the number that low? Interoperability. Hand-over between the orchestrator agent and its computer use counterparts is incredibly lossy. It’s like watching a group of toddlers try to construct an IKEA shelf. It’s lovely to watch but ultimately amounts to nothing.⁵

Browser

CLI

Code

Orchestrator

A hard problem

When I talk about computer use, I really mean browser use. The browser is a highly valuable and well defined problem space.⁶ Productivity apps move to the web, for better or worse. Figma is the de-facto UI/UX design standard. Microsoft implemented the latest Office 365 on web technologies. The list goes on.

Why is browser use so incredibly hard? To humans, using a graphical user interface or GUI feels like second-nature. We recognize that a floppy disk means save. Learned patterns of interaction. Collective memory. Intuition. We know instinctively how to drag and drop or scroll. Or we just brute-force our way through a new application.

AI however, needs to work exponentially harder than it does in the paradigm of token prediction. A textbook case of Moravec’s Paradox.

Tasks that are effortless for humans can be fiendishly hard for machines.

Grounded understanding

Successfully operating a browser involves a mix of skills that AI has traditionally handled in isolation but not in combination: Visual understanding, context and planning and precise action execution.

First, the CUA must interpret the visual concepts from a mass of pixels. While some browser automation based agents just read the underlying accessibility tree, a general CUA sees the screen as a stream of raw images. It needs to visually ground its understanding, localize semantic regions in pixel space and figure out what these painstakingly crafted icons represent. Far more complex than parsing text input or structured data like JSON.

Now imagine that we need to somehow notify an orchestrator model with its own planning memory about the happenings on screen. You see how there are many ways this huge amount of complexity may lead to failure.

Measuring success

Browser or computer use is also inherently hard to measure. What constitutes success? What are the use cases to be included in testing? I don’t see a standard materializing between the competing benchmarks.⁷ A critical debate around the reward design of some larger benchmarks is ongoing in academia.

Often, general computer use test suites are combined with synthetic benchmarks that test visual grounding or GUI understanding.

By anyone’s measure, computer use is hard. Claude Sonnet, a model with comparatively abysmal visual understanding, ranks number 1 with 44% in the widely used OSWorld benchmark. Which is puzzling to me.

Overall, not a great situation.

Agent accessibility

The GUI or browser use problem is so tough in fact, I believe we can only solve it from both ends. At least for now. Instead of just building agents that can browse the web, we should be building a web that is accessible to agents.⁸

A recent study on agent-web interactions points out that modern agents “never scroll beyond two viewports” and ignore purely visual CTAs.⁹ Agents respond more reliably when semantic information is available next to visual elements. The study then proposes design principles like semantic overlays, hidden labels and even standardized placement of UI elements to increase agent accessibility.

The gist of it all: know the limitations, alleviate the pain. Think of agents as a new group of sensorially challenged users. Agent accessibility is a superset of human accessibility.

Starting today, we can quality test across a catalog of criteria relevant to agent accessibility in all applications we build. And we can work to make our extended development environment more agent accessible.

That includes interfaces like Storybook and Figma. Following the RTFM mindset, agents get a condensed cheat sheet that explains how to use the respective GUI.

We wrap interfaces in CLI tools so that they are safely authenticated and target elements can be debugged in isolation. Consent overlays or cookie banners are removed or replaced. We prefer reduced motion by default.

And so forth.

Future-proofing

What does it mean for web applications or regular read-only websites?

Think about how agents might discover your internal GUIs in a corporate network. Use sandboxed agents with browser access and establish your RTFM context.¹⁰ Test your UI with one of the advanced CUA just like you would with a screen reader. Make sure your next internal web application complies with latest a11y standards.

Agent accessibility does not equal buzzword bingo. It’s the draft stage of a future UI contract between humans and agents.

A win for both teams.

Footnotes

This post contains 0% AI-generated text. ↩︎
Which involves context engineering the problem space, running sandboxes, planning tasks with a suitable general-purpose agent model, offloading computer use to a CUA expert model, running an orchestrator or LLM-as-a-judge in a closed loop ↩︎
Claude itself is abysmal at visual understanding. Don’t take my word for it and see for example Andrew Chan’s PBF Bench. Add to this the economic issue of constantly encoding image data. A single 1000×1000 screenshot can add roughly 5,000 tokens to the prompt. ↩︎
Based on anecdotal evidence from our team. Code appears relatively clean, conforms to formatting guidelines. Results are well covered by unit tests. Component variants are mostly there. UI might be vastly off. ↩︎
Planning, memory, general context awareness. Some computer use models like VyUI are clearly ahead of the competition but run on-device and lack the planning and coding skills of their SOTA peers like Claude or Gemini. ↩︎
Web standards to rely on, well documented browser APIs and markup as fallback always available. ↩︎
For visual grounding, GroundUI and ScreenSpot are often found in marketing material. Older benchmarks like WebArena, OSWorld and Mind2Web appear questionable in their reward designs. Others like OpenAI’s BrowseComp or WebShop focus on specific tasks like information retrieval or shopping, respectively. ↩︎
Lú et al (2025): Build the web for agents, not agents for the web. arxiv ↩︎
Nitu, Mühle & Stöckl (2025): Machine-Readable Ads: Accessibility and Trust Patterns for AI Web Agents interacting with Online Advertisements. arxiv ↩︎
Delivered via MCP but be aware of Context Rot. ↩︎