How We Built Browser Automation with MCP: A Technical Deep Dive

Most AI platforms connect to external services through REST APIs. That works for structured data — reading a CRM contact, creating a Jira ticket — but it misses everything happening inside the browser. The email draft in Gmail, the Slack thread you’re reading, the form half-filled in ServiceNow.

We built a browser extension that bridges this gap using the Model Context Protocol (MCP). This post covers the architecture decisions, the problems we hit, and how the system works end to end.

The Problem: Browser State is Invisible to APIs

APIs give you database records. They don’t give you what’s on screen. A sales rep composing a follow-up email in Gmail has context that no API can capture — the tone, the half-written paragraphs, the tabs open alongside it. We wanted AI workflows to operate on this live browser context the way a human assistant sitting next to you would.

Architecture Overview

The extension is built on WXT 0.20 (Manifest V3) with Svelte 5 and TypeScript. It implements an MCP client that connects to the JieGou server via WebSocket, exposing 60+ browser automation tools that AI workflows can invoke.

The message flow looks like this:

MCP Server → WebSocket → Background Service Worker → Content Script → Page

When a recipe or workflow needs browser interaction, it sends a JSON-RPC 2.0 tool call through the WebSocket connection. The extension’s background worker routes it to the appropriate tool executor, which injects a content script into the active tab and returns the result.

WebSocket Bridge

The WebSocket proxy handles authentication, heartbeats, reconnection, and token refresh.

Authentication happens on connect — the client sends an authenticate message with a JWT token. The server validates it and begins accepting tool calls.

Heartbeats run every 15 seconds at the application level (not relying on WebSocket protocol pings). If a pong doesn’t arrive within 5 seconds, the connection is considered dead and reconnection starts.

Auto-reconnect uses a 3-second delay on disconnect with exponential backoff. Token refresh happens proactively 5 minutes before JWT expiry via a REST endpoint, so the connection never drops due to expired credentials.

Tool Executor Pipeline

All tools extend a BaseBrowserToolExecutor class that provides common helpers: injectContentScript() with ping/pong deduplication, sendMessageToTab(), getActiveTabOrThrow(), and tab focus management.

Tools fall into several categories:

Page interaction — click_element, fill_form_field, select_dropdown, check_box, scroll_page, and navigate. These operate on DOM elements identified by CSS selectors or auto-generated reference IDs.

Content reading — read_page parses the DOM and assigns stable reference IDs to interactive elements. This is how AI “sees” a page — it gets structured text with clickable references rather than raw HTML.

Platform-specific extractors — web_fetcher has specialized parsers for Gmail, Slack, Jira, Salesforce, ServiceNow, and HubSpot. Instead of generic DOM scraping, these understand each platform’s markup structure and extract clean, typed data.

Browser internals — javascript executes arbitrary code via Chrome DevTools Protocol Runtime.evaluate, network_capture monitors HTTP traffic, screenshot captures the viewport or specific elements, and gif_recorder creates animated recordings of multi-step interactions.

Inject Scripts: Running in the Page’s World

The trickiest part of the architecture is inject scripts. Content scripts run in an isolated world — they can read the DOM but can’t access the page’s JavaScript context, React component state, or framework internals.

We have 16 TypeScript modules that get bundled as IIFEs via a custom esbuild plugin and injected into the MAIN world. This lets them access React internals, call page-level APIs, and interact with single-page app routers.

The injection uses ping/pong deduplication to avoid injecting the same script twice into a tab, and results flow back through window.postMessage to the content script, then up to the background worker.

What We Learned

Manifest V3 service workers are ephemeral. They can be killed at any time by the browser. We had to make the WebSocket connection resilient to service worker restarts — reconnecting transparently without losing pending tool calls.

Platform-specific parsing beats generic scraping. Our first version used generic DOM extraction for everything. Gmail’s HTML is deeply nested and changes between updates. Writing targeted parsers for each platform (6 so far) dramatically improved reliability and data quality.

MCP is a good protocol for this. The JSON-RPC 2.0 base with tools/list discovery and typed tools/call invocation is simple enough to implement but structured enough to be reliable. We’ve found it easier to extend than building a custom protocol would have been.

What’s Next

We’re working on expanding platform-specific handlers, improving element targeting reliability across single-page app navigation, and exploring ways to share tool definitions with the broader MCP ecosystem.

If you’re building MCP integrations or browser automation tools, we’d love to hear what patterns you’ve found useful. The protocol is still young and the community is figuring out best practices together.