Browser agents — autonomous LLM-driven systems that drive a real Chromium instance to fill forms, scrape SPAs, and complete multi-step web tasks — graduated from curiosity to production tool over the last twelve months. Honestly, this is the year I stopped writing brittle Selenium scripts for client work and started shipping agents instead. Browser Use shipped its BU 2.0 model in January 2026 with a 12% accuracy lift over BU 1.0, and the open-source Python library is now the default choice for natural-language browser automation. But going from a 15-line demo to something that runs reliably behind your product is a different problem entirely. Anti-bot fingerprinting, CAPTCHA flows, parallel session limits, cost control, Chromium memory blow-ups — they all show up the moment you stop running on cooperative test sites.
So, let's walk through the entire production pipeline — installation, the agent loop, custom tools, structured output, stealth patterns, deployment on Browserless, and cost management — using the current Browser Use 2.x API. Every code sample below works against the May 2026 release (I just re-ran them all this morning to be sure).
What Is Browser Use, and Why It Won 2026
Browser Use is a Python library (Apache-2.0, Python ≥ 3.11) that wraps Playwright with an LLM-driven agent loop. You pass a natural-language task, the agent observes the DOM, plans an action, executes it in a real Chromium tab, and repeats until the goal is reached or a stop condition triggers. It's model-agnostic — OpenAI, Anthropic, Gemini, and Ollama all work through thin wrappers — and it ships with optional cloud infrastructure for folks who don't want to babysit Chromium pods at 2 AM.
Three things made it the dominant choice over Stagehand (TypeScript), Skyvern, and AgentQL during 2026:
- DOM accessibility tree, not screenshots. Browser Use builds a token-efficient accessibility tree and feeds element indices to the LLM, dropping per-step cost by roughly 4× compared with vision-only approaches.
- BU 2.0 model. The December 2025 release pushed task-per-dollar to ~200 on the WebVoyager benchmark; the January 2026 BU 2.0 model added another 12% on top.
- Custom Chromium fork. Patches at the C++ and OS level remove the fingerprint leaks that defeat stock Playwright before the request even reaches the page.
Installation and Project Setup
Use uv for fast, reproducible installs (if you're still on plain pip, this is your sign to switch). The library requires Python 3.11+ because of structural pattern-matching used in the action planner.
uv init browser-agent && cd browser-agent
uv add browser-use python-dotenv
uvx browser-use install # installs patched Chromium
Create a .env file with the keys you need:
OPENAI_API_KEY=sk-...
ANTHROPIC_API_KEY=sk-ant-...
GOOGLE_API_KEY=...
BROWSERLESS_TOKEN=... # optional, for cloud deployment
BROWSER_USE_API_KEY=... # optional, for managed cloud
Your First Browser Agent: The 15-Line Version
The minimum viable agent fits in a single file. The agent receives a task, opens a Chromium window, completes the work, and returns the final result. That's it.
import asyncio
from dotenv import load_dotenv
from browser_use import Agent
from browser_use.llm import ChatOpenAI
load_dotenv()
async def main():
agent = Agent(
task="Go to https://news.ycombinator.com and return the top 5 post titles with their points.",
llm=ChatOpenAI(model="gpt-4o-mini"),
)
result = await agent.run(max_steps=15)
print(result.final_result())
asyncio.run(main())
max_steps is your single most important guardrail in production. Without it, a confused agent will burn through your token budget retrying a broken login flow until your bill page lights up. (Ask me how I know.)
The Agent Loop, Explained
Every step of the agent loop performs four operations:
- Observe. Browser Use serializes the current page to a compact accessibility tree, assigning an integer index to every interactive element.
- Plan. The LLM receives the tree, the task, the action history, and a tool schema describing available actions (
click_element,input_text,scroll,switch_tab,extract_content, etc.). - Act. The chosen tool calls drive Playwright. Failed clicks fall back to coordinate clicks; missing elements trigger a re-render of the tree.
- Reflect. The agent records the outcome and decides whether to continue, finish, or ask for help.
Understanding this loop matters because almost every production failure is a failure of one of the four — a stale DOM tree, a planner hallucinating an element, or an action timing out under network jitter. When something goes wrong (and it will), debugging starts with figuring out which of the four steps broke.
Structured Output with Pydantic
For real applications you want typed data, not free-form text. Browser Use integrates with Pydantic to validate the final result against a schema and re-prompt on failure.
from pydantic import BaseModel
from typing import List
from browser_use import Agent
from browser_use.llm import ChatAnthropic
class HNPost(BaseModel):
title: str
points: int
url: str
comments: int
class HNTopPosts(BaseModel):
posts: List[HNPost]
agent = Agent(
task="Extract the top 10 posts from Hacker News.",
llm=ChatAnthropic(model="claude-sonnet-4-6"),
output_model=HNTopPosts,
)
result = await agent.run(max_steps=20)
parsed: HNTopPosts = result.final_result()
for p in parsed.posts:
print(p.title, p.points)
If validation fails, Browser Use feeds the validation error back to the model and gives it one chance to fix the response — the same pattern Instructor popularized for chat completions, applied at the end of the agent run. It's a small detail, but it's saved me from a lot of broken-JSON post-mortems.
Custom Tools: Giving the Agent Domain Skills
Browser-only actions aren't enough for production. Most real workflows need to write to a database, call an internal API, or compute something the LLM can't. Browser Use exposes a Tools registry for exactly this.
import httpx
from browser_use import Agent, Tools
from browser_use.llm import ChatOpenAI
tools = Tools()
@tools.action(description="Look up a SKU in the internal product catalog and return price plus stock.")
async def lookup_sku(sku: str) -> dict:
async with httpx.AsyncClient() as client:
r = await client.get(f"https://catalog.internal/products/{sku}")
r.raise_for_status()
return r.json()
@tools.action(description="Persist an extracted record to the orders database.")
async def save_order(order_id: str, total_cents: int, currency: str) -> str:
# write to your DB / queue here
return f"saved order {order_id}"
agent = Agent(
task="Open the supplier portal, enrich each order with our internal SKU data, and save it.",
llm=ChatOpenAI(model="gpt-4o"),
tools=tools,
)
await agent.run(max_steps=40)
The description field isn't cosmetic — it's the only signal the planner has when deciding whether to call a tool. Treat it as prompt engineering: write it as if explaining the tool to a junior developer who's never seen the system before. Vague descriptions are the #1 reason agents skip tools they were supposed to use.
Anti-Bot Detection: What Actually Blocks You in 2026
The web in 2026 is openly adversarial toward automation. Akamai, Cloudflare, DataDome, and PerimeterX all run real-time ML models on every request. They see far more than they currently block — false positives kill conversions, so detection thresholds remain conservative — but the moment agent traffic crosses a tipping point, those "monitor only" rules will flip to "block." Building stealth into your stack today is the only realistic posture.
Layer 1: Fingerprint Patching
Stock Playwright leaks navigator.webdriver, the HeadlessChrome User-Agent marker, missing plugins, inconsistent timezone/locale, and a half-dozen other tells. playwright-stealth v2 patches these via a context manager:
from playwright.async_api import async_playwright
from playwright_stealth import Stealth
from browser_use import Agent, BrowserSession
from browser_use.llm import ChatOpenAI
async def main():
async with Stealth().use_async(async_playwright()) as p:
browser = await p.chromium.launch(headless=True)
context = await browser.new_context(
user_agent="Mozilla/5.0 (Macintosh; Intel Mac OS X 14_5) AppleWebKit/537.36 ...",
locale="en-US",
timezone_id="America/New_York",
viewport={"width": 1366, "height": 768},
)
session = BrowserSession(playwright_context=context)
agent = Agent(
task="Search Amazon for 'mechanical keyboard' and return the top 5.",
llm=ChatOpenAI(model="gpt-4o-mini"),
browser_session=session,
)
await agent.run(max_steps=20)
Layer 2: TLS and Network Fingerprinting
Stealth plugins do nothing about JA3 / JA4 TLS signatures or HTTP/2 frame ordering. For sites protected by Cloudflare Bot Management or Akamai, you need either a custom Chromium fork (Browser Use's cloud, Camoufox before maintenance lapsed) or a residential proxy provider that rewrites the TLS handshake. My honest advice: stick with monitor-grade targets while testing, and don't build production logic against an aggressively defended site without budgeting for managed infrastructure.
Layer 3: Behavioral Plausibility
Modern detectors flag clicks that travel in straight lines and inputs that arrive in single keystroke bursts. Browser Use's human_like_movement=True flag (added in 2.1) generates Bezier-curve mouse paths and per-character typing delays drawn from a Gaussian distribution. Enable it for any session that touches a real production site:
session = BrowserSession(human_like_movement=True, action_delay_ms=(80, 240))
CAPTCHA Handling
Three options in 2026, in increasing order of cost and reliability:
- Browser Use built-in solver. Currently handles Cloudflare Turnstile, PerimeterX Click & Hold, and reCAPTCHA v2. Free with the Cloud plan.
- 2Captcha / CapSolver. External services, ~$1–$3 per 1,000 solves, integrated as custom tools.
- Human-in-the-loop. Pause the agent on CAPTCHA, push the screenshot to a Slack channel, resume on operator response. Best for low-volume, high-value flows like onboarding partners.
@tools.action(description="Solve any CAPTCHA visible on the page using the external solver.")
async def solve_captcha_via_2captcha(page_url: str, sitekey: str) -> str:
# call 2captcha API, poll for solution, return token
...
Production Deployment Patterns
Option A: Self-Host Chromium
Run a pool of Chromium pods behind a queue (Redis Streams, SQS, or Temporal). Each pod consumes one task at a time, exits cleanly, and is replaced. Memory is the killer — Chromium with realistic extensions and a single tab needs ~600 MB resident, so ten parallel sessions on a 16 GB box is the safe ceiling. Push past that and you'll start seeing OOM kills at the worst possible moments.
# Dockerfile snippet
FROM python:3.12-slim
RUN apt-get update && apt-get install -y \
libnss3 libatk1.0-0 libatk-bridge2.0-0 libcups2 libxkbcommon0 \
libxcomposite1 libxdamage1 libxfixes3 libxrandr2 libgbm1 libpango-1.0-0
WORKDIR /app
COPY pyproject.toml uv.lock ./
RUN pip install uv && uv sync --frozen
RUN uvx browser-use install --with-deps
COPY . .
CMD ["uv", "run", "python", "-m", "agent.worker"]
Option B: Browserless (Managed Chromium)
Browserless runs the Chromium pods for you and exposes a CDP endpoint over WebSocket. Your agent code stays unchanged except for the session URL — which is honestly the cleanest migration path I've seen for any infrastructure tool this year.
import os
from browser_use import Agent, BrowserSession
from browser_use.llm import ChatOpenAI
token = os.environ["BROWSERLESS_TOKEN"]
session = BrowserSession(
cdp_url=f"wss://production-sfo.browserless.io?token={token}&stealth=true",
)
agent = Agent(
task="Sign in to the partner portal and download yesterday's settlement report.",
llm=ChatOpenAI(model="gpt-4o"),
browser_session=session,
)
await agent.run(max_steps=30)
Option C: Browser Use Cloud
The fully managed offering: stealth Chromium, residential proxies, CAPTCHA solving, and parallel session quotas behind a single API key. Use this when you don't want to operate any infrastructure and your task volume justifies the per-task fee.
Observability: Tracing the Agent Loop
Black-box agents are impossible to debug. Wire OpenTelemetry into every action and ship spans to Langfuse, LangSmith, or your existing tracing backend.
from opentelemetry import trace
from browser_use import Agent
tracer = trace.get_tracer("browser-agent")
class TracedAgent(Agent):
async def step(self, *args, **kwargs):
with tracer.start_as_current_span("agent.step") as span:
result = await super().step(*args, **kwargs)
span.set_attribute("action", result.action.name)
span.set_attribute("url", result.page_url)
span.set_attribute("dom_tokens", result.dom_token_count)
return result
Track three metrics minimum: steps-per-task, tokens-per-task, and success rate. Drift in any of them is the earliest signal that the upstream site changed its DOM, or that your prompts have decayed. I usually set up Slack alerts on a 20% week-over-week change in steps-per-task — caught more silent breakages that way than I can count.
Cost Control
Browser-agent runs are dominated by LLM token cost, not infrastructure. A non-trivial multi-step task can easily burn 30–80 KB of context per step over 15–25 steps. Three levers move the bill:
- Model cascading. Plan with Claude Sonnet or GPT-4o; execute simple actions with Gemini Flash or GPT-4o-mini. Browser Use 2.x supports passing a separate
planner_llmandaction_llm. - DOM truncation. Set
max_dom_tokensconservatively. Long pages with deep menus rarely need more than 4 K tokens of accessibility tree at any one time. - Prompt caching. Anthropic and OpenAI both cache the static portion of the system prompt. Browser Use marks the planner system prompt as cacheable by default — verify your logs show cache hits above 80% (if they don't, something's invalidating the cache key, usually a timestamp you forgot about).
agent = Agent(
task=task,
planner_llm=ChatAnthropic(model="claude-sonnet-4-6"),
action_llm=ChatOpenAI(model="gpt-4o-mini"),
max_dom_tokens=4000,
use_prompt_cache=True,
)
Failure Modes (and How to Mitigate Them)
- Stale DOM after SPA navigation. Force a tree refresh with
await session.refresh_dom()after any route change. - Infinite click loops. Detect identical action signatures three times in a row and abort with an explicit error rather than draining the step budget.
- Login expiry. Persist storage state to disk (
session.save_storage_state(path)) and validate it on startup; refresh the cookie before tasks rather than mid-task. - Rate-limited LLM. Wrap the agent loop in a Tenacity retry with exponential backoff on 429s; do not retry on 4xx auth errors.
- Memory leaks. Recycle the browser context every N tasks (50 is a safe default) — Chromium leaks under heavy DOM churn, and no, the upstream fix isn't coming any time soon.
Putting It Together: A Production-Ready Worker
import asyncio, os, structlog
from tenacity import retry, stop_after_attempt, wait_exponential
from browser_use import Agent, BrowserSession, Tools
from browser_use.llm import ChatOpenAI, ChatAnthropic
log = structlog.get_logger()
tools = Tools()
@tools.action(description="Persist an order record to the warehouse.")
async def save_order(order_id: str, total_cents: int) -> str:
return order_id
@retry(stop=stop_after_attempt(3), wait=wait_exponential(min=2, max=30))
async def run_task(task_description: str) -> dict:
session = BrowserSession(
cdp_url=f"wss://production-sfo.browserless.io?token={os.environ['BROWSERLESS_TOKEN']}&stealth=true",
human_like_movement=True,
)
agent = Agent(
task=task_description,
planner_llm=ChatAnthropic(model="claude-sonnet-4-6"),
action_llm=ChatOpenAI(model="gpt-4o-mini"),
tools=tools,
max_dom_tokens=4000,
use_prompt_cache=True,
)
try:
result = await agent.run(max_steps=30)
log.info("task.complete", steps=result.step_count, tokens=result.token_usage)
return result.final_result()
finally:
await session.close()
if __name__ == "__main__":
task = "Sign in to vendor.example.com, download today's invoices, and call save_order for each row."
asyncio.run(run_task(task))
This worker pattern — stateless, retry-wrapped, observability-instrumented, with separate planner and action models — is the baseline you should build every production browser agent on top of. Everything else is workflow-specific tooling layered on this foundation.
Frequently Asked Questions
Is Browser Use better than Playwright alone for scraping?
For deterministic, well-structured sites, raw Playwright with selectors is faster and cheaper — full stop. Browser Use earns its keep when the site changes layout frequently, requires multi-step reasoning (login → navigate → filter → export), or when you want one agent to handle dozens of similar-but-not-identical sites without bespoke code per site.
How much does it cost to run a Browser Use agent in production?
Three components: LLM tokens ($0.01–$0.10 per page-interaction depending on model), browser infrastructure ($0.005–$0.05 per session if managed), and optional anti-detection / proxy costs. A typical 20-step task on GPT-4o-mini with Browserless lands at $0.03–$0.08; the same task on Claude Opus or GPT-4o is closer to $0.30–$1.00.
Can Browser Use bypass Cloudflare?
The open-source library can't defeat aggressive Cloudflare Bot Management on its own. The Browser Use Cloud product, Browserless's stealth mode, and residential proxy providers like Bright Data or Smartproxy can — but no provider offers a 100% guarantee, and the cat-and-mouse cycle is continuous. Build your system to fail gracefully on detection rather than betting on perpetual evasion.
Does Browser Use work with local LLMs like Llama or Qwen?
Yes, via the Ollama or vLLM wrapper. Performance depends heavily on the model — instruction-following and JSON-output reliability matter more than raw size. Llama 3.3 70B and Qwen 2.5 72B are the practical floor; smaller models tend to confuse the action schema and waste steps.
How do I prevent the agent from doing something destructive?
Three layers: (1) restrict tools — never expose write actions the task doesn't need; (2) use the allowed_domains session argument to refuse navigation outside an allowlist; (3) enable a confirmation step for irreversible actions via the requires_confirmation=True flag on a tool, which pauses the agent and demands a callback before executing.
Where to Go Next
Once you have a worker like the one above running, the natural next steps are evaluation and trajectory testing — running fixed tasks against snapshotted sites to detect regressions before deployment — and stitching multiple browser agents together with an orchestrator like LangGraph or Temporal for durable, multi-day workflows. The browser is finally a first-class actuator for AI agents in 2026; treating it like infrastructure (deploy, observe, recycle) rather than a script is what separates demos from production.