I replaced 100 login scripts with a browser agent loop

Most websites have a login flow. None of them look the same. Some have email first, then password on a second screen. Some combine both. Most throw a CAPTCHA. Some ask you to pick a workspace. Some show a cookie banner to dismiss before you can do anything. If you're building browser agents for authenticated use cases, this is your nightmare. At Anon, I've spent the last year building infrastructure that lets AI agents act on behalf of users across the web. In the past, logging into hundreds of different services programmatically meant writing a dedicated login state machine for every single provider. Each one was a brittle mess. Hardcoded selectors and transitions. Everything breaks when a site ships a redesign on a Tuesday afternoon. This is why I built the Login Machine (GitHub, hosted demo). One loop: screenshot the page, ask an LLM what it sees, programmatically act on the structured response. No hard coded scripts, works for any login! Here's how I did it, and how you can replicate it 👇 Why This Works (and why scripts don't) Before I walk through the implementation, read this because it might change how you think about browser automation entirely. Traditional login automation treats every provider as a known, static flow (a state machine if you will). You write a Playwright script like this: This works until the site renames #email to #username, a cookie banner covers the submit button, or the flow adds an MFA step you didn't account for. For one provider, you can maintain this. For ten, it's annoying. For a hundred, it's untenable. At scale, it is the combinatorial explosion of cases that really kills this… (RIP ✝︎) Different account types (member, admin, external admin, users with multiple roles etc.) behave differently across login flows. Multiply the providers by the account types by the flow variations and you get hundreds of possible paths through every login. But wait… login pages are designed for humans. Every screen is self-contained. You can always deduce what to do without knowing the previous or next step. An LLM with vision can do the same. Instead of hardcoding what each login page looks like, you send the model a screenshot and stripped-down HTML of the current page and ask: What kind of screen is this, and what are the interactive elements? It returns structured data you can act on programmatically. That's the entire idea. This also solves a problem of credential custody. When the LLM analyzes a page, its structured output describes exactly what fields are needed (an email, a password, a workspace picker). That output becomes an input request you can surface to the user through a dynamic UI, a password manager API, or a secrets vault. Credentials flow transiently into the browser session and are never stored or surfaced to the LLM. The Login Machine tells you what to ask for. You decide how to collect it. How It Works The Core Loop No hardcoded transitions. No state machine that assumes "after email comes password." After every action, the system takes a fresh screenshot, sends it to the LLM… The browser runs in the cloud on BrowserBase. No Docker containers to manage, no browser binaries to update. BrowserBase handles fingerprinting, residential proxies, and keeps the session alive server-side. I just connect via CDP, do my work, and disconnect. HTML Extraction + Screenshot: Preparing the prompt I don't send raw HTML to the LLM. I gut it and send along a screenshot. A typical login page has thousands of lines of scripts, stylesheets, SVGs, tracking pixels, and hidden elements that add zero signal. Sending all of that burns tokens and drowns the actual form fields in noise. The extraction function walks the DOM recursively and is ruthless about what it keeps: Three things matter here: Shadow DOM traversal - Many modern login forms (especially enterprise SSO widgets) live behind Shadow DOM boundaries. If you only walk node.children, you miss them entirely. Checking node.shadowRoot catches these. Aggressive tag stripping - Scripts, styles, SVGs, images, no script blocks, link tags all gone. They contribute nothing to form identification and massively inflate token count. Attribute filtering - Even on the tags we keep, we strip most attributes. The LLM doesn't need data-analytics-id or aria-describedby. It needs id, name, type, placeholder, the things that help it generate a working Playwright locator. The iframe content gets extracted separately and appended with explicit tags: This cut token usage by roughly 10x on complex pages like Amazon's login flow. And the LLM makes better decisions with less noise, fewer hallucinated locators, faster classification. Along with all this, a screenshot helps the LLM with visual clues. Screen Classification: structured output screen types I define six screen types, each with a strict Zod schema. Vercel AI SDK's generateText with Output.object makes this straightforward. I pass a Zod schema and get typed, validated output back. The LLM cl

Scraped Article