How to really stop your agents from making the same mistakes

LangChain has raised $160 million. Three years of development. A billion-dollar valuation. LangSmith, their testing platform, is genuinely sophisticated: trajectory evals, trace-to-dataset pipelines, LLM-as-judge, regression suites, unit test frameworks for tools. They have the pieces. Credit where it's due. But pieces aren't a practice. LangChain gives you testing tools. It never tells you what to test, in what order, or when you're done. There's no opinionated workflow that says, in order: this failure happened now write a skill now write the deterministic code now write unit tests now write LLM evals now add a resolver trigger now eval the resolver now audit for duplicates now smoke test now file correctly That loop doesn't exist. You have to invent it yourself from scattered primitives. $160 million in funding, and most LangChain users still don't test their agents, because the framework gave them a gym membership without a workout plan. Most AI agent "reliability" is vibes-based. Prompt tweaks. Bigger system messages. "Please don't hallucinate" incantations. That stuff decays the moment the conversation gets complex. The frameworks that raised hundreds of millions of dollars to solve this gave you monitoring dashboards and unit test helpers and said "good luck." My agent screwed up twice this week. Neither failure can happen again. Not because I asked nicely. Because I turned each failure into a permanent structural fix: a skill with tests that run every day, forever. What hundreds of millions of dollars of VC capital couldn't buy you, I am going to give it to you today for free in open source. I call the practice "skillify." Once you use it, your agents won't keep making the same mistakes. Here's how it works. Failure 1: The Trip That Was Already in the Database I asked my OpenClaw about an old business trip, nearly ten years back, buried somewhere in calendar history. Simple question. Should take one second. Instead the agent did this: Called the live calendar API → blocked (too far back). Tried email search → noisy results, nothing conclusive. Tried the calendar API again with different params → still blocked. Five minutes later, searched my local knowledge base and found it instantly. The answer had been sitting in my own data the whole time. 3,146 calendar files spanning 2013 through 2026. Already indexed. Already local. One grep away. The agent just didn't look there first. In the framework I've been writing about (thin harness, fat skills) there's a key distinction between work that requires judgment and work that requires precision. I call them latent and deterministic. Calendar grep is deterministic. Same input, same output, every time. No model needed. But the agent did it in latent space anyway, spinning up reasoning, making API calls, interpreting results, when a three-line script would have returned the answer instantly. That's the bug. Not a wrong answer. A wrong side. The fix: calendar-recall (Steps 1 + 2) In thin harness / fat skills, a skill is a markdown procedure that teaches the model how to approach a task. Not what to do (the user supplies the what). The skill supplies the process. Think of it like a method call: same procedure, radically different outputs depending on what you pass in. Here's the skill that came out of this failure: name: calendar-recall description: "Brain-first historical calendar lookup. ALWAYS use this before any live API for any event not in the future or the last 48 hours." And the hard rule inside: Live calendar APIs are ONLY for events in the FUTURE or the LAST 48 HOURS. Everything historical goes through the local knowledge base first. Here's the thing that makes this work: the agent itself wrote the deterministic script. The skill file (markdown, living in latent space) told the agent how to fix the problem. The agent read the skill, understood that calendar search is deterministic work, and generated a script to handle it: $ node scripts/calendar-recall.mjs search "Beijing" Found 2 matching day(s): ── 2016-05-07 ── Flight to Beijing, InterContinental Beichen check-in ── 2016-05-08 ── Lunch with investors in Guomao, Chaoyang Code that runs in under 100 milliseconds (most of which is Bun startup; the actual grep is sub-millisecond). Zero LLM calls. Zero network. Just local files. This is the loop that makes the whole architecture work: the latent space builds the deterministic tool, then the deterministic tool constrains the latent space. The agent used judgment (latent) to write calendar-recall.mjs. Now the skill forces the agent to run that script instead of reasoning about calendar data. The model's intelligence created the constraint that prevents the model from being stupid. The old failure path becomes structurally unreachable. The skill says "search local first." The script does the search. The agent never gets a chance to be clever about it or get it wrong again. Failure 2: "28 Minutes" (Steps 1 + 2 again) Same day. Agent says: "Your ne

Scraped Article