Measuring agent reliability on your own tasks

Dec 2025

Public benchmarks (like ARC or GPQA) are useful for comparing LLM models in the abstract. A model that scores 90% on a benchmark is better than one that scores 85%. But this tells you almost nothing about whether that model is good enough for your specific task.

A model might score well on general knowledge questions but struggle with your specific site's quirky UI. It might excel at reading tables but fail on dynamic content. It might handle English perfectly but choke on a date format you use. A benchmark score is a proxy, not a prediction.

BrowserPilot measures reliability on your actual tasks. Every time you run a template, the result is recorded: did it complete? Did it fail? If it failed, did a retry recover it? How many steps did it take? How much did it cost?

These metrics roll up per template: completion rate (what fraction of runs finish), recovery rate (what fraction of failures were recovered by retry), median steps (how many actions did a run typically take), and median cost. You can watch these per template, per time period, or across your whole workspace.

Completion rate matters most: if a workflow is production-ready, it should complete consistently. If you are seeing 95% completion on a critical task, that 5% failure rate might be acceptable or catastrophic depending on your use case. A price monitor that misses 5% of updates is still useful. A checkout verification that fails 5% of the time is not.

Recovery rate tells you whether failures are transient or systemic. If a failing step recovers on retry, it is probably a flaky target or a network glitch. If it fails consistently, you probably have an issue in your task or the agent cannot handle the site. Watching recovery rate over time helps you distinguish between the two and optimize accordingly.