WABER: Evaluating Reliability and Efficiency of Web Agents with Existing Benchmarks

ICLR 2025 Workshop on Foundation Models in the Wild |

Organized by Microsoft Research at Redmond US

Most existing web agent benchmarks evaluate agents solely based on their task completion rate, excluding other crucial aspects of agent behavior that impact their usability and deployability in real-world. We propose incorporating two important metrics into a web agent benchmark: reliability that assesses how consistently the agent completes tasks despite transient web unreliability that are common in the wild, and efficiency that measures the speed and cost-effectiveness of the agent’s task completion. Developing new benchmarks to measure these metrics would take significant efforts. To address this, we introduce a novel network proxy-based solution called WABER, which enables the evaluation of these two metrics on existing agents and benchmarks without requiring any modifications to them. This allows agent developers to adopt it effortlessly on any agent and benchmark, with zero developer effort. Using our WABER prototype, we evaluated two existing agents on the WebArena benchmark: Stacked LLM Policy and Agent Workflow Memory. Our results show that current SoTA agents struggle to complete tasks on the WABER framework, demonstrating the need to design agents that are able to generalize to real-word, unreliable scenarios.