Published:
Author:

Introducing WebGames: a suite of challenges for web-browsing AI agents

We’re excited to announce the release of WebGames, a comprehensive collection of web-based challenges designed to test and evaluate general-purpose web-browsing AI agents. This suite features over 50 unique, web-based challenges that are intentionally crafted to be easy for humans but challenging for today’s AI systems.

What makes WebGames special?

WebGames is carefully designed to be:

  • Human-friendly: tasks that humans can complete with ease
  • AI-challenging: tests the limitations of current AI agents
  • Easy and quick to run: Just client-side state and a single-page JavaScript app
  • Clear to evaluate: Each challenge provides a unique password upon successful completion, which can be used to verify the agent completed the challenge
  • Orthogonal: some challenges are designed to test singular abilities, such as scrolling, navigating iframes, downloading files, and so on.

Relation to other web benchmarks

Compared to WebVoyager: WebGames has no external dependencies. It does not rely on any external websites. The tasks are hermetic, can be run locally, and have easily-verified ground-truth solutions.

Compared to WebArena: WebGames is easier, faster and cheaper to run yourself. It’s also available at webgames.convergence.ai. The challenges are faster to complete, and the evaluation is more straightforward.

Challenge categories

The suite includes a range of challenges across multiple categories:

  1. Fundamental browser interaction: covers basic operations like DOM element selection/activation, viewport manipulation, tab management, and file system tasks (download, parsing, upload). These form the essential building blocks of web navigation.
  2. Advanced input processing: tests sophisticated interaction patterns including drag-and-drop operations, hover state management, and complex keyboard commands. Focuses on fine-grained control and temporal coordination.
  3. Cognitive and memory tasks: evaluates higher-order reasoning through tree-based search problems, mental mapping, data visualization interpretation, and state management across interactions. Tests planning, reasoning, and adaptation capabilities.
  4. Workflow automation: assesses real-world task completion like e-commerce inventory management, retail transactions, and event coordination. Requires maintaining consistency across extended interaction sequences.
  5. Interactive entertainment systems: features real-time challenges including arcade game reproductions, obstacle navigation, and physics engine interactions. Requires rapid visual processing, precise timing, and adaptive strategy formation.

Preliminary results

We ran WebGames on leading large vision-language models including GPT-4o, Claude Computer Use (Sonnet 3.5), and Gemini 1.5 Pro. We found a significant capabilities gap when compared against human-level performance. A detailed discussion of these results will follow in an in-depth technical report.

ModelEnvironmentScaffoldingPerformance (%) ↑
GPT-4oWebbrowserSoMs + ReAct Prompting41.2 ± 7.0
Claude Computer-UseLinux MachineReAct Prompting35.3 ± 6.8
Gemini-1.5-ProWebbrowserSoMs + ReAct Prompting27.5 ± 6.3
Qwen2-VL-7bWebbrowserSoMs + ReAct Prompting13.7 ± 4.9
Qwen2-VL-72bWebbrowserSoMs + ReAct Prompting29.4 ± 6.4
ProxyWebbrowser43.1 ± 7.0
HumanComputer95.7 ± 0.6

For developers and researchers

The entire suite is open-source and available on GitHub. You can: Download the complete challenge set in JSONL format Integrate the challenges into your testing pipeline Contribute new challenges or improvements

Try it out

Visit WebGames to explore the challenges yourself. Whether you’re a human looking for some interesting puzzles or developing an AI system, we’d love to see how you fare against our challenges. Let the WebGames begin!

Built with ❤️ by convergence.ai

Sign Up