We’re excited to announce the release of WebGames, a comprehensive collection of web-based challenges designed to test and evaluate general-purpose web-browsing AI agents. This suite features over 50 unique, web-based challenges that are intentionally crafted to be easy for humans but challenging for today’s AI systems.
WebGames is carefully designed to be:
Compared to WebVoyager: WebGames has no external dependencies. It does not rely on any external websites. The tasks are hermetic, can be run locally, and have easily-verified ground-truth solutions.
Compared to WebArena: WebGames is easier, faster and cheaper to run yourself. It’s also available at webgames.convergence.ai. The challenges are faster to complete, and the evaluation is more straightforward.
The suite includes a range of challenges across multiple categories:
We ran WebGames on leading large vision-language models including GPT-4o, Claude Computer Use (Sonnet 3.5), and Gemini 1.5 Pro. We found a significant capabilities gap when compared against human-level performance. A detailed discussion of these results will follow in an in-depth technical report.
Model | Environment | Scaffolding | Performance (%) ↑ |
---|---|---|---|
GPT-4o | Webbrowser | SoMs + ReAct Prompting | 41.2 ± 7.0 |
Claude Computer-Use | Linux Machine | ReAct Prompting | 35.3 ± 6.8 |
Gemini-1.5-Pro | Webbrowser | SoMs + ReAct Prompting | 27.5 ± 6.3 |
Qwen2-VL-7b | Webbrowser | SoMs + ReAct Prompting | 13.7 ± 4.9 |
Qwen2-VL-72b | Webbrowser | SoMs + ReAct Prompting | 29.4 ± 6.4 |
Proxy | Webbrowser | – | 43.1 ± 7.0 |
Human | Computer | – | 95.7 ± 0.6 |
The entire suite is open-source and available on GitHub. You can: Download the complete challenge set in JSONL format Integrate the challenges into your testing pipeline Contribute new challenges or improvements
Visit WebGames to explore the challenges yourself. Whether you’re a human looking for some interesting puzzles or developing an AI system, we’d love to see how you fare against our challenges. Let the WebGames begin!
Built with ❤️ by convergence.ai