Introducing Web-World Models
At Convergence, we’re building agents that are helpful in the workforce. We are focused on building agents that are reliable, autonomous and can learn from everyone.
We have spent a lot effort on improving the reliability of our agents and reducing their error rate in common environments encountered in day-to-day work. Today, we are excited to share state-of-the-art results, and outline how we achieved these new levels of performance.
We are proud to announce our agent – Proxy – achieves top performance of 82% on Web Voyager.
A New State-of-the-Art
At the heart of Proxy are highly capable web-browsing models that can execute a wide variety of tasks, from simple information retrieval to multi-step workflows.
In particular, these agents excel in planning. Using our method of Generative Tree Search, our agents leverage Web-World Models that predict the state of the web after a proposed action has been taken. These are generated recursively to produce a tree of possible futures that are searched over to select the next optimal action, as ranked by our value models. Our Web-World models can also be used to train agents in hypothetical situations without generating a lot of expensive data.
Under the hood, Proxy consists of an orchestration system of agents that plan, reason, and execute actions while interacting with one another. Most excitingly, Proxy has the ability to dynamically spin-up and launch relevant specific agents (and even full copies of itself) to complete appropriate sub-tasks in parallel, before aggregating the results and responding to a user request. For computational tasks, we also provide Proxy access to a sand-boxed virtual machine for both managing resources and running code.
Given these advances, we are pleased to announce that Proxy has achieved state-of-the-art results on the Web Voyager benchmark, outperforming Agent E (Emergence), Runner H (H Company), and Claude Computer Use (Anthropic). This validates the practical capabilities of Proxy on a popular benchmark in a realistic setting.
Benchmarking & Validation
The Web Voyager benchmark consists of 640 tasks ranging from shopping, news crawling, and hotel booking, as well as using tools like maps and GitHub. This benchmark is run on real websites rather than simulated scenarios.
At Convergence, we are focused on building highly capable agents that can benefit everyone. We are mindful of the potential risks of letting autonomous systems interact freely with the broader world. Consequently, we have adopted the UK AI Safety Institute’s InspectAI framework for evaluations, both for running standard benchmarks like Web Voyager above, and for closely monitoring unexpected behaviour and potentially dangerous capabilities.
We ran all of our tasks at roughly the same time at the beginning of December. We evaluated the success of each task firstly using the original Web Voyager auto-evaluator based on GPT-4o and the last 15 screenshots from the trajectory. We then double-checked the results using paid human evaluators who were shown the entirety of Proxy’s decisions and actions.
Outlook
We’re working closely with enterprises, putting Proxy to work in real-world scenarios. The capability improvements we’ve discussed have improved Proxy’s reliability on a wide range of tasks, and unlocked new workflows involving more sophisticated planning and computation. Keep your eyes a open for an open source release of our web browsing models in the new year. Proxy will also be going general access in January.
Sign up to try Proxy today and see how it could benefit your business.