The rise of browser-use agents: Why Convergences Proxy is beating...

@VentureBeat поделился ссылкой

2025-02-22 18:34:44 ·

The rise of browser-use agents: Why Convergences Proxy is beating OpenAIs Operator

Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn MoreA new wave of AI-powered browser-use agents is emerging, promising to transform how enterprises interact with the web. These agents can autonomously navigate websites, retrieve information, and even complete transactions but early testing reveals significant gaps between promise and performance.While consumer examples offered by OpenAIs new browser-use agent Operator, like ordering pizza or buying game tickets, have grabbed headlines, the question is about where the main developer and enterprise use cases are. The thing that we dont know is what will be the killer app, said Sam Witteveen, co-founder of Red Dragon, a company that develops AI agent applications. My guess is its going to be things that just take time on the web that you dont actually enjoy. This includes things like going on the web and searching for the cheapest price of a product or booking the best hotel accommodations. More likely it will be used in combination with other tools like Deep Research, where companies can then do even more sophisticated research plus execution of tasks around the web.Companies need to carefully evaluate the rapidly evolving landscape as established players and startups take different approaches to solving the autonomous browsing challenge.Key players in the browser-use agent landscapeThe field has quickly become crowded with both major tech companies and innovative startups:OpenAIs Operator (launched January 2025) Available to ChatGPT Pro subscribers ($200/month), focusing on consumer-friendly web automationConvergences Proxy (launched December 2024) A UK startup offering free limited use (5 sessions/day) or unlimited access for $20/monthGoogles Project Mariner Currently in preview testing with a waitlist for accessAnthropics Computer Use (launched October 2024) Expected to release an update soonMicrosofts OmniParser V2 (February 2025) An open-source project for converting UI screenshots into structured data, allowing LLMs to interpret and interact with sites.ByteDances UI-TARS Requires deeper system access, raising potential security concernsBrowser-Use A developer-focused tool allowing choice of AI models, including Googles Gemini 2.0 FlashOperator and Proxy are the most advanced, in terms of being consumer-friendly and out-of-the-box ready. Many of the others appear to be positioning themselves more for developer or enterprise usage. For example, Browser Use, a Y-Combinator startup that allows users to customize the models used with the agent. This gives you more control over how the agent works, including using a model from your local machine. But its definitely more involved. The others listed above provide a varying degree of functionality and interaction with local machine resources. I decided not even to test ByteDances UI-TARS for now, because it requested lower level access to my machines security and privacy features (if I test it out, Ill definitely use a secondary computer).Testing reveals reasoning challengesSo the easiest to test are OpenAIs Operator and Convergences Proxy. In our testing, the results highlighted how reasoning capabilities can matter more than raw automation features. Operator, in particular, was more buggy. For example, I asked the agents to find and summarize VentureBeats five most popular stories. It was an ambiguous task, because VentureBeat doesnt have a most popular section per se. Operator struggled with this. It first fell into an infinite scrolling loop while searching for most popular stories, requiring manual intervention. In another attempt, it found a three-year-old article titled Top five stories of the week. In contrast, Proxy demonstrated better reasoning by identifying the five most visible stories on the homepage as a practical proxy for popularity, and it gave accurate summaries.The distinction became even clearer in real-world tasks. I asked the agents to book a reservation at a romantic restaurant for noon in Napa, California. Operator approached the task linearly finding a romantic restaurant first, then checking availability at noon. When no tables were available, it reached a dead end. Proxy showed more sophisticated reasoning by starting with OpenTable to find restaurants that were both romantic and available at the desired time. It even came back with a slightly better rated restaurant.Even seemingly simple tasks revealed important differences. When searching for a YubiKey 5C NFC price on Amazon, Proxy quickly found the item more easily than Operator.OpenAI hasnt divulged much about technologies it uses for training its Operator agent, other than saying it has trained its model on browser-use tasks. Convergence, however, has provided more detail: Its agent uses something called Generative Tree Search to leverage Web-World Models that predict the state of the web after a proposed action has been taken. These are generated recursively to produce a tree of possible futures that are searched over to select the next optimal action, as ranked by our value models. Our Web-World models can also be used to train agents in hypothetical situations without generating a lot of expensive data. (More here).Benchmarks may be useless for nowOn paper, these tools appear closely matched. Convergences Proxy achieves 88% on the WebVoyager benchmark, which evaluates web agents across 643 real-world tasks on 15 popular websites like Amazon and Booking.com. OpenAIs Operator scores 87%, while Browser-Use says it reaches 89% but only after changing the WebVoyager codebase slightly, it conceded, according to our needs.These benchmark scores should really be taken with a grain of salt, though, as they can be gamed. The real test comes in practical usage for real-world cases. Its very early, the space is so rapidly changing, and these products are changing almost on a daily basis. The results will depend more on the specific jobs youre trying to do, and you may want to instead rely on the vibes you get while using the different products.Enterprise implicationsThe implications for enterprise automation are significant. As Witteveen points out in our video podcast conversation about this, where we do a deep dive into this browser-use trend, many companies are currently paying for virtual assistants operated by real people to handle basic web research and data gathering tasks. These browser-use agents could dramatically change that equation.If AI takes this over, Witteveen notes, thats going to be some of the first low hanging fruit of people losing their jobs. Its going to show up in some of these kinds of things.This could feed into the robotic process automation (RPA) trend, where browser use is pulled in as just another tool for companies to automate more tasks. And as mentioned earlier, the more powerful uses cases will be when an agent combined browser use with other tools, including things like Deep Research, where an LLM-driven agent uses a search tool plus browser use to do more sophisticated jobs.Cost dynamics driving innovationAnother key factor driving rapid development is the availability of powerful open-source reasoning models like DeepSeek-R1. This allows companies building these browser-use agents to compete effectively with larger players by leveraging these models rather than building their own.The pricing pressure is already evident. While OpenAI requires a $200 monthly ChatGPT Pro subscription to access Operator, Convergence offers limited free use (up to five uses per day) and a $20/month unlimited plan. This competitive dynamic should accelerate enterprise adoption, though clear use cases are still emerging.Security and integration challengesSeveral hurdles remain before widespread enterprise adoption. Some websites actively block automated browsing, while others require CAPTCHA verification. While OpenAI and Convergence have tools that can get past CAPTCHAs, they let users take over the task to fill them out instead of doing them directly, since the whole point of CAPTCHAs is to ensure a human is at the other end. Tools like ByteDances UI-TARS request deep system access, which raises security concerns for enterprise deployment.Additionally, the approach to website cooperation varies. OpenAI has worked with specific partners like Instacart, Priceline, DoorDash and Etsy, while others attempt to navigate any website. This inconsistency could impact reliability for enterprise use cases. And of course, any time an agent hits a site requiring login details, that will slow things as the agents will turn things over to you to fill in those details.For enterprises evaluating these tools, the focus should be on specific use cases where autonomous web interaction could provide clear value whether in research, customer service, or process automation. The technology is progressing rapidly, but success will depend on matching capabilities to concrete business needs.As this space evolves, expect to see more enterprise-focused features and potentially specialized agents for specific industries or tasks. The race between established players and innovative startups should drive both technical advancement and competitive pricing, making 2025 a crucial year for enterprise browser-use agent adoption.For more detail on these trends and testing results, check out the full video conversation between Sam Witteveen and myself.Daily insights on business use cases with VB DailyIf you want to impress your boss, VB Daily has you covered. We give you the inside scoop on what companies are doing with generative AI, from regulatory shifts to practical deployments, so you can share insights for maximum ROI.Read our Privacy PolicyThanks for subscribing. Check out more VB newsletters here.An error occured.

0 Комментарии ·0 Поделились ·46 Просмотры

Обновить до Про