Tech News

Understand AI agents with multiple personalities

In the coming years, there was a widespread expectation for agents to take over the growing number of trivia on behalf of humans, including the use of computers and smartphones. But, for now, they are too easy to happen to be more useful.

A new agent called S2 created by launching the simulation AI, combining the Frontier model with a model dedicated to using the computer. The agent achieves the latest performance on tasks such as using applications and manipulating files, and suggests that moving to different models in different situations may help agents improve.

“Computers use different agents than large language models and different from coding,” said Ang Li, co-founder and CEO of Simular. “This is another problem.”

In Simular’s approach, powerful general-purpose AI models, such as OpenAI’s GPT-4O or Anthropic’s Claude 3.7, are used to reason about how to best accomplish the task at hand – while smaller open source models intervene in tasks such as interpreting web pages.

Lee, a researcher at Google DeepMind before Simular was founded in 2023, explained that large language models perform well in planning but are not that good at identifying elements of graphical user interfaces.

S2 aims to learn from experience using external memory modules that record actions and user feedback and use these records to improve future operations.

On particularly complex tasks, S2 performs better than any other model on OSWorld, a benchmark for measuring the ability of a proxy to use a computer’s operating system.

For example, S2 can complete 34.5% of the task, involving 50 steps, defeating the operator of Openai, which can complete 32%. Similarly, S2 gets 50% on Androidworld, the benchmark for using smartphone proxy, while the next best proxy scores 46%.

Victor Zhong, a computer scientist at the University of Waterloo in Canada and one of the creators of OSWorld, believes that future large AI models may contain training data to help them understand the visual world and understand the graphical user interface.

“This will help agents navigate Guis with higher accuracy.” “I think at the same time, before this basic breakthrough, the latest systems will be similar to simulation systems, as they combine multiple models to patch the limitations of a single model.”

To prepare for this column, I booked Amazon’s flights and search deals using Simular, and it looked better than some of the open source agents I tried last year, including Autogen and Vimgpt.

But even the smartest AI agents still seem to be bothered by edge cases and occasionally exhibit strange behavior. In one instance, when I asked S2 to help the researchers behind OSWorld find contact information, the agent was trapped in a loop between the project page and the login of OSWorld Discord.

Osworld’s benchmark shows why agents are now more hyped than reality. Although humans can complete 72% of OSWORLD tasks, the time for proxying is 38% on complex tasks. That said, when the benchmark is launched in April 2024, the best agents can only complete 12% of the tasks.

Related Articles

Leave a Reply

Your email address will not be published. Required fields are marked *

Back to top button