Hype vs. Reality of AI Agents: GPT-4 Falls Short, Success Rate on Real Tasks Below 15%

Wed, May 29 2024 08:04 AM EST

Synced

By: Yali

As large language models continue to evolve and innovate, their performance, accuracy, and stability have seen significant improvements, as validated by various benchmark datasets.

However, for current iterations of LLMs, it seems that their overall capabilities may not fully support AI agents. Multimodal, multitask, and cross-domain capabilities have become essential requirements for AI agents in the public media space. However, the actual performance falls short of expectations, serving as a reminder for AI startups and tech giants to stay grounded and focus on enhancing AI capabilities step by step.

A recent blog post highlighted the discrepancy between the promotion and real-world performance of AI agents, emphasizing that "AI agents are giants in promotion, but the reality is far from ideal."

It is undeniable that the prospect of autonomous AI agents performing complex tasks has generated great excitement. Large Language Models (LLMs) can complete multi-step workflows without human intervention through interactions with external tools and functionalities.

However, reality has proven to be more challenging than anticipated.

Benchmark tests conducted in the WebArena leaderboard, a reproducible online environment for evaluating practical agent performance, revealed that even the best-performing models achieved a success rate of only 35.8% in real-world tasks. WebArena leaderboard benchmark test results for LLM agents in real-world tasks: The SteP model performed the best in terms of success rate, reaching 35.8%, while the renowned GPT-4 only achieved a success rate of 14.9%.

What is an AI agent?

The term "AI agent" has not been clearly defined, and there is much debate about what exactly constitutes an intelligent agent.

An AI agent can be defined as "an LLM endowed with the ability to act (usually making function calls in an RAG environment) to make high-level decisions on how to perform tasks in the environment."

Currently, there are two main architectural approaches to building AI agents:

Single agent: A large model handles the entire task and makes all decisions and actions based on its comprehensive contextual understanding. This approach leverages the emergent capabilities of large models, avoiding information loss from task decomposition.
Multi-agent systems: Tasks are decomposed into subtasks, each handled by a smaller, more specialized agent. Instead of trying to use a large, unwieldy general-purpose agent, people can use many smaller agents to select the right strategies for specific subtasks. This approach is sometimes necessary due to practical constraints such as limitations on context window length or the need for different skill combinations.

In theory, a single agent with infinite context length and perfect attention is ideal. However, in practice, multi-agent systems often outperform single systems on specific problems due to the shorter context.

Challenges in practice

After witnessing many attempts with AI agents, the author believes they are currently premature, costly, slow, and unreliable. Many AI agent startups seem to be waiting for a breakthrough model to kickstart the competition for commercializing AI agents.

The performance of AI agents in practical applications is not yet mature, reflected in issues such as inaccurate outputs, subpar performance, high costs, liability risks, and difficulties in gaining user trust:

Reliability: It is well known that LLMs are prone to hallucinations and inconsistencies. Connecting multiple AI steps exacerbates these issues, especially for tasks requiring precise outputs.
Performance and cost: GPT-4, Gemini-1.5, and Claude Opus perform well in tool/function calls, but they are still slow and costly, especially when loops and automatic retries are needed.
Legal issues: Companies may be held accountable for errors made by their AI agents. A recent example is Air Canada being ordered to compensate a customer misled by an airline chatbot.
User trust: The "black box" nature of AI agents and similar examples make it difficult for users to understand and trust their outputs. Winning user trust will be challenging in sensitive tasks involving payments or personal information like paying bills or shopping.

Real-world attempts

Currently, several startups are venturing into the AI agent field, but most are still in the experimental stage or limited to invite-only use:

adept.ai - Raised $350 million in funding, but access remains very limited.
MultiOn - Funding status unknown, their API-first approach looks promising.
HypeWrite - Raised $2.8 million, initially an AI writing assistant, later expanded into the agent field.
minion.ai - Initially garnered some attention but has since gone quiet, with only a waiting list.

It seems that only MultiOn is pursuing the approach of "giving instructions and observing their execution," which aligns more closely with the promise of AI agents.

All other companies are following the record-and-replay (RPA) route, which may be necessary at this stage to ensure reliability.

Meanwhile, some major companies are bringing AI capabilities to desktops and browsers, and it appears they will achieve local AI integration at the system level.

OpenAI announced their Mac desktop application that can interact with the operating system screen.

Duration

00:45

At the Google I/O conference, Google demonstrated Gemini automatically handling shopping returns. Microsoft has announced Copilot Studio, which will allow developers to build AI-powered virtual agents. The technology demos are impressive, and people can look forward to seeing how these AI agents perform in real-world scenarios upon public release, rather than being limited to carefully selected demo cases.

Which path will AI agents take?

The author emphasizes, "AI agents have been overhyped, with most not yet ready for critical tasks."

However, with rapid advancements in base models and architectures, he suggests that we can still expect to see more successful practical applications.

The most promising path for AI agents may be as follows:

The current focus should be on using AI to enhance existing tools rather than providing broad fully autonomous services.
A human-machine collaborative approach, involving humans in supervising and handling edge cases.
Setting realistic expectations based on current capabilities and limitations.

By combining tightly constrained LLMs, good evaluation data, human-machine collaborative supervision, and traditional engineering methods, reliable and good results can be achieved in complex tasks like automation.

Will AI agents automate mundane repetitive tasks such as web scraping, form filling, and data entry?

Author: "Yes, absolutely."

Will AI agents automatically book holidays without human intervention?

Author: "It is unlikely at least in the near future."

Original article link: https://www.kadoa.com/blog/ai-agents-hype-vs-reality

pre：Adapting to Multiform Multi-Tasking: The Birth of the Most Powerful Open-Source Robot Learning System "Octopus"

next：Tsinghua, Huawei, and Others Propose iVideoGPT: Specializing in Interactive World Models

Hype vs. Reality of AI Agents: GPT-4 Falls Short, Success Rate on Real Tasks Below 15%

Navigation

Related Articles