Setting another new SOTA on the Generalized AI Assistant (GAIA) benchmark with Trase Agent

We learned quite a bit from our last Trase Agent submission that set a new state-of-the-art on the Generalized AI Assistant (GAIA) benchmark, and this time we came up with a new from-scratch agent to pull ahead of strong recent submissions by Google and H2O. I'll review what changed this time versus last, giving as much detail as possible without spilling the beans too much for our upcoming paper.
The H2O team wrote a great blog post after getting to #1 on GAIA, and many of my findings with this dataset vibe with theirs. In particular, the recent emphasis on multi-agent systems is a bit overblown, and single agents are underrated. It's much easier to debug and prompt engineer a single agent. I also really enjoyed this blog post by the OpenHands team on the benefits of single-agent systems and agree with most of their points as well. The real benefit to multi-agent is that different teams can develop agents independently (which we leveraged). Still, I don't see the need for more than two agents with GAIA. We definitely found it critical to get the right tasks to the right LLMs though and used a combination of Claude, Gemini, o1, and DeepSeek R1.
Web browsing was one of the things I pointed out in my last post as an area we could improve upon, given that 70% of GAIA tasks require a browser. We tested several alternatives to the typical markdown approach, including using the accessibility tree like Agent-E and set of mark (SOM) screenshots like WebVoyager and AutoGen Magentic-One. The breadth of sites in this benchmark revealed that there isn't any one-size-fits-all way to represent a given page.
Dense sites can have an accessibility tree that takes up too much of the agent's context window. A markdown representation can give a high-fidelity representation of the original content, but what about when the question is about the indentation of stanzas in a poem, which gets lost during conversion to markdown? Using screenshots can also get you into trouble, especially for tabular data where multi-modal LLMs don't have the visual acuity to read small numbers. We found that the trick in getting the most robust answers was combining multiple modalities and letting the agent resolve discrepancies between what was being observed via follow-up tool calls.
In our last submission, we copied the same approach for planning/replanning steps used by most open source GAIA agents, where every n
number of iterations, the facts gathered so far are collected, and the agent's plan is updated based on those facts. However, as we watched our agent click through menus trying to download the highest resolution image from a Wikipedia page, we saw its chain of thought get interrupted by a replanning step, after which it went off track and did a Google search to find the image instead. In this submission, we sought to minimize the number of periodic interruptions to the agent's chain of thought as much as possible using a method that hasn't been published yet but relies on letting the agent decide when to update the facts and plan.
We also improved our critic in this submission, leveraging some of the latest reasoning models to decide when we should stop calling tools. By giving lots of structure to our facts + planning structured output, we could prompt the critic to keep going until key information had been verified multiple ways. This helped us catch cases where the agent would answer a question based on a single screenshot without confirming with other tools and sites. With our new critic setup, the agent has a habit of being extremely thorough, sometimes overly so, where it wants to click through every single YouTube video on a page with hundreds of videos. Finding the optimal balance between an overly critical critic and a too lenient one is probably the hardest and, at the same time, most important thing to get right on this benchmark. We came up with a way to ground the critic's behavior with actual human behavior, the details of which will be published in an upcoming paper.
Last but certainly not least were new tools. We added tons of new tools, way more than last time, because it's always way less error-prone to make an API call than to get an agent to nail interacting with many different related pages. Our agent still can have a tough time answering questions that involve data spread across many different pages, so we try to download the data if possible.
One area for improvement we've identified is reasoning problems over images, which are also prevalent in the Humanity's Last Exam benchmark. We found that LLMs usually fail on these due to not being able to see enough detail in the image, not because of a lack of reasoning ability. For example, there is one question that asks for the difference in standard deviations of a grid of numbers, where the numbers are two different colors. All major LLMs like Claude, Gemini, and the GPT family can perfectly extract the numbers, but we found none that get the colors right. Similarly for chess boards, o1 can solve some problems if we give the board as text, but when requiring it to work off of the image representation, it fails to read the board correctly.
We hope to improve in these areas as time goes on. Like H2O emphasized, the key is to keep your agent simple enough so that you can take advantage of newer, more powerful models as they are released, and not go nuts building a complicated multi-agent system. As we were working on this submission, OpenAI published on their website that their new Deep Research agent got 67.36% pass@1 on the GAIA validation set and 72.57% with 64 votes using their new o3 model. We actually got slightly higher than that for pass@1, and while I can't disclose how many votes we used for the submitted val set results, it was way less than 64. Given that we've found many problems where having images in context is critical, I found it interesting that o3 does not support images at this time. I have a feeling the version they are using is an unreleased one that supports images. Another interesting thing was that they didn't post test set results anywhere or any results at all to the leaderboard.
Our CEO wrote another great blog post describing our journey to the top of the leaderboard again, so definitely check that out.