Setting a new SOTA on the Generalized AI Assistant (GAIA) benchmark with Trase Agent

The GAIA benchmark is the only benchmark that simultaneously tests reasoning, multi-modality handling, web browsing, and general tool-use proficiency. It’s designed to be non-gameable, since the answers must be gathered from completing some number of steps and do not exist in the large plain text web scrapes that LLMs are trained on. GAIA questions are conceptually simple for humans, who get around 92% correct, yet challenging for most advanced LLM-powered agents, which at the time the paper was published by authors that included Meta Chief Scientist Yan LeCun in late 2023, struggled to get above 15%.

GAIA abstract

When we started on this project, the leaderboard looked like this:

  1. Sibyl System at 34.6% pass rate. Created by Baichuan, an Alibaba-backed startup.
  2. HuggingFace at 33.3%. Created by the transformers library team.
  3. Microsoft at 32.3. Created by the AutoGen team at MSR.

All three of these submissions had made their submission code public, but the first two borrowed most of the text-based web browsing code from the Microsoft Autogen submission. I started by running the first two to verify their results. After doing that, I got 25.9% for Sibyl and 31.2% for HuggingFace.

I never could figure out why Sibyl did so much worse than what they posted and figured they must have done majority voting over multiple runs. The fact that they were running with 16 parallel tasks in their code told me that they had a seriously high tier at OpenAI, since all of the top three were using GPT-4o. It was supposed to be "guided by Society of Mind Theory and implements a multi-agent debate-based jury to self-refine the final answers," which sounded a little too fanciful for what I was looking for.

HuggingFace wrote a great blog post about their CodeAct agent, which specifies actions as code rather than JSON blobs via a custom python interpreter that could handle whitelisting imports, execution timeouts, and excessive printing. I'd heard of CodeAct getting good results on SWE-Bench in the OpenDevin/OpenHands project, and I liked the concept, so I decided to use this HuggingFace CodeAct implementation as my starting point.

I planned to modify their code as little as possible, as my primary goal was to assess the benefits of fine-tuning the generator LLM. One interesting thing was that the agent was pretty good at knowing when it was wrong and having to guess at the final answer. I leveraged this by adding a critic after the agent thinks it's done calling tools, which basically asks the same LLM if the chain of reasoning up to now is thorough and sound, and if not asks for a summary of what to do next. Then, that summary is passed as a user role message if the decision is to keep going. I made some bug fixes for the HF custom Python interpreter, a fix for how an observation was created from the final_answer tool, and, most importantly, added several new tools to help the agent with common stumbling blocks.

At Trase, our customers have unique action spaces and sets of question-answer pairs in similar numbers to the GAIA validation set. I wanted to leverage self-play and have a way for the agent to train on its own reasoning traces and solve increasingly more difficult problems. The o1 reasoner model preview came out around this time, which most people seemed to think used a process reward model (PRM) trained on step-level labels indicating good steps or bad steps like the following from the OpenAI paper Let's Verify Step by Step:

Math steps

There is a good amount of literature for training these PRMs on math problems, where each step is like the ones in this example. In my head, there is a direct extension of these ideas to agents if you just consider one tool call to be a step. You can imagine that each tool call an agent makes could have a similar thumbs up/down rating as the example. However, for both GAIA and our customer datasets, we have no such step-level labels, and there's often no correct ordering of the right actions as long as the final answer is correct. In GAIA for example, one question requires that you look up the word counts of a list of books. It doesn't matter which order you do that in, as long as you do it. For one of our oil and gas customers using an agent to derive the chain of ownership of a mineral rights lease, the agent should call one courthouse data tool and another tax assessor's office data tool, but the order doesn't matter. This meant that my approach needed to be outcome-supervised, where the agent is rewarded for getting the final answer correct but not for the intermediate steps.

I had recently read the paper Self-Taught Reasoner (STaR), where an LLM was iteratively trained on a growing dataset of good reasoning trajectories and got better after each round. A similar idea was rumored to be what the newly released o1-preview was using. With GAIA there are 300 blind test questions and 165 validation questions with known answers that you can train on. In the STaR paper, the LLM is training to get better at generating rationales for answers to multiple-choice questions, which is equivalent to an RL-style policy gradient:

Base STaR

I came up with a modified version of STaR to use with agents, where the dataset shown consists of agent trajectories to arrive at open-ended questions rather than the chain of thought rationales for multiple-choice questions. Our rationale generation consists of running the agent and collecting its action trajectory rationale made up of the planning steps, code actions, and self-critiques.

In the original STaR, for questions where the LLM answered incorrectly, the LLM was given the actual answer to the multiple-choice question as a hint before a new rationalization was generated, then the LLM was then fine-tuned as if it had come up with that rationalization without any hint in order to give some training signal for failures. For agents that rely on calling multiple tools, providing the final answer as a hint doesn’t tell the agent much about which tools to use to arrive at that answer. Instead, our hints consist of high-level human descriptions of how they solved the problem as provided in the GAIA validation set metadata. The descriptions are numbered lists of steps, which we incrementally add to the agent’s planning step to give a subtle nudge in the right direction without leaking too much ground truth information into the context. These hints are then removed from the agent’s trajectory before being added to the fine-tuning dataset as if the agent came up with the correct answer without the hint. This ensures the agent can learn from cases where it failed.

In the end, we got to the top of the leaderboard and have been there for a month at the time of writing. Our CEO wrote great a blog post on our success here.

Trase Systems tops leaderboard

Thinking ahead on how to do even better, I feel the most room for improvement with GAIA is in web browsing since 70% of the validation set questions require a web browser according to the human annotator metadata. We used the Markdown browser put forth first by the Autogen team before being lightly modified by HuggingFace. This limits the level of interaction with cookie banners, javascript, etc, and we observed several questions where this was the limiting factor. The handoff between tools and the agent was also critical, and while we used the separate web browser agent that HuggingFace proved out, I found the multi-agent setup slightly annoying to debug and prompt engineer, plus it complicates STaR training since you have one trajectory inside of another.