Experiments and outcomes
We evaluated agent RAG with FramesQA based mostly on the FRAMES paper. An instance of a multi-hop query is:
“Of the highest two most-watched TV season finales (as of June 2024), which finale aired the longest and the way lengthy was it?”
The RAG system should carry out a number of steps earlier than arriving on the right reply. First, we have to decide that the 2 most-watched closing episodes are from the reveals “M*A*S*H” and “Cheers.” Subsequent, it is advisable to discover their operating instances and calculate the distinction in size. In lots of RAG configurations (vanilla RAGs or agent RAGs with out ample context), the mannequin can find yourself within the following state of affairs:
“Regardless of a number of scans, we discovered no specific operating instances for M*A*S*H or Cheers. The doc gives viewership knowledge, however not length in minutes or hours.”
This does not reply your query.
Thankfully, our agent RAG can remedy this drawback by first trying to find TV reveals after which utilizing the question rewriter and Enough Context Agent to focus on the operating time of M*A*S*H or Cheers. That manner, Gemini can simply decide which finale was the longest and the way lengthy.
“The M*A*S*H finale lasted 150 minutes, the longest of the highest two. It was 52 minutes longer than the Cheers finale, which lasted about 98 minutes.”
We ran an experiment to check this function at scale (FramesQA has a corpus containing 824 queries and a pair of,676 PDF paperwork). The “vanilla” RAG configuration makes use of Google’s RAG engine (which incorporates a complicated search engine, LLM parser, and reranker). We in contrast this to Agent RAG in two settings. Within the single corpus setting, it comes from the FramesQA doc. The cross-corpus configuration additionally consists of three different distracting datasets that the planner agent should resolve from the place to retrieve them. This cross-corpus setup mimics a use case the place an organization has databases managed by separate groups. Calculate accuracy through the use of LLM as a choice and evaluating the system’s response to the bottom reality reply within the dataset.
Within the cross-corpus setting, our system roughly matches the accuracy of a single corpus. Even when the planner agent has to decide on the right corpus amongst 4 prospects, it efficiently routes the search question and accurately solutions 90.1% of the questions. Moreover, the single-corpus and cross-corpus variations have comparable delays (inside 3% on common). This reveals that the Agentic RAG system can infer a number of unrelated knowledge sources, opening up the likelihood for extra versatile acquisition situations.


