Beyond RAG - Retrieval Agents

Mark Lowe
Mark Lowe 03.08.2025

In the previous articles of this series, I've highlighted why and when we need to use RAG and how a typical RAG pipeline works. More and more refined RAG pipelines and products supporting them have been published during 2023/2024 but are already being made obsolete again by a new information retrieval technique: Retrieval Agents.

If you haven't read the previous articles, find them here:

Part 1: RAG Explained
Part 2: How RAG Pipelines Work

Search Assistant Chat

Quick Recap

First, a quick recap on the steps included in a typical RAG Query:

🔢Pre-analyze question
🔍 Retrieval
🤖 Generation

In the Pre-analyze and Retrieval steps, we prepare an LLM-Prompt that generates an answer to the user's question.

With tools becoming available on many LLMs, we can now hand the Pre-Analyze and Retrieval steps directly to the LLM.

Setting up A Retrieval Agent

Let's re-build our prompt from the previous posts:

SYSTEM PROMPT:
You are a helpful agent that helps users find information in a document database. Use the provided query tool to search for information that answers the user's question.

Question:
Hey RAG-Agent, what were the earnings in 2024 and how do they compare to the earnings of 2023?

Available categories are:
- Finance
- HR
- IT Support

Available Languages are:
- EN
- FR
- DE

Available Tools:
queryForInformation
- string: query
- string: language
- string: category

(tool config details are spared here for readability)

We now need to implement our queryForInformation function.

Using KeySemantics Seek APIjavascript
1const API_BASE = 'https://portal.keysemantics.ai/query';
2const API_KEY = '<your-api-key-here>'; // Replace with your actual API key
3
4async function queryForInformation(query, language, category) {
5  const url = `${API_BASE}/query/agents/seek?query=${encodeURIComponent(query)}&language=${encodeURIComponent(language)}&category=${encodeURIComponent(category)}`;
6
7  try {
8    const response = await fetch(url, {
9      method: 'GET',
10      headers: {
11        'Authorization': `Bearer ${API_KEY}`,
12        'Accept': 'text/plain'
13      }
14    });
15
16    if (!response.ok) {
17      throw new Error(`HTTP error ${response.status}: ${response.statusText}`);
18    }
19
20    const matchingChunks = await response.text();
21    return matchingChunks;
22  } catch (error) {
23    console.error('Error querying information:', error.message);
24    throw error;
25  }
26}

This code-snippet calls the KeySemantics Seek API which returns matching text chunks. This could also be any vector DB or Search Engine.

Inside a Retrieval Agent Query

Hey RAG-Agent, what were the earnings in 2024 and how do they compare to the earnings of 2023?

🤖 RAG-Agent:

1. 🔍 queryForInformation("Annual Report 2024", "EN", "Finance")
-- ...This is the Annual Report for the financial year of 2023/2024 ....
2. 🔍 queryForInformation("Annual Report 2024 earnings", "EN", "Finance")
-- ... the earnings of 2024 were 42 Mio ....
2. 🔍 queryForInformation("Annual Report 2023 earnings", "EN", "Finance")
-- ... the earings of the 2022/2023 fiscal year were 40 Mio which was an increase of...
3. 🔍 queryForInformation("earnings 2023 2024 comparison", "EN", "Finance")
-- ...the earnings of 2024 were 5% higher compared to 2023...

Answer: The earnings of 2024 were 42 Mio which is an increase of 5% compared to the previous year (40 Mio).

The RAG-Agent will continue to call our queryForInformation function until it decides that the information it has gathered is enough to confidently answer the question.

A very interesting effect here is that an agent learns more about the available data while scanning the results. The subsequent search queries become more refined based on words in the previously searched chunks often leading to a better-formulated query .

Handing the query capability to the LLM eliminates the need to pre-analyze the user's question because the agent will interpret the question and form queries autonomously.

Important Things To Note

Retrieval Speed matters more than ever! While on "traditional" RAG Systems, we run a single query against our data storage, an agent will run multiple. Very quick query times are essential for interaction-based systems like chatbots.

Limit and sanitize tool calls: If an Agent cannot find what it is looking for in the provided information. It will continue to query the available tool until it's context window (128K tokens for GPT-4) is exhausted which leads to slow answers and high costs. It is important to limit the number of tool calls to prevent this. ALso, query parameters such as Language or Category need to be sanitized because LLMs will sometimes halucinate invalid values.

Data Governance: While data governance is important for any type of interaction with an LLM, it becomes particularily important when we start handing over control to an external LLM. Ensure that tools can only access data that is allowed to be processed by the LLM.

Recap

Retrieval Agents offer a means of greatly improving RAG Pipelines. If done right, the pipeline will be able to answer even vague questions or complex comparisons. But working with agents in general is challenging due to their nondeterministic nature. Safeguards need to be in place and a rigid quality testing framework is important.

More information on Retrieval Agents:

https://blog.langchain.com/conversational-retrieval-agents
https://huggingface.co/learn/agents-course/en/unit2/smolagents/retrieval_agents

Q&A

Retrieval Agents are a newer approach to information retrieval where the LLM actively forms and refines its own search queries based on available tools and data, removing the need for manual query pre-analysis as done in traditional RAG pipelines.

Instead of a single query, the agent issues multiple, increasingly refined queries using a tool like queryForInformation, learning from each response until it has enough context to answer confidently.

They can generate excessive tool calls, inflate cost, slow down response time, and even misuse query parameters if not properly sanitized or constrained, especially when handling sensitive internal data.

Because the agent runs multiple queries per request, slow retrieval can significantly delay responses, making fast, reliable data access essential for a good user experience.

How can KeySemantics help?

We offer a complete, state-of-the-art RAG pipeline from crawling to UI. Ready to use as a simple online service. Just enter your sitemap on our portal and we will:

  • Crawl and update your content daily (or on demand)
  • Analyze webpages, images, and documents
  • Index, vectorize, and build a knowledge graph — more advanced than a basic vector store
  • Enable you to query via API or embed our UI widget
  • Query Agent built-in to our API
  • Clear data governance. All data stored in Switzerland.
  • Self-Hosting options (on-prem or cloud) available.

Semantic Tags

retrieval agentsRAGqueryForInformationinformation retrievalRAG pipelineLLM promptKeySemantics Seek APIretrieval speeddata governancesearch querypre-analyzecontext windowlanguage modeltool usagequery refinementsanitizationtool call limitsquery formulationretrieval qualityearnings comparisontoken limitcost optimizationnon-determinismvector databasesearch enginedocument databaselanguagecategoryLLM toolsAzure GPT