RAG Explained

Mark Lowe
Mark Lowe 01.08.2025

Usually, when asking questions to Chat GPT and other LLM-based Chat Agents, the answer we receive is based on the trained knowledge of the given Large Language Model. Models typically have a cutoff-date which means information newer than the cutoff date is not included in the Model. For example, if I ask what weather it is today, the LLM won't have an answer. Most chat clients use Tools like Web Search to work around this issue.

But what if I need to ask ChatGPT about information from a specific source i.e. from my company's internal documents? It doesn't have access to this information and therefore won't be able to answer or will make up an answer based on other information. This is where Retrieval Augmented Generation (RAG) comes to use.

Search Assistant Chat

Is it... Magic?

Retrieval Augemented Generation has been a big hype in the past years because of the stunning results it can achieve. It utilizes the ability of an LLM to answer questions based on a given text:

Hey GPT, please analyze the attached PDF and tell me what this the annual earnings of 2024 were and how they compare to the 2023 earnings.

If the information can be found in the PDF, the answer will most likely be correct. It never stops to amaze me, how well these LLMs can extract information from a given text solely based on probability. It seems like magic but it is pure mathematics.

Limitations of Prompts

This works well with single documents but what if I want to search ALL documents of my company or all contents of a specific website? We will quickly hit some limitations:

1. Context Size

Uploading ALL documents to ChatGPT is probably not a good idea for many reasons. Besides it not being practical and the obvious data governance issues, a single Prompt has a maximum size - usually referred to as Context - that it can't exceed. Typical Context sizes:

Model

Context Size

GPT-3

2'049 tokens (~1537 words)

GPT-4-turbo

128'000 tokens (~96'000 words)

Claude Sonnet 4

200'000 tokens (~150'000 words)

This means a single prompt to ChatGPT (as of today) is limited to about 300 pages of text at most. If I want to search through thousands of documents, this will not work.

2. Cost

Prompts are usually billed based on tokens. The larger the prompt the more it will cost.

3. Prompt Speed

Large prompts will take a while to execute. For research, it might be OK to wait 30-60 seconds for an answer but for other applications like Voice-based chat agents, speed is crucial!

4. Answer Quality

Studies have shown that the quality of answers decreases the more (irrelevant) information is included in a context. If I would include my entire intranet in a prompt, there would be so much noise and irrelevant information in the context that the likelihood of receiving incorrect answers from the LLM would increase significantly. These studies have also found a general quality degradation for very large contexts above 32k tokens on different models.
[1] https://arxiv.org/abs/2410.05983
[2] https://www.databricks.com/blog/long-context-rag-performance-llms

Overcoming these limitations

To overcome these limitations, we need to make sure that our prompt only includes as much relevant context as needed to answer these questions. If I ask a question like "What were the earnings in 2024", this will probably be found in documents relating to Finances and the 2024 fiscal year. Older earnings documents are not relevant and HR or IT-Support documents are even less relevant to this question.

This is where an RAG-Pipeline comes in.

RAG Pipelines

An RAG pipeline is a system that combines:

  • Retrieval – searching a knowledge base or corpus for relevant documents.
  • Generation – using a language model to produce an answer based on the query and retrieved documents.

This allows the model to generate accurate, up-to-date responses using external knowledge, not just what it memorized during training. By limiting the prompt to only relevant information, a RAG system ensures answers are faster, cheaper, and more accurate.

Read more about how RAG pipelines work in the next article.

Q&A

Because traditional LLMs can't access private or up-to-date information and struggle with large or irrelevant contexts. RAG enables accurate, real-time answers by retrieving only the most relevant data from your own sources.

RAG is a method that combines document retrieval with a language model to generate accurate, context-based answers using external knowledge.

It retrieves relevant documents from a knowledge base and feeds them to an LLM, which then generates an answer based only on that targeted content.

More context increases the chance of including irrelevant info, which can distract the model and lead to incorrect answers. Studies have also shown that answer quality generally decreases on current LLMs with contexts larger than 32K tokens.

How can KeySemantics help?

We offer a complete, state-of-the-art RAG pipeline from crawling to UI. Ready to use as a simple online service. Just enter your sitemap on our portal and we will:

  • Crawl and update your content daily (or on demand)
  • Analyze webpages, images, and documents
  • Index, vectorize, and build a knowledge graph — more advanced than a basic vector store
  • Enable you to query via API or embed our UI widget
  • Query Agent built-in to our API
  • Clear data governance. All data stored in Switzerland.
  • Self-Hosting options (on-prem or cloud) available.

Semantic Tags

RAGcontextretrieval augmented generationretrievalChatGPTLLMtokenspromptdocumentspipelinegenerationaccuracyknowledge cutoffquerycontext windowdocument searchcostspeedinternal datasearchmodelknowledge baseprompt engineeringrelevance filteringnoisePDFweb searchsemanticearningsGPT-4training datainferenceClaudetext analysis