Is it... Magic?
Retrieval Augemented Generation has been a big hype in the past years because of the stunning results it can achieve. It utilizes the ability of an LLM to answer questions based on a given text:
Hey GPT, please analyze the attached PDF and tell me what this the annual earnings of 2024 were and how they compare to the 2023 earnings.
If the information can be found in the PDF, the answer will most likely be correct. It never stops to amaze me, how well these LLMs can extract information from a given text solely based on probability. It seems like magic but it is pure mathematics.
Limitations of Prompts
This works well with single documents but what if I want to search ALL documents of my company or all contents of a specific website? We will quickly hit some limitations:
1. Context Size
Uploading ALL documents to ChatGPT is probably not a good idea for many reasons. Besides it not being practical and the obvious data governance issues, a single Prompt has a maximum size - usually referred to as Context - that it can't exceed. Typical Context sizes:
Model | Context Size |
|---|---|
GPT-3 | 2'049 tokens (~1537 words) |
GPT-4-turbo | 128'000 tokens (~96'000 words) |
Claude Sonnet 4 | 200'000 tokens (~150'000 words) |
This means a single prompt to ChatGPT (as of today) is limited to about 300 pages of text at most. If I want to search through thousands of documents, this will not work.
2. Cost
Prompts are usually billed based on tokens. The larger the prompt the more it will cost.
3. Prompt Speed
Large prompts will take a while to execute. For research, it might be OK to wait 30-60 seconds for an answer but for other applications like Voice-based chat agents, speed is crucial!
4. Answer Quality
Studies have shown that the quality of answers decreases the more (irrelevant) information is included in a context. If I would include my entire intranet in a prompt, there would be so much noise and irrelevant information in the context that the likelihood of receiving incorrect answers from the LLM would increase significantly. These studies have also found a general quality degradation for very large contexts above 32k tokens on different models.
[1] https://arxiv.org/abs/2410.05983
[2] https://www.databricks.com/blog/long-context-rag-performance-llms
Overcoming these limitations
To overcome these limitations, we need to make sure that our prompt only includes as much relevant context as needed to answer these questions. If I ask a question like "What were the earnings in 2024", this will probably be found in documents relating to Finances and the 2024 fiscal year. Older earnings documents are not relevant and HR or IT-Support documents are even less relevant to this question.
This is where an RAG-Pipeline comes in.
RAG Pipelines
An RAG pipeline is a system that combines:
- Retrieval – searching a knowledge base or corpus for relevant documents.
- Generation – using a language model to produce an answer based on the query and retrieved documents.
This allows the model to generate accurate, up-to-date responses using external knowledge, not just what it memorized during training. By limiting the prompt to only relevant information, a RAG system ensures answers are faster, cheaper, and more accurate.
Read more about how RAG pipelines work in the next article.


