How RAG Pipelines work

Mark Lowe
Mark Lowe 01.08.2025

I've explained the limitations for Large-Language Models in my previous post. To overcome these, we need to ensure that the context of our RAG-Prompts includes only information that might be relevant to the question asked. This is not a trivial task and includes many steps which typically include

Processing:
📥 Data Ingestion
🧹 Preprocessing & Chunking
🧠 Embedding & Indexing
💾 Storage

RAG Query:
🔢Pre-analyze question
🔍 Retrieval
🤖 Generation

Search Assistant Chat

A typical RAG pipeline

📥 Data Ingestion

In order to be efficiently retrieved (searched for), text data has to be read from websites (crawling) or it has to be extracted from Documents or Images.

🧹 Preprocessing & Chunking

Extracted text data is split into smaller chunks (typically paragraphs). Irrelevant content is removed.

🧠 Embedding & Indexing

In this stage a vector embedding is assigned to each chunk. A vector embedding is in a way an "Address" given to it by an LLM that represents the meaning/topic of the encoded text. I like to think of a bookstore: When you ask for a book about economics they might tell you that it can be found on floor 2 aisle 4 at the back. "Floor 2 aisle 4 at the back" would be the vector and books with similar topics will have a similar vector.

Side-note on embedding models:
Many commercial and open-source embedding models exist for numerous use-cases. The MTEB Leaderboard compares existing models. But picking the best embedding model isn't a matter of just picking the leader on the MTEB list. Proper data ingestion and chunking matter much more than the embedding model used and speed and ease-of-use are not measured in the MTEB benchmarks which are also important factors in my opinion.

💾 Storage

Here we store the chunks and embeddings in a vector storage. This will allow us to query for matching chunks later on.

Side-note on vector storage:
There has been a hype around new vector database providers but proven Search Engines like Apache Solr and Elasticsearch offer state-of-the-art fulltext search on top of vector search which helps improving retrieval quality and allows for filters, facets and more. Hybrid search (fulltext + vector search combined) has been proven to be more precise, faster and more explainable than plain vector-based retrieval.
[1] https://arxiv.org/abs/2412.03736
[2] https://www.elastic.co/search-labs/blog/improving-information-retrieval-elastic-stack-hybrid


Inside a RAG Query

When we ask questions to a RAG component, typically the following happens:

Hey RAG-Tool, what were the earnings in 2024 and how do they compare to the earnings of 2023?

🔢 Question is pre-analyzed

  1. Filter for malicious queries => Query is OK
  2. Extract a short, precise search query => "earnings 2024 2023"
  3. Which vector store should I search (Finances, HR, IT-Support,...) => "Finances"

🔍 Vector Storage is queried for matching text chunks

🤖 An LLM prompt is executed using the matching text chunks

Please answer the user's question based on the provided data:
QUESTION:
Hey RAG-Tool, what were the earnings in 2024 and how do they compare to the earnings of 2023?

DATA:
Earnings Document 2024
...the earnings of 2024 were 42 Mio...

Earnings Document 2023
...in the fiscal year of 2023/2024 the company had earnings of 40 Mio which was decrease 20% from the previous year...

Answer:

The earnings of 2024 were 42 Mio which is an increase compared to the previous year (40 Mio)

Why not just ask ChatGPT?

The outputs of this RAG Pipeline offer the following advantages over just asking LLMs like Claude and ChatGPT:

  • The information you're searching for is not necessarily part of the LLM's training data
  • The information source becomes clear: I control what data is searched
  • Data is up to date (no cutoff)

What can RAG be used for?

Apart from the many Chatbots that are popping up everywhere, RAG also has other use cases:

🔍 Semantic Search in natural language
📧 Automated e-mail answering
☎️ Customer support automation
⚖️ Legal and compliance assistants
📄 Coding and writing assistants

Q&A

A RAG pipeline helps large language models answer questions more accurately by providing relevant, real-time information that isn’t part of their training data.

It involves collecting and cleaning data, splitting it into chunks, turning those into vector embeddings, and storing them so they can be quickly retrieved during question answering.

The system analyzes the question, retrieves matching content from the vector store, and builds a prompt that helps the language model generate a grounded, accurate response.

LLMs may not have the latest or domain-specific data. RAG ensures answers are based on current, controlled sources, making responses more reliable and transparent.

How can KeySemantics help?

We offer a complete, state-of-the-art RAG pipeline from crawling to UI. Ready to use as a simple online service. Just enter your sitemap on our portal and we will:

  • Crawl and update your content daily (or on demand)
  • Analyze webpages, images, and documents
  • Index, vectorize, and build a knowledge graph — more advanced than a basic vector store
  • Enable you to query via API or embed our UI widget
  • Query Agent built-in to our API
  • Clear data governance. All data stored in Switzerland.
  • Self-Hosting options (on-prem or cloud) available.

Semantic Tags

retrieval augmented generationRAG pipelinevector embeddingsvector storagechunkingsemantic searchhybrid searchlarge language modelsdata ingestionpreprocessingindexingLLM prompt engineeringElasticsearchApache Solrmalicious query filtering