RAG Pipeline/Demo: Understanding – Retrieval Augmented Generation

This project is a deep, production-aligned demonstration of a Retrieval Augmented Generation (RAG) system applied to realistic insurance documents.

Rather than hiding complexity, this demo makes every stage observable: document ingestion, chunking, embeddings, vector search, retrieval behavior, and how the LLM ultimately produces grounded answers.

This post walks through the system exactly as an insurance AI engineer would debug, evaluate, and productionize it. I’ve also written it so that even if you’ve never touched a RAG system before, you’ll understand what’s happening at each stage and why it matters.


Project Directory Structure

The repository is intentionally structured to mirror a real RAG service, with clear separation between ingestion, querying, exploration, and UI layers.

rag-documents-demo/
├── docs/                          # Mock insurance documents
│   ├── property_insurance_policy.txt
│   ├── motor_claims_procedure.txt
│   ├── underwriting_guidelines_commercial_property.txt
│   ├── business_interruption_policy.txt
│   ├── cyber_insurance_policy.txt
│   └── claims_faq.txt
│
├── chroma_db/                     # Persisted vector database
│
├── ingest.py                      # One-time document ingestion
├── explore_chunks.py              # Chunk inspection & validation
├── explore_embeddings.py          # Embedding + vector search inspection
│
├── query_cli.py                   # Interactive CLI RAG interface
├── app.py                         # Streamlit web UI
│
├── requirements.txt               # Python dependencies
├── .env.example                   # Environment variable template
└── README.md

This separation allows each stage of the RAG lifecycle to be inspected independently, which is critical when debugging hallucinations or retrieval failures.


Step 1: Document Ingestion

The ingestion phase loads raw insurance documents, splits them into chunks, creates embeddings, and stores them in a persistent vector database.

Think of ingestion as the preparation stage. It’s similar to how a new claims handler would read through every policy document on their first week, highlight the important sections, and organise their notes so they can find answers quickly later. The system does this once upfront, so every future question can be answered in seconds rather than requiring a full search through every document.

Command

python3 ingest.py

Execution Output

INSURANCE DOCUMENTS INGESTION PIPELINE
================================================================================
Loaded 6 documents

property_insurance_policy.txt: 4,205 characters
motor_claims_procedure.txt: 7,453 characters
underwriting_guidelines_commercial_property.txt: 12,204 characters
business_interruption_policy.txt: 13,200 characters
cyber_insurance_policy.txt: 17,260 characters
claims_faq.txt: 18,483 characters

Splitting documents into chunks (size=1000, overlap=200)
Created 96 chunks
Average chunk size: 824 characters

Creating embeddings with text-embedding-3-small
Persisted vector store to chroma_db

INGESTION COMPLETE
================================================================================

What This Means

  • Each document is loaded with source metadata so the system always knows which file an answer came from
  • Documents are split into overlapping chunks to preserve context (more on this below)
  • Each chunk is embedded exactly once, converted into a numerical fingerprint the system can search against
  • The vector database is saved to disk and reused across every future query

This mirrors production best practice: ingestion is a batch job run once (or when documents change), not something that happens every time someone asks a question.


Step 2: Chunking. Why We Break Documents Into Pieces

Chunking is the most important design decision in any RAG system. Poor chunking guarantees poor retrieval, and poor retrieval means the AI gives bad answers regardless of how good the language model is.

Why not just feed the whole document to the AI? Language models have a limited context window, which is the maximum amount of text they can process at once. Even with modern models that accept large inputs, sending entire documents is wasteful and expensive. More importantly, it’s less accurate. If you dump a 30-page policy document into the AI and ask about exclusions, the model has to find the relevant paragraph buried in thousands of words of irrelevant content. It’s like asking someone to find a specific clause by reading an entire filing cabinet instead of going straight to the right folder.

Instead, we break each document into smaller, focused pieces called chunks, and only send the most relevant ones to the AI when a question is asked.

Command

python3 explore_chunks.py

Chunking Configuration

chunk_size: 1000 characters
chunk_overlap: 200 characters
separators:
- Paragraph breaks
- Line breaks
- Spaces
- Characters (fallback)

Chunk size (1000 characters) means each piece is roughly a long paragraph. This is large enough to contain a complete thought (an entire exclusion clause, a full FAQ answer, or a complete step in a claims process) but small enough that it stays focused on one topic.

Chunk overlap (200 characters) is the clever part. Imagine you’re cutting a long document with scissors. If you cut cleanly between paragraphs, you might separate a sentence from the context that makes it meaningful. For example, a policy might say “Subject to the conditions in Section 3.2 above, the following exclusions apply…” and if Section 3.2 ended up in the previous chunk, the exclusions chunk loses critical context. The 200-character overlap means each chunk shares its edges with its neighbours, like overlapping tiles on a roof. Nothing falls through the gaps.

Chunk Statistics

Total chunks: 96
Average size: 824 characters
Min size: 115 characters
Max size: 996 characters

Chunks per Document

business_interruption_policy.txt: 18
claims_faq.txt: 25
cyber_insurance_policy.txt: 22
motor_claims_procedure.txt: 9
property_insurance_policy.txt: 6
underwriting_guidelines_commercial_property.txt: 16

Notice the claims FAQ produced the most chunks (25) despite not being the longest document. That’s because FAQs have natural paragraph breaks between each question-answer pair, and the splitter respects those boundaries. This is exactly what we want. Each FAQ entry becomes its own chunk, so when someone asks “how do I report a claim?” the system retrieves that specific Q&A rather than a random slice of text.

Why This Matters for Answer Quality

  • Exclusions stay grouped so the AI can list all exclusions from a single retrieval
  • Definitions are not split mid-sentence, so the AI never sees half a definition
  • Claims workflows remain sequential. Step 1, 2, 3 stay together
  • FAQ questions stay paired with their answers. The system never retrieves a question without its answer

Chunk inspection is how you prevent hallucinations before they happen. If the AI is giving wrong answers, the first thing you check is whether the chunks themselves make sense, because the AI can only work with what it’s given.


Step 3: Embeddings. Teaching the Computer to Understand Meaning

This is where the system goes from working with text to working with meaning, and it’s the core technology that makes intelligent search possible.

The problem with traditional search: If you search a document for the word “exclusion” you’ll find every mention of that exact word. But what if the policy says “this coverage does not extend to…” or “the following are not covered…”? Traditional keyword search misses those entirely, even though they mean the same thing. And if someone asks “what am I NOT covered for?” there’s no keyword match at all, despite it being the same question.

What embeddings do: Each chunk of text is converted into a list of 1,536 numbers, called a vector, that represents what the text means, not just what words it contains. Think of it like plotting every chunk on an enormous map with 1,536 dimensions. Chunks that discuss similar topics end up close together on this map, even if they use completely different words.

So “exclusions under this policy” and “what am I NOT covered for?” end up in the same neighbourhood on the map, because they’re about the same concept. Meanwhile, “exclusion zone” (a geography term) would be plotted far away, because its meaning is different despite sharing the word “exclusion”.

Embedding Properties

  • 1,536 dimensions per chunk. Each chunk is represented by 1,536 numbers, giving the system a rich understanding of meaning
  • Zero-centred values. The numbers range around zero, which is standard for this type of mathematical representation
  • Optimised for cosine similarity. The system measures “closeness” by comparing the angle between vectors, not the raw distance

Example Embedding

Vector (first 10 of 1,536 dimensions):
[-0.0023, 0.0599, 0.0538, 0.0673, -0.0519,
 0.0266, -0.0368, 0.0067, 0.0085, 0.0493]

These individual numbers don’t mean anything on their own. You can’t look at 0.0599 and say “that’s the insurance dimension.” The meaning emerges from the pattern across all 1,536 numbers taken together, and specifically from how one chunk’s pattern compares to another’s. Two chunks with similar patterns are about similar topics. That’s the entire principle behind semantic search.

The Trade-offs of This Approach

Converting text into vectors is what makes the whole system work, but it comes with trade-offs worth understanding:

On the plus side, retrieval is extremely fast because the search is just maths on numbers rather than scanning through text. It also understands semantic meaning, so you get results based on what text means rather than which keywords it contains.

On the other hand, every piece of text must be converted to vectors before it can be searched, and that conversion costs API calls. There’s also a storage cost: 96 chunks multiplied by 1,536 numbers multiplied by 4 bytes per number comes to roughly 590KB for this demo. That’s trivial at this scale, but it grows linearly with your document corpus and becomes a real consideration when you’re indexing thousands of policy documents.


Step 4: Semantic Similarity Search. Finding the Right Answers

When a user asks a question, the exact same embedding process is applied to their query, turning it into another point on that 1,536-dimension map. The system then finds the chunks that are closest to the query on that map.

This is like walking into a library where every book is arranged by topic rather than alphabetically. You describe what you’re looking for, the librarian figures out which section that belongs in, and pulls the most relevant books from that shelf. Except this librarian understands meaning, not just keywords.

Example Query

What does the cyber policy exclude?

Retrieved Chunks

Rank 1 - cyber_insurance_policy.txt (score: 0.35)
Rank 2 - cyber_insurance_policy.txt (score: 0.34)
Rank 3 - business_interruption_policy.txt (score: 0.24)

The scores represent how semantically close each chunk is to the question. A score of 0.35 means strong relevance; that chunk is about the same topic as the question. The system correctly identifies that the top two results come from the cyber policy (which is exactly where exclusions for cyber coverage would be), and also surfaces a business interruption chunk that likely discusses related exclusions.

What “good” scores look like: In real-world RAG systems, similarity scores typically fall between 0.2 and 0.5 for relevant results. You won’t see scores of 0.9 or 1.0 unless the query is almost identical to the chunk text. A score of 0.3 to 0.4 indicates the system has found genuinely relevant content. Not an exact match, but a strong semantic relationship.


Step 5: Interactive Retrieval (Raw View)

Interactive mode shows what the retriever returns before the LLM reasons over it. This is the debugging view. It lets you see exactly what context the AI will receive, so you can understand why it gives the answers it does.

Command

python3 explore_embeddings.py

Example Query: “waiting period”

Rank 1 - business_interruption_policy.txt (0.12)
Rank 2 - claims_faq.txt (0.05)
Rank 3 - claims_faq.txt (0.05)
Rank 4 - claims_faq.txt (0.01)

Why This Looks Noisy

Notice the scores are much lower here (0.01 to 0.12) compared to the cyber exclusions query. That’s because “waiting period” is a short, ambiguous query. Multiple documents discuss timing concepts in different contexts (claims processing times, business interruption waiting periods, policy cooling-off periods). The retriever casts a wide net because it can’t be sure which “waiting period” the user means.

  • The query is short and ambiguous. More specific questions produce tighter results
  • Multiple documents discuss timing concepts in different ways
  • The system intentionally prioritises recall (finding everything potentially relevant) over precision (only returning perfect matches)

This is by design. The LLM in the next stage acts as the intelligent filter. It reads all the retrieved chunks, works out which ones actually answer the question, and synthesises a coherent response while ignoring the noise. The retriever’s job is to make sure the right information is in the mix; the LLM’s job is to make sense of it.


Step 6: LangChain Orchestration

LangChain is the framework that connects all of these components (document loading, chunking, embedding, vector storage, retrieval, and LLM generation) into a single pipeline.

Core Components

DirectoryLoader("docs")                        # Load documents from folder
RecursiveCharacterTextSplitter(...)            # Split into chunks
OpenAIEmbeddings(...)                          # Convert to vectors
Chroma.from_documents(...)                     # Store in vector database
ConversationalRetrievalChain.from_llm(...)     # Wire it all together

Without a framework like LangChain, you’d be writing hundreds of lines of glue code to pass data between these stages, manage conversation history, format prompts, and handle errors. LangChain removes that boilerplate while keeping the behaviour explicit and debuggable. You can inspect what’s happening at every stage, which matters when you need to understand why the system gave a particular answer.

When to Use Which Framework

The project also includes a LlamaIndex implementation for comparison. The choice between the two comes down to what you’re building. LlamaIndex is purpose-built for RAG and document Q&A. It’s more opinionated, has less boilerplate, and gets you to a working retrieval system faster. This insurance demo is a textbook LlamaIndex use case. LangChain is the better choice when your system needs to go beyond retrieval into agents, complex multi-step chains, tool calling, or memory systems. It’s more flexible but comes with more wiring.

If your project is “ask questions, get grounded answers from documents,” start with LlamaIndex. If your project is “orchestrate multiple AI capabilities, call external tools, and maintain complex state,” reach for LangChain. Both are production-capable and both are included in this demo so you can compare them directly.


CLI and Web Interfaces

Two user-facing interfaces are provided:

  • CLI for debugging and evaluation, useful for testing queries quickly and inspecting raw retrieval results
  • Streamlit UI for demonstration, providing a clean chat interface with expandable source citations and conversation history

Both interfaces use the exact same retrieval and generation pipeline underneath. The only difference is how the results are displayed. This is important because it means any answer you get from the web UI is identical to what the CLI would produce, making it easy to test and debug without switching tools.


What’s Missing for Production

This demo is deliberately focused on correctness and observability, making every stage visible and verifiable. A production deployment would add the resilience, scale, and governance layers that enterprise systems require:

Required Additions

  • Error handling and retries for graceful recovery when API calls fail
  • Monitoring and cost tracking for visibility into query volumes, response times, and API spend
  • Managed vector database (Pinecone / Weaviate) for scale, reliability, and multi-user access
  • Caching (Redis) to avoid re-computing answers for repeated questions
  • Security and RBAC to control who can query which documents
  • Evaluation frameworks (RAGAS) for systematic measurement of answer quality
  • Hybrid search and reranking to combine semantic search with keyword matching for better recall
  • Docker, CI/CD, backups as standard production infrastructure

The demo focuses on getting the fundamentals right. Production systems layer resilience, scale, and governance on top of those fundamentals, but without correct chunking, good embeddings, and reliable retrieval, none of the production tooling matters.


This is great to learn how to do real AI under the hood stuff.

This project demonstrates how to build a trustworthy documents RAG system by making every stage explicit, inspectable, and testable.

In regulated industries or any large data set setup , the AI’s answer is only as good as the evidence behind it. Every response in this system can be traced back to specific document chunks, with similarity scores that explain why those chunks were selected. That transparency is what makes LLMs usable in environments where getting it wrong has real consequences.

Leave a Reply

Your email address will not be published. Required fields are marked *