Tutorials

A Practical Guide to Building Your First RAG Pipeline

9 min read

If you’ve ever asked a large language model a factual question and gotten a confident, well-structured, completely wrong answer, you already understand why RAG exists. The model wasn’t broken – it was doing exactly what it was designed to do: predict plausible-sounding text. It just didn’t have the right information to work with.

Retrieval-Augmented Generation, or RAG, fixes this by giving the model access to external knowledge at query time. Instead of relying solely on what it memorized during training, the model first retrieves relevant documents from a knowledge base, then generates a response grounded in that retrieved context. It’s the difference between answering from memory and answering with your notes open in front of you.

This guide walks through building a RAG pipeline from scratch – not a toy demo, but something you can actually extend into a production system. I’ll cover the architecture, the tools, the decisions that actually matter, and the mistakes I’ve watched people make repeatedly.

The Architecture at a Glance

Every RAG pipeline follows the same basic flow, regardless of the tools you use:

  1. Load your source documents (PDFs, web pages, databases, whatever)
  2. Chunk them into smaller passages
  3. Embed each chunk into a vector representation
  4. Store those vectors in a vector database
  5. Retrieve the most relevant chunks when a user asks a question
  6. Generate a response using the LLM, with the retrieved chunks as context

That’s it. Six steps. The devil, predictably, is in the details of each one.

RAG Architecture Flow

How data flows from raw documents to AI-generated answers

📄

Documents
PDFs, Web, DBs

Chunking
Split into passages

🧮

Embedding
Vectorize text

🗃

Vector Store
Index & persist

🔍

Retrieval
Find relevant chunks

🤖

LLM Generation
Grounded response

Step 1: Loading Your Documents

Before you can search your data, you need to ingest it. The loading step handles reading files from various formats and converting them into plain text that the rest of the pipeline can work with.

If you’re using LangChain, it ships with document loaders for just about everything – PDFs (PyPDFLoader), web pages (WebBaseLoader), CSVs, Notion exports, Google Drive, Confluence, and dozens more. LlamaIndex has a similar ecosystem through its SimpleDirectoryReader and the LlamaHub connector library.

A practical tip here: garbage in, garbage out applies tenfold with RAG. If your PDF loader is mangling tables, dropping headers, or concatenating columns into nonsense strings, your retrieval quality will suffer downstream no matter how good your embeddings are. I’ve seen teams spend weeks tuning their retrieval and generation steps when the real problem was a sloppy document parser. Spend time inspecting your loaded documents before moving on. Print a few out. Read them. If they don’t make sense to you, they won’t make sense to an embedding model either.

For PDFs specifically, PyMuPDF (also called fitz) tends to produce cleaner text extraction than the default PyPDF2 parser. For scanned documents, you’ll need OCR – Tesseract works but unstructured.io‘s library handles mixed-format documents (text + tables + images) more gracefully.

Step 2: Chunking – Where Most People Go Wrong First

You can’t feed an entire 200-page PDF into an LLM’s context window as retrieval context. Even if the context window is large enough (and with modern models, it might be), retrieving an entire document when the user only needs one paragraph is wasteful and often degrades answer quality. So you split your documents into chunks.

The question everyone asks: how big should my chunks be?

There’s no single right answer, but here’s a framework that works well in practice:

  • 250-500 tokens – Good for precise factual retrieval. If your use case is answering specific questions from a knowledge base (like an FAQ or technical documentation), smaller chunks ensure the retrieved context is focused and relevant. Less noise in the context means better answers.
  • 500-1000 tokens – A solid middle ground for most applications. Big enough to preserve context within a passage, small enough to keep retrieval precise. This is where I’d start for a general-purpose RAG system.
  • 1000-2000 tokens – Better for tasks that require broader context, like summarizing sections of a legal document or understanding the argument arc of an essay. Retrieval precision drops, but you lose less surrounding context.

Beyond raw size, your chunking strategy matters just as much. The naive approach – splitting every N characters regardless of content – produces chunks that break mid-sentence, separate a question from its answer, or split a code block in half. You almost always want to use a recursive character text splitter that tries to split on paragraph boundaries first, then sentence boundaries, then falls back to character count. LangChain’s RecursiveCharacterTextSplitter does this well out of the box.

Chunk overlap is another knob worth tuning. Adding 50-200 tokens of overlap between adjacent chunks ensures that information sitting on a chunk boundary doesn’t get lost. It does increase your storage and embedding costs slightly, but the retrieval improvement is usually worth it. I typically start with 10-20% overlap relative to chunk size.

Step 3: Embedding – Turning Text Into Vectors

Once you have your chunks, you need to convert each one into a numerical vector – a list of floating-point numbers that captures the semantic meaning of the text. This is what makes similarity search possible: chunks with similar meaning end up close together in vector space.

The embedding model you choose matters a lot. Some solid options as of early 2025:

  • OpenAI’s text-embedding-3-small – 1536 dimensions, good performance, easy to use via API. Costs $0.02 per million tokens. Reasonable for most use cases, though you’re sending your data to OpenAI.
  • OpenAI’s text-embedding-3-large – 3072 dimensions, better accuracy, costs $0.13 per million tokens. Worth it if retrieval quality is critical.
  • Sentence-transformers (all-MiniLM-L6-v2) – Free, open-source, runs locally. 384 dimensions. Fast and lightweight but less accurate on complex queries. Good for prototyping.
  • BGE (BAAI General Embedding) models – Open-source models from the Beijing Academy of AI. The bge-large-en-v1.5 consistently ranks near the top of the MTEB leaderboard. My go-to recommendation for production systems where you want to self-host.
  • Cohere embed-v3 – Strong multilingual support. Worth considering if your documents aren’t exclusively in English.

One thing to be aware of: you must use the same embedding model for indexing and querying. If you embed your documents with OpenAI’s model and then embed queries with a sentence-transformer model, the vector spaces won’t align and your similarity search will return garbage. This sounds obvious, but I’ve debugged this exact issue for people more than once.

Step 4: Vector Storage

Your embedded chunks need to live somewhere searchable. This is where vector databases come in. The main options, roughly ordered by complexity:

ChromaDB – My recommendation for getting started. It’s open-source, runs in-memory or with persistent storage, and has a clean Python API. You can go from zero to a working vector store in about five lines of code. It’s not built for billions of vectors, but for datasets up to a few million chunks, it works great.

FAISS (Facebook AI Similarity Search) – A library, not a database. Extremely fast similarity search, battle-tested at Meta’s scale. No built-in persistence – you manage the index files yourself. Great for performance-critical applications where you want full control.

Pinecone – Fully managed cloud vector database. You don’t manage infrastructure, scaling, or backups. Easy to use, but it’s a paid service and your data lives on their servers. Good for teams that want to move fast and don’t want to operate a database.

Weaviate, Qdrant, Milvus – More full-featured open-source vector databases with built-in hybrid search (combining vector similarity with keyword matching), filtering, and multi-tenancy. These are what you graduate to when ChromaDB’s limitations start showing.

Step 5: Retrieval – Finding the Right Context

When a user sends a query, you embed their question using the same embedding model, then search your vector store for the K most similar chunks. This is the retrieval step, and it’s where the “R” in RAG does its work.

Starting value for K: retrieve 3-5 chunks. Fewer than 3 and you risk missing relevant information. More than 10 and you’re stuffing the LLM’s context with marginally relevant text that can actually hurt answer quality – a phenomenon researchers call “lost in the middle,” where models pay less attention to context in the middle of a long prompt.

Pure vector similarity search works well for most cases, but it has a known weakness: it can miss results that share keywords but are phrased differently, and it can return results that are semantically similar but not actually relevant. Hybrid search – combining vector similarity with BM25 keyword matching – often outperforms either approach alone. Weaviate and Qdrant support this natively. With LangChain, you can build an EnsembleRetriever that merges results from both methods.

Another technique worth implementing early: reranking. After your initial retrieval returns, say, 20 candidates, pass them through a cross-encoder reranking model (like Cohere’s Rerank API or an open-source cross-encoder from sentence-transformers) that scores each chunk’s relevance to the specific query more carefully. Then take the top 3-5. This two-stage approach – fast retrieval followed by precise reranking – consistently improves answer quality.

Step 6: Generation – The LLM Does Its Thing

Finally, you construct a prompt that includes the retrieved chunks and the user’s question, and send it to your LLM. A basic prompt template looks something like this:

Use the following context to answer the question. If the context doesn’t contain enough information to answer, say so – don’t make up information.

Context: {retrieved_chunks}

Question: {user_question}

That instruction to not make things up is important. Without it, the model will happily blend retrieved facts with its own training data, which defeats the purpose of RAG. You want the model to treat the retrieved context as its primary source of truth.

For the LLM itself, GPT-4o, Claude 3.5 Sonnet, and Llama 3.1 70B all work well for generation. If cost is a concern, GPT-4o-mini or Llama 3.1 8B can handle straightforward Q&A tasks without much quality loss. The retrieval quality matters more than the generator model in most RAG setups – a good retriever with an average LLM outperforms a bad retriever with a great LLM almost every time.

Common Pitfalls (and How to Avoid Them)

After building and debugging a fair number of these systems, here are the mistakes I see most often:

  • Not evaluating retrieval separately from generation. When your RAG system gives a bad answer, you need to know whether the retriever found the right chunks but the LLM fumbled the synthesis, or whether the retriever returned irrelevant chunks and the LLM never had a chance. Log your retrieved chunks. Inspect them. Build a small eval set of questions with known source passages and measure retrieval recall.
  • Ignoring metadata. Your chunks should carry metadata – source document, page number, section heading, date. This lets you filter retrieval (e.g., “only search documents from 2024”) and lets the LLM cite its sources in the response. Users trust answers with citations more than answers without them.
  • Chunking without preprocessing. Raw documents are messy. Headers, footers, page numbers, copyright notices, table-of-contents entries – all of this becomes noise in your chunks. Clean your documents before chunking. Strip boilerplate. Normalize formatting. It’s tedious work, but it pays dividends.
  • Using the wrong distance metric. Most embedding models are designed for cosine similarity. If your vector store defaults to L2 (Euclidean) distance, your results may be subtly wrong. Check the documentation for your embedding model and make sure the distance metric matches.
  • Not handling “I don’t know.” Your RAG system will receive questions that your knowledge base simply doesn’t cover. A well-built system should recognize low retrieval confidence and respond honestly rather than hallucinating an answer from the LLM’s training data. Set a similarity score threshold and return a graceful fallback when nothing relevant is found.

Where to Go From Here

Once your basic pipeline works, here’s the progression I’d recommend:

  1. Add evaluation. Use a framework like RAGAS or DeepEval to systematically measure faithfulness (does the answer match the retrieved context?), relevance (are the retrieved chunks actually relevant?), and answer correctness.
  2. Experiment with query transformation. Sometimes the user’s raw question isn’t the best search query. Techniques like HyDE (Hypothetical Document Embeddings) – where you ask the LLM to generate a hypothetical answer, then use that as the search query – can dramatically improve retrieval for certain types of questions.
  3. Implement a chat memory layer. If your RAG system needs to handle multi-turn conversations, you’ll need to reformulate follow-up questions to be self-contained before sending them to the retriever. “What about their revenue?” means nothing without the context of the previous question about a specific company.
  4. Consider agentic RAG. Instead of a fixed retrieve-then-generate pipeline, let an LLM agent decide when to retrieve, what query to use, whether to retrieve again with a refined query, and when it has enough information to answer. LangChain’s agent framework and LlamaIndex’s query engine agents both support this pattern.

Building a RAG pipeline isn’t conceptually difficult – the individual pieces are well-understood and well-documented. The hard part is making it work reliably on your specific data, for your specific use case, at your specific scale. Start simple, measure everything, and resist the urge to add complexity until you’ve confirmed that the simple version isn’t good enough. More often than you’d expect, it is.

Share
JR
Contributing Writer
Self-taught programmer and AI educator. Runs a popular YouTube channel on practical AI tutorials. Believes the best way to learn is by building. Always on the hunt for the next cool open-source project.

Join the Discussion

Your email address will not be published.

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.