withsoon
← Back to guides
2026-06-04RAGembeddingsLLM

Build a RAG Pipeline from Scratch

A complete walkthrough of Retrieval-Augmented Generation: chunking, embedding, vector search, and generation.

What is RAG?

Retrieval-Augmented Generation (RAG) lets you give an LLM access to your own documents without fine-tuning. It works in two phases:

  1. Retrieval — find the most relevant chunks from your document store
  2. Generation — pass those chunks as context to the LLM and generate an answer

The pipeline

Documents → Chunk → Embed → Store in vector DB
Query → Embed → Search vector DB → Top-K chunks → LLM → Answer

Step 1: Chunk your documents

Split documents into overlapping windows (e.g. 500 tokens, 50 overlap) so no single chunk is too long.

def chunk_text(text, size=500, overlap=50):
    words = text.split()
    chunks = []
    for i in range(0, len(words), size - overlap):
        chunks.append(" ".join(words[i : i + size]))
    return chunks

Step 2: Embed each chunk

Use an embedding model (e.g. text-embedding-3-small) to convert each chunk to a vector.

import openai

def embed(texts):
    resp = openai.embeddings.create(model="text-embedding-3-small", input=texts)
    return [r.embedding for r in resp.data]

Step 3: Store in a vector database

Use Chroma, Pinecone, or pgvector. For a quick start:

import chromadb

client = chromadb.Client()
col = client.create_collection("docs")
col.add(documents=chunks, embeddings=embeds, ids=[str(i) for i in range(len(chunks))])

Step 4: Query and generate

query = "How do I set up authentication?"
q_embed = embed([query])[0]
results = col.query(query_embeddings=[q_embed], n_results=5)

context = "\n\n".join(results["documents"][0])
prompt = f"Context:\n{context}\n\nQuestion: {query}\nAnswer:"

More steps and a working GitHub repo coming soon.