What is RAG?
Retrieval-Augmented Generation (RAG) lets you give an LLM access to your own documents without fine-tuning. It works in two phases:
- Retrieval — find the most relevant chunks from your document store
- Generation — pass those chunks as context to the LLM and generate an answer
The pipeline
Documents → Chunk → Embed → Store in vector DB
Query → Embed → Search vector DB → Top-K chunks → LLM → Answer
Step 1: Chunk your documents
Split documents into overlapping windows (e.g. 500 tokens, 50 overlap) so no single chunk is too long.
def chunk_text(text, size=500, overlap=50):
words = text.split()
chunks = []
for i in range(0, len(words), size - overlap):
chunks.append(" ".join(words[i : i + size]))
return chunks
Step 2: Embed each chunk
Use an embedding model (e.g. text-embedding-3-small) to convert each chunk to a vector.
import openai
def embed(texts):
resp = openai.embeddings.create(model="text-embedding-3-small", input=texts)
return [r.embedding for r in resp.data]
Step 3: Store in a vector database
Use Chroma, Pinecone, or pgvector. For a quick start:
import chromadb
client = chromadb.Client()
col = client.create_collection("docs")
col.add(documents=chunks, embeddings=embeds, ids=[str(i) for i in range(len(chunks))])
Step 4: Query and generate
query = "How do I set up authentication?"
q_embed = embed([query])[0]
results = col.query(query_embeddings=[q_embed], n_results=5)
context = "\n\n".join(results["documents"][0])
prompt = f"Context:\n{context}\n\nQuestion: {query}\nAnswer:"
More steps and a working GitHub repo coming soon.