What is RAG?
Retrieval-Augmented Generation ā give an LLM access to your own documents without fine-tuning. The model retrieves relevant chunks at query time and uses them as context.
āāāāāāāāāāāāāāā
Documents āāāāāāā¶ ā Indexing ā āāā¶ Vector DB
āāāāāāāāāāāāāāā
ā² offline
āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā
ā¼ online
Query āāāāāāāāāāā¶ Embed āāā¶ Search āāā¶ Top-K chunks āāā¶ LLM āāā¶ Answer
Step 1: Document ingestion
Supported formats: PDF, DOCX, HTML, Markdown, plain text.
from pathlib import Path
def load_documents(directory: str) -> list[dict]:
docs = []
for path in Path(directory).rglob("*"):
if path.suffix in {".txt", ".md", ".pdf"}:
docs.append({"path": str(path), "text": path.read_text()})
return docs
Step 2: Chunking strategy
Chunking is the most underrated part of RAG quality.
| Strategy | When to use | |---|---| | Fixed size (512 tokens) | Simple, fast baseline | | Sentence splitter | Better coherence | | Recursive character | Good for mixed content | | Semantic chunking | Best quality, slowest |
from langchain.text_splitter import RecursiveCharacterTextSplitter
splitter = RecursiveCharacterTextSplitter(
chunk_size=512,
chunk_overlap=64,
separators=["\n\n", "\n", ". ", " "]
)
chunks = splitter.split_text(document_text)
Overlap matters ā 10-15% overlap ensures no context is lost at boundaries.
Step 3: Embedding models
| Model | Dims | Best for |
|---|---|---|
| text-embedding-3-small | 1536 | Cost-effective, great quality |
| text-embedding-3-large | 3072 | Highest quality, 2x cost |
| nomic-embed-text | 768 | Open source, self-hostable |
| bge-large-en | 1024 | Strong on retrieval benchmarks |
import openai
def embed_chunks(chunks: list[str]) -> list[list[float]]:
resp = openai.embeddings.create(
model="text-embedding-3-small",
input=chunks
)
return [r.embedding for r in resp.data]
Step 4: Vector database
| DB | Best for | |---|---| | Chroma | Local dev, prototypes | | Pinecone | Managed, production | | Weaviate | Hybrid search (keyword + vector) | | pgvector | Already on Postgres | | Qdrant | High performance, self-hosted |
import chromadb
client = chromadb.PersistentClient(path="./chroma_db")
collection = client.get_or_create_collection("docs")
collection.add(
documents=chunks,
embeddings=embeddings,
ids=[f"chunk_{i}" for i in range(len(chunks))],
metadatas=[{"source": "doc.pdf", "page": i} for i in range(len(chunks))]
)
Step 5: Retrieval + Re-ranking
Basic retrieval returns top-K by cosine similarity. Re-ranking improves precision.
def retrieve(query: str, k: int = 10) -> list[str]:
q_embed = embed_chunks([query])[0]
results = collection.query(query_embeddings=[q_embed], n_results=k)
return results["documents"][0]
# Re-rank with a cross-encoder
from sentence_transformers import CrossEncoder
reranker = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2")
def rerank(query: str, chunks: list[str], top_n: int = 5) -> list[str]:
scores = reranker.predict([(query, chunk) for chunk in chunks])
ranked = sorted(zip(scores, chunks), reverse=True)
return [chunk for _, chunk in ranked[:top_n]]
Step 6: Generation
import anthropic
def answer(query: str) -> str:
chunks = retrieve(query, k=10)
top_chunks = rerank(query, chunks, top_n=5)
context = "\n\n---\n\n".join(top_chunks)
client = anthropic.Anthropic()
message = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=1024,
system="You are a helpful assistant. Answer based only on the provided context. If the answer isn't in the context, say so.",
messages=[{
"role": "user",
"content": f"Context:\n{context}\n\nQuestion: {query}"
}]
)
return message.content[0].text
Production checklist
- [ ] Add metadata filters (date, source, category) to narrow retrieval
- [ ] Implement hybrid search (BM25 + vector) for keyword-heavy queries
- [ ] Cache embeddings ā don't re-embed unchanged documents
- [ ] Log retrieved chunks for debugging low-quality answers
- [ ] Eval with RAGAS or DeepEval ā measure faithfulness and relevance
- [ ] Handle chunking failures (malformed PDFs, encoding errors)
- [ ] Rate limit embedding API calls for large ingestion jobs