PATH 02

Retrieval-Augmented
Generation

From document ingestion to GraphRAG β€” build RAG systems that are accurate, grounded, and production-ready.

01

Why RAG?

🧩
Hallucinations

LLMs confidently generate false information. RAG grounds responses in retrieved facts from your actual data.

πŸ“…
Knowledge Cutoff

LLMs are frozen at training time. RAG provides real-time information from any continuously updated source.

🏒
Private Data

Your company's proprietary documents aren't in the LLM. RAG lets you query internal knowledge securely.

Core RAG principle: Don't bake knowledge into parameters β€” retrieve it dynamically at inference time.

02

The RAG Pipeline

flowchart LR subgraph INGEST ["πŸ“₯ Ingestion (Offline)"] A[Raw Documents\nPDF, DOCX, HTML] --> B[Text Extraction\n& Cleaning] B --> C[Chunking\n500-1000 tokens] C --> D[Embedding Model\ntext-embedding-3-large] D --> E[(Vector DB\nPinecone / Azure AI Search)] end subgraph QUERY ["πŸ” Retrieval (Online)"] F[User Question] --> G[Embed Query] G --> H[Similarity Search\nTop-k chunks] E --> H H --> I[Re-Ranker\nCross-encoder] end subgraph GENERATE ["✨ Generation (Online)"] I --> J[Build Prompt\nContext + Question] J --> K[LLM\nGPT-4o / Claude] K --> L[Answer + Citations] end style INGEST fill:#e0f2fe,stroke:#0284c7 style QUERY fill:#f0fdf4,stroke:#16a34a style GENERATE fill:#fef3c7,stroke:#d97706

βœ‚οΈ Chunking Strategies

How you split documents is critical to retrieval quality:

  • β–ΈFixed-size β€” 500–1000 tokens + 20% overlap. Fast, simple, fine for uniform docs.
  • β–ΈRecursive Character β€” Splits on paragraphs β†’ sentences β†’ words. Best general default.
  • β–ΈSemantic Chunking β€” Split where topic changes (embedding similarity drop). High quality, slower.
  • β–ΈDocument-aware β€” Respect headers, tables, code blocks. Use for structured docs.
03

Embeddings & Similarity Search

An embedding is a dense vector capturing semantic meaning. Similar meanings β†’ similar vectors (high cosine similarity). A chunk about "ML" and "neural networks" will be close in vector space.

Embedding Models

text-embedding-3-largeOpenAI β€” 3072d, best
text-embedding-3-smallOpenAI β€” 1536d, cost-effective
Azure ada-002Azure OpenAI β€” 1536d
all-mpnet-base-v2Open source β€” 768d
BGE-M3Open β€” multilingual

Search Types

Dense (Semantic) Search

Cosine similarity between query and doc embeddings. Finds related content even without exact keyword match.

Sparse (Keyword) β€” BM25

TF-IDF ranking. Best for exact terms, product codes, proper nouns. Fast, no embeddings.

Hybrid Search ⭐ Recommended

Dense + Sparse via Reciprocal Rank Fusion (RRF). Best of both worlds for production.

04

Vector Database Landscape

DatabaseBest ForHostingHybrid?
Azure AI SearchEnterprise, ACLs, M365, SharePointAzureβœ… Built-in
PineconeServerless, fast startupSaaSβœ… Sparse
WeaviateGraphQL, multi-tenancySelf/Cloudβœ… BM25+vec
QdrantHigh performance, filteringSelf/Cloudβœ… Sparse
ChromaLocal dev, prototyping onlyLocal❌ Dense only
FAISSIn-memory, researchIn-process❌ Dense only

Production tip: Azure AI Search for enterprise (compliance, ACLs, M365 integration). Qdrant/Weaviate for self-hosted. Chroma/FAISS for local dev only.

05

Advanced RAG Patterns

πŸ” HyDE β€” Hypothetical Document Embeddings

Ask LLM to generate a hypothetical answer first, embed it, then search for chunks close to the hypothetical. Dramatically improves recall for complex questions.

flowchart LR A[User Query] --> B[LLM: Generate\nhypothetical answer] B --> C[Embed hypothetical] C --> D[Search with hypothetical] D --> E[Real chunks returned] style B fill:#f0fdf4,stroke:#16a34a

πŸ”€ Multi-Query Retrieval

Generate multiple query variants, retrieve for each, deduplicate. Gets broader coverage of the knowledge base.

# LangChain: generates 3-5 query variants
from langchain.retrievers import MultiQueryRetriever

retriever = MultiQueryRetriever.from_llm(
  retriever=vectorstore.as_retriever(),
  llm=llm
)

πŸ“Š Re-Ranking

Retrieve top-50, then re-rank with cross-encoder that scores query+chunk together. More accurate than embedding similarity alone.

  • β–Έ Cohere Rerank β€” managed API, state-of-the-art
  • β–Έ cross-encoder/ms-marco β€” open source
  • β–Έ BGE Reranker β€” multilingual

Pattern: Retrieve top-50 β†’ rerank β†’ pass top-5 to LLM

πŸ•ΈοΈ GraphRAG (Microsoft)

Extract entities and relationships into a knowledge graph. Enables multi-hop reasoning across document relationships.

graph LR Q[Query: Who reports\nto CTO at Acme?] --> KG[(Knowledge\nGraph Neo4j)] KG --> E1[CTO β†’ Alice] KG --> E2[Alice β†’ Bob, Charlie] style KG fill:#f3e8ff,stroke:#7c3aed

RAG Maturity Levels

1. Naive RAG
Prototype only

Chunk β†’ embed β†’ similarity search β†’ stuff into prompt. Fast, poor precision.

2. Advanced RAG
Production

Query rewrite + HyDE + hybrid search + reranking + citations.

3. Modular RAG
Enterprise

Flexible pipeline β€” swap retriever, reranker, generator independently.

4. GraphRAG
Complex Reasoning

Knowledge graph extraction, multi-hop reasoning, community summaries.

06

Local vs Cloud Deployment

LOCAL / ON-PREMISE
Vector StoreFAISS, Chroma
Embeddingssentence-transformers
LLMOllama, LM Studio
Cost$0–$500/mo
Best: Prototyping, air-gapped, sensitive data
AZURE CLOUD
Vector StoreAzure AI Search
EmbeddingsAzure OpenAI Ada
LLMAzure OpenAI GPT-4
Cost$500–$5000+/mo
Best: Enterprise, ACLs, compliance, M365
HYBRID
Vector StoreAzure AI Search
EmbeddingsAzure OpenAI
LLMSelf-hosted Llama
CostMedium
Best: Cloud search + private LLM
← LLM Guide Next: AI Agents β†’