RAG Complete Guide | AI Engineer Portal

Why RAG?

🧩

Hallucinations

LLMs confidently generate false information. RAG grounds responses in retrieved facts from your actual data.

📅

Knowledge Cutoff

LLMs are frozen at training time. RAG provides real-time information from any continuously updated source.

🏢

Private Data

Your company's proprietary documents aren't in the LLM. RAG lets you query internal knowledge securely.

Core RAG principle: Don't bake knowledge into parameters — retrieve it dynamically at inference time.

The RAG Pipeline

flowchart LR subgraph INGEST ["📥 Ingestion (Offline)"] A[Raw Documents\nPDF, DOCX, HTML] --> B[Text Extraction\n& Cleaning] B --> C[Chunking\n500-1000 tokens] C --> D[Embedding Model\ntext-embedding-3-large] D --> E[(Vector DB\nPinecone / Azure AI Search)] end subgraph QUERY ["🔍 Retrieval (Online)"] F[User Question] --> G[Embed Query] G --> H[Similarity Search\nTop-k chunks] E --> H H --> I[Re-Ranker\nCross-encoder] end subgraph GENERATE ["✨ Generation (Online)"] I --> J[Build Prompt\nContext + Question] J --> K[LLM\nGPT-4o / Claude] K --> L[Answer + Citations] end style INGEST fill:#e0f2fe,stroke:#0284c7 style QUERY fill:#f0fdf4,stroke:#16a34a style GENERATE fill:#fef3c7,stroke:#d97706

✂️ Chunking Strategies

How you split documents is critical to retrieval quality:

▸Fixed-size — 500–1000 tokens + 20% overlap. Fast, simple, fine for uniform docs.
▸Recursive Character — Splits on paragraphs → sentences → words. Best general default.
▸Semantic Chunking — Split where topic changes (embedding similarity drop). High quality, slower.
▸Document-aware — Respect headers, tables, code blocks. Use for structured docs.

Embeddings & Similarity Search

An embedding is a dense vector capturing semantic meaning. Similar meanings → similar vectors (high cosine similarity). A chunk about "ML" and "neural networks" will be close in vector space.

Embedding Models

text-embedding-3-largeOpenAI — 3072d, best

text-embedding-3-smallOpenAI — 1536d, cost-effective

Azure ada-002Azure OpenAI — 1536d

all-mpnet-base-v2Open source — 768d

BGE-M3Open — multilingual

Search Types

Dense (Semantic) Search

Cosine similarity between query and doc embeddings. Finds related content even without exact keyword match.

Sparse (Keyword) — BM25

TF-IDF ranking. Best for exact terms, product codes, proper nouns. Fast, no embeddings.

Hybrid Search ⭐ Recommended

Dense + Sparse via Reciprocal Rank Fusion (RRF). Best of both worlds for production.

Vector Database Landscape

Database	Best For	Hosting	Hybrid?
Azure AI Search	Enterprise, ACLs, M365, SharePoint	Azure	✅ Built-in
Pinecone	Serverless, fast startup	SaaS	✅ Sparse
Weaviate	GraphQL, multi-tenancy	Self/Cloud	✅ BM25+vec
Qdrant	High performance, filtering	Self/Cloud	✅ Sparse
Chroma	Local dev, prototyping only	Local	❌ Dense only
FAISS	In-memory, research	In-process	❌ Dense only

Production tip: Azure AI Search for enterprise (compliance, ACLs, M365 integration). Qdrant/Weaviate for self-hosted. Chroma/FAISS for local dev only.

Advanced RAG Patterns

🔁 HyDE — Hypothetical Document Embeddings

Ask LLM to generate a hypothetical answer first, embed it, then search for chunks close to the hypothetical. Dramatically improves recall for complex questions.

flowchart LR A[User Query] --> B[LLM: Generate\nhypothetical answer] B --> C[Embed hypothetical] C --> D[Search with hypothetical] D --> E[Real chunks returned] style B fill:#f0fdf4,stroke:#16a34a

🔀 Multi-Query Retrieval

Generate multiple query variants, retrieve for each, deduplicate. Gets broader coverage of the knowledge base.

# LangChain: generates 3-5 query variants
from langchain.retrievers import MultiQueryRetriever

retriever = MultiQueryRetriever.from_llm(
retriever=vectorstore.as_retriever(),
llm=llm
)

📊 Re-Ranking

Retrieve top-50, then re-rank with cross-encoder that scores query+chunk together. More accurate than embedding similarity alone.

▸ Cohere Rerank — managed API, state-of-the-art
▸ cross-encoder/ms-marco — open source
▸ BGE Reranker — multilingual

Pattern: Retrieve top-50 → rerank → pass top-5 to LLM

🕸️ GraphRAG (Microsoft)

Extract entities and relationships into a knowledge graph. Enables multi-hop reasoning across document relationships.

graph LR Q[Query: Who reports\nto CTO at Acme?] --> KG[(Knowledge\nGraph Neo4j)] KG --> E1[CTO → Alice] KG --> E2[Alice → Bob, Charlie] style KG fill:#f3e8ff,stroke:#7c3aed

RAG Maturity Levels

1. Naive RAG

Prototype only

Chunk → embed → similarity search → stuff into prompt. Fast, poor precision.

2. Advanced RAG

Production

Query rewrite + HyDE + hybrid search + reranking + citations.

3. Modular RAG

Enterprise

Flexible pipeline — swap retriever, reranker, generator independently.

4. GraphRAG

Complex Reasoning

Knowledge graph extraction, multi-hop reasoning, community summaries.

Local vs Cloud Deployment

LOCAL / ON-PREMISE

Vector StoreFAISS, Chroma

Embeddingssentence-transformers

LLMOllama, LM Studio

Cost$0–$500/mo

Best: Prototyping, air-gapped, sensitive data

AZURE CLOUD

Vector StoreAzure AI Search

EmbeddingsAzure OpenAI Ada

LLMAzure OpenAI GPT-4

Cost$500–$5000+/mo

Best: Enterprise, ACLs, compliance, M365

HYBRID

Vector StoreAzure AI Search

EmbeddingsAzure OpenAI

LLMSelf-hosted Llama

CostMedium

Best: Cloud search + private LLM

← LLM Guide Next: AI Agents →

Retrieval-AugmentedGeneration