LLM Fundamentals | AI Engineer Portal

What is a Large Language Model?

A Large Language Model (LLM) is a deep learning model trained on massive text corpora to predict the next token in a sequence. With enough scale and data, these models develop emergent capabilities — reasoning, code generation, translation — without explicit programming for those tasks.

Modern LLMs are based on the Transformer architecture (Vaswani et al., 2017) and trained on hundreds of billions of tokens via self-supervised learning.

📊 Scale

Billions of parameters. GPT-4 estimated at ~1.8T (MoE). Training GPT-3 cost ~$4.6M in compute.

🎓 Training

Pre-trained on web data, books, code. Fine-tuned via SFT + RLHF for instruction following and safety.

⚡ Emergent

In-context learning, chain-of-thought reasoning, and few-shot generalization emerge at scale.

Transformer Architecture

The Transformer uses self-attention to process entire sequences in parallel. Modern LLMs use a decoder-only variant that predicts the next token from all previous ones.

flowchart TD A["Input Tokens\n['The', ' cat', ' sat']"] --> B["Token Embeddings\n+ Positional Encoding"] B --> C["Multi-Head Self-Attention\n(Attend to all positions simultaneously)"] C --> D["Add & Norm (Residual)"] D --> E["Feed-Forward Network\n(4x hidden dim expansion)"] E --> F["Add & Norm"] F --> G["Repeat N layers\n(GPT-4: ~96 layers)"] G --> H["Language Model Head\n(Linear + Softmax)"] H --> I["Token Probabilities → Sample next token"] style A fill:#ede9fe,stroke:#8b5cf6 style H fill:#ede9fe,stroke:#8b5cf6 style I fill:#ddd6fe,stroke:#7c3aed

🔎 Self-Attention

For each token, attention computes how much to "attend" to every other token in the sequence:

Q = X · W_Q (Query)
K = X · W_K (Key)
V = X · W_V (Value)

Attn = softmax(QKᵀ / √d_k) · V

Multi-head: run h attention heads in parallel, concatenate → richer representations.

📍 Positional Encoding

Since attention has no inherent order, positions are injected as vectors:

▸Sinusoidal — original Transformer, fixed formula
▸Learned Absolute — GPT-2/3, embeddings per position
▸RoPE — LLaMA/Mistral, rotary, better long-context
▸ALiBi — attention with linear biases

Tokenization & Context Windows

How Tokenization Works

Text is split into tokens (~¾ of a word on average). GPT-4 uses ~100,256 token vocabulary.

# "Hello, world!" → tokens
["Hello", ",", " world", "!"]
[9907, 11, 1917, 0]

BPE (Byte Pair Encoding)GPT, LLaMA

WordPieceBERT, PaLM

SentencePieceT5, multilingual

Context Window Sizes

Gemini 1.5 Pro1M tokens

Claude 3.5 Sonnet200K tokens

GPT-4 Turbo128K tokens

LLaMA 3.1128K tokens

GPT-3.5-turbo16K tokens

Rule of thumb: 1 token ≈ 4 chars ≈ 0.75 words. 100K tokens ≈ a novel.

Prompt Engineering

Zero-Shot Prompting

Ask directly without examples. Works well for GPT-4 / Claude on straightforward tasks.

System: You are a helpful assistant.

User: Classify sentiment:
"I love this product!"
→ Positive/Negative/Neutral

Chain-of-Thought (CoT)

"Think step by step" dramatically improves multi-step reasoning accuracy.

User: Roger has 5 balls. Buys 2 cans of 3. How many? Think step by step.

Model: Starts with 5.
Buys 2×3=6 more.
Total: 5+6 = 11

Few-Shot Prompting

Provide examples in the prompt — the model learns from demonstration.

"Great!" → Positive
"Terrible." → Negative
"It's okay." → Neutral

"I love this!" → Positive

System Prompt Design

Set persona, constraints, and output format before the conversation starts.

System: You are a Python expert.
- Always add type hints
- Include error handling
- Respond in markdown
- Explain your reasoning

Sampling Parameters

🌡️ Temperature

Controls randomness. 0.0 = deterministic. 1.0 = very random. Use 0.0 for code/facts, 0.7 for creative.

🎯 Top-p (Nucleus)

Sample from smallest set of tokens whose cumulative probability ≥ p. top_p=0.9 = only top 90% probability mass.

🛑 Max Tokens

Hard limit on output length. Remember: input + output ≤ context window. Set based on expected response size.

Major LLMs Compared

Model	Provider	Context	Strengths	Open?
GPT-4o	OpenAI	128K	General purpose, reasoning, multimodal	Closed
Claude 3.5 Sonnet	Anthropic	200K	Long context, coding, safety, analysis	Closed
Gemini 1.5 Pro	Google	1M	Massive context, multimodal, BigQuery	Closed
LLaMA 3.1 405B	Meta	128K	Open weights, customizable, self-hostable	Open
Mistral Large	Mistral AI	128K	European, multilingual, function calling	Partial
Phi-3 Medium	Microsoft	128K	Small but capable, edge deployment	Open

Fine-tuning vs Prompting vs RAG

💬

Prompt Engineering

Modify the input prompt. No training required. Fast iteration.

✅ Zero cost to implement
✅ Instant iteration
✅ Works with any model
❌ Limited by context window
❌ No persistent memory

Best for: Most use cases — start here

🔍

RAG

Retrieve relevant docs at runtime. Ground LLM in external knowledge.

✅ Up-to-date knowledge
✅ Source attribution
✅ No training cost
❌ Retrieval quality matters
❌ Latency overhead

Best for: Knowledge bases, docs Q&A

🔧

Fine-tuning

Update model weights on domain-specific data.

✅ Custom style/behavior
✅ No runtime retrieval
✅ Smaller prompts
❌ Expensive ($$$)
❌ Knowledge becomes stale

Best for: Style, format, domain adaptation

← Home Next: RAG Guide →

Large LanguageModels

What is a Large Language Model?

Transformer Architecture

🔎 Self-Attention

📍 Positional Encoding

Tokenization & Context Windows

How Tokenization Works

Context Window Sizes

Prompt Engineering

Zero-Shot Prompting

Chain-of-Thought (CoT)

Few-Shot Prompting

System Prompt Design

Sampling Parameters

Major LLMs Compared

Fine-tuning vs Prompting vs RAG

Prompt Engineering

RAG

Fine-tuning

Large Language
Models