PATH 01

Large Language
Models

From transformer basics to production APIs โ€” everything you need to understand and work with modern LLMs.

01

What is a Large Language Model?

A Large Language Model (LLM) is a deep learning model trained on massive text corpora to predict the next token in a sequence. With enough scale and data, these models develop emergent capabilities โ€” reasoning, code generation, translation โ€” without explicit programming for those tasks.

Modern LLMs are based on the Transformer architecture (Vaswani et al., 2017) and trained on hundreds of billions of tokens via self-supervised learning.

๐Ÿ“Š Scale

Billions of parameters. GPT-4 estimated at ~1.8T (MoE). Training GPT-3 cost ~$4.6M in compute.

๐ŸŽ“ Training

Pre-trained on web data, books, code. Fine-tuned via SFT + RLHF for instruction following and safety.

โšก Emergent

In-context learning, chain-of-thought reasoning, and few-shot generalization emerge at scale.

02

Transformer Architecture

The Transformer uses self-attention to process entire sequences in parallel. Modern LLMs use a decoder-only variant that predicts the next token from all previous ones.

flowchart TD A["Input Tokens\n['The', ' cat', ' sat']"] --> B["Token Embeddings\n+ Positional Encoding"] B --> C["Multi-Head Self-Attention\n(Attend to all positions simultaneously)"] C --> D["Add & Norm (Residual)"] D --> E["Feed-Forward Network\n(4x hidden dim expansion)"] E --> F["Add & Norm"] F --> G["Repeat N layers\n(GPT-4: ~96 layers)"] G --> H["Language Model Head\n(Linear + Softmax)"] H --> I["Token Probabilities โ†’ Sample next token"] style A fill:#ede9fe,stroke:#8b5cf6 style H fill:#ede9fe,stroke:#8b5cf6 style I fill:#ddd6fe,stroke:#7c3aed

๐Ÿ”Ž Self-Attention

For each token, attention computes how much to "attend" to every other token in the sequence:

Q = X ยท W_Q  (Query)
K = X ยท W_K  (Key)
V = X ยท W_V  (Value)

Attn = softmax(QKแต€ / โˆšd_k) ยท V

Multi-head: run h attention heads in parallel, concatenate โ†’ richer representations.

๐Ÿ“ Positional Encoding

Since attention has no inherent order, positions are injected as vectors:

  • โ–ธSinusoidal โ€” original Transformer, fixed formula
  • โ–ธLearned Absolute โ€” GPT-2/3, embeddings per position
  • โ–ธRoPE โ€” LLaMA/Mistral, rotary, better long-context
  • โ–ธALiBi โ€” attention with linear biases
03

Tokenization & Context Windows

How Tokenization Works

Text is split into tokens (~ยพ of a word on average). GPT-4 uses ~100,256 token vocabulary.

# "Hello, world!" โ†’ tokens
["Hello", ",", " world", "!"]
[9907, 11, 1917, 0]
BPE (Byte Pair Encoding)GPT, LLaMA
WordPieceBERT, PaLM
SentencePieceT5, multilingual

Context Window Sizes

Gemini 1.5 Pro1M tokens
Claude 3.5 Sonnet200K tokens
GPT-4 Turbo128K tokens
LLaMA 3.1128K tokens
GPT-3.5-turbo16K tokens

Rule of thumb: 1 token โ‰ˆ 4 chars โ‰ˆ 0.75 words. 100K tokens โ‰ˆ a novel.

04

Prompt Engineering

Zero-Shot Prompting

Ask directly without examples. Works well for GPT-4 / Claude on straightforward tasks.

System: You are a helpful assistant.

User: Classify sentiment:
"I love this product!"
โ†’ Positive/Negative/Neutral

Chain-of-Thought (CoT)

"Think step by step" dramatically improves multi-step reasoning accuracy.

User: Roger has 5 balls. Buys 2 cans of 3. How many? Think step by step.

Model: Starts with 5.
Buys 2ร—3=6 more.
Total: 5+6 = 11

Few-Shot Prompting

Provide examples in the prompt โ€” the model learns from demonstration.

"Great!" โ†’ Positive
"Terrible." โ†’ Negative
"It's okay." โ†’ Neutral

"I love this!" โ†’ Positive

System Prompt Design

Set persona, constraints, and output format before the conversation starts.

System: You are a Python expert.
- Always add type hints
- Include error handling
- Respond in markdown
- Explain your reasoning

Sampling Parameters

๐ŸŒก๏ธ Temperature

Controls randomness. 0.0 = deterministic. 1.0 = very random. Use 0.0 for code/facts, 0.7 for creative.

๐ŸŽฏ Top-p (Nucleus)

Sample from smallest set of tokens whose cumulative probability โ‰ฅ p. top_p=0.9 = only top 90% probability mass.

๐Ÿ›‘ Max Tokens

Hard limit on output length. Remember: input + output โ‰ค context window. Set based on expected response size.

05

Major LLMs Compared

Model Provider Context Strengths Open?
GPT-4oOpenAI128KGeneral purpose, reasoning, multimodalClosed
Claude 3.5 SonnetAnthropic200KLong context, coding, safety, analysisClosed
Gemini 1.5 ProGoogle1MMassive context, multimodal, BigQueryClosed
LLaMA 3.1 405BMeta128KOpen weights, customizable, self-hostableOpen
Mistral LargeMistral AI128KEuropean, multilingual, function callingPartial
Phi-3 MediumMicrosoft128KSmall but capable, edge deploymentOpen
06

Fine-tuning vs Prompting vs RAG

๐Ÿ’ฌ

Prompt Engineering

Modify the input prompt. No training required. Fast iteration.

  • โœ… Zero cost to implement
  • โœ… Instant iteration
  • โœ… Works with any model
  • โŒ Limited by context window
  • โŒ No persistent memory
Best for: Most use cases โ€” start here
๐Ÿ”

RAG

Retrieve relevant docs at runtime. Ground LLM in external knowledge.

  • โœ… Up-to-date knowledge
  • โœ… Source attribution
  • โœ… No training cost
  • โŒ Retrieval quality matters
  • โŒ Latency overhead
Best for: Knowledge bases, docs Q&A
๐Ÿ”ง

Fine-tuning

Update model weights on domain-specific data.

  • โœ… Custom style/behavior
  • โœ… No runtime retrieval
  • โœ… Smaller prompts
  • โŒ Expensive ($$$)
  • โŒ Knowledge becomes stale
Best for: Style, format, domain adaptation
โ† Home Next: RAG Guide โ†’