From transformer basics to production APIs โ everything you need to understand and work with modern LLMs.
A Large Language Model (LLM) is a deep learning model trained on massive text corpora to predict the next token in a sequence. With enough scale and data, these models develop emergent capabilities โ reasoning, code generation, translation โ without explicit programming for those tasks.
Modern LLMs are based on the Transformer architecture (Vaswani et al., 2017) and trained on hundreds of billions of tokens via self-supervised learning.
Billions of parameters. GPT-4 estimated at ~1.8T (MoE). Training GPT-3 cost ~$4.6M in compute.
Pre-trained on web data, books, code. Fine-tuned via SFT + RLHF for instruction following and safety.
In-context learning, chain-of-thought reasoning, and few-shot generalization emerge at scale.
The Transformer uses self-attention to process entire sequences in parallel. Modern LLMs use a decoder-only variant that predicts the next token from all previous ones.
For each token, attention computes how much to "attend" to every other token in the sequence:
Multi-head: run h attention heads in parallel, concatenate โ richer representations.
Since attention has no inherent order, positions are injected as vectors:
Text is split into tokens (~ยพ of a word on average). GPT-4 uses ~100,256 token vocabulary.
Rule of thumb: 1 token โ 4 chars โ 0.75 words. 100K tokens โ a novel.
Ask directly without examples. Works well for GPT-4 / Claude on straightforward tasks.
"Think step by step" dramatically improves multi-step reasoning accuracy.
Provide examples in the prompt โ the model learns from demonstration.
Set persona, constraints, and output format before the conversation starts.
Controls randomness. 0.0 = deterministic. 1.0 = very random. Use 0.0 for code/facts, 0.7 for creative.
Sample from smallest set of tokens whose cumulative probability โฅ p. top_p=0.9 = only top 90% probability mass.
Hard limit on output length. Remember: input + output โค context window. Set based on expected response size.
| Model | Provider | Context | Strengths | Open? |
|---|---|---|---|---|
| GPT-4o | OpenAI | 128K | General purpose, reasoning, multimodal | Closed |
| Claude 3.5 Sonnet | Anthropic | 200K | Long context, coding, safety, analysis | Closed |
| Gemini 1.5 Pro | 1M | Massive context, multimodal, BigQuery | Closed | |
| LLaMA 3.1 405B | Meta | 128K | Open weights, customizable, self-hostable | Open |
| Mistral Large | Mistral AI | 128K | European, multilingual, function calling | Partial |
| Phi-3 Medium | Microsoft | 128K | Small but capable, edge deployment | Open |
Modify the input prompt. No training required. Fast iteration.
Retrieve relevant docs at runtime. Ground LLM in external knowledge.
Update model weights on domain-specific data.