Understand how LLMs work for engineering it better
For engineers who use GPT APIs but don’t trust the “it’s magic” answer.
Large Language Models (LLMs) like GPT‑4 or Claude aren’t magical. They’re predictive engines trained to guess the next token in a sequence, that’s it.
What does 1. how does next-token prediction work? look like?
Every output, chat, code, essay, is a sequence of guesses.
graph LR
Input[Prompt tokens] --> Embed[Token embeddings]
Embed --> Attn[Attention layers]
Attn --> Predict[Next token<br/>probability distribution]
Predict --> Sample[Sample one token]
Sample --> Append[Append to output]
Append --> Input
style Predict fill:#dbeafe,stroke:#1e40af
style Sample fill:#dcfce7,stroke:#15803d
# filename: example.py
# description: Code example from the post.
logits = model(tokens[:-1])
probs = softmax(logits[-1])
next_token = sample(probs)
Each new token depends on all previous ones. If the model drifts, the problem started earlier in the sequence.
Analogy: Think of an LLM as an engineer typing code with autocomplete turned on, one token at a time, no global plan.
What does 2. why are tokens not words? look like?
Models don’t read “words,” they read subword tokens.
"antidisestablishmentarianism" → ["anti", "dis", "establish", "ment", "arian", "ism"]
- Common words ≈ 1 token
- Rare or technical words = multiple tokens
- Cost and latency scale with token count
👉 Always check token length before sending prompts.
len(tokenizer.encode("Your prompt"))
What does 3. how does attention let the model "think"? look like?
Each token decides which earlier tokens matter using self‑attention.
Query × Key → Attention weights → Weighted sum of Values
That’s the heart of the Transformer architecture.
During generation:
- First token = slow (O(n²))
- Later tokens = faster (cached, ≈ O(n))
Result: initial delay, then smooth streaming.
What does 4. what is the context window and why does it limit memory? look like?
The context window (e.g., 128k tokens) defines how much the model can “see.” Old tokens fade from focus in long prompts.
Fix it:
- Summarize before Q&A
- Use RAG to load only relevant chunks
- Keep critical info near the end, recency bias helps
Key learnings
LLMs are not reasoning machines, they’re compression‑based next‑token predictors with limited memory. Understanding that boundary makes you a better builder.
For a hands-on path through this topic, see Prompt Engineering Crash Course.
Key takeaways
- The pattern described above addresses a specific production failure mode that naive implementations miss.
- Mechanical guardrails beat heroic debugging. Ship the fix that prevents the bug class, not the bug instance.
- Measure before and after. If the change is not visible in metrics, it was not worth the complexity.
- To see this pattern wired into a full production agent stack, walk through the Build your own coding agent course, or start with the AI Agents Fundamentals primer.
Frequently asked questions
Why do LLMs perform worse with very long context windows?
Context windows fade as sequences grow long. The model's attention mechanism can process the entire input, but earlier tokens lose focus relative to recent ones. The post explains this as a memory limitation of next-token prediction. Solutions: summarize before Q&A, use RAG for relevant chunks only, or place critical info near the end to exploit recency bias.
Why does my LLM have high initial latency before streaming?
The first token requires computing attention across the entire input sequence, an O(n²) operation with high latency. Subsequent tokens use cached computations, reducing complexity to approximately O(n) and enabling smooth streaming. The post explains this is fundamental to Transformer architecture. Initial delay is unavoidable, but recognizing it helps you design buffering into your UI.
Why does my LLM prompt cost more than expected?
Cost scales with token count, not word count. LLMs tokenize text into subword units. Common words are 1 token, rare terms split into many. 'Antidisestablishmentarianism' costs 6 tokens. The post shows this matters for real systems. Always count tokens before sending prompts. Technical text and rare words inflate costs fast, so review your prompt structure if bills climb.
For the full reference, see the Anthropic API documentation.
Take the next step
- Generative AI Foundation Course, Deep dive into LLM internals, tokens, and practical applications
Continue Reading
Ready to go deeper?
Go beyond articles. Build production AI systems with hands-on workshops and our intensive AI Bootcamp.