For engineers who use GPT APIs but don’t trust the “it’s magic” answer.

Large Language Models (LLMs) like GPT‑4 or Claude aren’t magical. They’re predictive engines trained to guess the next token in a sequence, that’s it.

What does 1. how does next-token prediction work? look like?

Every output, chat, code, essay, is a sequence of guesses.

graph LR
    Input[Prompt tokens] --> Embed[Token embeddings]
    Embed --> Attn[Attention layers]
    Attn --> Predict[Next token<br/>probability distribution]
    Predict --> Sample[Sample one token]
    Sample --> Append[Append to output]
    Append --> Input

    style Predict fill:#dbeafe,stroke:#1e40af
    style Sample fill:#dcfce7,stroke:#15803d
# filename: example.py
# description: Code example from the post.
logits = model(tokens[:-1])
probs = softmax(logits[-1])
next_token = sample(probs)

Each new token depends on all previous ones. If the model drifts, the problem started earlier in the sequence.

Analogy: Think of an LLM as an engineer typing code with autocomplete turned on, one token at a time, no global plan.

What does 2. why are tokens not words? look like?

Models don’t read “words,” they read subword tokens.

"antidisestablishmentarianism" → ["anti", "dis", "establish", "ment", "arian", "ism"]

  • Common words ≈ 1 token
  • Rare or technical words = multiple tokens
  • Cost and latency scale with token count

👉 Always check token length before sending prompts.

len(tokenizer.encode("Your prompt"))

What does 3. how does attention let the model "think"? look like?

Each token decides which earlier tokens matter using self‑attention.

Query × Key → Attention weights → Weighted sum of Values

That’s the heart of the Transformer architecture.

During generation:

  • First token = slow (O(n²))
  • Later tokens = faster (cached, ≈ O(n))

Result: initial delay, then smooth streaming.

What does 4. what is the context window and why does it limit memory? look like?

The context window (e.g., 128k tokens) defines how much the model can “see.” Old tokens fade from focus in long prompts.

Fix it:

  • Summarize before Q&A
  • Use RAG to load only relevant chunks
  • Keep critical info near the end, recency bias helps

Key learnings

LLMs are not reasoning machines, they’re compression‑based next‑token predictors with limited memory. Understanding that boundary makes you a better builder.


For a hands-on path through this topic, see Prompt Engineering Crash Course.

Key takeaways

  1. The pattern described above addresses a specific production failure mode that naive implementations miss.
  2. Mechanical guardrails beat heroic debugging. Ship the fix that prevents the bug class, not the bug instance.
  3. Measure before and after. If the change is not visible in metrics, it was not worth the complexity.
  4. To see this pattern wired into a full production agent stack, walk through the Build your own coding agent course, or start with the AI Agents Fundamentals primer.

Frequently asked questions

Why do LLMs perform worse with very long context windows?

Context windows fade as sequences grow long. The model's attention mechanism can process the entire input, but earlier tokens lose focus relative to recent ones. The post explains this as a memory limitation of next-token prediction. Solutions: summarize before Q&A, use RAG for relevant chunks only, or place critical info near the end to exploit recency bias.

Why does my LLM have high initial latency before streaming?

The first token requires computing attention across the entire input sequence, an O(n²) operation with high latency. Subsequent tokens use cached computations, reducing complexity to approximately O(n) and enabling smooth streaming. The post explains this is fundamental to Transformer architecture. Initial delay is unavoidable, but recognizing it helps you design buffering into your UI.

Why does my LLM prompt cost more than expected?

Cost scales with token count, not word count. LLMs tokenize text into subword units. Common words are 1 token, rare terms split into many. 'Antidisestablishmentarianism' costs 6 tokens. The post shows this matters for real systems. Always count tokens before sending prompts. Technical text and rare words inflate costs fast, so review your prompt structure if bills climb.

For the full reference, see the Anthropic API documentation.

Take the next step

Share this post

Continue Reading

Weekly Bytes of AI

Technical deep-dives for engineers building production AI systems.

Architecture patterns, system design, cost optimization, and real-world case studies. No fluff, just engineering insights.

Unsubscribe anytime. We respect your inbox.

Ready to go deeper?

Go beyond articles. Build production AI systems with hands-on workshops and our intensive AI Bootcamp.