Understand how LLMs work for engineering it better

For engineers who use GPT APIs but don’t trust the “it’s magic” answer.

Large Language Models (LLMs) like GPT‑4 or Claude aren’t magical. They’re predictive engines trained to guess the next token in a sequence, that’s it.

What does 1. how does next-token prediction work? look like?

Every output, chat, code, essay, is a sequence of guesses.

graph LR
    Input[Prompt tokens] --> Embed[Token embeddings]
    Embed --> Attn[Attention layers]
    Attn --> Predict[Next token<br/>probability distribution]
    Predict --> Sample[Sample one token]
    Sample --> Append[Append to output]
    Append --> Input

    style Predict fill:#dbeafe,stroke:#1e40af
    style Sample fill:#dcfce7,stroke:#15803d

# filename: example.py
# description: Code example from the post.
logits = model(tokens[:-1])
probs = softmax(logits[-1])
next_token = sample(probs)

Each new token depends on all previous ones. If the model drifts, the problem started earlier in the sequence.

Analogy: Think of an LLM as an engineer typing code with autocomplete turned on, one token at a time, no global plan.

What does 2. why are tokens not words? look like?

Models don’t read “words,” they read subword tokens.

"antidisestablishmentarianism" → ["anti", "dis", "establish", "ment", "arian", "ism"]

Common words ≈ 1 token
Rare or technical words = multiple tokens
Cost and latency scale with token count

👉 Always check token length before sending prompts.

len(tokenizer.encode("Your prompt"))

What does 3. how does attention let the model "think"? look like?

Each token decides which earlier tokens matter using self‑attention.

Query × Key → Attention weights → Weighted sum of Values

That’s the heart of the Transformer architecture.

During generation:

First token = slow (O(n²))
Later tokens = faster (cached, ≈ O(n))

Result: initial delay, then smooth streaming.

What does 4. what is the context window and why does it limit memory? look like?

The context window (e.g., 128k tokens) defines how much the model can “see.” Old tokens fade from focus in long prompts.

Fix it:

Summarize before Q&A
Use RAG to load only relevant chunks
Keep critical info near the end, recency bias helps

Key learnings

LLMs are not reasoning machines, they’re compression‑based next‑token predictors with limited memory. Understanding that boundary makes you a better builder.

For a hands-on path through this topic, see Prompt Engineering Crash Course.

Key takeaways

The pattern described above addresses a specific production failure mode that naive implementations miss.
Mechanical guardrails beat heroic debugging. Ship the fix that prevents the bug class, not the bug instance.
Measure before and after. If the change is not visible in metrics, it was not worth the complexity.
To see this pattern wired into a full production agent stack, walk through the Build your own coding agent course, or start with the AI Agents Fundamentals primer.

Context windows fade as sequences grow long. The model's attention mechanism can process the entire input, but earlier tokens lose focus relative to recent ones. The post explains this as a memory limitation of next-token prediction. Solutions: summarize before Q&A, use RAG for relevant chunks only, or place critical info near the end to exploit recency bias.

Why does my LLM have high initial latency before streaming?

The first token requires computing attention across the entire input sequence, an O(n²) operation with high latency. Subsequent tokens use cached computations, reducing complexity to approximately O(n) and enabling smooth streaming. The post explains this is fundamental to Transformer architecture. Initial delay is unavoidable, but recognizing it helps you design buffering into your UI.

Why does my LLM prompt cost more than expected?

Cost scales with token count, not word count. LLMs tokenize text into subword units. Common words are 1 token, rare terms split into many. 'Antidisestablishmentarianism' costs 6 tokens. The post shows this matters for real systems. Always count tokens before sending prompts. Technical text and rare words inflate costs fast, so review your prompt structure if bills climb.

For the full reference, see the Anthropic API documentation.

Take the next step

Generative AI Foundation Course, Deep dive into LLM internals, tokens, and practical applications

For engineers who use GPT APIs but don’t trust the “it’s magic” answer.

Large Language Models (LLMs) like GPT‑4 or Claude aren’t magical. They’re predictive engines trained to guess the next token in a sequence, that’s it.

What does 1. how does next-token prediction work? look like?

Every output, chat, code, essay, is a sequence of guesses.

graph LR
    Input[Prompt tokens] --> Embed[Token embeddings]
    Embed --> Attn[Attention layers]
    Attn --> Predict[Next token<br/>probability distribution]
    Predict --> Sample[Sample one token]
    Sample --> Append[Append to output]
    Append --> Input

    style Predict fill:#dbeafe,stroke:#1e40af
    style Sample fill:#dcfce7,stroke:#15803d

# filename: example.py
# description: Code example from the post.
logits = model(tokens[:-1])
probs = softmax(logits[-1])
next_token = sample(probs)

Each new token depends on all previous ones. If the model drifts, the problem started earlier in the sequence.

Analogy: Think of an LLM as an engineer typing code with autocomplete turned on, one token at a time, no global plan.

What does 2. why are tokens not words? look like?

Models don’t read “words,” they read subword tokens.

"antidisestablishmentarianism" → ["anti", "dis", "establish", "ment", "arian", "ism"]

Common words ≈ 1 token
Rare or technical words = multiple tokens
Cost and latency scale with token count

👉 Always check token length before sending prompts.

len(tokenizer.encode("Your prompt"))

What does 3. how does attention let the model "think"? look like?

Each token decides which earlier tokens matter using self‑attention.

Query × Key → Attention weights → Weighted sum of Values

That’s the heart of the Transformer architecture.

During generation:

First token = slow (O(n²))
Later tokens = faster (cached, ≈ O(n))

Result: initial delay, then smooth streaming.

What does 4. what is the context window and why does it limit memory? look like?

The context window (e.g., 128k tokens) defines how much the model can “see.” Old tokens fade from focus in long prompts.

Fix it:

Summarize before Q&A
Use RAG to load only relevant chunks
Keep critical info near the end, recency bias helps

Key learnings

LLMs are not reasoning machines, they’re compression‑based next‑token predictors with limited memory. Understanding that boundary makes you a better builder.

For a hands-on path through this topic, see Prompt Engineering Crash Course.

Key takeaways

The pattern described above addresses a specific production failure mode that naive implementations miss.
Mechanical guardrails beat heroic debugging. Ship the fix that prevents the bug class, not the bug instance.
Measure before and after. If the change is not visible in metrics, it was not worth the complexity.
To see this pattern wired into a full production agent stack, walk through the Build your own coding agent course, or start with the AI Agents Fundamentals primer.

Generative AI Foundation Course, Deep dive into LLM internals, tokens, and practical applications

Understand how LLMs work for engineering it better

Share this post

What does 1. how does next-token prediction work? look like?

What does 2. why are tokens not words? look like?

What does 3. how does attention let the model "think"? look like?

What does 4. what is the context window and why does it limit memory? look like?

Key learnings

Key takeaways

Frequently asked questions

Why do LLMs perform worse with very long context windows?

Why does my LLM have high initial latency before streaming?

Why does my LLM prompt cost more than expected?

Take the next step

Share this post

Continue Reading

Query anonymization for RAG bias mitigation

pip vs uv vs poetry for Python AI services

Retry patterns for LLM API errors in production

Ready to go deeper?

Understand how LLMs work for engineering it better

Share this post

What does 1. how does next-token prediction work? look like?

What does 2. why are tokens not words? look like?

What does 3. how does attention let the model "think"? look like?

What does 4. what is the context window and why does it limit memory? look like?

Key learnings

Key takeaways

Frequently asked questions

Why do LLMs perform worse with very long context windows?

Why does my LLM have high initial latency before streaming?

Why does my LLM prompt cost more than expected?

Take the next step

Share this post

Continue Reading

Query anonymization for RAG bias mitigation

pip vs uv vs poetry for Python AI services

Retry patterns for LLM API errors in production

Ready to go deeper?

Understand how LLMs work for engineering it better

Share this post

Share this post

Continue Reading

Query anonymization for RAG bias mitigation

pip vs uv vs poetry for Python AI services

Retry patterns for LLM API errors in production

Weekly Bytes of AI

Ready to go deeper?

Understand how LLMs work for engineering it better

Share this post

Share this post

Continue Reading

Query anonymization for RAG bias mitigation

pip vs uv vs poetry for Python AI services

Retry patterns for LLM API errors in production

Weekly Bytes of AI

Ready to go deeper?