LLM basics: how machines think (and don't)

The big idea: it's all about prediction

If you saw the sentence: The dog chased the ___, what word comes next?

You probably thought ball, cat, or squirrel. You didn't know the answer for sure; you predicted it based on common patterns you've seen in language before.

That's exactly what an LLM does. It's a prediction machine. It predicts the most likely next word (or "token") based on all the text it has learned from.

But if it's just guessing, how does it seem so smart? It's about how it guesses and the settings we can control.

How do we talk to an LLM?

In the coding world, you don't just "talk" to an LLM. You send it a structured request, often called an API call. Think of it as filling out a form for the LLM.

# filename: example.py
# description: Code example from the post.
# 1. Import the necessary library
from openai import OpenAI

# 2. Initialize the connection client
#    (In a real app, the API key is loaded securely)
client = OpenAI(api_key="...") 

# 3. Create the chat completion request
response = client.chat.completions.create(
    model="gpt-4o-mini",  # Specify the model
    messages=[            # Define the message history
        {"role": "user", "content": "Explain what an LLM is in one sentence."}
    ],
    max_tokens=150        # Set a maximum token limit for the reply
)

# 4. Extract and print the text content of the reply
answer = response.choices[0].message.content
print(answer)

The magic isn't just in the prompt. It's in the settings you can add to that request. The most important one is "temperature."

Temperature: the creativity dial

"Temperature" is a setting that controls how creative or random the LLM's predictions are.

graph TD
    A[Temperature Dial] --> B{LLM's Behavior}
    B --> C[0.0: Predictable, Factual]
    B --> D[0.7: Balanced, General Chat]
    B --> E[1.5+: Creative, Unpredictable]

Low Temperature (e.g., 0.0): The LLM will always pick the most obvious, safest next word. It's boring, predictable, and great for facts or code.
High Temperature (e.g., 1.5): The LLM takes more risks, picking less common words. This makes it highly creative and imaginative, but also more likely to go off-topic.

Here's how we'd add that setting to our code:

# This request asks for a creative slogan
# by turning the temperature up.
response = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[
        {"role": "user", "content": "Write a catchy slogan for a coffee shop."}
    ],
    temperature=1.2,  # <-- Set the "creativity dial" high
    max_tokens=20
)

If you ran this, you'd get a different slogan almost every time. If you set temperature=0.0, you'd probably get the same slogan every single time.

What are tokens?

LLMs don't actually see "words." They see tokens.

Think of tokens as "word pieces." For English, 1 token is about 0.75 words.

graph TD
    A["Human Language: 'Hello world!'"] --> B[Tokenization]
    B --> C["'Hello' (1 token)"]
    B --> D["' world' (1 token)"]
    B --> E["'!' (1 token)"]

This matters for two big reasons: cost and limits.

Cost: You pay for every token. Both the tokens you send in (your prompt) and the tokens you get back (the answer).
Limits: Every LLM has a "context window," or a maximum number of tokens it can remember at one time.

Notice how different things "cost" different token amounts. Long words or other languages are often "more expensive" in tokens. A library like tiktoken is used to count them.

import tiktoken

# Get the encoder for a specific model
encoding = tiktoken.encoding_for_model("gpt-4o")

# 'encoding.encode' turns text into a list of token numbers
tokens_hello = encoding.encode("Hello")
tokens_long_word = encoding.encode("Antidisestablishmentarianism")
tokens_chinese = encoding.encode("人工智能") # "Artificial Intelligence"

print(f"'Hello': {len(tokens_hello)} tokens")
print(f"'Antidisestablishmentarianism': {len(tokens_long_word)} tokens")
print(f"'人工智能': {len(tokens_chinese)} tokens")

# Output:
# 'Hello': 1 tokens
# 'Antidisestablishmentarianism': 5 tokens
# '人工智能': 6 tokens

This is why an LLM might feel "smarter" or "cheaper" in English, it was trained on more English tokens, so it's more efficient at processing them.

Context windows: the llm's short-term memory

The context window is the LLM's entire short-term memory. It's the maximum number of tokens (your prompt + its answer) it can handle at once.

graph TD
    A["Your Input (e.g., 2,000 Tokens)"] --- B["LLM's Brain"]
    C["LLM's Reply (e.g., 1,000 Tokens)"] --- B
    
    subgraph TOTAL["Total Memory Used: 3,000 Tokens"]
        A
        C
    end

    D{"Context Window Limit (e.g., 4,000 Tokens)"}
    TOTAL -- "Must Be Less Than" --> D

If your conversation (all your prompts and all its answers) gets longer than this limit, the LLM starts to "forget" the beginning of the conversation.

This is the single biggest challenge in using LLMs. You can't just ask it to "summarize this 500-page book" by pasting the whole book, because the book is probably 200,000 tokens, but the LLM's memory (context window) might only be 8,000 or 128,000 tokens.

The cost equation

Using LLMs isn't free. The cost is calculated very simply:

Total Cost = (Input Tokens × Input Price) + (Output Tokens × Output Price)

A key insight: Output tokens (the LLM's answer) are almost always more expensive than input tokens (your prompt). It "costs" more for the LLM to think than to listen.

Here's the logic for a cost-calculating function:

# A simple function to estimate the cost of one LLM call.
def estimate_cost(input_text, output_text):
    # 1. Define example prices (per 1 MILLION tokens)
    INPUT_PRICE_PER_1M_TOKENS = 0.15  # $0.15
    OUTPUT_PRICE_PER_1M_TOKENS = 0.60 # $0.60 (4x more expensive!)
    
    # 2. Count the tokens (using a hypothetical count_tokens function)
    input_tokens = count_tokens(input_text)
    output_tokens = count_tokens(output_text)
    
    # 3. Calculate the cost for each part
    input_cost = (input_tokens / 1_000_000) * INPUT_PRICE_PER_1M_TOKENS
    output_cost = (output_tokens / 1_000_000) * OUTPUT_PRICE_PER_1M_TOKENS
    
    # 4. Add them up
    total_cost = input_cost + output_cost
    return total_cost

When do LLMs fail?

LLMs are amazing, but they are not perfect. They have very predictable failure modes.

1. Hallucinations (making stuff up)

An LLM's job is to predict the next word. It does not know what is true or false. A hallucination is when the LLM confidently generates a plausible-sounding but completely false statement.

If you ask it:

"Tell me about the 2023 Nobel Prize winner in Astrobotany."

It won't say, "That's not a real prize." It will invent a person, a university, and their "groundbreaking research" because those words statistically follow the pattern of your question.

2. Bad at math

LLMs are text-prediction machines, not calculators. They can recognize simple math (like 2 + 2 = 4) because they've seen that text in their training data. But they can't do math.

If you ask it:

"What is 234 * 567?"

It is very likely to give you the wrong answer. It's just predicting what a number looks like in that position, not actually performing the calculation.

Frequently asked questions

Why do LLMs hallucinate

Hallucinations happen because LLMs are prediction machines, not fact-checkers. They pick the most statistically likely next word based on training patterns, not on what's true. When asked something outside their training, they confidently generate plausible-sounding but false statements. The LLM doesn't know the difference between a real fact and an invented one.

Should I increase temperature in LLM requests

Temperature controls the randomness of an LLM's predictions: lower temps give consistent outputs, higher temps give creative ones. Use low temperature (0.0-0.3) for code generation and fact-checking. Use high temperature (0.7-1.5) for creative writing or brainstorming. Pick based on your actual use case needs, not on theory.

What happens when my prompt exceeds the context window

The LLM forgets earlier conversation when your context exceeds its window limit. This is the single biggest challenge when using LLMs. You can't paste a 500-page book (200k tokens) into an 8k context window and expect answers about the whole thing. Design around context limits from the start.

For a hands-on path through this topic, see Prompt Engineering Crash Course.

For the full reference, see the Anthropic API documentation.

Key takeaways

LLMs are predictors, not thinkers. They just guess the next most likely word.
Temperature is your "creativity dial." Low for facts, high for fiction.
Tokens are the "word pieces" you pay for. Everything has a cost.
Context Windows are the LLM's "short-term memory." This is their biggest limitation.
LLMs Hallucinate and are bad at math. Never trust them with facts or numbers without checking.

For more on building production AI systems, check out our AI Engineering Bootcamp.

Take the next step

Generative AI Foundation Course, Master LLM fundamentals, tokens, and structured output

The big idea: it's all about prediction

If you saw the sentence: The dog chased the ___, what word comes next?

You probably thought ball, cat, or squirrel. You didn't know the answer for sure; you predicted it based on common patterns you've seen in language before.

That's exactly what an LLM does. It's a prediction machine. It predicts the most likely next word (or "token") based on all the text it has learned from.

But if it's just guessing, how does it seem so smart? It's about how it guesses and the settings we can control.

How do we talk to an LLM?

In the coding world, you don't just "talk" to an LLM. You send it a structured request, often called an API call. Think of it as filling out a form for the LLM.

# filename: example.py
# description: Code example from the post.
# 1. Import the necessary library
from openai import OpenAI

# 2. Initialize the connection client
#    (In a real app, the API key is loaded securely)
client = OpenAI(api_key="...") 

# 3. Create the chat completion request
response = client.chat.completions.create(
    model="gpt-4o-mini",  # Specify the model
    messages=[            # Define the message history
        {"role": "user", "content": "Explain what an LLM is in one sentence."}
    ],
    max_tokens=150        # Set a maximum token limit for the reply
)

# 4. Extract and print the text content of the reply
answer = response.choices[0].message.content
print(answer)

The magic isn't just in the prompt. It's in the settings you can add to that request. The most important one is "temperature."

Temperature: the creativity dial

"Temperature" is a setting that controls how creative or random the LLM's predictions are.

graph TD
    A[Temperature Dial] --> B{LLM's Behavior}
    B --> C[0.0: Predictable, Factual]
    B --> D[0.7: Balanced, General Chat]
    B --> E[1.5+: Creative, Unpredictable]

Low Temperature (e.g., 0.0): The LLM will always pick the most obvious, safest next word. It's boring, predictable, and great for facts or code.
High Temperature (e.g., 1.5): The LLM takes more risks, picking less common words. This makes it highly creative and imaginative, but also more likely to go off-topic.

Here's how we'd add that setting to our code:

# This request asks for a creative slogan
# by turning the temperature up.
response = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[
        {"role": "user", "content": "Write a catchy slogan for a coffee shop."}
    ],
    temperature=1.2,  # <-- Set the "creativity dial" high
    max_tokens=20
)

If you ran this, you'd get a different slogan almost every time. If you set temperature=0.0, you'd probably get the same slogan every single time.

What are tokens?

LLMs don't actually see "words." They see tokens.

Think of tokens as "word pieces." For English, 1 token is about 0.75 words.

graph TD
    A["Human Language: 'Hello world!'"] --> B[Tokenization]
    B --> C["'Hello' (1 token)"]
    B --> D["' world' (1 token)"]
    B --> E["'!' (1 token)"]

This matters for two big reasons: cost and limits.

Cost: You pay for every token. Both the tokens you send in (your prompt) and the tokens you get back (the answer).
Limits: Every LLM has a "context window," or a maximum number of tokens it can remember at one time.

Notice how different things "cost" different token amounts. Long words or other languages are often "more expensive" in tokens. A library like tiktoken is used to count them.

import tiktoken

# Get the encoder for a specific model
encoding = tiktoken.encoding_for_model("gpt-4o")

# 'encoding.encode' turns text into a list of token numbers
tokens_hello = encoding.encode("Hello")
tokens_long_word = encoding.encode("Antidisestablishmentarianism")
tokens_chinese = encoding.encode("人工智能") # "Artificial Intelligence"

print(f"'Hello': {len(tokens_hello)} tokens")
print(f"'Antidisestablishmentarianism': {len(tokens_long_word)} tokens")
print(f"'人工智能': {len(tokens_chinese)} tokens")

# Output:
# 'Hello': 1 tokens
# 'Antidisestablishmentarianism': 5 tokens
# '人工智能': 6 tokens

This is why an LLM might feel "smarter" or "cheaper" in English, it was trained on more English tokens, so it's more efficient at processing them.

Context windows: the llm's short-term memory

The context window is the LLM's entire short-term memory. It's the maximum number of tokens (your prompt + its answer) it can handle at once.

graph TD
    A["Your Input (e.g., 2,000 Tokens)"] --- B["LLM's Brain"]
    C["LLM's Reply (e.g., 1,000 Tokens)"] --- B
    
    subgraph TOTAL["Total Memory Used: 3,000 Tokens"]
        A
        C
    end

    D{"Context Window Limit (e.g., 4,000 Tokens)"}
    TOTAL -- "Must Be Less Than" --> D

If your conversation (all your prompts and all its answers) gets longer than this limit, the LLM starts to "forget" the beginning of the conversation.

The cost equation

Using LLMs isn't free. The cost is calculated very simply:

Total Cost = (Input Tokens × Input Price) + (Output Tokens × Output Price)

A key insight: Output tokens (the LLM's answer) are almost always more expensive than input tokens (your prompt). It "costs" more for the LLM to think than to listen.

Here's the logic for a cost-calculating function:

# A simple function to estimate the cost of one LLM call.
def estimate_cost(input_text, output_text):
    # 1. Define example prices (per 1 MILLION tokens)
    INPUT_PRICE_PER_1M_TOKENS = 0.15  # $0.15
    OUTPUT_PRICE_PER_1M_TOKENS = 0.60 # $0.60 (4x more expensive!)
    
    # 2. Count the tokens (using a hypothetical count_tokens function)
    input_tokens = count_tokens(input_text)
    output_tokens = count_tokens(output_text)
    
    # 3. Calculate the cost for each part
    input_cost = (input_tokens / 1_000_000) * INPUT_PRICE_PER_1M_TOKENS
    output_cost = (output_tokens / 1_000_000) * OUTPUT_PRICE_PER_1M_TOKENS
    
    # 4. Add them up
    total_cost = input_cost + output_cost
    return total_cost

When do LLMs fail?

LLMs are amazing, but they are not perfect. They have very predictable failure modes.

1. Hallucinations (making stuff up)

An LLM's job is to predict the next word. It does not know what is true or false. A hallucination is when the LLM confidently generates a plausible-sounding but completely false statement.

If you ask it:

"Tell me about the 2023 Nobel Prize winner in Astrobotany."

It won't say, "That's not a real prize." It will invent a person, a university, and their "groundbreaking research" because those words statistically follow the pattern of your question.

2. Bad at math

LLMs are text-prediction machines, not calculators. They can recognize simple math (like 2 + 2 = 4) because they've seen that text in their training data. But they can't do math.

If you ask it:

"What is 234 * 567?"

It is very likely to give you the wrong answer. It's just predicting what a number looks like in that position, not actually performing the calculation.

LLMs are predictors, not thinkers. They just guess the next most likely word.
Temperature is your "creativity dial." Low for facts, high for fiction.
Tokens are the "word pieces" you pay for. Everything has a cost.
Context Windows are the LLM's "short-term memory." This is their biggest limitation.
LLMs Hallucinate and are bad at math. Never trust them with facts or numbers without checking.

For more on building production AI systems, check out our AI Engineering Bootcamp.

Take the next step

Generative AI Foundation Course, Master LLM fundamentals, tokens, and structured output

LLM basics: how machines think (and don't)

Share this post

Share this post

Continue Reading

Choosing the LLM judge for evaluation pipelines

Hallucination testing for RAG pipelines

Fact-checking RAG answers: grounding with verification

Weekly Bytes of AI

Ready to go deeper?

LLM basics: how machines think (and don't)

Share this post

Share this post

Continue Reading

Choosing the LLM judge for evaluation pipelines

Hallucination testing for RAG pipelines

Fact-checking RAG answers: grounding with verification

Weekly Bytes of AI

Ready to go deeper?