The big idea: it's all about prediction
If you saw the sentence: The dog chased the ___, what word comes next?
You probably thought ball, cat, or squirrel. You didn't know the answer for sure; you predicted it based on common patterns you've seen in language before.
That's exactly what an LLM does. It's a prediction machine. It predicts the most likely next word (or "token") based on all the text it has learned from.
But if it's just guessing, how does it seem so smart? It's about how it guesses and the settings we can control.
How do we talk to an LLM?
In the coding world, you don't just "talk" to an LLM. You send it a structured request, often called an API call. Think of it as filling out a form for the LLM.
# filename: example.py
# description: Code example from the post.
# 1. Import the necessary library
from openai import OpenAI
# 2. Initialize the connection client
# (In a real app, the API key is loaded securely)
client = OpenAI(api_key="...")
# 3. Create the chat completion request
response = client.chat.completions.create(
model="gpt-4o-mini", # Specify the model
messages=[ # Define the message history
{"role": "user", "content": "Explain what an LLM is in one sentence."}
],
max_tokens=150 # Set a maximum token limit for the reply
)
# 4. Extract and print the text content of the reply
answer = response.choices[0].message.content
print(answer)
The magic isn't just in the prompt. It's in the settings you can add to that request. The most important one is "temperature."
Temperature: the creativity dial
"Temperature" is a setting that controls how creative or random the LLM's predictions are.
graph TD
A[Temperature Dial] --> B{LLM's Behavior}
B --> C[0.0: Predictable, Factual]
B --> D[0.7: Balanced, General Chat]
B --> E[1.5+: Creative, Unpredictable]
- Low Temperature (e.g., 0.0): The LLM will always pick the most obvious, safest next word. It's boring, predictable, and great for facts or code.
- High Temperature (e.g., 1.5): The LLM takes more risks, picking less common words. This makes it highly creative and imaginative, but also more likely to go off-topic.
Here's how we'd add that setting to our code:
# This request asks for a creative slogan
# by turning the temperature up.
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[
{"role": "user", "content": "Write a catchy slogan for a coffee shop."}
],
temperature=1.2, # <-- Set the "creativity dial" high
max_tokens=20
)
If you ran this, you'd get a different slogan almost every time. If you set temperature=0.0, you'd probably get the same slogan every single time.
What are tokens?
LLMs don't actually see "words." They see tokens.
Think of tokens as "word pieces." For English, 1 token is about 0.75 words.
graph TD
A["Human Language: 'Hello world!'"] --> B[Tokenization]
B --> C["'Hello' (1 token)"]
B --> D["' world' (1 token)"]
B --> E["'!' (1 token)"]
This matters for two big reasons: cost and limits.
-
Cost: You pay for every token. Both the tokens you send in (your prompt) and the tokens you get back (the answer).
-
Limits: Every LLM has a "context window," or a maximum number of tokens it can remember at one time.
Notice how different things "cost" different token amounts. Long words or other languages are often "more expensive" in tokens. A library like tiktoken is used to count them.
import tiktoken
# Get the encoder for a specific model
encoding = tiktoken.encoding_for_model("gpt-4o")
# 'encoding.encode' turns text into a list of token numbers
tokens_hello = encoding.encode("Hello")
tokens_long_word = encoding.encode("Antidisestablishmentarianism")
tokens_chinese = encoding.encode("人工智能") # "Artificial Intelligence"
print(f"'Hello': {len(tokens_hello)} tokens")
print(f"'Antidisestablishmentarianism': {len(tokens_long_word)} tokens")
print(f"'人工智能': {len(tokens_chinese)} tokens")
# Output:
# 'Hello': 1 tokens
# 'Antidisestablishmentarianism': 5 tokens
# '人工智能': 6 tokens
This is why an LLM might feel "smarter" or "cheaper" in English, it was trained on more English tokens, so it's more efficient at processing them.
Context windows: the llm's short-term memory
The context window is the LLM's entire short-term memory. It's the maximum number of tokens (your prompt + its answer) it can handle at once.
graph TD
A["Your Input (e.g., 2,000 Tokens)"] --- B["LLM's Brain"]
C["LLM's Reply (e.g., 1,000 Tokens)"] --- B
subgraph TOTAL["Total Memory Used: 3,000 Tokens"]
A
C
end
D{"Context Window Limit (e.g., 4,000 Tokens)"}
TOTAL -- "Must Be Less Than" --> D
If your conversation (all your prompts and all its answers) gets longer than this limit, the LLM starts to "forget" the beginning of the conversation.
This is the single biggest challenge in using LLMs. You can't just ask it to "summarize this 500-page book" by pasting the whole book, because the book is probably 200,000 tokens, but the LLM's memory (context window) might only be 8,000 or 128,000 tokens.
The cost equation
Using LLMs isn't free. The cost is calculated very simply:
Total Cost = (Input Tokens × Input Price) + (Output Tokens × Output Price)
A key insight: Output tokens (the LLM's answer) are almost always more expensive than input tokens (your prompt). It "costs" more for the LLM to think than to listen.
Here's the logic for a cost-calculating function:
# A simple function to estimate the cost of one LLM call.
def estimate_cost(input_text, output_text):
# 1. Define example prices (per 1 MILLION tokens)
INPUT_PRICE_PER_1M_TOKENS = 0.15 # $0.15
OUTPUT_PRICE_PER_1M_TOKENS = 0.60 # $0.60 (4x more expensive!)
# 2. Count the tokens (using a hypothetical count_tokens function)
input_tokens = count_tokens(input_text)
output_tokens = count_tokens(output_text)
# 3. Calculate the cost for each part
input_cost = (input_tokens / 1_000_000) * INPUT_PRICE_PER_1M_TOKENS
output_cost = (output_tokens / 1_000_000) * OUTPUT_PRICE_PER_1M_TOKENS
# 4. Add them up
total_cost = input_cost + output_cost
return total_cost
When do LLMs fail?
LLMs are amazing, but they are not perfect. They have very predictable failure modes.
1. Hallucinations (making stuff up)
An LLM's job is to predict the next word. It does not know what is true or false. A hallucination is when the LLM confidently generates a plausible-sounding but completely false statement.
If you ask it:
"Tell me about the 2023 Nobel Prize winner in Astrobotany."
It won't say, "That's not a real prize." It will invent a person, a university, and their "groundbreaking research" because those words statistically follow the pattern of your question.
2. Bad at math
LLMs are text-prediction machines, not calculators. They can recognize simple math (like 2 + 2 = 4) because they've seen that text in their training data. But they can't do math.
If you ask it:
"What is 234 * 567?"
It is very likely to give you the wrong answer. It's just predicting what a number looks like in that position, not actually performing the calculation.
Frequently asked questions
Why do LLMs hallucinate
Hallucinations happen because LLMs are prediction machines, not fact-checkers. They pick the most statistically likely next word based on training patterns, not on what's true. When asked something outside their training, they confidently generate plausible-sounding but false statements. The LLM doesn't know the difference between a real fact and an invented one.
Should I increase temperature in LLM requests
Temperature controls the randomness of an LLM's predictions: lower temps give consistent outputs, higher temps give creative ones. Use low temperature (0.0-0.3) for code generation and fact-checking. Use high temperature (0.7-1.5) for creative writing or brainstorming. Pick based on your actual use case needs, not on theory.
What happens when my prompt exceeds the context window
The LLM forgets earlier conversation when your context exceeds its window limit. This is the single biggest challenge when using LLMs. You can't paste a 500-page book (200k tokens) into an 8k context window and expect answers about the whole thing. Design around context limits from the start.
For a hands-on path through this topic, see Prompt Engineering Crash Course.
For the full reference, see the Anthropic API documentation.
Key takeaways
- LLMs are predictors, not thinkers. They just guess the next most likely word.
- Temperature is your "creativity dial." Low for facts, high for fiction.
- Tokens are the "word pieces" you pay for. Everything has a cost.
- Context Windows are the LLM's "short-term memory." This is their biggest limitation.
- LLMs Hallucinate and are bad at math. Never trust them with facts or numbers without checking.
For more on building production AI systems, check out our AI Engineering Bootcamp.
Take the next step
- Generative AI Foundation Course, Master LLM fundamentals, tokens, and structured output
Continue Reading
Ready to go deeper?
Go beyond articles. Build production AI systems with hands-on workshops and our intensive AI Bootcamp.