Chain-of-thought reasoning in LLM prompts

We've built a bot that gives specific answers (explicit instructions and role prompting) and another that formats data (structured JSON output). Now, we'll tackle a harder problem: logic.

This post is for you if you've ever been shocked by an AI failing a simple riddle. LLMs are "text predictors", not "calculators". They are brilliant at language, but they often fail at simple math and logic because they take an "intuitive" shortcut. They predict the most likely next word, which is often the wrong but plausible-sounding answer.

Today, we'll build a Math Tutor Bot and use Chain-of-Thought (CoT) Prompting to force it to slow down, show its work, and get the correct answer.

Why does the model confidently guess wrong?

Let's give an LLM a classic logic puzzle.

Use Case: "A bat and a ball cost $1.10 in total. The bat costs $1.00 more than the ball. How much does the ball cost?"

graph TD
    A["User: 'A bat and a ball cost $1.10...'"] --> B(LLM)
    B --> C["Bot: 'The ball costs 10 cents.'"]
    
    style C fill:#ffebee,stroke:#b71c1c,color:#212121

Why this is bad:

It's WRONG.
It's confident. The user will trust this incorrect answer.
The Flawed Logic: The LLM's "fast brain" saw $1.10 and $1.00 and just subtracted.
The Real Logic: If the ball is $0.10, the bat is $1.10 ($1.00 more). The total is $1.20, which is wrong.

The correct answer is 5 cents.

Ball = $0.05
Bat = $1.05
Total = $1.10

How does chain-of-thought (CoT) prompting fix it?

The LLM got it wrong because it tried to answer in one step. We can fix this by forcing it to show its work. The process of reasoning helps the LLM catch its own mistakes.

This is the famous Chain-of-Thought (CoT) technique.

The "How": We'll add a simple magic phrase to our prompt: "Let's think step by step."

# filename: example.py
# description: Code example from the post.
prompt = """
A bat and a ball cost $1.10 in total. The bat costs $1.00
more than the ball. How much does the ball cost?

Let's think step by step.
"""

graph TD
    A["User: Bat & Ball problem"] --> B["Add magic phrase: Let's think step by step"]
    B --> C(LLM)
    
    C --> D["Step 1: Define variables"]
    D --> E["Step 2: Set up equations"]
    E --> F["Step 3: Solve step by step"]
    F --> G["Step 4: Verify answer"]
    G --> H["Bot: Correct answer 5 cents"]
    
    style H fill:#e8f5e9,stroke:#388e3c,color:#212121

Observation: It worked! By forcing the LLM to write out the logical steps, it's no longer a "guessing" problem. It's a "sequence completion" problem, which LLMs are excellent at. We've guided it to the correct answer.

Think About It: This CoT prompt makes the answer very long (verbose). In a real app, how could you show this to the user? (Hint: Maybe a "Show my work" toggle?)

This single technique is one of the most powerful in all of prompt engineering. It's the simplest way to improve the "reasoning" ability of any LLM. For more on how LLMs work internally, see our LLM fundamentals guide.

Challenge for you

Use Case: You have a list of tasks for a project: Task A (5 days), Task B (3 days, needs A to finish), Task C (2 days, needs B to finish).
The Problem: You ask the LLM, When will Task C be done? and it just guesses.
Your Task: Write a Chain-of-Thought prompt that forces the LLM to calculate the dependencies and find the correct total time.

Frequently asked questions

Why do LLMs fail at simple math problems?

LLMs are text predictors, not calculators. They pick the most likely next word based on probability, which often surfaces a plausible-sounding but incorrect answer. The bat-and-ball problem shows this: the model guesses $0.10 without actually calculating. Chain-of-Thought fixes it by forcing step-by-step work, shifting from guessing to sequence completion where LLMs excel.

How much does chain-of-thought prompting slow down responses?

CoT makes responses significantly longer and more verbose. The post acknowledges this trade-off directly: correctness versus latency. For production systems, the solution isn't to avoid CoT but to design around it, like a "Show my work" toggle that keeps the UI fast while showing reasoning on demand. The trade-off is worth it when correctness matters.

Should I use chain-of-thought in every prompt?

No. CoT is specifically for problems requiring logical reasoning across multiple steps: math, dependencies, debugging logic. For simple retrieval or classification, the overhead doesn't justify the gains. Apply CoT where the real problem is reasoning, not guessing. Use the simplest technique that actually solves your problem.

For the full reference, see the Anthropic prompt engineering guide.

Key takeaways

LLMs are pattern matchers, not calculators: They predict likely text, not solve equations
Chain-of-thought forces reasoning: Asking for step-by-step work transforms guessing into logical problem-solving
The magic phrase works: Simply adding "Let's think step by step" dramatically improves accuracy on logic problems
CoT improves all reasoning tasks: This technique works for math, logic puzzles, planning, and any task requiring sequential thinking

For more on building production AI systems, check out our AI Bootcamp for Software Engineers.

What does take the next step look like?

Prompt Engineering Crash Course, Master chain-of-thought, self-critique, and advanced techniques

We've built a bot that gives specific answers (explicit instructions and role prompting) and another that formats data (structured JSON output). Now, we'll tackle a harder problem: logic.

Today, we'll build a Math Tutor Bot and use Chain-of-Thought (CoT) Prompting to force it to slow down, show its work, and get the correct answer.

Why does the model confidently guess wrong?

Let's give an LLM a classic logic puzzle.

Use Case: "A bat and a ball cost $1.10 in total. The bat costs $1.00 more than the ball. How much does the ball cost?"

graph TD
    A["User: 'A bat and a ball cost $1.10...'"] --> B(LLM)
    B --> C["Bot: 'The ball costs 10 cents.'"]
    
    style C fill:#ffebee,stroke:#b71c1c,color:#212121

Why this is bad:

It's WRONG.
It's confident. The user will trust this incorrect answer.
The Flawed Logic: The LLM's "fast brain" saw $1.10 and $1.00 and just subtracted.
The Real Logic: If the ball is $0.10, the bat is $1.10 ($1.00 more). The total is $1.20, which is wrong.

The correct answer is 5 cents.

Ball = $0.05
Bat = $1.05
Total = $1.10

How does chain-of-thought (CoT) prompting fix it?

The LLM got it wrong because it tried to answer in one step. We can fix this by forcing it to show its work. The process of reasoning helps the LLM catch its own mistakes.

This is the famous Chain-of-Thought (CoT) technique.

The "How": We'll add a simple magic phrase to our prompt: "Let's think step by step."

# filename: example.py
# description: Code example from the post.
prompt = """
A bat and a ball cost $1.10 in total. The bat costs $1.00
more than the ball. How much does the ball cost?

Let's think step by step.
"""

graph TD
    A["User: Bat & Ball problem"] --> B["Add magic phrase: Let's think step by step"]
    B --> C(LLM)
    
    C --> D["Step 1: Define variables"]
    D --> E["Step 2: Set up equations"]
    E --> F["Step 3: Solve step by step"]
    F --> G["Step 4: Verify answer"]
    G --> H["Bot: Correct answer 5 cents"]
    
    style H fill:#e8f5e9,stroke:#388e3c,color:#212121

Think About It: This CoT prompt makes the answer very long (verbose). In a real app, how could you show this to the user? (Hint: Maybe a "Show my work" toggle?)

Challenge for you

Use Case: You have a list of tasks for a project: Task A (5 days), Task B (3 days, needs A to finish), Task C (2 days, needs B to finish).
The Problem: You ask the LLM, When will Task C be done? and it just guesses.
Your Task: Write a Chain-of-Thought prompt that forces the LLM to calculate the dependencies and find the correct total time.

LLMs are pattern matchers, not calculators: They predict likely text, not solve equations
Chain-of-thought forces reasoning: Asking for step-by-step work transforms guessing into logical problem-solving
The magic phrase works: Simply adding "Let's think step by step" dramatically improves accuracy on logic problems
CoT improves all reasoning tasks: This technique works for math, logic puzzles, planning, and any task requiring sequential thinking

For more on building production AI systems, check out our AI Bootcamp for Software Engineers.

What does take the next step look like?

Prompt Engineering Crash Course, Master chain-of-thought, self-critique, and advanced techniques

Chain-of-thought prompting for reasoning

Share this post

Why does the model confidently guess wrong?

How does chain-of-thought (CoT) prompting fix it?

Challenge for you

Frequently asked questions

Why do LLMs fail at simple math problems?

How much does chain-of-thought prompting slow down responses?

Should I use chain-of-thought in every prompt?

Key takeaways

What does take the next step look like?

Share this post

Continue Reading

Query anonymization for RAG bias mitigation

LLM-based content filtering for RAG pipelines

Combining vector stores in RAG: multi-source retrieval

Ready to go deeper?

Chain-of-thought prompting for reasoning

Share this post

Why does the model confidently guess wrong?

How does chain-of-thought (CoT) prompting fix it?

Challenge for you

Frequently asked questions

Why do LLMs fail at simple math problems?

How much does chain-of-thought prompting slow down responses?

Should I use chain-of-thought in every prompt?

Key takeaways

What does take the next step look like?

Share this post

Continue Reading

Query anonymization for RAG bias mitigation

LLM-based content filtering for RAG pipelines

Combining vector stores in RAG: multi-source retrieval

Ready to go deeper?

Chain-of-thought prompting for reasoning

Share this post

Share this post

Continue Reading

Query anonymization for RAG bias mitigation

LLM-based content filtering for RAG pipelines

Combining vector stores in RAG: multi-source retrieval

Weekly Bytes of AI

Ready to go deeper?

Chain-of-thought prompting for reasoning

Share this post

Share this post

Continue Reading

Query anonymization for RAG bias mitigation

LLM-based content filtering for RAG pipelines

Combining vector stores in RAG: multi-source retrieval

Weekly Bytes of AI

Ready to go deeper?