LangChain chain types: stuff vs map reduce vs refine
You picked chain_type='stuff' and your summary crashed above 50 pages
You wired up a LangChain summarization chain in 5 minutes. It worked great on the test PDF. You handed it a real customer's document, hit the context limit, and got a ContextWindowExceeded error. Your fix was to switch to chain_type='map_reduce' because Stack Overflow said so. Now it works, but the summaries are weirdly disjointed and 3 times more expensive than they should be.
This is the LangChain summarization tax. The library hands you 4 chain types (stuff, map_reduce, refine, map_rerank) with almost no guidance about when each one wins. The defaults are tuned for demos. Production picks are different.
This post is the decision framework. By the end you will know exactly which chain type to use for which document size, what each one actually does under the hood, and the cost and quality trade-offs that should drive the choice.
What are LangChain chain types and why do they exist?
A chain type is a strategy for handling documents that are too big to fit in a single LLM call. Each strategy splits the document, sends pieces to the model, and combines the results in a different way.
graph TD
Doc[Long document N pages] --> Choice{Chain type}
Choice -->|stuff| Stuff[Send everything in one call]
Choice -->|map_reduce| Map[Summarize each chunk independently then combine]
Choice -->|refine| Refine[Summarize chunk 1 then iteratively refine with chunk 2 3 4]
Choice -->|map_rerank| Rerank[Score each chunk and return the best answer]
Stuff --> Out[Final summary]
Map --> Out
Refine --> Out
Rerank --> Out
style Stuff fill:#dcfce7,stroke:#15803d
style Map fill:#dbeafe,stroke:#1e40af
style Refine fill:#fef3c7,stroke:#b45309
style Rerank fill:#fce7f3,stroke:#be185d
The 4 strategies differ on 3 axes: how many LLM calls they make, how much context coherence they preserve, and how predictable their cost is. Pick wrong and you either overpay, lose coherence, or hit the context limit. Pick right and you get the cheapest option that fits your document.
When should you use the stuff chain?
Use stuff when the entire document fits comfortably inside the model's context window with room to spare for the summary. That is the entire test. Nothing else matters.
The stuff chain concatenates every chunk into one big prompt and makes a single LLM call. One round trip. Maximum coherence because the model sees everything at once. Lowest cost per token because there is no per-chunk overhead. The catch is that it crashes the moment your document exceeds the context window.
In practice, with modern long-context models (200K tokens for Claude, 128K for GPT-4-turbo), stuff handles roughly:
| Document type | Approximate page count | Use stuff? |
|---|---|---|
| Blog post | up to 50 pages | Yes |
| Legal contract | up to 100 pages | Yes |
| Technical RFC | up to 80 pages | Yes |
| Full novel | 300+ pages | No |
| Multi-doc corpus | varies | No, use Map Reduce |
Anything that fits, ship with stuff. The newer the model, the more this is the right answer. Long-context models killed off most of the use cases for the other chain types. Do not reach for fancier strategies until stuff actually fails.
When does map reduce beat stuff?
map_reduce wins when the document is larger than the context window or when you want parallel processing across many independent chunks. The pattern: summarize each chunk in parallel (the "map" step), then combine the chunk summaries with a final LLM call (the "reduce" step).
# filename: summarize_map_reduce.py
# description: A small map_reduce summarizer. One call per chunk, plus
# one final reduce call. Parallel across chunks for speed.
import asyncio
from anthropic import Anthropic
client = Anthropic()
MAP_PROMPT = 'Summarize the following passage in 3 sentences:\n\n{chunk}'
REDUCE_PROMPT = (
'Combine the following chunk summaries into a single coherent '
'summary of the whole document:\n\n{summaries}'
)
async def summarize_chunk(chunk: str) -> str:
reply = await client.messages.create(
model='claude-haiku-4-5-20251001',
max_tokens=300,
messages=[{'role': 'user', 'content': MAP_PROMPT.format(chunk=chunk)}],
)
return reply.content[0].text
async def map_reduce_summary(chunks: list[str]) -> str:
summaries = await asyncio.gather(*[summarize_chunk(c) for c in chunks])
joined = '\n\n'.join(f'- {s}' for s in summaries)
final = await client.messages.create(
model='claude-sonnet-4-6',
max_tokens=800,
messages=[{'role': 'user', 'content': REDUCE_PROMPT.format(summaries=joined)}],
)
return final.content[0].text
The win is that all chunk calls run in parallel, so wall-clock latency stays low even for big documents. The trade-off is that each chunk is summarized in isolation, so concepts that span chunk boundaries get lost. The reduce step recovers some of that coherence but not all.
Cost-wise, you pay for N map calls plus one reduce call. With chunk-level summaries that are short, the reduce call is cheap. Total cost is usually dominated by the map step, which is why people use a small model (Haiku tier) for the map and a big one for the reduce.
For a deeper walkthrough of how summarization fits into a real RAG pipeline, see the Agentic RAG Masterclass, which covers Map Reduce alongside reranking and quote extraction. The free RAG Fundamentals primer is the right starting point if you are still building your first retrieval pipeline.
When does refine beat map reduce?
refine wins when chunk order matters and you cannot afford to lose cross-chunk context. The pattern: summarize chunk 1, then feed the running summary plus chunk 2 into the next call ("refine the summary with this new information"), then chunk 3, and so on.
# filename: summarize_refine.py
# description: A refine summarizer. Sequential, one call per chunk,
# each call sees the running summary plus the next chunk.
async def refine_summary(chunks: list[str]) -> str:
running = ''
for chunk in chunks:
prompt = (
f'Existing summary so far:\n{running}\n\n'
f'Refine it using this new passage:\n{chunk}'
)
reply = await client.messages.create(
model='claude-sonnet-4-6',
max_tokens=600,
messages=[{'role': 'user', 'content': prompt}],
)
running = reply.content[0].text
return running
2 important properties. First, it is sequential: chunk N depends on chunk N-1. You cannot parallelize. Latency scales linearly with document length. Second, the running summary preserves narrative coherence in a way Map Reduce cannot. If you are summarizing a story, an investigation, or a thread of reasoning, refine produces noticeably better output.
The trade-off is cost and time. N sequential calls means N times the latency of stuff and roughly N times the cost. For a 100-chunk document this is too slow and too expensive to run interactively. Refine is for batch jobs where coherence matters more than speed.
How do you pick the right chain type? A decision tree
Skip the 4-axis comparison matrix. The decision is shorter than people make it.
graph TD
Start[Document to summarize] --> Q1{Fits in context?}
Q1 -->|yes| Stuff[Use stuff. Done.]
Q1 -->|no| Q2{Order matters for coherence?}
Q2 -->|no, chunks independent| MR[Use map_reduce. Parallel, cheap, fast.]
Q2 -->|yes, narrative or thread| Ref[Use refine. Sequential, slow, coherent.]
style Stuff fill:#dcfce7,stroke:#15803d
style MR fill:#dbeafe,stroke:#1e40af
style Ref fill:#fef3c7,stroke:#b45309
3 questions, one answer:
- Does the document fit in the context window? If yes, use
stuff. Stop reading. - If not, does chunk order matter for the summary to make sense? If no, use
map_reduce. It is parallel, cheap, and fast. - If chunk order matters (a story, an investigation, a step-by-step argument), use
refine. Accept the latency hit.
map_rerank is for question-answering, not summarization. Skip it for this use case.
In practice, 80 percent of production summarization should be stuff because long-context models exist now. The other 20 percent is map_reduce. refine is rare enough that I have only shipped it twice in 3 years.
What about RAG summarization specifically?
RAG summarization is the case where you retrieve N chunks for a query and want to summarize them into an answer. This is not the same as summarizing a single long document.
For RAG, the right pattern is almost always stuff plus quote extraction. Retrieve the top K chunks, run quote extraction to compress them (see Advanced RAG: Quote Extraction for Context Compression for the technique), then stuff the quotes into one summarization call. You get the coherence of stuff, the cost savings of quote extraction, and you avoid the boundary-loss problem of Map Reduce.
If your retrieved chunks are too large to fit even after extraction, fall back to map_reduce. But this case is rare and usually a sign your retriever is returning too many chunks, not that you need a fancier chain.
What to do Monday morning
- List every place in your codebase where you use
chain_type=orload_summarize_chain. Most of them are probablystufformap_reduceand were copied from a tutorial. - For each, measure the actual document size you feed into it. If the document fits in your model's context window with room to spare, switch to
stuffeven if it currently usesmap_reduce. You will save money and get better summaries. - For the remaining cases that genuinely overflow context, ask yourself the order question. If chunks are independent, stay with
map_reduce. If order matters, switch torefineand accept the latency. - For RAG summarization, replace any per-chunk summarization with quote extraction plus
stuff. This is almost always the right answer. - Track per-chain cost in your observability stack. The wrong chain type is invisible until your bill arrives.
The headline: long-context models killed most of the reasons to use anything other than stuff. Use the simple thing until it actually fails.
Frequently asked questions
What is the difference between stuff, map_reduce, and refine in LangChain?
stuff sends every chunk in one LLM call. map_reduce summarizes each chunk independently in parallel and then combines the summaries. refine walks the chunks sequentially, updating a running summary with each new chunk. Stuff is cheapest and most coherent but limited by context length. Map Reduce is fast and parallel but can lose cross-chunk context. Refine preserves narrative order at the cost of latency and total tokens.
When should I use the stuff chain in LangChain?
Whenever the entire document fits in the model's context window with room for the summary itself. With Claude's 200K context or GPT-4-turbo's 128K, this covers most documents under 100 pages. The stuff chain is the cheapest, simplest, and most coherent option, so the rule is "use it until it breaks." Long-context models made stuff the default for almost all summarization in 2026.
When should I use map_reduce instead of stuff?
When the document is larger than the model's context window or when you have many independent documents to summarize at once. Map Reduce is parallel, so wall-clock latency stays low even for very large inputs. The trade-off is that concepts spanning chunk boundaries can get lost. Use it for corpus-wide summarization or for documents over the context limit.
When does refine beat map_reduce?
When chunk order matters for the summary to make sense. Refine processes chunks sequentially and feeds a running summary into each step, so it preserves narrative coherence in a way Map Reduce cannot. The trade-off is that refine cannot run in parallel and costs roughly N times as much as stuff. Use it for stories, investigations, or step-by-step arguments where order is the meaning.
What chain type should I use for RAG summarization?
Use stuff plus quote extraction. Retrieve the top chunks, extract only the verbatim sentences relevant to the query, then send those quotes through a single stuff-style summarization call. This combines the coherence of stuff with the token savings of compression and avoids the boundary problems of Map Reduce. Fall back to Map Reduce only if your compressed quotes still overflow the context window, which is rare.
Key takeaways
stuffis the default for almost all summarization in 2026. Long-context models killed off the reasons to use anything fancier.map_reduceis for documents larger than the context window where chunk order does not matter. Parallel, cheap, fast, with some boundary loss.refineis for sequential, narrative documents where coherence beats latency. Rare in practice.map_rerankis for question answering, not summarization. Do not use it here.- For RAG summarization specifically, combine quote extraction with
stuff. It dominates the alternatives on cost, latency, and quality. - To see chain selection wired into a working agentic RAG pipeline, walk through the Agentic RAG Masterclass, or start with the RAG Fundamentals primer.
For the official chain type reference, see the LangChain summarization how-to guide. It documents every chain type with code, but stops short of telling you which one to pick. This post is the missing decision tree.
Continue Reading
Ready to go deeper?
Go beyond articles. Build production AI systems with hands-on workshops and our intensive AI Bootcamp.