Persistent memory for coding agents: cross-session context
Your agent forgets everything between sessions and it's driving you crazy
You tell your agent "we use pytest, not unittest, and all our services live under src/services/." It nods, makes the fix, and 20 minutes later in a new session you ask for a bug fix and the agent writes unittest boilerplate and puts the new file in the wrong directory. You explain again. It forgets again. Every session is day 1.
This is the gap between a chatbot and an assistant. A chatbot forgets on purpose (because each conversation is independent). An assistant has to remember (because the user expects continuity). Most coding agent tutorials stop at "here is the loop" and skip the memory layer entirely, which is why shipped agents feel amnesic compared to human collaborators.
This post is the memory model I use for coding agents: 3 kinds of memory, the storage layer, the recall pattern that avoids context bloat, and the 30 lines that turn "stateless loop" into "agent that remembers you."
Why can't you just stuff past conversations into the prompt?
Because context windows fill up fast and you will poison the current session with irrelevant detail. Naive "memory" implementations dump the entire conversation history into every new call, which costs money on every token and confuses the model when old context contradicts new instructions.
3 problems with the naive approach:
- Token cost. A user who has 50 past conversations sends the union of all of them on every new call. Even with 200k context, a verbose history will push into expensive territory.
- Relevance drift. Old conversations often contradict current state. The user changed their mind, the codebase evolved, the preferences shifted. Including old context makes the model reason about stale information.
- Signal dilution. Even if everything is still true, the important facts (preferred test framework, project layout) are buried in chatter about unrelated bugs.
The fix is to store memory as structured facts with metadata, not as raw transcripts. Recall only the facts that are relevant to the current task. Keep the conversation itself short.
graph TD
Session1[Session 1: raw messages] --> Extract[Extraction step]
Extract --> Store[Fact store]
NewTask[New session starts] --> Retrieve[Retrieve relevant facts]
Store --> Retrieve
Retrieve --> Inject[Inject into system prompt]
Inject --> Agent[Agent with memory]
style Store fill:#dbeafe,stroke:#1e40af
style Agent fill:#dcfce7,stroke:#15803d
2 separate steps: extracting facts from conversations and retrieving them in new sessions. The agent loop itself is unchanged.
What are the 3 types of memory a coding agent needs?
1. User facts
Things about the user that rarely change. "Prefers pytest over unittest." "Works primarily in Python." "Projects live under ~/code/work/." These are preferences and environment facts you can surface in any future session.
Stored as short declarative sentences with a category and the date they were written. Rewritten when they change. Deleted when the user says "stop doing that" more than twice.
2. Project facts
Things about the current project that change slowly. "Tests are in tests/, not test/." "Uses httpx for HTTP calls, not requests." "The main entry point is src/cli.py." These are codebase conventions the agent should respect on every task in that project.
Stored scoped to a project ID (usually the git repo root or the workspace path). Retrieved when the agent starts a new session in that project. Different projects have different conventions, so this memory has to be scoped or it leaks across workspaces.
3. Session recap
A short summary of what just happened in the previous session. "We refactored auth middleware, added session timeout handling, and left test coverage at 74 percent." Retrieved when the user says "keep going" or "continue the work from last time." Not relevant to new tasks.
Stored as a single paragraph per session with a timestamp. Retrieved only on explicit continuity cues. Truncated to the last N sessions (say, the 5 most recent) to prevent unbounded growth.
How do you store and retrieve these memories?
A tiny SQLite table (or Postgres if you already have one) with 4 columns: id, type, scope, content, created_at. The type is one of user/project/session. The scope is the user ID for user facts, the project ID for project facts, and the session ID for session recaps.
# filename: memory_store.py
# description: A minimal memory store with 3 types and scoped recall.
import sqlite3
from datetime import datetime
from pathlib import Path
SCHEMA = '''
CREATE TABLE IF NOT EXISTS memory (
id INTEGER PRIMARY KEY,
type TEXT NOT NULL,
scope TEXT NOT NULL,
content TEXT NOT NULL,
created_at TEXT NOT NULL
);
CREATE INDEX IF NOT EXISTS idx_memory_type_scope ON memory(type, scope);
'''
class MemoryStore:
def __init__(self, path: Path):
self.conn = sqlite3.connect(path)
self.conn.executescript(SCHEMA)
def add(self, type_: str, scope: str, content: str) -> int:
cursor = self.conn.execute(
'INSERT INTO memory (type, scope, content, created_at) VALUES (?, ?, ?, ?)',
(type_, scope, content, datetime.utcnow().isoformat()),
)
self.conn.commit()
return cursor.lastrowid
def recall(self, type_: str, scope: str, limit: int = 20) -> list[str]:
rows = self.conn.execute(
'SELECT content FROM memory WHERE type = ? AND scope = ? '
'ORDER BY created_at DESC LIMIT ?',
(type_, scope, limit),
).fetchall()
return [row[0] for row in rows]
def remove(self, memory_id: int) -> None:
self.conn.execute('DELETE FROM memory WHERE id = ?', (memory_id,))
self.conn.commit()
SQLite is enough for 99 percent of agent memory workloads. A single-user agent stores a few hundred facts total. A multi-user service scales to tens of thousands per user. Neither hits SQLite's limits.
How do you extract facts from a completed session?
An LLM call at the end of each session. Give the model the conversation and a short prompt asking it to extract any durable facts worth remembering. Save the output as individual memories.
# filename: extract_facts.py
# description: Run after a session completes to extract durable memories.
import json
from anthropic import Anthropic
from memory_store import MemoryStore
client = Anthropic()
EXTRACT_PROMPT = '''Read this conversation and extract any durable facts
worth remembering for future sessions. Categorize each fact as:
- user: preferences or environment that rarely change
- project: codebase conventions for this specific project
- session: one-paragraph recap of what we accomplished
Output JSON only: {"user": ["..."], "project": ["..."], "session": "..."}.
Skip things that are one-off, ephemeral, or already obvious from the code.
Conversation:
{transcript}'''
def extract_and_save(
store: MemoryStore, user_id: str, project_id: str, session_id: str, transcript: str,
) -> dict:
reply = client.messages.create(
model='claude-haiku-4-5-20251001',
max_tokens=800,
messages=[{'role': 'user', 'content': EXTRACT_PROMPT.format(transcript=transcript)}],
)
facts = json.loads(reply.content[0].text)
for fact in facts.get('user', []):
store.add('user', user_id, fact)
for fact in facts.get('project', []):
store.add('project', project_id, fact)
if facts.get('session'):
store.add('session', session_id, facts['session'])
return facts
Haiku is the right model for extraction. It is cheap, fast, and the task is simple enough that flagship reasoning is overkill. The prompt's "skip things that are one-off" clause is doing a lot of work; without it, the extractor dumps every line of the conversation.
How do you recall memories into a new session?
At the start of a session, read relevant memories and inject them into the system prompt. User facts always go in. Project facts go in when a project ID is known. Session recap goes in only on explicit continuity cues from the user.
# filename: recall.py
# description: Build the memory section of the system prompt at session start.
from memory_store import MemoryStore
def build_memory_context(
store: MemoryStore, user_id: str, project_id: str | None, continuity: bool,
) -> str:
parts = []
user_facts = store.recall('user', user_id, limit=10)
if user_facts:
parts.append('User preferences:\n' + '\n'.join(f'- {f}' for f in user_facts))
if project_id:
project_facts = store.recall('project', project_id, limit=10)
if project_facts:
parts.append('Project conventions:\n' + '\n'.join(f'- {f}' for f in project_facts))
if continuity:
recaps = store.recall('session', f'user:{user_id}:project:{project_id}', limit=1)
if recaps:
parts.append(f'Last session recap:\n{recaps[0]}')
return '\n\n'.join(parts)
The system prompt then includes this memory block before the task-specific instructions. The model sees durable facts as part of its background, not as a recent user message, which matches the way humans treat long-term knowledge.
For the broader system prompt and planning pattern that memory layers on top of, see the Build a Coding Agent with Claude: A Step-by-Step Guide post. For the event loop that memory-aware agents plug into, see The Event Loop Inside a Coding Agent.
How do you prevent memory from growing out of control?
3 mechanisms:
- Deduplication. Before adding a new fact, check if a semantically similar one already exists. Either skip or replace. Avoid storing "user prefers pytest" 50 times over 50 sessions.
- Expiration. Project facts older than 90 days should be re-verified or dropped. Codebases evolve; stale facts become wrong. Session recaps older than 30 days should be deleted; nobody asks to continue work from 2 months ago.
- User control. A
/forgetcommand or explicit "stop remembering that" detection. If a user tells the agent to forget something twice in a row, delete the matching memory and never re-add it in the same shape.
The auto memory skill pattern in Claude Code uses exactly this 3-part model. If you are using Claude Code already, you can see a reference implementation there.
What to do Monday morning
- Create a
memory.dbSQLite file for your agent. Add the 4-column schema from this post. 10 lines of setup code. - At session start, recall user facts and project facts and inject them into the system prompt. Confirm the memories show up in the first model call.
- At session end, run the extraction prompt on the transcript. Save the 3 fact types. Start with Haiku or another cheap model for the extractor.
- Add a
/forget <substring>command that deletes any memory matching the substring. The user will appreciate the escape hatch. - Measure: after a week, count how many memories exist per user. If the number is over 100, your extraction prompt is too eager. Tune the "skip one-off things" clause.
The headline: persistent memory for coding agents is a SQLite file, 2 LLM calls (extract and recall), and a scope rule. 50 lines total. The difference between an amnesic tool and an assistant that knows you.
Frequently asked questions
What is persistent memory in a coding agent?
It is the ability to remember facts across conversations. Instead of starting every session as a blank slate, the agent stores durable facts (user preferences, project conventions, session recaps) in a database and retrieves them at the start of each new session. The memory layer sits above the agent loop and does not change the loop itself.
What kinds of memory should a coding agent keep?
3 types: user facts (preferences that rarely change), project facts (codebase conventions scoped to a project), and session recaps (short summaries of what happened). Each has a different retrieval policy. User facts always go in, project facts are scoped to the current project, session recaps go in only when the user signals continuity.
How do you extract memories from a session?
Run an LLM call at the end of the session with the transcript and a prompt asking for durable facts categorized by type. Use a cheap model like Haiku for extraction because the task is simple. Save the extracted facts to the store. The extraction prompt should explicitly say "skip one-off details" or the extractor will dump everything.
Where should agent memories be stored?
SQLite for single-user agents and small multi-user deployments. Postgres when you already run it for other reasons. A vector database only if you need semantic recall at scale. The storage backend rarely matters until you have tens of thousands of memories per user, which is far beyond most agents.
How do you prevent memory from becoming context bloat?
By storing structured facts instead of raw transcripts, capping the number of facts recalled per session (10 per type is a reasonable default), and deduplicating or expiring stale facts. Inject memories into the system prompt, not the user message, so the model treats them as background knowledge instead of recent instructions.
Key takeaways
- A stateless agent loop is fine for chatbots and wrong for assistants. Coding agents need persistent memory to respect user preferences and project conventions across sessions.
- Store memory as structured facts with type and scope, not raw transcripts. Transcripts cost tokens and poison the context with stale detail.
- 3 memory types cover most needs: user facts, project facts, and session recaps. Each has its own retrieval policy.
- Extract facts at session end with a cheap LLM call. Recall them at session start and inject into the system prompt, not the message history.
- Prevent bloat with deduplication, expiration, and a
/forgetescape hatch. Users will thank you for the control. - To see memory wired into a full coding agent with the loop, tool registry, and safety rails, walk through the Build Your Own Coding Agent course, or start with the AI Agents Fundamentals primer.
For a reference implementation of agent memory with scoped types, see the MemGPT paper and codebase. The typed-memory hierarchy in this post borrows from the main-context/archival distinction introduced there.
Continue Reading
Ready to go deeper?
Go beyond articles. Build production AI systems with hands-on workshops and our intensive AI Bootcamp.