Modular architectures for agentic AI: the 4-layer cut

Your agent codebase is a 900-line `main.py` and nobody wants to touch it

You started with one file. Prompt, tool registry, agent loop, FastAPI routes, database session, LLM client, all in main.py. It shipped. A month later it is 900 lines. A second engineer joins, tries to add a tool, breaks the retry loop, and quietly stops contributing. Every change is a merge conflict. Every bug lives in a nested if-branch three screens deep.

This is the default failure mode of every agentic AI project. The code works, the product works, but the codebase refuses to grow. The fix is not a framework. The fix is 4 modules with hard boundaries between them, and the discipline to keep each module from knowing things it does not need to know.

This post is the 4-layer pattern I ship on every agent project: the cut lines, what belongs in each layer, the dependency rules, and the refactor that turns a 900-line file into something a team can work on without stepping on each other.

Why does the single-file approach stop working past a certain size?

Because every new feature touches every other feature. A single-file agent has no natural boundaries, so every change reads like a diff across the whole system. Tests cannot be scoped to one concept. Mocking is painful because everything imports everything. New engineers read the whole file before they can touch anything.

3 specific failure modes:

Merge conflicts on every pull request because every change lands in main.py.
Test paralysis. You cannot test the agent loop without also instantiating the database and the LLM client. Unit tests become integration tests become slow.
Feature drift. Because there are no boundaries, shortcuts accumulate. A tool handler imports the database directly. The prompt template pulls from the response serializer. After a month the dependency graph is a spider web.

Modularization fixes all 3 by drawing hard lines between layers and making each layer only depend on the layer below it. Same code, same features, different shape.

graph TD
    Routes[Layer 1: HTTP routes<br/>FastAPI endpoints]
    Services[Layer 2: Services<br/>Agent loop, business logic]
    Domain[Layer 3: Domain<br/>Tools, prompts, schemas]
    Infra[Layer 4: Infrastructure<br/>DB, LLM client, cache]

    Routes --> Services
    Services --> Domain
    Services --> Infra
    Domain --> Infra

    style Routes fill:#dbeafe,stroke:#1e40af
    style Services fill:#dcfce7,stroke:#15803d
    style Domain fill:#fef3c7,stroke:#b45309
    style Infra fill:#e5e7eb,stroke:#374151

One direction of dependency: top to bottom, never sideways, never upward. That rule is the whole game.

What goes in each of the 4 layers?

Layer 1: Routes (`app/routes/`)

HTTP endpoints only. FastAPI route decorators, request and response models, auth middleware, rate limit middleware. Nothing else. A route handler reads the request, calls one service function, and returns the response. If a route has business logic, it is in the wrong layer.

# filename: app/routes/chat.py
# description: Chat route delegates to the service layer. No agent logic here.
from fastapi import APIRouter, Depends
from app.services.chat import run_chat_turn
from app.schemas import ChatRequest, ChatResponse
from app.auth import get_auth

router = APIRouter()


@router.post('/chat', response_model=ChatResponse)
async def chat(body: ChatRequest, auth=Depends(get_auth)) -> ChatResponse:
    result = await run_chat_turn(auth.user_id, body.message, body.session_id)
    return ChatResponse(answer=result.answer, session_id=result.session_id)

Layer 2: Services (`app/services/`)

The business logic. The agent loop lives here. Session management, retry logic, tool dispatch, response assembly. Services call into Domain and Infrastructure but never into Routes.

Layer 3: Domain (`app/domain/`)

Pure functions and types. Prompt templates, Pydantic schemas, tool definitions, content filters. No I/O. No database, no LLM client, no HTTP. Domain is the testable core that has zero external dependencies.

Layer 4: Infrastructure (`app/infra/`)

Everything that touches the outside world. Database session factories, LLM client wrappers, Redis, cache layers, observability setup. Infrastructure is instantiated once at startup and injected into services via dependency injection.

For the broader production FastAPI stack that this layering sits inside, see the FastAPI and Uvicorn for production agentic AI systems post.

How do you enforce the dependency rule?

By convention and by import checks. The rule: each layer imports only from layers below it. Routes can import from Services, Domain, and Infra. Services can import from Domain and Infra. Domain can import only from Infra and standard library. Infra imports nothing from the app.

# filename: app/services/chat.py
# description: The chat service orchestrates the agent loop.
# Imports from domain and infra only.
from app.domain.prompts import SYSTEM_PROMPT
from app.domain.tools import dispatch
from app.infra.llm import call_llm
from app.infra.db import get_session


async def run_chat_turn(user_id: str, message: str, session_id: str):
    messages = await load_history(session_id)
    messages.append({'role': 'user', 'content': message})
    # agent loop continues
    reply = await call_llm(messages, system=SYSTEM_PROMPT)
    return reply

Enforce the rule with a linter (import-linter Python package) or a pre-commit hook that greps for forbidden import patterns. Once the boundaries are set, CI catches any regression automatically.

For the tool registry pattern that belongs in Layer 3 (Domain), see the Designing modular tool integrations for coding agents post.

What does the refactor look like in practice?

Start with the import-linter check failing, then extract one layer at a time. Typical week-long refactor:

Day 1: Extract Infrastructure. Move the LLM client, database session, Redis, and cache into app/infra/. Add a dependency injection container or a lifespan block that instantiates each at startup. Don't change any business logic yet.
Day 2: Extract Domain. Move prompts, Pydantic schemas, and tool definitions into app/domain/. Strip any I/O that snuck in. Domain becomes a set of pure modules you can import in tests without touching a database.
Day 3: Extract Services. Pull the agent loop, session management, and retry logic into app/services/. Services are now the only layer that knows how features compose.
Day 4: Thin the routes. Each route handler becomes 5-10 lines that parse input, call a service, and format the response.
Day 5: Add the import-linter config. Freeze the architecture so future changes cannot break the rule.

Post-refactor, you measure wins in merge conflicts avoided, test run time, and how many files a new feature touches. A clean 4-layer agent typically needs 2-4 files per feature, not 1 giant main.py.

How do you test each layer independently?

Each layer has its own test strategy because each layer has its own shape.

Domain tests: pure unit tests. No fixtures, no mocks, no network. Prompt templates, schema validators, and pure tool functions are trivial to test.
Service tests: use fakes for Infrastructure. Inject a FakeLLMClient that returns canned responses and a FakeDB that stores sessions in memory. The agent loop is then testable without a network or a database.
Route tests: use FastAPI's TestClient. Stub the service layer. Verify request parsing, auth, and response shape.
Infrastructure tests: integration tests that hit real dependencies. Slow, run on a smaller cadence, isolated in a separate CI job.

This split lets you run 95 percent of tests in under 2 seconds, which is the threshold for tests being run on every save instead of every commit.

When is the 4-layer pattern overkill?

For truly tiny agents (one tool, one route, under 300 lines), a single file is fine. The pattern earns its keep when the codebase crosses roughly 800 lines or when a second engineer joins. Below that, the overhead of 4 modules is larger than the benefit.

The rule I use: refactor when you hit your second merge conflict or your second "I'm scared to touch this file" moment, whichever comes first.

For the full production stack picture that the 4-layer pattern fits into, the Build your own coding agent course walks through it module by module. The free AI Agents Fundamentals primer is the right starting point if the agent loop concept is still new.

What to do Monday morning

Open your biggest agent file. If it is over 500 lines, start the refactor. Below 500, note the layers mentally and wait.
Extract Infrastructure first. LLM client, database, cache, observability. Move each into app/infra/{module}.py and instantiate in the FastAPI lifespan block.
Extract Domain next. Prompts, schemas, tool definitions. These should have zero external imports beyond Pydantic and standard library.
Extract Services third. The agent loop, session management, and anything that composes features. This is your new business-logic core.
Thin the routes last. Each handler should be under 10 lines: parse input, call a service, format the response.
Add import-linter or a pre-commit hook that enforces the dependency rule. Without enforcement, the architecture drifts back within a month.

The headline: 4 layers, one dependency direction, one file per concept. The refactor is boring by design. What you get back is a codebase a team can grow.

Frequently asked questions

What is a modular architecture for agentic AI?

A modular architecture splits an agentic AI codebase into layers with hard boundaries and a strict dependency direction. The 4-layer version uses Routes, Services, Domain, and Infrastructure. Each layer only imports from layers below it, which keeps the dependency graph acyclic and lets each layer be tested and changed in isolation. The goal is a codebase that survives team growth and feature additions.

How do you decide when to refactor from a single file to modules?

Refactor when the single file crosses roughly 800 lines or when a second engineer cannot confidently change it without breaking something else. Below that, the overhead of 4 modules is larger than the benefit. The trigger is not a line count, it is the first merge conflict or the first "I am scared to touch this" moment.

Which layer does the agent loop belong to?

The Services layer. The agent loop orchestrates prompts (Domain), tool dispatch (Domain), and LLM calls (Infrastructure). It is business logic, not infrastructure, so it stays in Services. Keeping the loop in Services means you can test it with fake tools and a fake LLM client, which is the single biggest testing win of the 4-layer pattern.

How do you prevent the dependency rule from being violated?

Use a linter like import-linter with a contract that enforces the layer hierarchy. Add it to your pre-commit hooks or CI pipeline. Without automated enforcement, architectural rules drift within a month because every engineer takes one small shortcut that adds up. The linter makes violations fail the build, which is the only enforcement that actually works.

What is the right test strategy for each layer?

Domain gets pure unit tests with no mocks. Services get tests with fake infrastructure (FakeLLMClient, FakeDB). Routes get FastAPI TestClient tests with the service layer stubbed. Infrastructure gets integration tests that hit real dependencies on a slower cadence. This split keeps 95 percent of tests fast and the remaining 5 percent as a checkpoint before deploy.

Key takeaways

Single-file agents stop scaling past ~800 lines because every change touches every other change. The fix is not a framework, it is 4 modules with hard boundaries.
The 4 layers are Routes, Services, Domain, Infrastructure. Each imports only from layers below. One direction of dependency, never sideways or upward.
Domain is the testable core with zero external dependencies. Prompts, schemas, and pure tool functions live here.
Services own the agent loop and business logic. They are testable with fake infrastructure and are where feature composition happens.
Enforce the dependency rule with import-linter in CI. Without enforcement, the architecture drifts within a month.
To see this layering wired into a full production agent stack with auth, tools, and observability, walk through the Build your own coding agent course, or start with the AI Agents Fundamentals primer.

For the original Clean Architecture framing that this layering borrows from, see Robert Martin's Clean Architecture book summary. The dependency rule in this post is the same one Uncle Bob drew in 2012.

Your agent codebase is a 900-line `main.py` and nobody wants to touch it

Why does the single-file approach stop working past a certain size?

3 specific failure modes:

Merge conflicts on every pull request because every change lands in main.py.
Test paralysis. You cannot test the agent loop without also instantiating the database and the LLM client. Unit tests become integration tests become slow.
Feature drift. Because there are no boundaries, shortcuts accumulate. A tool handler imports the database directly. The prompt template pulls from the response serializer. After a month the dependency graph is a spider web.

Modularization fixes all 3 by drawing hard lines between layers and making each layer only depend on the layer below it. Same code, same features, different shape.

graph TD
    Routes[Layer 1: HTTP routes<br/>FastAPI endpoints]
    Services[Layer 2: Services<br/>Agent loop, business logic]
    Domain[Layer 3: Domain<br/>Tools, prompts, schemas]
    Infra[Layer 4: Infrastructure<br/>DB, LLM client, cache]

    Routes --> Services
    Services --> Domain
    Services --> Infra
    Domain --> Infra

    style Routes fill:#dbeafe,stroke:#1e40af
    style Services fill:#dcfce7,stroke:#15803d
    style Domain fill:#fef3c7,stroke:#b45309
    style Infra fill:#e5e7eb,stroke:#374151

One direction of dependency: top to bottom, never sideways, never upward. That rule is the whole game.

What goes in each of the 4 layers?

Layer 1: Routes (`app/routes/`)

# filename: app/routes/chat.py
# description: Chat route delegates to the service layer. No agent logic here.
from fastapi import APIRouter, Depends
from app.services.chat import run_chat_turn
from app.schemas import ChatRequest, ChatResponse
from app.auth import get_auth

router = APIRouter()


@router.post('/chat', response_model=ChatResponse)
async def chat(body: ChatRequest, auth=Depends(get_auth)) -> ChatResponse:
    result = await run_chat_turn(auth.user_id, body.message, body.session_id)
    return ChatResponse(answer=result.answer, session_id=result.session_id)

Layer 2: Services (`app/services/`)

The business logic. The agent loop lives here. Session management, retry logic, tool dispatch, response assembly. Services call into Domain and Infrastructure but never into Routes.

Layer 3: Domain (`app/domain/`)

Layer 4: Infrastructure (`app/infra/`)

For the broader production FastAPI stack that this layering sits inside, see the FastAPI and Uvicorn for production agentic AI systems post.

How do you enforce the dependency rule?

# filename: app/services/chat.py
# description: The chat service orchestrates the agent loop.
# Imports from domain and infra only.
from app.domain.prompts import SYSTEM_PROMPT
from app.domain.tools import dispatch
from app.infra.llm import call_llm
from app.infra.db import get_session


async def run_chat_turn(user_id: str, message: str, session_id: str):
    messages = await load_history(session_id)
    messages.append({'role': 'user', 'content': message})
    # agent loop continues
    reply = await call_llm(messages, system=SYSTEM_PROMPT)
    return reply

Enforce the rule with a linter (import-linter Python package) or a pre-commit hook that greps for forbidden import patterns. Once the boundaries are set, CI catches any regression automatically.

For the tool registry pattern that belongs in Layer 3 (Domain), see the Designing modular tool integrations for coding agents post.

What does the refactor look like in practice?

Start with the import-linter check failing, then extract one layer at a time. Typical week-long refactor:

Day 1: Extract Infrastructure. Move the LLM client, database session, Redis, and cache into app/infra/. Add a dependency injection container or a lifespan block that instantiates each at startup. Don't change any business logic yet.
Day 2: Extract Domain. Move prompts, Pydantic schemas, and tool definitions into app/domain/. Strip any I/O that snuck in. Domain becomes a set of pure modules you can import in tests without touching a database.
Day 3: Extract Services. Pull the agent loop, session management, and retry logic into app/services/. Services are now the only layer that knows how features compose.
Day 4: Thin the routes. Each route handler becomes 5-10 lines that parse input, call a service, and format the response.
Day 5: Add the import-linter config. Freeze the architecture so future changes cannot break the rule.

How do you test each layer independently?

Each layer has its own test strategy because each layer has its own shape.

Domain tests: pure unit tests. No fixtures, no mocks, no network. Prompt templates, schema validators, and pure tool functions are trivial to test.
Service tests: use fakes for Infrastructure. Inject a FakeLLMClient that returns canned responses and a FakeDB that stores sessions in memory. The agent loop is then testable without a network or a database.
Route tests: use FastAPI's TestClient. Stub the service layer. Verify request parsing, auth, and response shape.
Infrastructure tests: integration tests that hit real dependencies. Slow, run on a smaller cadence, isolated in a separate CI job.

This split lets you run 95 percent of tests in under 2 seconds, which is the threshold for tests being run on every save instead of every commit.

When is the 4-layer pattern overkill?

The rule I use: refactor when you hit your second merge conflict or your second "I'm scared to touch this file" moment, whichever comes first.

What to do Monday morning

Open your biggest agent file. If it is over 500 lines, start the refactor. Below 500, note the layers mentally and wait.
Extract Infrastructure first. LLM client, database, cache, observability. Move each into app/infra/{module}.py and instantiate in the FastAPI lifespan block.
Extract Domain next. Prompts, schemas, tool definitions. These should have zero external imports beyond Pydantic and standard library.
Extract Services third. The agent loop, session management, and anything that composes features. This is your new business-logic core.
Thin the routes last. Each handler should be under 10 lines: parse input, call a service, format the response.
Add import-linter or a pre-commit hook that enforces the dependency rule. Without enforcement, the architecture drifts back within a month.

The headline: 4 layers, one dependency direction, one file per concept. The refactor is boring by design. What you get back is a codebase a team can grow.

Single-file agents stop scaling past ~800 lines because every change touches every other change. The fix is not a framework, it is 4 modules with hard boundaries.
The 4 layers are Routes, Services, Domain, Infrastructure. Each imports only from layers below. One direction of dependency, never sideways or upward.
Domain is the testable core with zero external dependencies. Prompts, schemas, and pure tool functions live here.
Services own the agent loop and business logic. They are testable with fake infrastructure and are where feature composition happens.
Enforce the dependency rule with import-linter in CI. Without enforcement, the architecture drifts within a month.
To see this layering wired into a full production agent stack with auth, tools, and observability, walk through the Build your own coding agent course, or start with the AI Agents Fundamentals primer.

Modular architectures for agentic AI: 4 layer pattern

Share this post

Share this post

Continue Reading

Which language should you build Redis in? Lessons from rebuilding it 6 times

Query anonymization for RAG bias mitigation

pip vs uv vs poetry for Python AI services

Weekly Bytes of AI

Ready to go deeper?

Modular architectures for agentic AI: 4 layer pattern

Share this post

Share this post

Continue Reading

Which language should you build Redis in? Lessons from rebuilding it 6 times

Query anonymization for RAG bias mitigation

pip vs uv vs poetry for Python AI services

Weekly Bytes of AI

Ready to go deeper?