Custom tools for coding agents: beyond the built-ins

Your agent keeps reimplementing the same helper, poorly

Your agent has read_file, edit_file, run_bash. You ask it to "find all callers of validate_token and add logging." The agent opens 5 files, greps each one manually with run_bash, parses the output, decides which ones match, and edits them one by one. It takes 25 tool calls. 3 of them are wrong. The whole task took 90 seconds.

If you had given the agent a find_callers tool, the same task would be 4 tool calls and 15 seconds. The tool encapsulates a repeatable subroutine the agent would otherwise have to reinvent in every conversation. This is what custom tools are for. Not replacing the basics, but adding higher-level operations the model cannot improvise as cheaply.

This post is the design rule for custom tools, the schema patterns that make them usable by models, and 3 specific custom tools I ship in every coding agent because they pay for themselves on the first task.

Why are the built-in tools not enough?

Because built-in tools are primitives. Read, edit, and bash cover the mechanics but they force the model to chain many low-level operations for tasks a single well-designed tool could handle in one call.

3 failure modes of primitive-only agents:

Reimplementation cost. Every conversation starts from scratch. The model has no memory of how it greps for callers, so it grep-parses its own output fresh every time. Expensive and lossy.
Context pressure. Each low-level tool call adds to the message history. 25 tool calls is 25 assistant messages and 25 tool results. The agent hits context limits much faster than it would with 4 well-chosen custom tool calls.
Error accumulation. Every manual step has a small error rate. Chain 25 of them and the probability at least one fails is high. Chain 4 and it is small. Fewer tool calls means fewer places to be wrong.

Custom tools collapse multi-step routines into single calls with precise semantics. The trade-off is the engineering cost to build them and the risk that the model does not know when to use them. Both are manageable if you follow a few rules.

graph TD
    Task[Find callers and add logging] --> P[Primitive-only agent]
    P -->|25 tool calls| Many[Multiple errors, long history]

    Task --> C[With find_callers tool]
    C -->|1 call| List[List of call sites]
    List -->|4 edit_file calls| Done[Done in 4 total calls]

    style Many fill:#fee2e2,stroke:#b91c1c
    style Done fill:#dcfce7,stroke:#15803d

The custom tool does not replace edit. It replaces the search-parse-interpret chain that would otherwise happen before edit.

What makes a custom tool worth building?

5 criteria, all of which should be true before you build a new tool:

The operation happens in 20 percent or more of tasks. Tools for rare operations waste tool-description budget and confuse the model.
The operation requires 3 or more primitive calls to replicate. Below that, the primitive chain is fine and the custom tool is redundant.
The operation has deterministic correct output. Tools whose correctness depends on judgment are better as prompts, not tools.
The operation can return structured data, not prose. Structured output is easier for the model to consume and reduces error rates downstream.
The operation does not overlap with an existing tool by more than 50 percent. If it overlaps, merge or specialize.

Apply all 5 and you end up with a small set of high-use custom tools instead of a sprawling toolbox nobody can remember.

What are 3 custom tools worth shipping in every coding agent?

1. `find_callers(function_name: str)`

Given a function name, return every call site in the codebase with file, line, and 3 lines of context. Uses rg or the language's own indexing under the hood. Solves the "how is this function used" question in one tool call instead of a 5-step search-and-parse chain.

# filename: find_callers.py
# description: Custom tool that wraps ripgrep with context and returns
# structured call site data for the model.
import subprocess
import json


def find_callers(function_name: str, root: str = '.') -> dict:
    pattern = f'\\b{function_name}\\s*\\('
    cmd = [
        'rg', '--json', '--max-count', '5', '-C', '2',
        '--type-add', 'code:*.{py,ts,js,go,rs,java}',
        '--type', 'code',
        pattern, root,
    ]
    result = subprocess.run(cmd, capture_output=True, text=True, timeout=15)
    callers = []
    for line in result.stdout.splitlines():
        try:
            event = json.loads(line)
        except json.JSONDecodeError:
            continue
        if event.get('type') == 'match':
            data = event['data']
            callers.append({
                'path': data['path']['text'],
                'line': data['line_number'],
                'context': data['lines']['text'].rstrip('\n'),
            })
    return {'function': function_name, 'callers': callers, 'count': len(callers)}

The tool does the grep, parses the output, and returns a list the model can iterate over without re-parsing.

2. `run_tests(scope: str = "all")`

Run the project's test suite and return a structured pass/fail summary, not raw output. The model does not need to see 5000 lines of pytest output; it needs to know which tests failed and what the first line of each failure said.

# filename: run_tests.py
# description: Structured test runner. Returns pass/fail counts and the
# top failure from each failing test.
import re
import subprocess


def run_tests(scope: str = 'all') -> dict:
    target = '' if scope == 'all' else scope
    cmd = ['pytest', '--tb=line', '-q']
    if target:
        cmd.append(target)

    result = subprocess.run(
        cmd, capture_output=True, text=True, timeout=300,
    )

    passed = re.search(r'(\d+) passed', result.stdout)
    failed = re.search(r'(\d+) failed', result.stdout)
    failures = re.findall(r'FAILED (.+?) - (.+)', result.stdout)

    return {
        'ok': result.returncode == 0,
        'passed': int(passed.group(1)) if passed else 0,
        'failed': int(failed.group(1)) if failed else 0,
        'failures': [{'test': t, 'error': e} for t, e in failures[:5]],
    }

The trimmed output is a fraction of the raw pytest dump and contains exactly what the model needs to decide its next edit.

3. `project_tree(depth: int = 3)`

Return a concise tree of the project root, respecting .gitignore, up to a given depth. Solves "what files exist in this project" without the agent having to ls its way down manually.

# filename: project_tree.py
# description: A concise project tree for model orientation.
# Respects .gitignore, caps depth and result count.
import subprocess


def project_tree(depth: int = 3) -> dict:
    cmd = [
        'rg', '--files', '--hidden', '--glob', '!.git',
    ]
    result = subprocess.run(cmd, capture_output=True, text=True, timeout=10)
    files = result.stdout.splitlines()
    filtered = [f for f in files if f.count('/') <= depth]
    return {'files': filtered[:200], 'total': len(files), 'shown': len(filtered[:200])}

Uses rg's file listing mode (--files) which respects gitignore for free. Caps the output so a huge monorepo does not overwhelm the model.

How do you name custom tools for best model accuracy?

Verb-object, specific, and non-overlapping with built-in tools. find_callers is better than search because the model can differentiate it from code_search. run_tests is better than test because it is a verb, not an ambiguous noun.

3 naming anti-patterns to avoid:

Names that are verbs used by other tools. get_file next to read_file is confusing. Pick one.
Names that are nouns. callers sounds like a data structure, not an action. Use find_callers.
Names that abbreviate. fc and rt save 10 characters of tool description and cost the model significant accuracy. Spell it out.

For the full registry and dispatch pattern that makes adding custom tools painless, see Designing Modular Tool Integrations for Coding Agents. For the event loop that these tools plug into, see The Event Loop Inside a Coding Agent.

How do you teach the model when to use custom tools?

Through the tool description, not the system prompt. System prompts are read once; tool descriptions are injected into every call. Putting the "when to use" hint in the description is 100x more reliable than putting it in the system prompt.

A good description has 4 parts: what the tool does, when to use it, when NOT to use it, and 1 example invocation. The "when not to" line is the one most teams skip and it is the highest-use part:

# filename: tool_descriptions.py
# description: Tool descriptions that include when-to-use and when-not-to-use
# hints to steer the model away from misuse.
FIND_CALLERS_DESC = (
    'Find every call site of a function across the codebase and return '
    'structured results with file, line, and context. Use this instead of '
    'grep when the goal is to understand how a function is used. '
    'Do NOT use for general text search (use code_search for that). '
    'Example: find_callers(function_name="validate_token").'
)

RUN_TESTS_DESC = (
    'Run the project test suite and return a structured pass/fail summary. '
    'Use after any edit_file change that touches logic. '
    'Do NOT use to run a single arbitrary command (use run_bash). '
    'Example: run_tests(scope="tests/test_auth.py").'
)

The "do NOT" clause is what prevents the model from defaulting to the new tool for every search. Models follow negation better than ambiguous guidance.

When does a custom tool stop paying off?

When the operation is called fewer than once per 5 sessions. If you ship a tool and a week later the traces show it was called 3 times in 100 tasks, kill it. Every tool that sits in the description eats context tokens on every call and adds cognitive load for the model. Unused tools are pure cost.

The rule: measure call frequency per tool over a week. Anything under 5 percent of task sessions is a candidate for deletion. Exceptions: tools that are rarely called but critical (like approve_production_deploy) can stay even at low frequency because of their importance.

For the full build of a custom-tool-rich coding agent with the registry, event loop, and safety rails, walk through the Build Your Own Coding Agent course. For the conceptual starting point, the free AI Agents Fundamentals primer is the right on-ramp.

What to do Monday morning

Open a week of agent logs. Find 3 repeated tool-call chains (grep-then-edit, ls-then-cat, run-bash-then-grep-output). Each one is a candidate for a custom tool.
Build the find_callers tool first. It is the single highest-use custom tool for coding agents because call-site lookups happen in nearly every refactor task.
Build run_tests second. A structured test runner turns a 5000-line output into 10 lines and saves enormous context.
Write tool descriptions with "when to use" and "when NOT to use" clauses. The negation trains the model to avoid overuse.
After a week of traces, delete any custom tool called in fewer than 5 percent of sessions. Unused tools are pure tax on every call.

The headline: custom tools are not replacements for primitives; they are accelerators for repeated subroutines. 3 well-chosen tools turn a 25-call task into a 4-call task. Build them, measure their use, and delete the ones that do not earn their keep.

Frequently asked questions

When should I add a custom tool to a coding agent?

When an operation happens in 20 percent or more of tasks and takes 3 or more primitive calls to replicate. Below either threshold, the primitive chain is fine. Above both, a custom tool collapses many low-level calls into one, reduces error accumulation, and leaves more room in the context window for the actual task.

What custom tools are worth building in a coding agent?

3 pay off in almost every coding agent: find_callers for call-site lookups, run_tests for structured test runs, and project_tree for codebase orientation. Each replaces a multi-call chain with one call and returns structured data the model can consume without re-parsing. Other domain-specific tools depend on the codebase.

How do I name custom tools for best model accuracy?

Use verb-object names that are specific and non-overlapping with built-in tools. find_callers beats search, run_tests beats test. Avoid abbreviations, bare nouns, and verbs that collide with existing tools. A 2-word name with a clear action is the sweet spot for model accuracy and description economy.

Where should I teach the model when to use a custom tool?

In the tool description, not the system prompt. Descriptions are attached to every call; system prompts are read once. A good description has what the tool does, when to use it, when NOT to use it, and one example invocation. The "when not to" clause is the single highest-use part and the one most teams skip.

How do I know if a custom tool is worth keeping?

Measure call frequency per tool over a week. Tools called in fewer than 5 percent of sessions should usually be deleted because they cost context on every call and confuse the model. Exceptions: safety-critical tools like deploy approvals can stay at low frequency because of their importance, not their usage.

Key takeaways

Custom tools accelerate repeated subroutines; they do not replace primitives. 3 well-chosen custom tools turn a 25-call task into a 4-call task.
Build a custom tool only when it is used in more than 20 percent of tasks and replaces 3 or more primitive calls. Below either threshold it is waste.
find_callers, run_tests, and project_tree are the 3 highest-use custom tools in almost any coding agent. Ship them first.
Name tools verb-object. find_callers, not callers. The clarity beats the extra characters.
Put when-to-use and when-NOT-to-use hints in the tool description. Models follow negation in descriptions better than guidance in system prompts.
To see custom tools wired into a full coding agent with registry, event loop, and safety rails, walk through the Build Your Own Coding Agent course, or start with the AI Agents Fundamentals primer.

For deeper tool-design patterns and case studies, see the Anthropic tool use guide. The description-writing advice there maps directly onto the naming and structure rules in this post.

Your agent keeps reimplementing the same helper, poorly

Why are the built-in tools not enough?

3 failure modes of primitive-only agents:

Reimplementation cost. Every conversation starts from scratch. The model has no memory of how it greps for callers, so it grep-parses its own output fresh every time. Expensive and lossy.
Context pressure. Each low-level tool call adds to the message history. 25 tool calls is 25 assistant messages and 25 tool results. The agent hits context limits much faster than it would with 4 well-chosen custom tool calls.
Error accumulation. Every manual step has a small error rate. Chain 25 of them and the probability at least one fails is high. Chain 4 and it is small. Fewer tool calls means fewer places to be wrong.

graph TD
    Task[Find callers and add logging] --> P[Primitive-only agent]
    P -->|25 tool calls| Many[Multiple errors, long history]

    Task --> C[With find_callers tool]
    C -->|1 call| List[List of call sites]
    List -->|4 edit_file calls| Done[Done in 4 total calls]

    style Many fill:#fee2e2,stroke:#b91c1c
    style Done fill:#dcfce7,stroke:#15803d

The custom tool does not replace edit. It replaces the search-parse-interpret chain that would otherwise happen before edit.

What makes a custom tool worth building?

5 criteria, all of which should be true before you build a new tool:

The operation happens in 20 percent or more of tasks. Tools for rare operations waste tool-description budget and confuse the model.
The operation requires 3 or more primitive calls to replicate. Below that, the primitive chain is fine and the custom tool is redundant.
The operation has deterministic correct output. Tools whose correctness depends on judgment are better as prompts, not tools.
The operation can return structured data, not prose. Structured output is easier for the model to consume and reduces error rates downstream.
The operation does not overlap with an existing tool by more than 50 percent. If it overlaps, merge or specialize.

Apply all 5 and you end up with a small set of high-use custom tools instead of a sprawling toolbox nobody can remember.

What are 3 custom tools worth shipping in every coding agent?

1. `find_callers(function_name: str)`

# filename: find_callers.py
# description: Custom tool that wraps ripgrep with context and returns
# structured call site data for the model.
import subprocess
import json


def find_callers(function_name: str, root: str = '.') -> dict:
    pattern = f'\\b{function_name}\\s*\\('
    cmd = [
        'rg', '--json', '--max-count', '5', '-C', '2',
        '--type-add', 'code:*.{py,ts,js,go,rs,java}',
        '--type', 'code',
        pattern, root,
    ]
    result = subprocess.run(cmd, capture_output=True, text=True, timeout=15)
    callers = []
    for line in result.stdout.splitlines():
        try:
            event = json.loads(line)
        except json.JSONDecodeError:
            continue
        if event.get('type') == 'match':
            data = event['data']
            callers.append({
                'path': data['path']['text'],
                'line': data['line_number'],
                'context': data['lines']['text'].rstrip('\n'),
            })
    return {'function': function_name, 'callers': callers, 'count': len(callers)}

The tool does the grep, parses the output, and returns a list the model can iterate over without re-parsing.

2. `run_tests(scope: str = "all")`

# filename: run_tests.py
# description: Structured test runner. Returns pass/fail counts and the
# top failure from each failing test.
import re
import subprocess


def run_tests(scope: str = 'all') -> dict:
    target = '' if scope == 'all' else scope
    cmd = ['pytest', '--tb=line', '-q']
    if target:
        cmd.append(target)

    result = subprocess.run(
        cmd, capture_output=True, text=True, timeout=300,
    )

    passed = re.search(r'(\d+) passed', result.stdout)
    failed = re.search(r'(\d+) failed', result.stdout)
    failures = re.findall(r'FAILED (.+?) - (.+)', result.stdout)

    return {
        'ok': result.returncode == 0,
        'passed': int(passed.group(1)) if passed else 0,
        'failed': int(failed.group(1)) if failed else 0,
        'failures': [{'test': t, 'error': e} for t, e in failures[:5]],
    }

The trimmed output is a fraction of the raw pytest dump and contains exactly what the model needs to decide its next edit.

3. `project_tree(depth: int = 3)`

Return a concise tree of the project root, respecting .gitignore, up to a given depth. Solves "what files exist in this project" without the agent having to ls its way down manually.

# filename: project_tree.py
# description: A concise project tree for model orientation.
# Respects .gitignore, caps depth and result count.
import subprocess


def project_tree(depth: int = 3) -> dict:
    cmd = [
        'rg', '--files', '--hidden', '--glob', '!.git',
    ]
    result = subprocess.run(cmd, capture_output=True, text=True, timeout=10)
    files = result.stdout.splitlines()
    filtered = [f for f in files if f.count('/') <= depth]
    return {'files': filtered[:200], 'total': len(files), 'shown': len(filtered[:200])}

Uses rg's file listing mode (--files) which respects gitignore for free. Caps the output so a huge monorepo does not overwhelm the model.

How do you name custom tools for best model accuracy?

3 naming anti-patterns to avoid:

Names that are verbs used by other tools. get_file next to read_file is confusing. Pick one.
Names that are nouns. callers sounds like a data structure, not an action. Use find_callers.
Names that abbreviate. fc and rt save 10 characters of tool description and cost the model significant accuracy. Spell it out.

How do you teach the model when to use custom tools?

A good description has 4 parts: what the tool does, when to use it, when NOT to use it, and 1 example invocation. The "when not to" line is the one most teams skip and it is the highest-use part:

# filename: tool_descriptions.py
# description: Tool descriptions that include when-to-use and when-not-to-use
# hints to steer the model away from misuse.
FIND_CALLERS_DESC = (
    'Find every call site of a function across the codebase and return '
    'structured results with file, line, and context. Use this instead of '
    'grep when the goal is to understand how a function is used. '
    'Do NOT use for general text search (use code_search for that). '
    'Example: find_callers(function_name="validate_token").'
)

RUN_TESTS_DESC = (
    'Run the project test suite and return a structured pass/fail summary. '
    'Use after any edit_file change that touches logic. '
    'Do NOT use to run a single arbitrary command (use run_bash). '
    'Example: run_tests(scope="tests/test_auth.py").'
)

The "do NOT" clause is what prevents the model from defaulting to the new tool for every search. Models follow negation better than ambiguous guidance.

When does a custom tool stop paying off?

What to do Monday morning

Open a week of agent logs. Find 3 repeated tool-call chains (grep-then-edit, ls-then-cat, run-bash-then-grep-output). Each one is a candidate for a custom tool.
Build the find_callers tool first. It is the single highest-use custom tool for coding agents because call-site lookups happen in nearly every refactor task.
Build run_tests second. A structured test runner turns a 5000-line output into 10 lines and saves enormous context.
Write tool descriptions with "when to use" and "when NOT to use" clauses. The negation trains the model to avoid overuse.
After a week of traces, delete any custom tool called in fewer than 5 percent of sessions. Unused tools are pure tax on every call.

Custom tools accelerate repeated subroutines; they do not replace primitives. 3 well-chosen custom tools turn a 25-call task into a 4-call task.
Build a custom tool only when it is used in more than 20 percent of tasks and replaces 3 or more primitive calls. Below either threshold it is waste.
find_callers, run_tests, and project_tree are the 3 highest-use custom tools in almost any coding agent. Ship them first.
Name tools verb-object. find_callers, not callers. The clarity beats the extra characters.
Put when-to-use and when-NOT-to-use hints in the tool description. Models follow negation in descriptions better than guidance in system prompts.
To see custom tools wired into a full coding agent with registry, event loop, and safety rails, walk through the Build Your Own Coding Agent course, or start with the AI Agents Fundamentals primer.

For deeper tool-design patterns and case studies, see the Anthropic tool use guide. The description-writing advice there maps directly onto the naming and structure rules in this post.

Custom tools for coding agents: beyond the built-ins

Share this post

Share this post

Continue Reading

Which language should you build Redis in? Lessons from rebuilding it 6 times

Query anonymization for RAG bias mitigation

pip vs uv vs poetry for Python AI services

Weekly Bytes of AI

Ready to go deeper?

Custom tools for coding agents: beyond the built-ins

Share this post

Share this post

Continue Reading

Which language should you build Redis in? Lessons from rebuilding it 6 times

Query anonymization for RAG bias mitigation

pip vs uv vs poetry for Python AI services

Weekly Bytes of AI

Ready to go deeper?