Dynamic evaluation metric loading in Python

Your eval pipeline hardcodes metrics and you cannot add one without redeploying

Your eval script imports RAGAS faithfulness, context precision, and answer relevancy at the top. A teammate wants to add a custom metric for "does the answer include the required citation." To add it, they edit the eval script, import the new metric, run the pipeline, and redeploy. Every metric change is a code change, a PR, a review, a deploy.

The fix is dynamic metric loading. Define a small registry interface, register metrics by name, load them at runtime from config. Adding a metric becomes: write a class, register it, add the name to the eval config. No code changes in the pipeline itself.

This post is the dynamic metric loading pattern for Python eval pipelines: the registry interface, how to load from config files or entry points, how tests swap metrics with fakes, and the CI workflow that runs only the metrics a config file asks for.

Why does hardcoding metrics hurt?

Because eval pipelines need to evolve. Every new project has its own quality signals. 3 specific failure modes of hardcoded metrics:

Every metric change is a deploy. Add a metric → edit the pipeline → PR → merge → deploy. Slow loop kills experimentation.
Cannot run a subset. CI needs a fast subset (3 metrics, 30 seconds). Nightly needs the full suite (15 metrics, 10 minutes). With hardcoded imports, you cannot switch between them without code.
Test pollution. Tests that want to mock one metric have to patch the import. Fragile, slow, and breaks on refactors.

A registry fixes all 3 by making metrics addressable by string name and loadable from config.

graph LR
    Config[eval_config.yaml] -->|list of metric names| Loader[Metric loader]
    Loader --> Registry[(Registry: name to class)]
    Registry --> Faithfulness[FaithfulnessMetric]
    Registry --> Precision[ContextPrecisionMetric]
    Registry --> Custom[CustomCitationMetric]
    Loader --> Pipeline[Eval pipeline]
    Pipeline --> Results[Scored results]

    style Registry fill:#dbeafe,stroke:#1e40af
    style Loader fill:#dcfce7,stroke:#15803d

What does the metric registry look like?

A dict mapping name to class, plus a decorator that registers a metric when its module is imported.

# filename: app/eval/registry.py
# description: Dynamic metric registry with decorator-based registration.
from typing import Protocol, Callable


class Metric(Protocol):
    name: str
    async def score(self, question: str, actual: str, expected: str, context: str) -> float: ...


REGISTRY: dict[str, type[Metric]] = {}


def register_metric(name: str) -> Callable:
    def decorator(cls: type[Metric]) -> type[Metric]:
        if name in REGISTRY:
            raise ValueError(f'metric {name} already registered')
        cls.name = name
        REGISTRY[name] = cls
        return cls
    return decorator


def get_metric(name: str) -> Metric:
    cls = REGISTRY.get(name)
    if cls is None:
        raise KeyError(f'unknown metric: {name}. Known: {list(REGISTRY.keys())}')
    return cls()


def list_metrics() -> list[str]:
    return sorted(REGISTRY.keys())

3 pieces. Metric is a Protocol so any class with a score method qualifies. register_metric is a decorator that adds the class to the dict. get_metric(name) returns a live instance. Adding a metric is now a matter of writing a class with @register_metric('my_metric').

How do you define a metric?

Write a class with a score method. Register it with the decorator.

# filename: app/eval/metrics/citation.py
# description: Custom metric: does the output include at least one citation?
import re
from app.eval.registry import register_metric


@register_metric('citation_presence')
class CitationPresenceMetric:
    async def score(self, question: str, actual: str, expected: str, context: str) -> float:
        has_url = bool(re.search(r'https?://\S+', actual))
        has_ref = bool(re.search(r'\[\d+\]|\[[a-zA-Z]+\]', actual))
        return 1.0 if (has_url or has_ref) else 0.0

One class, one decorator, one implementation. The file gets imported at startup (via a package-level __init__.py that imports all metric modules), the decorator runs, and the metric is live.

For the broader eval pipeline that consumes these metrics, see the Automated evaluation pipelines for agentic systems post.

How do you load metrics from a config file?

The eval pipeline reads a YAML file, looks up each metric by name, and calls score on each one for every eval example.

# filename: app/eval/runner.py
# description: Config-driven eval pipeline using the metric registry.
import yaml
from pathlib import Path
from app.eval.registry import get_metric
import app.eval.metrics  # triggers registration of all metrics


async def run_eval(config_path: str, eval_set: list[dict]) -> dict:
    config = yaml.safe_load(Path(config_path).read_text())
    metric_names = config['metrics']
    metrics = [get_metric(name) for name in metric_names]

    results = []
    for item in eval_set:
        row = {'id': item['id']}
        for m in metrics:
            row[m.name] = await m.score(
                question=item['question'],
                actual=item['actual'],
                expected=item.get('expected', ''),
                context=item.get('context', ''),
            )
        results.append(row)

    # Aggregate
    agg = {
        m.name: sum(r[m.name] for r in results) / len(results)
        for m in metrics
    }
    return {'per_example': results, 'aggregate': agg}

The config file lists only the metrics you want for that run:

# filename: eval_config_fast.yaml
# description: Fast CI metric subset, 3 metrics, under 30 seconds.
metrics:
  - faithfulness
  - context_precision
  - citation_presence

# filename: eval_config_full.yaml
# description: Full nightly metric suite, 12 metrics, 5-10 minutes.
metrics:
  - faithfulness
  - context_precision
  - context_recall
  - answer_relevancy
  - answer_correctness
  - citation_presence
  - tone_safety
  - length_budget
  - numerical_accuracy
  - format_validity
  - refusal_appropriateness
  - llm_judge_overall

Same pipeline, different config, different metric set. CI uses the fast config, nightly uses the full config, both run without code changes.

How do tests override metrics?

# filename: tests/test_runner.py
# description: Override a metric in a pytest fixture.
import pytest
from app.eval.registry import REGISTRY, register_metric


@pytest.fixture
def fake_metric():
    @register_metric('test_metric')
    class FakeMetric:
        async def score(self, **kwargs):
            return 1.0
    yield
    REGISTRY.pop('test_metric', None)


@pytest.mark.asyncio
async def test_run_eval(fake_metric):
    from app.eval.runner import run_eval
    result = await run_eval('tests/fixtures/test_config.yaml', [
        {'id': '1', 'question': 'q', 'actual': 'a'},
    ])
    assert result['aggregate']['test_metric'] == 1.0

The fixture registers a fake metric at setup, tears it down after the test. No import patching, no module swaps, no fragile monkey-patching.

For the broader test strategy that this sits inside, see the Modular architectures for agentic AI post.

What to do Monday morning

Open your eval pipeline. Count the hardcoded metric imports at the top. If there are 3 or more, registry pattern is worth it.
Create app/eval/registry.py with the register_metric decorator and get_metric function.
Convert each existing metric to a class with @register_metric('name'). Keep the logic identical.
Move the pipeline to read a YAML config listing metric names. Remove direct imports from the pipeline module.
Create eval_config_fast.yaml and eval_config_full.yaml for CI and nightly. Point CI at the fast one.
Run both. Confirm the fast config runs in under 1 minute and the full config matches your previous baseline.

The headline: dynamic metric loading is 40 lines of registry code plus config files. Adding a metric becomes a class + one line of config. No redeploys, no import surgery, no test pollution.

Frequently asked questions

Why use a metric registry instead of hardcoded imports?

Because hardcoded imports make every metric change a code change, and you cannot run a subset of metrics without touching the pipeline source. A registry lets you add metrics by writing a class with a decorator, and run subsets by swapping a config file. Every operational win (fast CI vs full nightly, A/B testing metrics, experimenting with new dimensions) flows from this one pattern.

How do I register a metric?

Write a class with a score method that takes question, actual, expected, and context keyword arguments. Decorate the class with @register_metric('my_name'). When the module is imported, the decorator runs and the metric becomes available via get_metric('my_name'). That is the whole registration contract.

How do I load different metric sets for CI vs production?

Put each metric set in its own YAML config file. The pipeline reads the config at runtime and loads only the metrics named there. CI uses a fast config (3 metrics, under 30 seconds). Nightly uses a full config (12 metrics, 5-10 minutes). Same pipeline code, different configs, no deployment between them.

Can tests mock individual metrics?

Yes, by registering a fake metric in a pytest fixture. Use @register_metric('test_metric') inside a fixture, then pop the entry from the registry in the fixture teardown. This is cleaner than import patching and does not touch any module's global state.

How do I handle metric dependencies (e.g., some need a real LLM)?

Pass the LLM client as a dependency to the metric class constructor. In the registry, store the class (not an instance). When get_metric(name) is called, instantiate with the required dependencies via __init__. This keeps the registry logic-free while supporting dependency injection.

Key takeaways

Hardcoded metric imports make every metric change a deploy. A registry turns it into a class + one line of config.
The registry is 40 lines: a dict, a decorator, a lookup function. Every metric class registers itself at import time.
Config files (YAML) list which metrics to run. CI and nightly use different configs on the same pipeline code.
Tests override metrics with pytest fixtures that register fakes at setup and pop them at teardown. No monkey-patching.
Metric classes take dependencies via __init__, not globals. The registry stores the class, not an instance.
To see dynamic metric loading wired into a full production eval pipeline with RAGAS and dashboards, walk through the Agentic RAG Masterclass, or start with the RAG Fundamentals primer.

For the Python entry-points mechanism that extends the registry pattern to third-party packages, see the packaging.python.org entry-points guide. Use it when you want metrics to be installable as pip packages.

Your eval pipeline hardcodes metrics and you cannot add one without redeploying

Why does hardcoding metrics hurt?

Because eval pipelines need to evolve. Every new project has its own quality signals. 3 specific failure modes of hardcoded metrics:

Every metric change is a deploy. Add a metric → edit the pipeline → PR → merge → deploy. Slow loop kills experimentation.
Cannot run a subset. CI needs a fast subset (3 metrics, 30 seconds). Nightly needs the full suite (15 metrics, 10 minutes). With hardcoded imports, you cannot switch between them without code.
Test pollution. Tests that want to mock one metric have to patch the import. Fragile, slow, and breaks on refactors.

A registry fixes all 3 by making metrics addressable by string name and loadable from config.

graph LR
    Config[eval_config.yaml] -->|list of metric names| Loader[Metric loader]
    Loader --> Registry[(Registry: name to class)]
    Registry --> Faithfulness[FaithfulnessMetric]
    Registry --> Precision[ContextPrecisionMetric]
    Registry --> Custom[CustomCitationMetric]
    Loader --> Pipeline[Eval pipeline]
    Pipeline --> Results[Scored results]

    style Registry fill:#dbeafe,stroke:#1e40af
    style Loader fill:#dcfce7,stroke:#15803d

What does the metric registry look like?

A dict mapping name to class, plus a decorator that registers a metric when its module is imported.

# filename: app/eval/registry.py
# description: Dynamic metric registry with decorator-based registration.
from typing import Protocol, Callable


class Metric(Protocol):
    name: str
    async def score(self, question: str, actual: str, expected: str, context: str) -> float: ...


REGISTRY: dict[str, type[Metric]] = {}


def register_metric(name: str) -> Callable:
    def decorator(cls: type[Metric]) -> type[Metric]:
        if name in REGISTRY:
            raise ValueError(f'metric {name} already registered')
        cls.name = name
        REGISTRY[name] = cls
        return cls
    return decorator


def get_metric(name: str) -> Metric:
    cls = REGISTRY.get(name)
    if cls is None:
        raise KeyError(f'unknown metric: {name}. Known: {list(REGISTRY.keys())}')
    return cls()


def list_metrics() -> list[str]:
    return sorted(REGISTRY.keys())

How do you define a metric?

Write a class with a score method. Register it with the decorator.

# filename: app/eval/metrics/citation.py
# description: Custom metric: does the output include at least one citation?
import re
from app.eval.registry import register_metric


@register_metric('citation_presence')
class CitationPresenceMetric:
    async def score(self, question: str, actual: str, expected: str, context: str) -> float:
        has_url = bool(re.search(r'https?://\S+', actual))
        has_ref = bool(re.search(r'\[\d+\]|\[[a-zA-Z]+\]', actual))
        return 1.0 if (has_url or has_ref) else 0.0

One class, one decorator, one implementation. The file gets imported at startup (via a package-level __init__.py that imports all metric modules), the decorator runs, and the metric is live.

For the broader eval pipeline that consumes these metrics, see the Automated evaluation pipelines for agentic systems post.

How do you load metrics from a config file?

The eval pipeline reads a YAML file, looks up each metric by name, and calls score on each one for every eval example.

# filename: app/eval/runner.py
# description: Config-driven eval pipeline using the metric registry.
import yaml
from pathlib import Path
from app.eval.registry import get_metric
import app.eval.metrics  # triggers registration of all metrics


async def run_eval(config_path: str, eval_set: list[dict]) -> dict:
    config = yaml.safe_load(Path(config_path).read_text())
    metric_names = config['metrics']
    metrics = [get_metric(name) for name in metric_names]

    results = []
    for item in eval_set:
        row = {'id': item['id']}
        for m in metrics:
            row[m.name] = await m.score(
                question=item['question'],
                actual=item['actual'],
                expected=item.get('expected', ''),
                context=item.get('context', ''),
            )
        results.append(row)

    # Aggregate
    agg = {
        m.name: sum(r[m.name] for r in results) / len(results)
        for m in metrics
    }
    return {'per_example': results, 'aggregate': agg}

The config file lists only the metrics you want for that run:

# filename: eval_config_fast.yaml
# description: Fast CI metric subset, 3 metrics, under 30 seconds.
metrics:
  - faithfulness
  - context_precision
  - citation_presence

# filename: eval_config_full.yaml
# description: Full nightly metric suite, 12 metrics, 5-10 minutes.
metrics:
  - faithfulness
  - context_precision
  - context_recall
  - answer_relevancy
  - answer_correctness
  - citation_presence
  - tone_safety
  - length_budget
  - numerical_accuracy
  - format_validity
  - refusal_appropriateness
  - llm_judge_overall

Same pipeline, different config, different metric set. CI uses the fast config, nightly uses the full config, both run without code changes.

How do tests override metrics?

# filename: tests/test_runner.py
# description: Override a metric in a pytest fixture.
import pytest
from app.eval.registry import REGISTRY, register_metric


@pytest.fixture
def fake_metric():
    @register_metric('test_metric')
    class FakeMetric:
        async def score(self, **kwargs):
            return 1.0
    yield
    REGISTRY.pop('test_metric', None)


@pytest.mark.asyncio
async def test_run_eval(fake_metric):
    from app.eval.runner import run_eval
    result = await run_eval('tests/fixtures/test_config.yaml', [
        {'id': '1', 'question': 'q', 'actual': 'a'},
    ])
    assert result['aggregate']['test_metric'] == 1.0

The fixture registers a fake metric at setup, tears it down after the test. No import patching, no module swaps, no fragile monkey-patching.

For the broader test strategy that this sits inside, see the Modular architectures for agentic AI post.

What to do Monday morning

Open your eval pipeline. Count the hardcoded metric imports at the top. If there are 3 or more, registry pattern is worth it.
Create app/eval/registry.py with the register_metric decorator and get_metric function.
Convert each existing metric to a class with @register_metric('name'). Keep the logic identical.
Move the pipeline to read a YAML config listing metric names. Remove direct imports from the pipeline module.
Create eval_config_fast.yaml and eval_config_full.yaml for CI and nightly. Point CI at the fast one.
Run both. Confirm the fast config runs in under 1 minute and the full config matches your previous baseline.

The headline: dynamic metric loading is 40 lines of registry code plus config files. Adding a metric becomes a class + one line of config. No redeploys, no import surgery, no test pollution.

Hardcoded metric imports make every metric change a deploy. A registry turns it into a class + one line of config.
The registry is 40 lines: a dict, a decorator, a lookup function. Every metric class registers itself at import time.
Config files (YAML) list which metrics to run. CI and nightly use different configs on the same pipeline code.
Tests override metrics with pytest fixtures that register fakes at setup and pop them at teardown. No monkey-patching.
Metric classes take dependencies via __init__, not globals. The registry stores the class, not an instance.
To see dynamic metric loading wired into a full production eval pipeline with RAGAS and dashboards, walk through the Agentic RAG Masterclass, or start with the RAG Fundamentals primer.

Dynamic evaluation metric loading in Python

Share this post

Share this post

Continue Reading

Which language should you build Redis in? Lessons from rebuilding it 6 times

Query anonymization for RAG bias mitigation

pip vs uv vs poetry for Python AI services

Weekly Bytes of AI

Ready to go deeper?

Dynamic evaluation metric loading in Python

Share this post

Share this post

Continue Reading

Which language should you build Redis in? Lessons from rebuilding it 6 times

Query anonymization for RAG bias mitigation

pip vs uv vs poetry for Python AI services

Weekly Bytes of AI

Ready to go deeper?