Automated evaluation pipelines for agentic AI systems
You built an eval framework and nobody runs it
You shipped LLM-as-a-judge. You shipped Ragas. You shipped a custom metric registry. All 3 work. Nobody on your team runs them because running the eval is a 10-minute manual command, and engineers ship changes faster than they run evals. Regressions leak through because the eval exists but is not automatic.
The fix is an automated eval pipeline: CI runs a fast subset on every PR, nightly runs the full suite, a dashboard shows the trend, and any regression above a threshold blocks the merge. Same framework, different operational loop.
This post is the CI wiring pattern for agent eval: the fast PR gate, the full nightly job, the dashboard format, and the regression threshold rules that catch real problems without blocking on noise.
Why do manual evals stop working past week 2?
Because engineers optimize for the loop that shows up in their day. Things that run automatically shape the shipping rhythm. Things that require a manual command get skipped under deadline pressure. 3 specific failure modes of manual eval:
- Skipped under pressure. Deadline looms, engineer ships without running the eval, regression slips in.
- Inconsistent baselines. Engineer A ran eval on commit 1, engineer B did not run eval on commit 2, so there is no before/after comparison.
- No visibility. The eval dashboard exists but nobody remembers to check it. Changes ship on vibes.
Automated eval fixes all 3 by putting the eval on rails. PR gate forces a quick check. Nightly runs the full suite unsupervised. Dashboard alerts on regressions without manual polling.
graph TD
PR[PR opened] --> Fast[Fast eval: 3 metrics, 30s]
Fast --> Gate{Score >= baseline - 0.1?}
Gate -->|yes| Merge[Merge allowed]
Gate -->|no| Block[Blocked with report]
Nightly[Nightly cron] --> Full[Full eval: 12 metrics, 5min]
Full --> Dashboard[(Trend dashboard)]
Dashboard --> Alert{Regression >= 0.3?}
Alert -->|yes| Slack[Slack alert]
style Fast fill:#dbeafe,stroke:#1e40af
style Full fill:#dcfce7,stroke:#15803d
style Alert fill:#fef3c7,stroke:#b45309
What runs on every PR?
A fast subset of metrics on a small eval set. 3 metrics, 30-50 examples, under 60 seconds total. The goal is catching obvious regressions without adding significant CI latency.
# filename: .github/workflows/eval-pr.yml
# description: Fast eval gate on every PR.
name: Eval PR gate
on: pull_request
jobs:
eval:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-python@v5
with:
python-version: '3.12'
- run: pip install uv && uv sync --frozen
- name: Run fast eval
env:
ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
run: uv run python -m app.eval.runner eval_config_fast.yaml > eval_result.json
- name: Compare to baseline
run: uv run python -m app.eval.compare_baseline eval_result.json baseline.json
The compare_baseline script reads the current run's scores and compares to a committed baseline JSON. If any metric drops by more than 0.1 (on a 0-1 scale) or 0.5 (on a 1-5 scale), the step fails and the PR is blocked.
For the dynamic metric loader that reads the YAML config, see the Dynamic evaluation metric loading in Python post.
What runs nightly?
The full metric suite on the full eval set. 12 metrics, 500-1000 examples, 5-10 minutes. The goal is catching subtle regressions that the fast PR gate misses and building a trend line over weeks.
# filename: .github/workflows/eval-nightly.yml
# description: Full nightly eval with trend recording.
name: Eval nightly
on:
schedule:
- cron: '0 3 * * *' # 3am UTC daily
jobs:
eval:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-python@v5
with:
python-version: '3.12'
- run: pip install uv && uv sync --frozen
- name: Run full eval
env:
ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
run: uv run python -m app.eval.runner eval_config_full.yaml > nightly_result.json
- name: Upload to dashboard
run: curl -X POST https://eval-dashboard.internal/api/runs -d @nightly_result.json
- name: Check for regressions
run: uv run python -m app.eval.regression_check nightly_result.json
The dashboard is whatever you have: a Postgres table, a simple Metabase board, a Grafana setup, or even a Google Sheet that you append to. The key is that the data is stored somewhere that shows trends across runs.
How do you set the baseline?
Take the mean of the last 7 nightly runs. Commit the result as baseline.json. Update the baseline on a weekly cadence, or when a significant intentional improvement lands. Never let the baseline drift upward silently, or you will ship regressions that look like non-events.
# filename: app/eval/baseline.py
# description: Compute baseline from last 7 nightly runs.
import json
from pathlib import Path
from statistics import mean
def compute_baseline(runs: list[dict]) -> dict:
last_7 = runs[-7:]
metric_names = list(last_7[0]['aggregate'].keys())
return {
name: mean(r['aggregate'][name] for r in last_7)
for name in metric_names
}
The baseline represents the expected quality floor. Any PR dropping below it is a candidate regression.
What do you alert on?
2 alert tiers:
- PR gate failure: any metric drops by more than 0.1 on a 0-1 scale (or equivalent on 1-5). Blocks the merge.
- Nightly regression: any metric drops by more than 0.3 compared to the 7-day average. Slack alert, no hard block, but investigate same-day.
Do not alert on smaller changes. Noise will swamp signal and your team will start ignoring the alerts.
For the LLM-as-a-judge rationale audit that complements this trend view, see the LLM judges enforcing reasoning post.
What does the dashboard need to show?
3 views:
- Trend line per metric over the last 30 days. Spot gradual drift that point-in-time comparisons miss.
- Latest run details: mean, p50, p95, failures per dimension. Identify which kinds of outputs the agent struggles with.
- Rationale search. Filter judge rationales by keyword to find recurring failure modes.
Keep the dashboard opinionated. 20 tiles of metrics makes nothing visible. 3 focused views keep the team looking.
What to do Monday morning
- Wire the fast eval into PR CI. Use the metric registry config and point it at your smallest stable eval set (30-50 examples).
- Wire the full eval into a nightly GitHub Actions job. Use the full metric set and a larger eval set (500+ examples).
- Compute an initial baseline from a single clean run. Commit
baseline.json. Update weekly via a scheduled job that rewrites it. - Build a minimal dashboard: 1 trend chart per metric, 1 latest-run table, 1 rationale search. Start with Metabase or Grafana if you have one; fall back to a Google Sheet.
- Add a Slack alert for nightly regressions of 0.3 or more. Alert once, wait 24 hours before alerting again on the same metric.
- After 2 weeks, look at the data. If nothing surprising showed up, trust the pipeline and use it to ship more confidently.
The headline: automated eval is 2 CI jobs, 1 baseline file, 1 dashboard, 1 Slack alert. Fast on PR, full at night, trend over weeks, alert on real regressions. Turns eval from "nobody runs it" into "every change is measured".
Frequently asked questions
Why run eval in CI at all?
Because manual eval gets skipped under deadline pressure, and regressions slip through without automated gates. CI makes the eval part of the shipping loop, so every change is measured without requiring willpower from engineers. PR gate catches obvious breaks. Nightly catches subtle drift. Together they cover 95 percent of real regressions before production.
How long should the PR eval take?
Under 90 seconds total including setup, run, and compare. Longer than that and engineers start complaining, disabling it, or working around it. 90 seconds is enough to run 3 metrics on 30-50 carefully chosen examples, which is enough to catch obvious regressions. Reserve the full suite for nightly.
How do I set a baseline without locking in bad quality?
Compute the baseline as the rolling 7-day mean of nightly runs. Update it weekly via a scheduled job that rewrites baseline.json. When a real improvement lands, the baseline rises within a week automatically. When a regression is intentional (a model swap with known trade-offs), update the baseline manually in the same PR.
What regression threshold should I alert on?
Two tiers. PR gate: 0.1 drop on a 0-1 scale, or 0.5 on a 1-5 scale. Hard block. Nightly: 0.3 drop on a 0-1 scale, or 1.0 on a 1-5 scale. Slack alert, no hard block. Tighter thresholds produce false-positive noise; looser ones let real regressions through. Tune once, then trust the numbers.
What goes in the eval dashboard?
Three views: a trend line per metric over 30 days, a table of the latest run's aggregate scores with p50 and p95, and a rationale search box. No more. 20-tile dashboards are ignored because nobody knows which tile to look at. 3 tiles get checked every morning.
Key takeaways
- Manual eval gets skipped under pressure. Automate it or accept that regressions will leak.
- Fast eval runs on every PR: 3 metrics, 30-50 examples, under 90 seconds. Blocks merges on significant drops.
- Full eval runs nightly: 12 metrics, 500+ examples, 5-10 minutes. Records trend data in a dashboard.
- Baseline is a rolling 7-day mean. Auto-updates weekly so real improvements are absorbed, not hidden.
- Alert on 0.1 PR drop (block merge) and 0.3 nightly drop (Slack alert). Tighter thresholds = noise; looser = missed regressions.
- To see this eval pipeline wired into a production agent service with RAGAS and LLM-as-a-judge, walk through the Agentic RAG Masterclass, or start with the RAG Fundamentals primer.
For the GitHub Actions scheduled workflow syntax and secrets management, see the GitHub Actions docs on scheduled events.
Continue Reading
Ready to go deeper?
Go beyond articles. Build production AI systems with hands-on workshops and our intensive AI Bootcamp.