Do We Really Need Evals for Agents?

Estimated read time: 3 min

Last updated: September 18, 2025

Earlier this week, we hosted a live conversation with Hamza Tahir, CTO of ZenML, to ask a simple question: do we need evals for agentic systems?

The question has been debated across Twitter, LinkedIn, and Discord. Some founders argue evals slow you down. Others claim they are non-negotiable for production. The reality, as Hamza explained, is less binary.

The Startup View: Move Fast, Stay Tolerant

If you’re building an agent to automate restaurant bookings or draft sales emails, you don’t need a perfect evaluation framework on day one.

You still need something. At minimum:

Collect traces from your agent in production.
Label failures manually (an Excel sheet is fine).
Run a simple script to re-test against those labeled examples.

That’s the “starter pack.” It won’t slow you down, and it builds the muscle memory you’ll need when the stakes get higher.

The Enterprise View: Eval Discipline Matters

If your agent touches law, healthcare, or finance, mistakes are expensive. Silent failures or infinite loops can cost millions - or worse.

Enterprises often fall into the opposite trap: months of theoretical risk analysis without deploying anything. Hamza’s advice was clear: ship something small, measure real failures, then harden from there.

It’s the same lesson self-driving teams learned a decade ago: you only discover edge cases on the road, not in the lab.

Common Mistakes We See

Over-scoping Building one universal agent that does everything. Narrow use cases are easier to measure and safer to deploy.
Relying on canned benchmarks Generic metrics rarely match your product. Define your own heuristics.
Ignoring user experience Evals that optimize for “correctness” while frustrating users miss the point. If satisfaction drops, your evals are wrong.

What About LLM Judges?

There’s hype around using language models as evaluators. Done wrong, it’s unreliable. Done right, it’s powerful.

Hamza’s method:

Annotate examples yourself.
Compare your labels against the LLM judge.
Tune prompts until agreement is high.

At that point, the model acts as a proxy for your own evaluation. Teams that do this well speed up the entire loop.

Where ZenML Fits

ZenML describes itself as an “outer loop” framework: the infrastructure around agents, not the agents themselves.

That means:

Pipelines for deploying and re-deploying agents.
Data and artifact versioning.
Batch evals and reproducibility across environments.
Integrations with tracing and monitoring tools.

If LangGraph or OpenAI’s SDK define what the agent does, ZenML defines how it runs reliably at scale.

Our Take at Jentic

At Jentic, we believe reliability is the missing piece in the agent stack. Standards like Arazzo give you deterministic workflows. Frameworks like ZenML give you reproducibility. Together, they push agentic software closer to production-grade.

The lesson from Hamza’s talk is simple:

Start narrow.
Evaluate in a way that matches your use case.
Communicate clearly to your users what the agent can and cannot do.

It’s not about choosing between evals or no evals. It’s about the right evals at the right time.

Watch the full conversation on our YouTube channel. Try the Arazzo Engine and the Standard Agent on GitHub. For early access, join the beta and our Discord community.

If you’re running agents in production, we’d like to hear how you evaluate them. Share your approach in Discord.