Building Agents We Can Measure

For a long time, most software followed a simple contract: a function takes input x, applies logic, and returns output y. The work was to make this logic correct, fast, secure, and maintainable.

Machine learning changed part of that contract. Instead of writing all the logic ourselves, we trained models to predict an output from examples. The system became less explicit, but the interface was still usually narrow: a model receives a structured input and returns a prediction.

Agents move the boundary again. An agent receives a request, reasons about it, uses tools, reads data, takes actions, observes the result, and decides what to do next. The input is often a loose user query. The output is not only a message, but also a sequence of decisions that may depend on context, memory, permissions, external systems, and the model itself.

This makes agents powerful, but it also changes the engineering problem. A small change in a system prompt, a tool description, a model, or an output schema can improve one behavior while affecting another in ways that are hard to anticipate. The system is no longer only defined by code paths. It is also defined by the interaction between instructions, tools, data, and runtime decisions.

At Next Gate Tech, we build Spark as an Agentic platform for investment operations. Our agents interact with structured data, workflows, files, tasks, users, and external systems. They operate in environments where correctness, traceability, and control matter.

This forced us to rethink part of our engineering process. The question is not only: "Does the agent answer well on this example?" The question is: "Can we measure whether the agent behaves correctly across many realistic situations, before and after every change?"

That is why we started building our agent framework with evaluation-driven development. The goal is not to make agents fully deterministic, or to remove every uncertainty from the system. The goal is to make their behavior visible enough that product and engineering teams can improve it with confidence.

Why manual testing is not enough

Manual testing was useful at the beginning. We could change a prompt, run a few examples, inspect the trace, and see whether the agent behaved better. This was fast, and it helped us build intuition about how the system reacted to different instructions, tools, and user queries. For a prototype, or even for a first useful version, this kind of direct inspection is still one of the best ways to understand what is happening.

As we added more capabilities, we started to see the limits of this approach. A small change to a system prompt could improve one behavior and weaken another. A small change to a tool description could make the agent select a different tool. A new model could reduce cost but change the style of reasoning.

This is part of what makes agent development interesting, but also more difficult to control. The final behavior depends on the user query, the model, the prompt, the available tools, the tool schemas, the data returned by the tools, the memory, the environment state, and the orchestration logic around the model. In our case, it also depends on each customer setup: their Spark configuration, their data objects, their workflows, their custom skills, and their third-party MCP tools.

We gradually learned that we needed more than spot checks and manual validation. User feedback is valuable, but it arrives too late to be the main quality process. In enterprise operations, we cannot rely on production issues to discover whether a change affected tool usage, reasoning quality, or the way the agent handles incomplete data.

We needed a way to keep the fast feedback loop that made manual testing useful, while adding enough structure to measure progress across many realistic situations. This is what led us to build evaluation more deeply into the way we design, implement, release, and improve agents.

Evaluation-driven development

An evaluation is a test for an AI system, but for agents it needs to cover more than the final answer. When an agent responds, the quality of the run depends on the request it received, the tools it selected, the intermediate steps it took, the observations it used, and sometimes the final state of the system after the run.

We use a simple vocabulary internally:

A task describes a situation we want the agent to handle. It includes the user input, the relevant context, and the success criteria.
A trial is one run of the agent on that task. Because agent behavior can vary, we may run the same task multiple times to understand how stable the behavior is.
A trace gives us the full execution path: messages, tool calls, observations, intermediate decisions, and final answer.
An outcome captures what happened in the environment, such as whether the right record was found, the right task was created, or the right object was referenced.
An evaluator scores the result of the trial. Some evaluators are deterministic, for example checking whether a tool error occurred or whether a required reference was returned. Others are model-based, especially when the judgment requires more nuance.
An evaluation suite is a collection of tasks designed to measure a capability or protect against regressions.

This gives us a shared language between product and engineering. Product can describe the expected behavior through tasks and success criteria, while engineering can implement changes against those expectations. Over time, the evaluation suite becomes a practical contract for how the agent should behave.

It does not replace unit tests, integration tests, or manual review. Those remain the right tools to validate deterministic logic and platform behavior. Evaluations complement them by measuring the parts of the system that are specific to agents: reasoning quality, tool usage, ambiguity handling, response structure, and the path taken to reach an answer.

Designing evaluations from real failures

We usually start from real behavior. A user asks a question, an internal tester finds a weak spot, a developer notices that the agent used the wrong tool, or a customer use case exposes an ambiguity we had not anticipated. These observations are valuable because they show how the agent behaves in practice, not only how we expected it to behave during design. When we see a recurring pattern, we turn it into an evaluation task.

One example came from portfolio data. A user might ask:

What percentage of my book is exposed to JPY?

At first glance, this looks like a normal analytical question. To answer it properly, however, the agent needs to understand that the request may require scanning a broad dataset. It should not randomly query records, guess filters, or search by date until it finds something that looks useful.

The expected behavior is more precise. The agent should understand the scope of the request and determine whether it has enough context, such as the relevant portfolio, date, or exposure field. It should also assess whether the data can be retrieved safely through a few queries. If not, it should ask the user to refine the request instead of pretending to have computed the answer.

That failure became an evaluation task. The goal was not only to check whether the agent could "answer the question." The success criteria also described the expected reasoning path, the acceptable tool usage, and the right behavior when the request is underspecified.

This is where evaluations become more than quality checks. They help us write down what "good" means in practical terms, and they turn expected agent behavior into product requirements that product and engineering can review together.

How evaluations fit into our development lifecycle

We use evaluations at four moments: design, implementation, release, and production.

Design

During design, evaluations help us define the behavior we want before we implement it. This is especially useful for agent work because two people can agree on the same feature but imagine different behaviors in edge cases. Should the agent answer directly? Should it ask for clarification? Should it use a tool? Should it refuse because the data is incomplete? Should it return references?

Writing the task and success criteria makes these choices explicit. It also helps us separate product decisions from implementation details. The task says what the agent should achieve. The implementation can then change over time: prompt changes, model changes, tool changes, or orchestration changes.

Implementation

During implementation, evaluations give developers a direct feedback loop. When a developer changes a prompt, a tool schema, or a routing rule, they can run the relevant suite and see whether the targeted behavior improves. They can also inspect the trade-offs across quality, latency, tool errors, token usage, and cost.

This matters because agent improvements are rarely one-dimensional. A larger model may improve reasoning but increase cost. A smaller model may be faster but fail on complex tasks. A stricter schema may make the UI easier to build but reduce flexibility. A more detailed tool description may improve tool selection but increase prompt size.

Evaluations make these trade-offs visible during development, when changes are still easy to adjust. They also reduce the risk of local optimization: it is easy to fix one failure and break another behavior somewhere else, and a regression suite helps us catch those side effects earlier.

Release

Before release, we run evaluations as part of the release process. The goal is to compare the new version with the production baseline before the change reaches users.

These evaluations are integrated into our CI/CD pipeline and run on release branches. We run evaluation suites on larger datasets and inspect how the scores move across capabilities. If a release improves one behavior but creates regressions in another, we can see it before production. If a change increases tool errors, recursion limit hits, latency, or cost per task, we can investigate while the release is still under review.

This makes agent quality part of the release conversation in a more concrete way. Instead of relying only on manual impressions or a few selected examples, we can look at measurable evidence, compare it with the baseline, and decide whether the change is ready to ship.

Production

Production gives us some of the most valuable inputs for the evaluation suite. User feedback, failed runs, poor answers, tool errors, and edge cases all reveal situations where the expected behavior was either missing, unclear, or not robust enough.

When we find a new failure pattern, we try not to only fix the immediate issue. We also capture it as a new evaluation task when it represents a behavior we want to protect in the future. This helps the team turn production learning into a more durable part of the development process.

Over time, this creates a useful feedback loop. The more we learn from production, the more representative our evaluation suite becomes. The more representative the suite becomes, the easier it is to make future changes with confidence.

What we measure

We use several types of evaluators depending on what we want to understand. Some are simple and deterministic, which makes them easier to debug and cheaper to run. Others use a grading agent with a rubric, especially when the quality dimension is more nuanced and cannot be captured with a simple rule.

The main evaluators we use today are:

Success: did the run complete the task successfully?
Tool errors: did the agent generate an error while using a tool?
Tool selection: did the agent use the expected tools for the task?
Recursion limit: did the agent hit the limit we impose on the loop?
Response quality: did the final answer satisfy the expected behavior?
Trace quality: did the agent reach the answer in an acceptable way?

Trace quality is important because, for agents, a correct final sentence is not always enough. An agent can produce a plausible answer after using the wrong tool, ignoring part of the context, or guessing from incomplete data. In our domain, we need to understand not only what the agent answered, but also how it produced the answer. The trace lets us inspect the behavior behind the response and decide whether the run was genuinely successful, not only whether the final message looked acceptable.

We look at these evaluators through two lenses: capability and regression. Capability evaluations help us answer whether the agent can handle a new class of task. These tasks are often difficult, and some failures are expected. The goal is to create a hill to climb, so that when the score improves, we know the agent is becoming better at a behavior that matters.

Regression evaluations help us answer a different question: can the agent still do what it used to do? These tasks should remain stable across changes. A drop in score can signal that a prompt change, model change, schema change, or tool update affected an existing behavior.

Both lenses are necessary. Capability evaluations help us make progress, while regression evaluations help us keep the progress we already made.

Our implementation with LangSmith

Our agent framework is built on LangChain, so LangSmith was a natural fit for our evaluation workflow. We use it to store datasets, run evaluation suites, inspect traces, compare runs, and connect results directly to our development process.

Evaluation runs are triggered from our GitHub pipeline on feature branches and release branches, and the results are visible from the pull request. This is important because evaluations should not live in a separate research workflow used only by a few people. They need to be part of the normal engineering process, so that the developer changing the agent can see the effect of the change where the work is already happening.

This also creates a better review process. When we review a pull request, we can discuss the code, the product behavior, and the evaluation results together. We can see whether the change improves the target task, whether it creates side effects, and whether it changes cost, latency, or tool usage. The conversation becomes more concrete because everyone can look at the same traces, scores, and baseline comparison.

A concrete example: object references

One recent feature we worked on is object references. The idea is that when the agent mentions an object in Spark, the response should be able to reference that object directly. This could be a data object, a task, a file, a workflow run, or a user. If the answer comes from a third-party MCP tool, the same mechanism should also allow the response to include the relevant external reference.

From a product perspective, this makes the answer easier to use and easier to trust. The user does not only get a text response; they can click back to the source object, inspect the underlying context, and continue their work from there. It also gives the UI a better structure to render agent responses and allows the system to preserve the link between the answer and the operational object behind it.

To implement this, we introduced a structured response format. Instead of returning only free text, the agent returns both a response and a list of references. This gives the application a cleaner interface and makes the output easier to consume downstream.

The first implementation looked right from a product and architecture point of view, but the evaluation run showed that it reduced quality on some existing tasks.

When we inspected the traces, we saw that the agent was now solving two problems at the same time: answering the user and producing a valid reference structure. On simple tasks, this worked well. On more complex analytical tasks, the additional output constraint sometimes competed with the main reasoning task. The agent became more likely to over-focus on producing references, include weak references, or simplify the answer to fit the output shape.

This is a good example of why evaluations are useful in practice. The feature made sense from an interface point of view, it improved the product contract, and it passed normal implementation checks. Still, it changed agent behavior in a way that only became visible when we ran it across realistic tasks.

The fix was not to remove structured outputs, but to make the expected behavior more precise. We adjusted when references are required, clarified what qualifies as a valid reference, improved the output instructions, and added targeted tasks to the evaluation suite. The result was a better feature and a stronger test suite.

Conclusion

Building agents is different from building classic software, but the engineering principle remains the same: we need to make quality visible. Evaluation-driven development helps us do that by giving the team a clearer way to define expected behavior, measure progress, and detect regressions before they reach production.

It also helps us turn what we learn into something durable. When an agent fails, we do not only fix the immediate issue; we can capture the failure as a future safeguard. When we improve a capability, we can measure whether the improvement holds across realistic tasks. And when we change prompts, tools, models, or schemas, we can compare the new behavior against a baseline rather than relying only on manual impressions.

For us, this is not only an internal engineering practice. It is also part of the product we want to offer. If Spark enables teams to build their own agents, Spark should also help them measure those agents in their own environment, with their own data, tools, skills, and requirements.

This is how we believe agents move from impressive demos to reliable systems: not by removing uncertainty entirely, but by making behavior observable, learning from real usage, and improving it release after release.