Skip to main content
Assess and improve end-to-end AI agent performance systematically across sessions, traces, and individual decision points.

Overview

Agentic Evaluation is a comprehensive framework for systematically analyzing AI agent performance in real-world production scenarios. It enables structured evaluation of complete AI agent trajectories across entire sessions and individual decision points within traces, providing both high-level and granular insights into how AI agents reason, act, and interact over time.

Key Features

  • Model Trace Analysis: Import and evaluate app sessions and traces from deployed apps (production data) or simulated sessions.
  • Simulation Support: Generate simulated sessions using Personas and Test Scenarios to validate agent behavior before go-live and test edge cases safely.
  • Multi-level Evaluation: Assess AI behavior across different layers—sessions (for example, goal achievement or tone), traces (for example, agent selection or tool usage), and specific interactions.
  • Evaluator Library: Apply predefined evaluators to assess the quality and effectiveness of Agentic app behavior.
  • Interactive Analysis Tools: Use scorecards and trace visualizations to explore evaluation results, drill into sessions or traces, and inspect full execution paths—including supervisors, agents, and tools—to pinpoint errors, inefficiencies, or optimization opportunities.
  • Actionable Insights: Identify failures, deviations, or redundant interactions to continuously improve your app’s responsiveness, reliability, and user satisfaction.

Key Benefits

  • Continuous Improvement: Understand how your AI agents behave in production and refine them based on real usage patterns.
  • Scalable Quality Assurance: Evaluate thousands of sessions in bulk using automated evaluators.
  • Informed Decision Making: Use evaluation data to prioritize fixes, redesign workflows, or refine prompts and tools.
  • Stronger Agentic Design: Gain visibility into how supervisors, agents, and tools interact, allowing you to design more robust, reliable, and context-aware agentic applications.

User Journey

The key actions at each stage of the Agentic Evaluation workflow:
│ ┌───────────────────────┐ │ │ 1. Create Project │ │ └───────────┬───────────┘ │ ▼ │ ┌───────────────────────┐ │ │ 2. Choose Data Source │ │ └───────────┬───────────┘ │ ┌────────────────┴───────────────┐ │ ▼ ▼ │ ┌──────────────────────┐ ┌──────────────────────┐ │ │ SIMULATED DATA │ │ PRODUCTION DATA │ │ ├──────────────────────┤ ├──────────────────────┤ │ │ Create Personas │ │ Import Sessions │ │ │ │ │ │ from deployed │ │ │ ▼ │ │ Agentic apps │ │ │ Define Test Scenarios│ └───────────┬──────────┘ │ │ │ │ │ │ │ ▼ │ │ │ │ Run Simulations │ │ │ └─────────┬────────────┘ │ │ └──────────────┬───────────────────┘ │ ▼ │ ┌──────────────────────────┐ │ │ 3. Create an Evaluation │ │ └──────────┬───────────────┘ │ ▼ │ ┌──────────────────────────┐ │ │ 4. Import Data │ │ └──────────┬───────────────┘ │ ▼ │ ┌──────────────────────────┐ │ │ 5. Configure Evaluators │ │ └──────────┬───────────────┘ │ ▼ │ ┌──────────────────────────┐ │ │ 6. Run the Evaluation │ │ └──────────┬───────────────┘ │ ▼ │ ┌──────────────────────────┐ │ │ 7. View & Analyze Results│ │ └──────────────────────────┘

Create a Project

Before running any evaluations, you need to create a project in Evaluation Studio. A project serves as a workspace dedicated to a single Agentic app, where all related evaluations and session data are organized.
: Projects are scoped to a single Agentic app. All evaluations related to that app will remain part of this project. You can continue to import more sessions to the same project as needed. However, projects are limited to sessions from the same app version and environment.
Steps to create a new project:
  1. Log in to the Platform and go to Evaluation Studio.
  2. In the left pane, select Agentic Evaluation. Create project
  3. Select New Project.
    1. Enter a name for your project.
    2. Select the Agentic app for which you are creating the project.
    3. Select the environment for the app (for example, Draft, Testing, or Production). The selected environment determines which version of the agent is used when running simulations.
  4. On the Quick Overview page, choose your data path:
    • Simulated data: Create Personas → Define Test Scenarios → Run Simulations to import simulated sessions into an evaluation.
    • Production data: Select Evaluations to directly import sessions from deployed Agentic apps.

Simulations

Simulations enable automated creation of test sessions so you can evaluate your agent before your Agentic app goes live. Using simulations, you can test different scenarios with various personas and see how your agent performs across contexts and input types—without relying on real user data. Simulated evaluations help you:
  • Validate agent responses before go-live.
  • Benchmark results against desired outputs.
  • Iterate on logic, tone, or context handling.
  • Improve overall agent reliability and trustworthiness.

How It Works

Simulations create realistic testing conditions by linking three key components: Personas, Test Scenarios, and Simulation Runs.
  1. Create Personas — Define user profiles representing different end-user types with configurable traits such as demographics, tone, emotion, urgency, and intent patterns.
  2. Define Test Scenarios — Create or reuse test cases that represent typical or edge user queries, capturing the context and expected agent behavior for each situation.
  3. Run Simulations — Combine selected personas and test scenarios to generate simulated sessions. The system automatically generates queries and feeds them to the agent to simulate realistic mock conversations.
  4. Use Simulated Data for Evaluation — Import simulated sessions into Evaluation Studio and run evaluators to assess performance, tone, adherence, and tool accuracy.

Create Personas

Personas represent different types of users who interact with your agent. Each persona simulates distinct communication styles, goals, and behavior patterns. During simulation runs, they help test how your agent performs with varied user contexts and tones. Steps to create a persona:
  1. Go to Evaluation Studio → Agentic Evaluation → Personas.
  2. Click Create Persona.
  3. Enter a Persona name and Description.
  4. (Optional) Add additional persona attributes such as role, gender, language, emotion, and patience or urgency levels.
  5. Click Save.
Use the Additional Information field to configure specific behaviors for the persona. This lets you define and control how the persona interacts or responds.
Create persona You can create multiple personas to represent different user types. Each persona can be used in one or more simulations. After creating personas, you can edit, duplicate, or delete them anytime from the Personas page. When a persona is deleted, any active simulations using it automatically switch to the default Unidentified Persona.
Personas are reusable across multiple simulations within the same Agentic Evaluation project.

Create Test Scenarios

Test scenarios define the context, user intent, and expected outcome for a simulated interaction. Each scenario captures the situation, input details, and desired response to help evaluate how the agent performs under different conditions. You can create test scenarios in two ways:
  • Write test scenarios — Manually define details such as the user query, context, and expected output.
  • Generate test scenarios — Automatically generate multiple scenarios using AI, based on your agent’s configured capabilities.
Test scenarios page

Write Test Scenarios

To manually create a test scenario:
  1. Go to Evaluation Studio → Agentic Evaluation → Test Scenarios.
  2. Click Write test scenario.
  3. Enter the following details:
    • Agentic capability — The specific skill or function being tested.
    • Initial user query — The user’s opening message or intent.
    • Information provided to persona — Any context or background knowledge for the persona.
    • End condition — A plain-text condition that signals when the simulation should stop. The simulation also stops automatically if the agent ends the conversation early; if the end condition is not reached, it stops at Max turns.
    • Max turns — The maximum number of turns before the simulated conversation ends.
  4. Click Save.
Provide any specific details required for the task—such as account numbers, external links, ticket IDs, or user IDs—so the simulation agent can supply the necessary context for a complete multi-turn conversation.
Write test scenario

Generate Test Scenarios

To automatically generate test scenarios:
  1. On the Test Scenarios page, click Generate test scenarios.
  2. Specify the following details:
    • Model and model parameters to use for generation.
    • Agentic capabilities to cover (for example, Transfer balance).
    • Number of test scenarios per capability (maximum: 15).
    • Information provided to the persona, such as background or knowledge context.
  3. Click Generate.
Provide any required details—such as account numbers, external links, ticket IDs, or user IDs—when generating scenarios. These details are appended to all generated scenarios and used only when needed by the simulation agent to complete the multi-turn conversation.
Generate test scenario The system automatically generates multiple varied test queries for each capability, including both normal and negative cases. After creation, you can edit, duplicate, or delete scenarios anytime from the Test Scenarios page.
Manual and AI-generated scenarios can coexist and are reusable across multiple simulations within the Agentic Project.
Example:
FieldValue
Scenario nameRefund flow for impatient customer
Agentic capabilityHandle refund requests
Information provided to PersonaOrder ID: 123876452
User query”I canceled my order yesterday — why isn’t my refund showing up?”
End conditionAgent confirms refund initiation

Run Simulations

After creating personas and test scenarios, run a simulation to generate realistic interaction sessions. Simulations produce mock conversations that help you evaluate agent responses, measure consistency, track goal completion, and identify areas for improvement.
The Simulation module becomes active only when at least one persona and one test scenario exist in your workspace.
Steps to create and run a simulation:
  1. Go to Evaluation Studio → Agentic Evaluation → Simulations.
  2. Click Create Simulation.
  3. Specify the following:
    1. Name for your simulation.
    2. Model and model parameters to use.
    3. Agentic capabilities to filter available test scenarios.
    4. One or more personas and test scenarios.
  4. Click Run.
The system executes mock sessions for all generated test queries. Monitor progress in real time from the Simulation Runs panel. Create simulation What happens during a simulation:
  • The system combines each Persona × Test Scenario to generate multiple test queries.
  • Each query includes the first user input and context/knowledge for follow-up interactions.
  • Queries are passed to the agent to create mock conversations (each conversation representing one simulation session).
  • Outputs are recorded, including agent responses, scenario and persona information, and metadata.
Simulation complete After the simulation run completes:
  • View conversation transcripts between personas and the agent.
  • Analyze how well the agent handled each scenario.
  • Identify areas for improvement before production deployment.
Simulation transcript Simulation trace tree

Create an Evaluation

Once your project is set up, the next step is to create an evaluation where session data can be analyzed. Evaluations act as containers for imported sessions and are the starting point for applying evaluators. Steps to create an evaluation:
  1. On the Session Evaluation page, click Create Evaluation. Create evaluation
  2. Enter a name for your evaluation.
  3. Select the data source and click Create:
    • Production Data — Sessions generated by real users in your deployed Agentic apps.
    • Simulated Data — Sessions generated from one or more simulation runs.
    Evaluation creation dialog
After you create the evaluation, it appears in the Evaluations list where you can import data, configure evaluators, and start analysis. Evaluations listing

Import Data

To evaluate how your AI agents perform, import data into Evaluation Studio. You can choose between:
  • Production data — Sessions generated by real users in your deployed Agentic apps, ensuring evaluation is based on actual user interactions. Sourced from Agentic app sessions and traces.
  • Simulated data — Sessions generated from simulation runs, allowing you to test agent performance in controlled, repeatable scenarios.

Import Production Data

Prerequisites: Make sure you have already interacted with the app and generated conversation sessions in your production environment. Steps to import production data:
  1. Open your evaluation from Evaluation Studio → Agentic Evaluation → Evaluations. Evaluations listing
  2. Select the evaluation and click Import sessions. Import live data
  3. In the Import Session dialog, enter the following and click Import:
    • Version — Select the version of the Agentic app to evaluate.
    • Environment — Choose the environment that contains the session data.
    • Date — Select the time range to filter sessions by a specific release or timeframe.
    Import sessions dialog
After the import is complete, session data appears in the Sessions and Traces tabs on the Evaluation page.

Import Simulated Data

Prerequisites: At least one simulation run with the relevant personas and test scenarios must exist. Steps to import simulated data:
  1. Open your evaluation from Evaluation Studio → Agentic Evaluation → Evaluations. Evaluations listing
  2. Select the evaluation and click Import simulated data. Import simulated data
  3. Choose the Simulation from which to import sessions and click Import. Import simulated sessions dialog
After the import is complete, simulated sessions appear in the Sessions and Traces tabs, just like production data.

Understanding the Imported Data

Imported data is organized into two tabs:
  • Sessions — Displays the list of sessions with details like session ID, number of traces, creation time, and duration. Add session-level evaluators here to measure overall outcomes.
  • Traces — Breaks sessions into individual traces, each representing one pair of user input and Agentic app response. Add trace-level evaluators to check specific actions, like whether the correct agent or tool was used.
Imported sessions

Configure Evaluators

After importing sessions, apply pre-built evaluators to measure your AI agent’s behavior and performance. Evaluations can be run at both the session and trace levels. Steps to add evaluators:
  1. On the Imported Sessions page, click + Evaluators. Add evaluators
  2. Select the desired session or trace-level evaluators. Evaluators list
  3. Configure the evaluator settings (such as selecting the model and adjusting model configurations) and click Save. Evaluator configuration
  4. Click Run Evaluation to start the evaluation process.

Types of Evaluators

Agentic Evaluation uses two evaluator types to provide insights at both the overall conversation level and individual decision points.

Session Evaluators

Session evaluators assess the overall quality of a full session—a complete conversation between the user and the Agentic app, consisting of multiple traces.
EvaluatorDescription
Final Goal CheckDetermines whether the agent successfully achieved the intended outcome by the end of the session.
Message TonalityEvaluates whether the conversation maintained an appropriate and consistent tone (such as professional, friendly).
Topic AdherenceAssesses if the agent stayed within the defined knowledge boundaries and relevant topics, avoiding unrelated areas.

Trace Evaluators

Trace evaluators operate at a more granular level, analyzing decisions made within individual traces. A trace represents a single interaction cycle: one user query and the Agentic app’s response.
EvaluatorDescription
Agent Call AccuracyEvaluates whether the supervisor/orchestrator calls the correct agents based on the user’s query.
Tool Call AccuracyChecks if the agent invokes the correct tools with the right parameters based on context and intent.
Since traces are parts of sessions, you can add trace-level evaluators from the Sessions tab. However, session-level evaluators cannot be added from the Traces tab.

Run the Evaluation

Once sessions are imported and evaluators are configured, click Run Evaluation to apply the selected evaluators to the selected sessions or traces. The system runs all applicable evaluators, processes the data, and computes scores for each relevant session or trace segment. Results are displayed in the session grid, with each evaluator contributing one or more dedicated columns. Run evaluation
Evaluations can be run multiple times—for example, after updating agent logic, adding new simulated sessions, or importing additional production data. This allows you to benchmark performance across different agent versions and testing conditions.

View Evaluation Results

After the evaluation run completes, the session grid is updated with evaluator-specific columns displaying average scores for each session or trace. Clicking on a score allows you to drill down into detailed results.
  • In the Sessions tab, both session-level and trace-level evaluation results are visible. Sessions tab
  • In the Traces tab, only trace-level results are shown. Traces tab

Analyze Evaluation Results

After running evaluations, explore results through an interactive interface designed to reveal how your Agentic app performed at both session and trace levels. What you’ll see:
  • Summary scores at a glance — Scan evaluator columns to compare scores across sessions.
  • Drill-down views — Click into any session or trace to view a step-by-step evaluation breakdown, scores at the Supervisor, Agent, and Tool levels, inputs/outputs that were evaluated, and explanations behind each score.

Understanding the Session Table

Each imported session appears as a row in the table with:
  • Session metadata — Session ID, start time, duration, and number of traces.
  • Evaluator-specific columns — Each evaluator adds a column showing a summary score (numeric, categorical, or custom depending on the evaluator type).
Use the Sessions tab to view overall scores per session, including averages of trace-level evaluators. Use the Traces tab to see detailed scores for each step in a session. Session table Steps to analyze your evaluation:
  1. On the Sessions tab:
    • Click any session-level evaluator score (for example, Topic Adherence, Message Tonality, or Final Goal Check) to open a detailed view showing how the score was calculated.
    • Alternatively, click any session ID to open a filtered view in the Traces tab showing all traces for that session.
  2. On the Traces tab, click any trace-level evaluator score (for example, Tool Call Accuracy, Agent Call Accuracy) to open a detailed tree view showing reasoning steps and associated scores.

Session View Overview

Clicking a session score opens the Session Details View with two tabs:
  • Verdict — Displays the evaluation score (for example, pass/fail for Message Tonality) along with a written explanation.
  • Transcript — Shows the full conversation for that session.
Session view

Trace View Overview

Clicking a trace opens the Trace Details View with two panels: Left Panel:
  • Displays the trace tree visualization.
  • Shows the reasoning structure across three levels: Supervisor, Agent, and Tool.
  • Each node is clickable; selecting a node updates the right panel with its corresponding evaluators.
Right Panel:
  • Evaluators tab — Displays all evaluators applied at the selected node level with their results.
  • Input/Output tab — Shows relevant inputs and outputs for the selected node.
Trace view The trace tree displays the hierarchical structure of the agentic system: Root level (Trace root):
  • Represents the entire trace and aggregates all evaluators applied to it.
  • Clicking the root node displays all attached evaluators (for example, Agent Call Accuracy and Tool Call Accuracy) in the right panel.
Root level Supervisor level:
  • Contains evaluations assessing whether the supervisor called the correct agents.
  • Clicking a supervisor node shows only supervisor-specific evaluators (for example, Agent Call Accuracy).
Supervisor level Agent level:
  • Represents actions and decisions made by an agent.
  • Displays only agent-specific evaluators (for example, Tool Call Accuracy).
  • Clicking an agent node reveals tool usage and corresponding evaluation results.
Agent level Tool level:
  • Represents individual tool calls made by the agent.
  • Clicking a tool node opens the Tool Call Results view displaying the evaluation score, evaluator explanations, and the tool’s input and output.
Tool level

Understanding Evaluation Scores

Evaluation scores are assigned based on how well the Agentic app performed according to specific criteria.

Session Evaluator Scoring

EvaluatorPassFail
Final Goal CheckThe agent achieved the intended goal by the end of the session.The agent did not fulfill the user’s objective.
Message TonalityThe tone is professional, friendly, or appropriate.The tone is inconsistent or inappropriate.
Topic AdherenceThe agent stays within defined knowledge boundaries.The agent goes off-topic or outside its defined scope.
Session evaluator scoring

Trace Evaluator Scoring

Agent Call Accuracy:
  • Score 1 (Good) — The correct agent is selected.
  • Score 0 (Bad) — The wrong agent is selected.
Tool Call Accuracy:
  • Score 1 (Good) — The right tool is called with the correct parameters. For example, the user asks for the weather in Chicago, and the agent calls the weather tool with Chicago as the location.
  • Score 0.5 (Partial) — The right tool is used, but parameters are incorrect. For example, the agent calls the weather tool but passes “New York” instead of “Chicago.”
  • Score 0 (Bad) — The agent calls the wrong tool and also provides incorrect or missing parameters.
Trace evaluator scoring

Error Scenarios

During evaluation, you may encounter issues that affect scoring or analysis. These are surfaced clearly in the interface to help you diagnose and resolve problems. Model errors (401):
ErrorDescription
Token limit exceededThe input to the model exceeded its maximum token limit.
Rate limitToo many requests made in a short time, exceeding API limits.
Context length exceededThe evaluation payload was too large for the model’s context window.
Malformed JSON in model outputThe model returned a response that couldn’t be parsed (for example, missing brackets or invalid formatting).
Missing tool or agent calls: If a trace does not contain any tool or agent calls, the corresponding evaluator cell is highlighted in yellow with a message such as No tool calls found or No agent calls found. For example, if a user says “Hi! How are you?” and the agent replies with small talk without triggering any tool calls, the evaluator cell displays a warning. Missing tool errors