Evaluation Studio Overview

Evaluation Studio is a unified workspace for evaluating AI system performance across two main areas: Model Evaluation and Agentic Evaluation. It enables users to systematically assess both the quality of large language model (LLM) outputs and the behavior of agentic applications in real-world scenarios. By supporting model and agentic app evaluation, Evaluation Studio provides a comprehensive foundation for improving LLM quality and agentic application behavior. Whether you’re validating prompt effectiveness, debugging tools, or auditing full workflows, Evaluation Studio enables scalable, data-driven iteration — helping you build safer, more reliable, and higher-performing AI systems.

Model Evaluation

Model Evaluation enables you to assess the performance of large language models (LLMs) using configurable quality and safety metrics. You can:

Upload datasets with input-output pairs.
Apply built-in or custom evaluators.
Analyze model effectiveness through visual scoring, thresholds, and collaborative projects.

This evaluation is ideal for fine-tuning, comparing, and validating models before or after deployment. Learn more :octicons-arrow-right-24:

Agentic Evaluation

Agentic Evaluation is designed to assess how effectively an agentic app performs in both production and pre-production environments. Users can import real session data from deployed apps or generate simulated sessions to test behavior before go-live. You can:

Import app sessions and trace from production or simulations.
Generate simulated sessions using Personas and Test Scenarios to validate agent behavior before deployment.
Run multi-level evaluations to see how well the app achieves goals, follows workflows, and uses tools.
Analyze inputs and outputs across supervisors, agents, and tools.

Agentic Evaluation enables multi-level evaluation across sessions and traces, offering deep insights into how orchestrators, agents, and tools behave across real and simulated conditions. This helps uncover coordination issues, workflow failures, and opportunities for optimization. Learn more :octicons-arrow-right-24:

Accessing Evaluation Studio

Log in to your Platform account.
From the modules menu, select Evaluation Studio.
On the Evaluation page, select Model evaluation or Agentic evaluation to begin.

Building Agents

Platform Services

Operations

References

Old overview

Evaluation Studio Overview

Model Evaluation

Agentic Evaluation

Accessing Evaluation Studio

Building Agents

Platform Services

Operations

References

​Evaluation Studio Overview

​Model Evaluation

​Agentic Evaluation

​Accessing Evaluation Studio

Evaluation Studio Overview

Model Evaluation

Agentic Evaluation

Accessing Evaluation Studio