Evaluation Studio Overview
Evaluation Studio is a unified workspace for evaluating AI system performance across two main areas: Model Evaluation and Agentic Evaluation. It enables users to systematically assess both the quality of large language model (LLM) outputs and the behavior of agentic applications in real-world scenarios. By supporting model and agentic app evaluation, Evaluation Studio provides a comprehensive foundation for improving LLM quality and agentic application behavior. Whether you’re validating prompt effectiveness, debugging tools, or auditing full workflows, Evaluation Studio enables scalable, data-driven iteration — helping you build safer, more reliable, and higher-performing AI systems.Model Evaluation
Model Evaluation enables you to assess the performance of large language models (LLMs) using configurable quality and safety metrics. You can:- Upload datasets with input-output pairs.
- Apply built-in or custom evaluators.
- Analyze model effectiveness through visual scoring, thresholds, and collaborative projects.
Agentic Evaluation
Agentic Evaluation is designed to assess how effectively an agentic app performs in both production and pre-production environments. Users can import real session data from deployed apps or generate simulated sessions to test behavior before go-live. You can:- Import app sessions and trace from production or simulations.
- Generate simulated sessions using Personas and Test Scenarios to validate agent behavior before deployment.
- Run multi-level evaluations to see how well the app achieves goals, follows workflows, and uses tools.
- Analyze inputs and outputs across supervisors, agents, and tools.
Accessing Evaluation Studio
- Log in to your Platform account.
- From the modules menu, select Evaluation Studio.
-
On the Evaluation page, select Model evaluation or Agentic evaluation to begin.