Evaluation Types
Model Evaluation
Model Evaluation enables you to assess LLM performance using configurable quality and safety metrics. Organize datasets into projects and evaluations, apply built-in or custom evaluators, and analyze results through visual scoring and collaborative workflows. Key capabilities:- Upload datasets with input-output pairs, or generate outputs from a deployed model.
- Apply Quality, Safety, and RAGAS evaluators—or create custom evaluators tailored to your needs.
- Integrate external outputs via Run a Prompt, Run an API, or Run Search AI.
- Analyze model effectiveness through visual scoring, pass/fail thresholds, and exportable results.
Agentic Evaluation
Agentic Evaluation assesses how effectively an agentic app performs in both production and pre-production environments. Import real session data from deployed apps or generate simulated sessions using Personas and Test Scenarios to validate behavior before go-live. Key capabilities:- Import app sessions and traces from production or simulations.
- Generate simulated sessions using Personas and Test Scenarios to test agent behavior before deployment.
- Run multi-level evaluations across sessions and traces to assess goal achievement, workflow adherence, and tool usage.
- Analyze inputs and outputs across supervisors, agents, and tools to uncover coordination issues and optimization opportunities.
Accessing Evaluation Studio
- Log in to your Platform account.
- From the modules menu, select Evaluation Studio.
- On the Evaluation page, select Model Evaluation or Agentic Evaluation to begin.
