Model Evaluation

Assess and optimize LLM performance systematically using datasets, automated evaluators, and visual insights.

Overview

Model Evaluation is a comprehensive tool in the Platform for assessing LLM performance. It offers the flexibility to select from a variety of pre-built evaluators or create custom evaluators to measure model effectiveness. Users can upload and organize datasets (including inputs and outputs) into designated projects for evaluation, with adjustable thresholds and scoring metrics for customization based on specific needs. Evaluation Studio fosters collaboration and sharing, allowing teams to work together on projects, share evaluation results, and collectively analyze model performance. Users can assess datasets against models, analyze results, and gain insights through a streamlined interface tailored to their selected criteria.

Create Project → Create Evaluation → Import Dataset → Configure Evaluators → Run Evaluation → Analyze Results

Key Features

Project-based Organization: Projects serve as containers for evaluations. Users can create, manage, and share projects, each containing multiple evaluations.
Flexible Dataset Handling: Supports importing data through CSV files (with input-output pairs or input-only data) and production data from deployed model traces.
Streamlined Evaluation Process: Evaluate model performance using out-of-the-box evaluators such as Groundness, Coherence, Toxicity, and more.
Flexibility for Different Data Scenarios: Supports three evaluation scenarios:
- One Input, One Output: Simple, straightforward evaluations.
- One Input, Multiple Outputs: Assess multiple outputs from a single input.
- Input Only: Generate outputs with your production model when only input data is available.

Why Use Evaluation Studio?

Streamlined Workflow: Manage projects, upload data, perform evaluations, and track results—all in one place.
Customizability: Define evaluation criteria to suit specific needs, from simple to complex use cases.
Collaboration: Share projects with collaborators, supporting team-based model evaluation in a centralized environment.
Continuous Improvement: Run regular evaluations to track model performance over time and drive ongoing optimization.
Seamless Integration: Import data from deployed production models and export custom trace data for further analysis.

User Journey

The following outlines the key actions at each stage of the Model Evaluation workflow:

Create a project — Log in to the Platform, go to Evaluation Studio, and create a project under Model Evaluation.
Create an evaluation — Create an evaluation within your project to organize session data.
Import a dataset — Upload a CSV dataset or import production data for evaluation.
Configure evaluators — Choose Quality, Safety, or RAGAS evaluators and map prompt variables to dataset columns.
Run an evaluation — Trigger the evaluation against your dataset and selected evaluators.
View evaluation results — Review evaluator tiles in Evaluation Insights with color-coded visual indicators.
Export evaluation results — Export the evaluation table as a CSV for offline analysis.

Projects

In Evaluation Studio, projects act as the core containers for organizing evaluations. Each project can store multiple evaluations, and each evaluation contains a distinct dataset.

Roles and Permissions

Roles and permissions determine who can access and modify projects and evaluations.

Role	Capabilities
Admin	Full access to create, modify, and delete projects and evaluations. Can manage user permissions and invite new users.
Editor	Can create and modify evaluations within a project, but cannot manage users or project-level settings.
Viewer	Read-only access — can view all evaluations, scores, score explanations, and evaluator properties.

Note: Permissions are applied consistently across all evaluations within a project. Users invited to a project have access to all evaluations within it based on their role. Users cannot be invited at the evaluation level.

Creating a New Project

Steps to create a new project:

Navigate to Model Evaluation.
On the Projects tab, click New Project.
In the Create a new project dialog, enter a name for the project (maximum 100 characters).
Click Create. The system redirects you to the newly created project’s main page.
Click the More options icon (three dots) in the project row to manage it:
- Rename — Change the project name.
- Delete — Permanently delete the project and all its associated evaluations and datasets.
- Share — Share the project with other users by providing email addresses or user identifiers.

Evaluations

Evaluations are central components within a project where users can test model performance against datasets. Each evaluation resides within a project and includes one dataset. The dedicated evaluations structure within projects eliminates the need to rename uploaded datasets—users can directly name the evaluation for the use case.

Creating an Evaluation

Steps to create an evaluation:

Navigate to Model Evaluation.
Click the Projects tab and select the relevant project.
To create a first evaluation, click Create evaluation.
If evaluations already exist for the project, click New evaluation in the Evaluations section.
In the Create evaluation dialog, enter a name for the evaluation and click Done. The system redirects to the Import dataset page.
To manage an existing evaluation, click the three-dot icon in its row:
- Rename — Change the evaluation name.
- Delete — Delete the evaluation.

Import Datasets

Evaluation Studio provides a flexible approach for importing and managing datasets. There are several ways to bring data into the platform:

Import a dataset — Upload datasets in CSV format containing input-output pairs or input-only data.
Import production data — Import data from real-time deployed models with filters for date range, source, and specific columns from model traces.

Adding a Dataset to an Evaluation

Steps to import a dataset:

Navigate to Evaluation Studio, click the Projects tab, and select the relevant project.
Select the specific evaluation to which you want to add a dataset.
Choose an import method:
- Upload from device — Click the Upload file link and select a CSV file from your local machine.
- Import production data — Click Proceed and fill in the required fields:
  - Models — Choose the model deployed in production. Only models used in the Platform within Tools, Prompts, and endpoints appear in the dropdown.
  - Source — Select the source where the model is deployed (Tools, Prompts, endpoints, or All).
  - Date — Set the date range (default: last 30 days).
  - Columns — Input and output columns are auto-fetched. Select additional columns (request ID, input tokens, response time, etc.) for more detailed analysis.
Review the dataset preview (first 10 rows) and click Proceed to confirm. The dataset is imported and displayed in a tabular format in the evaluation table.
Click the + button on the Evaluations page for additional dataset actions:
- Run a prompt — Generate outputs using a selected model and prompt.
- Run an API — Fetch content from external APIs or deployed tools.
- Run Search AI — Retrieve answers and context chunks via a Search AI integration (for RAG evaluation).
- Add an evaluator — Add a Quality, Safety, or RAGAS evaluator.
- Add human feedback — Manually input feedback for model outputs.
You can also filter data (text, numeric, boolean), sort it, adjust row heights, and customize columns.

Running a Prompt

The Run a Prompt option enables users to generate customized model output based on a specific model and prompt. This is useful for evaluating fine-tuned models—for example, to summarize customer conversations and evaluate the summaries. Key benefits:

Efficiency: Generate content for multiple categories quickly.
Customization: Edit prompts and regenerate content as needs evolve.
Streamlined Workflow: Manage all content generation in one place.

Steps to run a prompt:

On the Evaluations page, click + and select Run a Prompt.
In the Run a Prompt dialog:
- Enter the Column Name for the output data.
- Choose the Model and Configuration settings.
- Type the prompt, including any mapped variables (for example, summarize the {{input}}).
Click Run to generate a new output column with the results.

After running the prompt, additional options appear:

Properties — Modify your prompt or model configuration.
Regenerate — Refresh the output based on updated data or prompts.
Delete — Remove the output column (ensure no evaluators depend on it first).

Running an API

The Run an API option enables users to fetch content from external APIs or deployed tools directly into the evaluation process. This allows integration of live data, agent outputs, and outputs from models hosted outside the Platform. Key benefits:

External Data Integration: Bring in data or model outputs from external sources or deployed tools.
Flexible Data Handling: Evaluate dynamic content generated by deployed tools in real time.
Seamless Evaluation: Attach evaluators to API-generated outputs just like any other dataset column.

Steps to run an API:

On the Evaluations page, click + and select Run an API.
Configure the API call in the dialog:
- Column Name — Name for the column where the API output will appear.
- Method — HTTP method (GET, POST, PUT, DELETE, or PATCH).
- Request URL — URL of the API endpoint. You can paste a cURL command here.
- Headers — Key/value pairs for required headers (for example, authentication tokens).
- Body — Data to send with the request. Use column variables for dynamic values (for example, "input":"{{column1}}").
- Response — Auto-generated preview using the first row of input data.
- JSON Output Path — Path to the specific field within the JSON response to display.
Click Test to verify the API setup and preview the response.
Click Run to fetch content for all dataset rows. The API response is added as a new column.
Attach evaluators to the API-generated column and run the evaluation.

Example Workflow for Running an API

Create and deploy a tool in the Platform.
Copy the tool endpoint URL from the Tool Endpoint tab.
Upload a dataset containing only input columns in Evaluation Studio.
Click +, select Run an API, add a column name, and paste the endpoint URL in the Request URL field.
Generate an API key from the tool’s API Keys tab and copy it.
In the Headers tab, paste the API key in the Value field for the key x-api-key.
In the Body tab, replace the placeholder with your input column name (for example, {{Input}}).
Click Test to trigger the API using the first row and verify the response.
In the JSON output path tab, specify the path to extract the required field. For example, if the API response is "output": { "Summarization": "(generated output)" }, enter output.Summarization.
Click Run to fetch outputs for all dataset rows. A new column is populated with the API results.

Tip: Make sure the column name exactly matches the input column in your dataset to dynamically send each row’s input to the API.

Running Search AI

The Run Search AI option enables evaluation through retrieval-augmented generation (RAG). It uses a pre-configured Search AI integration to automatically retrieve answers and supporting context chunks for each input row. This is particularly useful for evaluating RAG systems, knowledge-grounded agents, and use cases where contextual accuracy is critical. Once executed, Evaluation Studio adds two new columns to the dataset:

Answers — RAG responses based on the retrieved context.
Retrieved Contexts — Supporting text chunks used to generate the answer.

Key benefits:

Import RAG pipelines: Integrate retrieval-augmented workflows using pre-validated Search AI connections.
Evaluate with custom criteria: Apply custom evaluators to assess RAG pipeline performance.

Steps to run Search AI:

Click + in Evaluation Studio and select Run Search AI.
In the Connection name field, select a pre-configured Search AI connection. Only integrations that are validated via the Integrations page appear in the dropdown.
In the Map Variables section, specify the input column to use for querying the retrieval system.
(Optional) Set Meta filters to narrow down search results—for example, specifying particular file names when sources contain multiple files.
Click Test to verify the connection. The response from the first-row query is shown in the Response tab.
Click Run to execute the retrieval. The Answers and Retrieved Contexts columns are populated. In the Retrieved Contexts column, click Show JSON to inspect the response. Retrieved contexts appear under the chunkText key.
Click + and select Add Evaluator to attach evaluators to the Answers or Retrieved Contexts columns.
Note: RAGAS evaluators are specifically designed to test RAG systems. Attach them to thoroughly assess RAG pipeline performance.
You can also add an empty column (inline-editable, supports text and numeric values) for manually entering ground truth in RAGAS evaluations.
Use Evaluation Studio’s filtering, sorting, and analysis tools to review Search AI outputs and identify opportunities for improvement.

Configure Evaluators

Evaluators are tools used to assess how well a model is performing based on specific tasks. They function like custom prompts or instructions designed to check certain aspects of a model’s output against predefined criteria.

Types of Evaluators

Evaluation Studio supports two primary evaluator types: AI evaluators and Human evaluators.

AI Evaluators

AI Evaluators are predefined instructions provided to an LLM to evaluate its outputs against a dataset of input and output data. There are two subtypes:

System AI Evaluators: Pre-built evaluators provided by the platform for common aspects of model performance (quality, correctness, safety). Ready-to-use and cannot be modified.
Custom AI Evaluators: User-defined evaluators with custom prompts and scoring mechanisms, offering full flexibility for specialized tasks.

Note: Access all available system evaluators through the global Evaluators page at the project level (click the Evaluators tab next to the Projects tab).

System evaluators are grouped into three categories:

Quality Metrics

Quality metrics assess the overall effectiveness and usefulness of model outputs.

Metric	Description	Required Dataset Components
Groundness	Evaluates whether the output accurately reflects the input without introducing information from the model’s knowledge base.	Input, Output
Query Relevance	Assesses the relevance of the output to the user query.	Input, Output, User Query
Ground Truth Relevance	Compares the output to a provided ground truth to assess relevance.	Input, Output, Ground Truth
Coherence	Evaluates logical consistency and natural flow of the generated output.	Output
Fluency	Assesses grammatical correctness and sentence quality.	Output
GPT Similarity Score	Compares the model’s response with a superior model’s response for the same input.	Input, your model’s response, Superior model’s response
Paraphrasing	Assesses whether the output conveys the same meaning as the input using different phrasing.	Input, Output
Completeness	Evaluates whether the output conveys the full context from the input without omissions.	Input, Output

Safety Metrics

Safety metrics evaluate whether model outputs are free from harmful or unethical content.

Metric	Description	Required Dataset Components
Bias Detection	Analyzes the output for potential biases, ensuring no unfair or discriminatory tendencies.	Output
Banned Topics	Scans for prohibited content related to specified sensitive topics.	Output
Toxicity	Screens for violent, sexual, or otherwise inappropriate content.	Output

RAGAS Evaluators

RAGAS evaluators assess the performance of RAG (Retrieval-Augmented Generation) pipelines, evaluating both the accuracy of the answer and the relevance of the retrieved contexts. Users can adjust the following parameters (evaluation prompts themselves cannot be modified):

Model: Choose which model and connection to use for the evaluation.
Pass Threshold: Modify the threshold required for a pass.
Variables in the Prompt: Attach variables such as ground_truth, retrieved_contexts, user_input.

Metric	Description	Required Dataset Components
Context Precision	Measures the proportion of relevant chunks among all retrieved chunks for the given input.	Input, Response, Retrieved context
Context Recall	Evaluates whether the retrieved context is sufficient to address the user input. Higher recall means fewer significant chunks are omitted.	Input, Response, Retrieved context, Reference answer
Context Entity Recall	Evaluates common entities present in the retrieved context relative to the total entities in the reference context.	Retrieved context, Reference answer
Noise Sensitivity	Provides the proportion of incorrect claims in the total number of retrieved claims.	Input, Response, Retrieved context, Reference answer
Faithfulness	Measures how factually consistent a response is with the retrieved context.	Input, Response, Retrieved context

Human Evaluators

Human evaluators provide insights into output quality that complement automated scoring:

Thumbs Up/Down — Users provide a thumbs up (1) or thumbs down (0) reaction to the model’s output.
Better Output — Users suggest an improved version of the model’s output, providing direct suggestions for improvement.
Comments — Users leave short positive or negative comments providing richer, detailed feedback.

To add a human evaluator, click Add human feedback on the Evaluations page and choose one of the three options. Human evaluators are added as separate columns in the dataset.

Adding a System Evaluator

Steps to add a system evaluator:

On the Evaluations page, click + and select Add evaluator.
From the list of Quality and Safety evaluators, select the desired evaluator.
In the Evaluators dialog, fill in these details:
- Model — Choose the evaluator model (only models deployed in the Platform appear in the dropdown).
- Model Configuration — Select hyperparameters such as Temperature, Output token limit, Top P.
- Prompt — View the system prompt (view-only; cannot be edited).
- Map variables — Map prompt variables to the corresponding dataset columns.
- Pass threshold — Set the minimum score for a pass (choose Greater than or Less than, then enter a value from 1 to 5):
  - Positive evaluators (higher score is better, e.g., Completeness): Scores above the threshold are marked green (Pass).
  - Negative evaluators (lower score is better, e.g., Toxicity): Scores above the threshold are marked red (Fail).
Click Save.

Adding a Custom Evaluator

Custom evaluators allow users to design AI evaluators tailored to specific needs. Custom evaluators can be saved globally, making them available across different projects. Steps to add a custom evaluator:

On the Evaluations page, click + and select Add evaluator.
Click Add evaluator in the evaluators list.
In the Custom evaluators dialog, fill in these details:
- Evaluator Name — Enter a name for the evaluator.
- Evaluator Type — Select Quality or Safety.
- Description — Provide a brief description of the evaluator’s purpose.
- Model — Choose the model for evaluation (both open-source and external models deployed in the Platform are available).
- Model Configuration — Select hyperparameters such as Temperature, Output token limit, Top P.
- Prompt — Enter the evaluation prompt, or click Template to start from a built-in template.
  Note: Do not specify the score format in the prompt—it is automatically determined by the selected output type. A mismatch may cause errors.
- Save as a Global Evaluator — Check this box to make the evaluator available across all projects.
- Map variables — Map prompt variables to the corresponding dataset columns.
- Output Type — Select Score or Boolean.
- Maximum Score — If output type is Score, specify the maximum value (for example, 1 to 10).
- Pass threshold — If output type is Score, set the pass threshold using Greater than or Less than (same logic as system evaluators).
Click Save.

Mapping Variables

Variable mapping connects the evaluator’s prompt variables to the corresponding columns in the dataset. This is a critical step for accurate evaluation results.

Prompt variables (shown in double curly braces, e.g., {{input}}, {{output}}, {{query}}) appear on the left side of the mapping section and are auto-populated by the system.
Dataset columns appear on the right side. Select the correct column to match each variable.
- Map {{input}} to the input column in your dataset.
- Map {{output}} to the output column.
For safety evaluators (Bias Detection, Toxicity), you may need to configure additional key-value pairs.

Run Evaluations

After setting up evaluators and mapping variables, trigger the evaluation to assess your model’s performance against the defined criteria.

Starting an Evaluation

Click Run on the Evaluation page. The system assesses model outputs based on configured evaluators. Once complete, evaluator columns are populated with scores and evaluator tiles appear in the Evaluation Insights section.

Stopping an Evaluation

If you need to stop a running evaluation—for example, when working with a large dataset and no longer wish to continue—you can stop the process to avoid unnecessary token consumption. To stop a running evaluation:

Click the Stop button at the top of the page.
A confirmation message appears. Click Continue to confirm. The evaluation keeps running in the background until you click Continue.
Once confirmed, the evaluation is fully stopped and all evaluators and rows are halted.

If you stop before completion, keep in mind:

Scores are shown only for rows processed up to the stop point.
Evaluation Insights reflects data generated up to the stop point.
Tokens are consumed for all rows processed up to the stop, even if the evaluation was not completed.

Tracking Evaluation Progress

The Evaluation Progress feature provides real-time visibility into the status of running evaluations. Click the progress ring icon in the top corner to view detailed progress information. The progress feature provides:

Real-time monitoring — Shows elapsed time and number of rows being processed.
Visibility into evaluation status — Instantly see whether an evaluation is stopped, in progress, or completed.
Efficient management — Know exactly when the evaluation started, how much is completed, and when it finishes.
Handling interruptions — If an evaluation is stopped, view how much work was completed before the stop. When restarted, the progress resets to reflect the new process.
Comprehensive reporting — Once complete, the total elapsed time, the user who ran the evaluation, and the start time are displayed. If the evaluation was restarted, total elapsed time combines both runs.

View Results in Evaluation Insights

The Evaluation Insights section displays a detailed overview of model performance through evaluator tiles—visual summaries showing how well model outputs meet predefined criteria. Each evaluator tile shows a bar graph where color and bar length indicate pass/fail status relative to the configured threshold:

Pass — The output meets the evaluation criteria (for example, no bias, no toxicity).
Fail — The output does not meet the evaluation criteria (for example, contains bias, contains toxicity).

Scoring methods by evaluator type: Quality evaluators (Groundness, Query Relevance, Coherence, Fluency, Paraphrasing, Completeness) use Continuous Scoring on a numerical scale (for example, 1–5 or 1–10):

Coherence (positive evaluator, threshold 2.5): Scores above 2.5 → Pass; scores below → Fail.
Toxicity (negative evaluator, threshold 2.5): Scores above 2.5 → Fail (toxic content detected); scores below → Pass.

Safety evaluators (Bias Detection, Banned Topics) use Boolean Scoring (pass = 1, fail = 0):

Bias Detection: 1 (Pass) = no bias detected; 0 (Fail) = bias detected.

Export Evaluation Results

Export the evaluation table as a CSV file for offline analysis or further use. To export an evaluation:

Go to Evaluation Studio and open the desired project.
Select the evaluation you want to export.
Click the three-dot menu at the top-right corner of the evaluation table.
Select Export.

The CSV file is automatically downloaded and includes:

All column data and score values.
Footer metrics:
- Boolean columns: pass%, fail%
- Score columns: pass%, fail%, average, min, max

Building Agents

Platform Services

Operations

References

Overview

Key Features

Why Use Evaluation Studio?

User Journey

Projects

Roles and Permissions

Creating a New Project

Evaluations

Creating an Evaluation

Import Datasets

Adding a Dataset to an Evaluation

Running a Prompt

Running an API

Example Workflow for Running an API

Running Search AI

Configure Evaluators

Types of Evaluators

AI Evaluators

Quality Metrics

Safety Metrics

RAGAS Evaluators

Human Evaluators

Adding a System Evaluator

Adding a Custom Evaluator

Mapping Variables

Run Evaluations

Starting an Evaluation

Stopping an Evaluation

Tracking Evaluation Progress

View Results in Evaluation Insights

Export Evaluation Results

Building Agents

Platform Services

Operations

References

​Overview

​Key Features

​Why Use Evaluation Studio?

​User Journey

​Projects

​Roles and Permissions

​Creating a New Project

​Evaluations

​Creating an Evaluation

​Import Datasets

​Adding a Dataset to an Evaluation

​Running a Prompt

​Running an API

​Example Workflow for Running an API

​Running Search AI

​Configure Evaluators

​Types of Evaluators

​AI Evaluators

Quality Metrics

Safety Metrics

RAGAS Evaluators

​Human Evaluators

​Adding a System Evaluator

​Adding a Custom Evaluator

​Mapping Variables

​Run Evaluations

​Starting an Evaluation

​Stopping an Evaluation

​Tracking Evaluation Progress

​View Results in Evaluation Insights

​Export Evaluation Results

​Related

Overview

Key Features

Why Use Evaluation Studio?

User Journey

Projects

Roles and Permissions

Creating a New Project

Evaluations

Creating an Evaluation

Import Datasets

Adding a Dataset to an Evaluation

Running a Prompt

Running an API

Example Workflow for Running an API

Running Search AI

Configure Evaluators

Types of Evaluators

AI Evaluators

Human Evaluators

Adding a System Evaluator

Adding a Custom Evaluator

Mapping Variables

Run Evaluations

Starting an Evaluation

Stopping an Evaluation

Tracking Evaluation Progress

View Results in Evaluation Insights

Export Evaluation Results

Related