Evaluate Multi-Turn
LLM Conversations

Analyze, measure, and improve AI agent conversations with structured evaluation workflows and hierarchical insights.

TurnWise Brain Logo

Three-Level Hierarchy

TurnWise organizes conversations into a hierarchical structure, enabling evaluation at any granularity level.

Hierarchical Structure: Conversation → Messages → Steps
1

Conversation

A complete multi-turn dialogue between users and AI agents. Evaluate the entire conversation flow and overall quality.

2

Messages

Individual messages within a conversation (user, assistant, system, tool). Analyze each exchange independently.

3

Steps

Individual reasoning steps within a message (thinking, tool calls, outputs). Dive deep into the agent's thought process.

Powerful Features

Everything you need to evaluate and improve your LLM conversations.

Hierarchical Evaluation

Evaluate entire conversations, individual messages, or specific reasoning steps. Create custom evaluation metrics with prompts and output schemas. Run evaluations on-demand or in batch.

Rolling Summaries

Automatically maintain compressed summaries of long conversations. Prevents context window overflow when evaluating lengthy dialogues. Incrementally updates summaries as conversations grow.

Evaluation Pipelines

Define reusable evaluation workflows (pipelines). Each pipeline contains multiple evaluation nodes (metrics). Execute pipelines across datasets with streaming results.

Data Management

Organize conversations into datasets. Track LLM calls, costs, and performance metrics. Store structured outputs and metadata for comprehensive analysis.

How It Works

Get started in three simple steps.

1

Upload Datasets

Import your multi-turn conversation data with messages and steps. Organize them into datasets for easy management.

2

Define Metrics

Create custom evaluation metrics with prompts and output schemas. Build reusable evaluation pipelines.

3

Run Evaluations

Execute evaluations and see results streaming in real-time. Get insights at conversation, message, or step level.

Comprehensive Evaluation Metrics

TurnWise includes a powerful set of evaluation metrics designed specifically for multi-turn LLM agent conversations. Evaluate at message, step, or conversation level.

💬

Message Level Metrics

🔄

CCM

Message

Conversation Continuity Metric - Detects when users re-ask similar questions, indicating the previous response was incomplete or unsatisfactory.

😞

RDM

Message

Response Dissatisfaction Metric - Identifies explicit user corrections or expressions of dissatisfaction with the assistant's response.

🔧

Step Level Metrics

🛠️

TSE

Step

Tool Selection Error - Evaluates whether the correct tool was selected for the given task context.

🎭

PH

Step

Parameter Hallucination - Detects hallucinated or fabricated parameters passed to tools (e.g., invented file paths, non-existent IDs).

🔄

SCD

Step

Self-Correction Detection - Measures the agent's ability to recognize and recover from its own errors.

🛠️

TUM

Step

Tool Use Metrics - Comprehensive multi-dimensional analysis of tool usage including selection, parameter accuracy, and result handling.

📝

Conversation Level Metrics

🔗

TCI

Conversation

Tool Chain Inefficiency - Identifies redundant, circular, or inefficient sequences of tool calls.

📈

ATA

Conversation

Agent Trajectory Analysis - Analyzes conversation patterns for circular reasoning, regression, stalls, and goal drift.

🎯

IDM

Conversation

Intent Drift Metric - Measures how well the agent maintains alignment with the original user intent throughout the conversation.

Basic vs Advanced Metrics

TurnWise provides two variants of each metric:

  • 📌Basic Metrics: Simple prompts without template variables. Suitable for quick evaluations.
  • 🚀Advanced Metrics: Use template variables like @HISTORY, {goal}, {tools} for context-aware evaluation.

📤 Output Types

Metrics can output different types of results:

📊 Progress (0.0-1.0)
📋 JSON (structured)
📝 Text (free-form)
☑️ Checkbox (pass/fail)

Ready to Get Started?

Join TurnWise and start evaluating your multi-turn LLM conversations with powerful, hierarchical insights.

TurnWise - Multi-turn Dataset Evaluation