Evaluator

The Evaluator block uses AI to score and assess content quality using customizable evaluation metrics that you define. Perfect for quality control, A/B testing, and ensuring your AI outputs meet specific standards.

Overview

The Evaluator block enables you to:

Score Content Quality: Use AI to evaluate content against custom metrics with numeric scores

Define Custom Metrics: Create specific evaluation criteria tailored to your use case

Automate Quality Control: Build workflows that automatically assess and filter content

Track Performance: Monitor improvements and consistency over time with objective scoring

How It Works

The Evaluator block processes content through AI-powered assessment:

Receive Content - Takes input content from previous blocks in your workflow
Apply Metrics - Evaluates content against your defined custom metrics
Generate Scores - AI model assigns numeric scores for each metric
Provide Summary - Returns detailed evaluation with scores and explanations

Configuration Options

Evaluation Metrics

Define custom metrics to evaluate content against. Each metric includes:

Name: A short identifier for the metric
Description: A detailed explanation of what the metric measures
Range: The numeric range for scoring (e.g., 1-5, 0-10)

Example metrics:

Accuracy (1-5): How factually accurate is the content?
Clarity (1-5): How clear and understandable is the content?
Relevance (1-5): How relevant is the content to the original query?

Content

The content to be evaluated. This can be:

Directly provided in the block configuration
Connected from another block's output (typically an Agent block)
Dynamically generated during workflow execution

Model Selection

Choose an AI model to perform the evaluation:

OpenAI: GPT-4o, o1, o3, o4-mini, gpt-4.1 Anthropic: Claude 3.7 Sonnet Google: Gemini 2.5 Pro, Gemini 2.0 Flash Other Providers: Groq, Cerebras, xAI, DeepSeek Local Models: Any model running on Ollama

Recommendation: Use models with strong reasoning capabilities like GPT-4o or Claude 3.7 Sonnet for more accurate evaluations.

API Key

Your API key for the selected LLM provider. This is securely stored and used for authentication.

How It Works

The Evaluator block takes the provided content and your custom metrics
It generates a specialized prompt that instructs the LLM to evaluate the content
The prompt includes clear guidelines on how to score each metric
The LLM evaluates the content and returns numeric scores for each metric
The Evaluator block formats these scores as structured output for use in your workflow

Example Use Cases

Content Quality Assessment

Scenario: Evaluate blog post quality before publication

Agent block generates blog post content
Evaluator assesses accuracy, readability, and engagement
Condition block checks if scores meet minimum thresholds
High scores → Publish, Low scores → Revise and retry

A/B Testing Content

Scenario: Compare multiple AI-generated responses

Parallel block generates multiple response variations
Evaluator scores each variation on clarity and relevance
Function block selects highest-scoring response
Response block returns the best result

Customer Support Quality Control

Scenario: Ensure support responses meet quality standards

Support agent generates response to customer inquiry
Evaluator scores helpfulness, empathy, and accuracy
Scores logged for training and performance monitoring
Low scores trigger human review process

Inputs and Outputs

Content: The text or structured data to evaluate
Evaluation Metrics: Custom criteria with scoring ranges
Model: AI model for evaluation analysis
API Key: Authentication for selected LLM provider

evaluator.content: Summary of the evaluation
evaluator.model: Model used for evaluation
evaluator.tokens: Token usage statistics
evaluator.cost: Cost summary for the evaluation call

Metric Scores: Numeric scores for each defined metric
Evaluation Summary: Detailed assessment with explanations
Access: Available in blocks after the evaluator

Best Practices

Use specific metric descriptions: Clearly define what each metric measures to get more accurate evaluations
Choose appropriate ranges: Select scoring ranges that provide enough granularity without being overly complex
Connect with Agent blocks: Use Evaluator blocks to assess Agent block outputs and create feedback loops
Use consistent metrics: For comparative analysis, maintain consistent metrics across similar evaluations
Combine multiple metrics: Use several metrics to get a comprehensive evaluation

Evaluator

Scenario: Evaluate blog post quality before publication

Scenario: Compare multiple AI-generated responses

Scenario: Ensure support responses meet quality standards

On this page