Evaluator
The Evaluator block uses a model to score content against metrics you define, returning a number per metric. Use it for quality gates, comparing variations, and quality control on AI output.
Configuration
Evaluation Metrics
The metrics to score against. Each metric has a name, a description of what it measures, and a numeric range:
Accuracy (1-5): How factually accurate is the content?
Clarity (1-5): How clear and understandable is it?
Relevance(1-5): How relevant is it to the original query?The model scores the content on each metric and returns the numbers. Metrics missing a name or range are skipped.
Content
The content to score. Usually an earlier output like <agent.content>. Structured data is formatted to text before scoring; the evaluation is text-based, so it can't score images or audio directly.
Model
The model that does the scoring, defaulting to claude-sonnet-4-6. Stronger reasoning models give more consistent scores. Type or pick any supported model. Temperature and a System Prompt are available under advanced, and on hosted Sim the API key is supplied for you.
Outputs
The Evaluator returns a number for each metric, read by the metric's lowercase name:
| Output | What it is |
|---|---|
<evaluator.accuracy> | The score for a metric (one output per metric you define) |
<evaluator.content> | The evaluation summary |
<evaluator.model> | The model that scored |
<evaluator.tokens> | Token usage |
<evaluator.cost> | Estimated cost of the call |
The block enforces a JSON Schema built from your metrics, so the model returns only the metric scores as numbers, no extra text.
Examples
Gate on a quality score
The Evaluator scores the draft, and a Condition gates on <evaluator.accuracy> — publishing a strong draft or sending a weak one back to revise.
The same shape covers other quality work: score several parallel variations and pick the best, or score every support reply and flag the low ones for review.
Best Practices
- Write specific metric descriptions. A clear definition of what each metric measures produces more consistent scores.
- Choose a sensible range. Enough granularity to act on (1–5 or 0–10), without splitting hairs.
- Score Agent output and loop back. Pair an Evaluator with a Condition to gate on a threshold and route weak output back for another pass.
- Keep metrics consistent. For comparing variations, use the same metrics across each evaluation.