Evaluation Engine

Benchmarks you can defend.

Multi-model ensemble judging. Formal policy documents. Sensitivity analysis. Reproducible experiments. Every step auditable, every result defensible.

Multi-model ensemble judging

A rotating panel of frontier and open-weight models — selected per task domain — so no single judge controls the outcome.

Frontier generalist

Broad reasoning across domains

Nuanced analyst

Deep contextual understanding

Multilingual evaluator

Cross-language consistency

Open-weight calibrator

Bias-check against proprietary models

Built-in consensus checks automatically flag items where judge agreement is weak, routing them to human review.

Policy as code

Every judging rule is written down, versioned, and machine-readable. No hidden settings. No “we eyeballed it.”

policy.yaml
████████████████:[your threshold]# Agreement level required before a verdict holds
████████████████:[configurable]# Number of independent evaluations per target
████████████████:[tiered]# How disagreements are triaged and escalated
████████████████:[your standard]# Minimum evaluator reliability to participate
████████████████:[your tolerance]# When to trigger recalibration

See how hidden settings flip rankings

Drag the slider to change the agreement strictness. Watch how model rankings shift — with the exact same underlying data.

Interactive Demo

Ranking Inversion

Drag the slider to change evaluation strictness and watch the model rankings shift.

0/9
Lenient
Strict
Same data. Different settings. Different winner.
1Model A
0.82
2Model B
0.71
3Model C
0.68
4Model D
0.65

Three stress tests, one answer

Before you publish a benchmark, Rensei answers the three questions that determine whether anyone should trust it.

Parametric

Sensitivity

How sensitive are your results to pipeline configuration? Identifies which settings matter and which are noise.

Validation

Fidelity

How close are automated results to your gold-standard labels? Measures agreement with statistical confidence so you know the margin.

Robustness

Stability

Could a small config change flip your model rankings? If so, the benchmark is fragile — and Rensei will tell you before anyone else finds out.

Research Brief

Built on peer-reviewed methodology

Furiwake’s evaluation framework draws on published research in inter-rater reliability, ensemble consensus, and sensitivity analysis. The technical brief details the statistical foundations behind ensemble judging, policy parameterisation, and ranking stability — with citations and worked examples.

Ready to build defensible benchmarks?

See how Rensei can make your evaluation pipeline transparent, reproducible, and stress-tested.