Evaluation Engine
Benchmarks you can defend.
Multi-model ensemble judging. Formal policy documents. Sensitivity analysis. Reproducible experiments. Every step auditable, every result defensible.
Multi-model ensemble judging
A rotating panel of frontier and open-weight models — selected per task domain — so no single judge controls the outcome.
Frontier generalist
Broad reasoning across domains
Nuanced analyst
Deep contextual understanding
Multilingual evaluator
Cross-language consistency
Open-weight calibrator
Bias-check against proprietary models
Built-in consensus checks automatically flag items where judge agreement is weak, routing them to human review.
Policy as code
Every judging rule is written down, versioned, and machine-readable. No hidden settings. No “we eyeballed it.”
See how hidden settings flip rankings
Drag the slider to change the agreement strictness. Watch how model rankings shift — with the exact same underlying data.
Interactive Demo
Ranking Inversion
Drag the slider to change evaluation strictness and watch the model rankings shift.
Three stress tests, one answer
Before you publish a benchmark, Rensei answers the three questions that determine whether anyone should trust it.
Sensitivity
How sensitive are your results to pipeline configuration? Identifies which settings matter and which are noise.
Fidelity
How close are automated results to your gold-standard labels? Measures agreement with statistical confidence so you know the margin.
Stability
Could a small config change flip your model rankings? If so, the benchmark is fragile — and Rensei will tell you before anyone else finds out.
Research Brief
Built on peer-reviewed methodology
Furiwake’s evaluation framework draws on published research in inter-rater reliability, ensemble consensus, and sensitivity analysis. The technical brief details the statistical foundations behind ensemble judging, policy parameterisation, and ranking stability — with citations and worked examples.
Ready to build defensible benchmarks?
See how Rensei can make your evaluation pipeline transparent, reproducible, and stress-tested.