Dataset + prompt versioning

Runner + logging

Scoring + reporting

A benchmarking tool that runs multi‑model sweeps on curated datasets with fixed prompts to identify the best cost/quality trade‑offs.

  It includes a live config editor with a read‑only YAML preview for reproducible runs, plus model toggles with pricing and throughput hints.

  Runs log tokens, latency, and quality scores per prompt, compare models side‑by‑side, and highlight the most suitable option for a target budget or score.

  Reports export to CSV/JSON with a table view, and a log inspector shows pretty JSON alongside parsed model response fields. A single‑model measurement mode helps quick spot checks.