Evaluation
Observability, prompt testing, and quality scoring.
48 tools
Evaluation is the discipline most underinvested in by AI product teams. Choosing an eval tool early is much cheaper than retrofitting one when an LLM regression hits production.
Spans full eval + observability platforms (Braintrust, LangSmith), prompt management (Humanloop, PromptLayer), ML-broad tracking with LLM features (Weights & Biases), and proxy-based observability (Helicone).
Pick Braintrust or LangSmith for full eval + observability. Pick Humanloop if PMs need to edit prompts. Pick Helicone for a one-line install on existing OpenAI/Claude code. Pick Patronus for automated hallucination/safety evals at scale.
LLM Stats
Live leaderboard and side-by-side comparison hub for 300+ frontier LLMs across reasoning, coding, and multimodal benchmarks.
LLMEval
Open academic benchmark suite for stress-testing LLMs on contamination-resistant, domain-specific tasks.
LangFast
No-signup LLM playground for testing, comparing, and versioning prompts against your own API keys.
Langfuse
Open-source LLM observability, prompt management, and evaluation in one platform.
LiveBench
Contamination-free LLM benchmark that refreshes its questions monthly to keep frontier models honest.
MLflow
Open-source platform for tracking, evaluating, and deploying ML models and LLM applications.
MathEval
Holistic benchmark suite for evaluating mathematical reasoning in large language models.
Maxim AI
End-to-end evaluation, simulation, and observability platform for shipping production-grade AI agents.
MixEval
Dynamic LLM benchmark that mixes web queries with existing datasets to mirror Chatbot Arena rankings at a fraction of the cost.
OlympicArena
Olympiad-level multi-discipline benchmark for stress-testing reasoning in LLMs and multimodal models.
OpenAI Evals
OpenAI's open-source framework for benchmarking LLMs against a shared registry of evaluations.
Opik
Open-source LLM observability and evaluation platform for debugging and monitoring AI agents in production.
Parea AI
LLM evaluation, observability, and prompt management platform for teams shipping production AI apps.
Phoenix
Open-source LLM and agent observability platform with tracing, evals, and experimentation built on OpenTelemetry.
Prompt Foundry
Prompt management and side-by-side LLM evaluation for OpenAI and Anthropic models.
Promptfoo
Open-source eval and red-teaming framework for LLM apps, prompts, and RAG pipelines.
Respan (formerly Keywords AI)
LLM engineering platform combining a multi-model gateway with tracing, evals, and prompt management.
SEAL Leaderboard
Private, expert-graded leaderboards from Scale AI that rank frontier LLMs on domains contaminated public benchmarks can no longer measure.
Superwise
Agentic management platform for runtime guardrails, policy enforcement, and observability across LLM agents.
TruLens
Open-source evaluation and tracing framework for LLM apps and agents, built on OpenTelemetry.
VisualWebArena
Open benchmark for evaluating multimodal web agents on realistic visual browsing tasks.
W&B Weave
Production observability, tracing, and evaluation for LLM and agent systems from the Weights & Biases stack.
Weco AI
Autoresearch engine that iteratively rewrites code to optimize against a numeric evaluation metric.
llmfit
Terminal tool that scores hundreds of open LLMs against your actual CPU, RAM, and GPU and tells you which ones will run well.