Evaluation

Observability, prompt testing, and quality scoring.

48 tools

Why it matters

Evaluation is the discipline most underinvested in by AI product teams. Choosing an eval tool early is much cheaper than retrofitting one when an LLM regression hits production.

What's in here

Spans full eval + observability platforms (Braintrust, LangSmith), prompt management (Humanloop, PromptLayer), ML-broad tracking with LLM features (Weights & Biases), and proxy-based observability (Helicone).

How to pick

Pick Braintrust or LangSmith for full eval + observability. Pick Humanloop if PMs need to edit prompts. Pick Helicone for a one-line install on existing OpenAI/Claude code. Pick Patronus for automated hallucination/safety evals at scale.

LLM Stats

Evaluation · Multi-model

Live leaderboard and side-by-side comparison hub for 300+ frontier LLMs across reasoning, coding, and multimodal benchmarks.

Free· Free to browse; underlying model usage billed by each providermodel-comparisonbenchmark-tracking

LLMEval

Evaluation · Multi-model

Open academic benchmark suite for stress-testing LLMs on contamination-resistant, domain-specific tasks.

Free· Free; open-source academic benchmarksllm-benchmarkingacademic-evaluation

LangFast

Evaluation · Multi-model

No-signup LLM playground for testing, comparing, and versioning prompts against your own API keys.

Paid· One-time lifetime ~$60-$120; 14-day money-backprompt-testingprompt-versioning

Langfuse

Evaluation · Model-agnostic

Open-source LLM observability, prompt management, and evaluation in one platform.

Freemium· Free self-host & Hobby tier; Core $29/mo, Pro $199/mo, Enterprise $2,499/mollm-observabilityprompt-management

LiveBench

Evaluation · Multi-model

Contamination-free LLM benchmark that refreshes its questions monthly to keep frontier models honest.

Free· Free and open source; self-hosted evaluation runnerllm-benchmarkingmodel-selection

MLflow

Evaluation · Multi-model

Open-source platform for tracking, evaluating, and deploying ML models and LLM applications.

Free· Free and open source (Apache 2.0); managed offering via Databricksllm-evaluationexperiment-tracking

MathEval

Evaluation · GPT-4 grader / DeepSeek-LLM-7B verifier

Holistic benchmark suite for evaluating mathematical reasoning in large language models.

Free· Free; open-source benchmark with leaderboard submissions via matheval.aillm-math-benchmarkingmodel-leaderboards

Maxim AI

Evaluation · Multi-model

End-to-end evaluation, simulation, and observability platform for shipping production-grade AI agents.

Freemium· Free tier; 14-day trial on paid plans; custom enterprise pricingagent-evaluationllm-observability

MixEval

Evaluation

Dynamic LLM benchmark that mixes web queries with existing datasets to mirror Chatbot Arena rankings at a fraction of the cost.

Free· Free and open sourcellm-benchmarkingmodel-ranking

OlympicArena

Evaluation

Olympiad-level multi-discipline benchmark for stress-testing reasoning in LLMs and multimodal models.

Free· Free, open-source research benchmarkllm-evaluationmultimodal-eval

OpenAI Evals

Evaluation · OpenAI GPT models (extensible)

OpenAI's open-source framework for benchmarking LLMs against a shared registry of evaluations.

Free· Free (MIT); you pay OpenAI API costs for eval runsllm-benchmarkingregression-testing

Opik

Evaluation · Multi-model

Open-source LLM observability and evaluation platform for debugging and monitoring AI agents in production.

Freemium· Free open-source self-host; free Cloud tier (no card); Enterprise contact salesllm-tracingagent-evaluation

Parea AI

Evaluation · Multi-model

LLM evaluation, observability, and prompt management platform for teams shipping production AI apps.

Freemium· Free (2 seats, 3k logs/mo); Team $150/mo; Enterprise customllm-evaluationprompt-management

Phoenix

Evaluation · Multi-model

Open-source LLM and agent observability platform with tracing, evals, and experimentation built on OpenTelemetry.

Freemium· Open source (ELv2) + free Phoenix Cloud; paid Arize AX for enterprisellm-tracingagent-debugging

Prompt Foundry

Evaluation · OpenAI + Anthropic (multi-model)

Prompt management and side-by-side LLM evaluation for OpenAI and Anthropic models.

Freemium· Free tier (10 prompts, 500 evals/mo); Pro $15/user/mo; Enterprise customprompt-managementmodel-comparison

Promptfoo

Evaluation · Multi-model

Open-source eval and red-teaming framework for LLM apps, prompts, and RAG pipelines.

Freemium· Open-source free; Enterprise SaaS contact salesllm-evalsred-teaming

Respan (formerly Keywords AI)

Evaluation · Multi-model (500+ via gateway)

LLM engineering platform combining a multi-model gateway with tracing, evals, and prompt management.

Freemium· Free tier; paid plans (pricing not public); enterprise on requestllm-observabilityprompt-management

SEAL Leaderboard

Evaluation · Multi-model (GPT, Claude, Gemini, Llama, etc.)

Private, expert-graded leaderboards from Scale AI that rank frontier LLMs on domains contaminated public benchmarks can no longer measure.

Free· Free to view; paid custom evals via Scale enterprise salesmodel-selectionbenchmark-tracking

Superwise

Evaluation · Multi-model

Agentic management platform for runtime guardrails, policy enforcement, and observability across LLM agents.

Freemium· Free Starter Edition; paid tiers via salesllm-guardrailsai-governance

TruLens

Evaluation · Multi-model (LLM-as-judge)

Open-source evaluation and tracing framework for LLM apps and agents, built on OpenTelemetry.

Free· Free, open source (Apache-licensed Python package)llm-evaluationrag-evaluation

VisualWebArena

Evaluation · Model-agnostic (GPT-4V, Gemini, Claude, open VLMs)

Open benchmark for evaluating multimodal web agents on realistic visual browsing tasks.

Free· Free and open source (MIT-style research release)multimodal-agent-evalweb-browsing-benchmark

W&B Weave

Evaluation · Multi-model

Production observability, tracing, and evaluation for LLM and agent systems from the Weights & Biases stack.

Freemium· Free tier available; paid and enterprise plans via W&Bllm-tracingagent-observability

Weco AI

Evaluation · Multi-model (LLM + AIDE tree search)

Autoresearch engine that iteratively rewrites code to optimize against a numeric evaluation metric.

Freemium· Open-source CLI; hosted/commercial pricing not publishedcode-optimizationgpu-kernel-tuning

llmfit

Evaluation · Multi-model

Terminal tool that scores hundreds of open LLMs against your actual CPU, RAM, and GPU and tells you which ones will run well.

Free· Free, MIT-licensedlocal-llm-selectionhardware-benchmarking