📖 The AI Tool Bible

Evaluation

Observability, prompt testing, and quality scoring.

48 tools

Why it matters

Evaluation is the discipline most underinvested in by AI product teams. Choosing an eval tool early is much cheaper than retrofitting one when an LLM regression hits production.

What's in here

Spans full eval + observability platforms (Braintrust, LangSmith), prompt management (Humanloop, PromptLayer), ML-broad tracking with LLM features (Weights & Biases), and proxy-based observability (Helicone).

How to pick

Pick Braintrust or LangSmith for full eval + observability. Pick Humanloop if PMs need to edit prompts. Pick Helicone for a one-line install on existing OpenAI/Claude code. Pick Patronus for automated hallucination/safety evals at scale.

Braintrust

Evaluation · Platform (any LLM)

Eval, monitor, and improve AI products end-to-end.

Freemium· Free up to 1k events/day; team from $249/moevalsmonitoring

LangSmith

Evaluation · Platform (any LLM)

LangChain's eval + observability platform.

Freemium· Free starter; Plus $39/mo per seatLLM tracingevals

Weights & Biases

Evaluation · Platform (any LLM)

The ML experiment tracker, now with LLM eval features.

Freemium· Free personal; team from $50/mo per seatML experimentsLLM eval

Helicone

Evaluation · Platform (any LLM)

Open-source LLM observability — one-line proxy install.

Freemium· Free 100k req/mo; Pro from $25/moobservabilitycost tracking

Humanloop

Evaluation · Platform (any LLM)

Prompt management + evals for collaborative AI teams.

Paid· From $200/mo teamprompt managementteam collab

PromptLayer

Evaluation · Platform (any LLM)

Lightweight prompt logging + management for OpenAI/Claude apps.

Freemium· Free; Pro from $50/moprompt loggingversioning

Patronus

Evaluation · Platform (any LLM)

Automated LLM evaluation for hallucinations, safety, and quality.

Paid· Enterprise / contact saleshallucination detectionsafety

Agenta

Evaluation · Multi-model

Open-source LLMOps platform for prompt engineering, evaluation, and observability in one workspace.

Freemium· Open-source self-host free; managed cloud has free tier plus paid plansprompt-engineeringllm-evaluation

AlpacaEval

Evaluation · GPT-4 Preview (Nov 2024) as annotator

Automatic LLM evaluator and leaderboard that benchmarks instruction-following with length-controlled win rates.

Free· Free and open-source; pay only for the underlying OpenAI annotator API callsllm-benchmarkinginstruction-following eval

Arena AI

Evaluation · Multi-model

Head-to-head LLM battle arena with a public leaderboard for ranking AI models.

Free· Free to use; no public paid tier listedllm-benchmarkingmodel-comparison

Arize AI

Evaluation · Multi-model

Enterprise observability and evaluation platform for LLM agents and generative AI applications.

Freemium· Free tier and OSS Phoenix; paid/enterprise tiers via salesllm-observabilityagent-evaluation

Arthur

Evaluation · Multi-model

Open-source toolkit for testing, tracing, and monitoring production AI agents.

Freemium· Open-source (MIT) + free SaaS tier; paid/enterprise plans on requestagent-evaluationprompt-management

Artificial Analysis

Evaluation · Multi-model

Independent benchmarking platform comparing AI models and inference providers across intelligence, speed, and cost.

Freemium· Free public leaderboards; paid plans for expanded data and reports (contact for pricing)model-benchmarkingprovider-comparison

Athina AI

Evaluation · Multi-model

Collaborative LLM evaluation and observability platform for teams shipping AI features to production.

Freemium· Starter free (10k logs/mo); Pro & Enterprise customllm-evaluationprompt-management

Berkeley Function-Calling Leaderboard

Evaluation · Multi-model

Open benchmark from UC Berkeley that ranks LLMs on real-world tool-use and function-calling accuracy.

Free· Free and open source; you pay only for inference when reproducing runs.function-calling evaltool-use benchmarking

Cleanlab TLM

Evaluation · Multi-model (wraps any LLM)

Trustworthiness scoring layer that flags LLM hallucinations in real time.

Freemium· Free tier for evaluation; usage-based API pricing; enterprise/private deployment via saleshallucination-detectionrag-evaluation

CompassRank

Evaluation · Multi-model

Public leaderboard from the OpenCompass project ranking open and closed LLMs across 100+ benchmarks.

Free· Free leaderboard; OpenCompass toolkit is Apache 2.0 open sourcellm-benchmarkingmodel-selection

Fiddler AI

Evaluation · Fiddler Centor (proprietary evaluators)

Enterprise AI observability and guardrails platform for monitoring agents, LLMs, and ML models in production.

Enterprise· Tiered plans; contact salesllm-observabilityagent-monitoring

Giskard

Evaluation · Multi-model

Continuous AI red teaming platform that stress-tests LLM agents for vulnerabilities before they hit production.

Freemium· Open-source free tier; Giskard Hub enterprise pricing on requestllm-red-teamingagent-security-testing

Great Expectations

Open-source data quality framework for validating the datasets that feed your ML and analytics pipelines.

Freemium· GX Core free (Apache 2.0); GX Cloud paid tiers, contact salesdata-validationpipeline-testing

HoneyHive

Evaluation · Multi-model

OpenTelemetry-native observability and evaluation platform for LLM agents in production.

Freemium· Free tier available; paid/enterprise tiers via salesagent-observabilityllm-evaluation

InfiBench

Stack Overflow-derived benchmark for evaluating code LLMs on real-world programming questions.

Free· Free and open source (CC BY-SA 4.0)code-llm-evalmodel-benchmarking

Inspect AI

Evaluation · Multi-model

Open-source LLM evaluation framework from the UK AI Security Institute with 200+ built-in benchmarks.

Free· Free and open source (MIT-style license); you pay only for underlying model API usage.llm-benchmarkingagent-evaluation

Kiln AI

Evaluation · Multi-model

Open-source workbench for building, evaluating, and fine-tuning AI agents across 190+ models.

Freemium· Free Individual tier; Team (request access); Enterprise (custom)llm-evaluationfine-tuning