📖 The AI Tool Bible

Best AI tools for observability

21 tools in the Evaluation category, filtered to observability.

All Evaluation

Braintrust

Featured
Evaluation · Platform (any LLM)
8.9

Eval, monitor, and improve AI products end-to-end.

Freemium· Free up to 1k events/day; team from $249/moevalsmonitoring

LangSmith

Evaluation · Platform (any LLM)
8.7

LangChain's eval + observability platform.

Freemium· Free starter; Plus $39/mo per seatLLM tracingevals

Helicone

Evaluation · Platform (any LLM)
8.3

Open-source LLM observability — one-line proxy install.

Freemium· Free 100k req/mo; Pro from $25/moobservabilitycost tracking

PromptLayer

Evaluation · Platform (any LLM)
7.9

Lightweight prompt logging + management for OpenAI/Claude apps.

Freemium· Free; Pro from $50/moprompt loggingversioning

Agenta

Evaluation · Multi-model

Open-source LLMOps platform for prompt engineering, evaluation, and observability in one workspace.

Freemium· Open-source self-host free; managed cloud has free tier plus paid plansprompt-engineeringllm-evaluation

Arize AI

Evaluation · Multi-model

Enterprise observability and evaluation platform for LLM agents and generative AI applications.

Freemium· Free tier and OSS Phoenix; paid/enterprise tiers via salesllm-observabilityagent-evaluation

Arthur

Evaluation · Multi-model

Open-source toolkit for testing, tracing, and monitoring production AI agents.

Freemium· Open-source (MIT) + free SaaS tier; paid/enterprise plans on requestagent-evaluationprompt-management

Athina AI

Evaluation · Multi-model

Collaborative LLM evaluation and observability platform for teams shipping AI features to production.

Freemium· Starter free (10k logs/mo); Pro & Enterprise customllm-evaluationprompt-management

Fiddler AI

Evaluation · Fiddler Centor (proprietary evaluators)

Enterprise AI observability and guardrails platform for monitoring agents, LLMs, and ML models in production.

Enterprise· Tiered plans; contact salesllm-observabilityagent-monitoring

HoneyHive

Evaluation · Multi-model

OpenTelemetry-native observability and evaluation platform for LLM agents in production.

Freemium· Free tier available; paid/enterprise tiers via salesagent-observabilityllm-evaluation

LangFast

Evaluation · Multi-model

No-signup LLM playground for testing, comparing, and versioning prompts against your own API keys.

Paid· One-time lifetime ~$60-$120; 14-day money-backprompt-testingprompt-versioning

Langfuse

Evaluation · Model-agnostic

Open-source LLM observability, prompt management, and evaluation in one platform.

Freemium· Free self-host & Hobby tier; Core $29/mo, Pro $199/mo, Enterprise $2,499/mollm-observabilityprompt-management

MLflow

Evaluation · Multi-model

Open-source platform for tracking, evaluating, and deploying ML models and LLM applications.

Free· Free and open source (Apache 2.0); managed offering via Databricksllm-evaluationexperiment-tracking

Maxim AI

Evaluation · Multi-model

End-to-end evaluation, simulation, and observability platform for shipping production-grade AI agents.

Freemium· Free tier; 14-day trial on paid plans; custom enterprise pricingagent-evaluationllm-observability

Opik

Evaluation · Multi-model

Open-source LLM observability and evaluation platform for debugging and monitoring AI agents in production.

Freemium· Free open-source self-host; free Cloud tier (no card); Enterprise contact salesllm-tracingagent-evaluation

Parea AI

Evaluation · Multi-model

LLM evaluation, observability, and prompt management platform for teams shipping production AI apps.

Freemium· Free (2 seats, 3k logs/mo); Team $150/mo; Enterprise customllm-evaluationprompt-management

Phoenix

Evaluation · Multi-model

Open-source LLM and agent observability platform with tracing, evals, and experimentation built on OpenTelemetry.

Freemium· Open source (ELv2) + free Phoenix Cloud; paid Arize AX for enterprisellm-tracingagent-debugging

Respan (formerly Keywords AI)

Evaluation · Multi-model (500+ via gateway)

LLM engineering platform combining a multi-model gateway with tracing, evals, and prompt management.

Freemium· Free tier; paid plans (pricing not public); enterprise on requestllm-observabilityprompt-management

Superwise

Evaluation · Multi-model

Agentic management platform for runtime guardrails, policy enforcement, and observability across LLM agents.

Freemium· Free Starter Edition; paid tiers via salesllm-guardrailsai-governance

TruLens

Evaluation · Multi-model (LLM-as-judge)

Open-source evaluation and tracing framework for LLM apps and agents, built on OpenTelemetry.

Free· Free, open source (Apache-licensed Python package)llm-evaluationrag-evaluation

W&B Weave

Evaluation · Multi-model

Production observability, tracing, and evaluation for LLM and agent systems from the Weights & Biases stack.

Freemium· Free tier available; paid and enterprise plans via W&Bllm-tracingagent-observability