📖 The AI Tool Bible

Best AI tools for prompt management

24 tools in the Evaluation category, filtered to prompt management.

All Evaluation

Braintrust

Featured
Evaluation · Platform (any LLM)
8.9

Eval, monitor, and improve AI products end-to-end.

Freemium· Free up to 1k events/day; team from $249/moevalsmonitoring

Weights & Biases

Evaluation · Platform (any LLM)
8.4

The ML experiment tracker, now with LLM eval features.

Freemium· Free personal; team from $50/mo per seatML experimentsLLM eval

Humanloop

Evaluation · Platform (any LLM)
8.2

Prompt management + evals for collaborative AI teams.

Paid· From $200/mo teamprompt managementteam collab

PromptLayer

Evaluation · Platform (any LLM)
7.9

Lightweight prompt logging + management for OpenAI/Claude apps.

Freemium· Free; Pro from $50/moprompt loggingversioning

Agenta

Evaluation · Multi-model

Open-source LLMOps platform for prompt engineering, evaluation, and observability in one workspace.

Freemium· Open-source self-host free; managed cloud has free tier plus paid plansprompt-engineeringllm-evaluation

Arize AI

Evaluation · Multi-model

Enterprise observability and evaluation platform for LLM agents and generative AI applications.

Freemium· Free tier and OSS Phoenix; paid/enterprise tiers via salesllm-observabilityagent-evaluation

Arthur

Evaluation · Multi-model

Open-source toolkit for testing, tracing, and monitoring production AI agents.

Freemium· Open-source (MIT) + free SaaS tier; paid/enterprise plans on requestagent-evaluationprompt-management

Athina AI

Evaluation · Multi-model

Collaborative LLM evaluation and observability platform for teams shipping AI features to production.

Freemium· Starter free (10k logs/mo); Pro & Enterprise customllm-evaluationprompt-management

Giskard

Evaluation · Multi-model

Continuous AI red teaming platform that stress-tests LLM agents for vulnerabilities before they hit production.

Freemium· Open-source free tier; Giskard Hub enterprise pricing on requestllm-red-teamingagent-security-testing

Kiln AI

Evaluation · Multi-model

Open-source workbench for building, evaluating, and fine-tuning AI agents across 190+ models.

Freemium· Free Individual tier; Team (request access); Enterprise (custom)llm-evaluationfine-tuning

LangFast

Evaluation · Multi-model

No-signup LLM playground for testing, comparing, and versioning prompts against your own API keys.

Paid· One-time lifetime ~$60-$120; 14-day money-backprompt-testingprompt-versioning

Langfuse

Evaluation · Model-agnostic

Open-source LLM observability, prompt management, and evaluation in one platform.

Freemium· Free self-host & Hobby tier; Core $29/mo, Pro $199/mo, Enterprise $2,499/mollm-observabilityprompt-management

MLflow

Evaluation · Multi-model

Open-source platform for tracking, evaluating, and deploying ML models and LLM applications.

Free· Free and open source (Apache 2.0); managed offering via Databricksllm-evaluationexperiment-tracking

Maxim AI

Evaluation · Multi-model

End-to-end evaluation, simulation, and observability platform for shipping production-grade AI agents.

Freemium· Free tier; 14-day trial on paid plans; custom enterprise pricingagent-evaluationllm-observability

OpenAI Evals

Evaluation · OpenAI GPT models (extensible)

OpenAI's open-source framework for benchmarking LLMs against a shared registry of evaluations.

Free· Free (MIT); you pay OpenAI API costs for eval runsllm-benchmarkingregression-testing

Opik

Evaluation · Multi-model

Open-source LLM observability and evaluation platform for debugging and monitoring AI agents in production.

Freemium· Free open-source self-host; free Cloud tier (no card); Enterprise contact salesllm-tracingagent-evaluation

Parea AI

Evaluation · Multi-model

LLM evaluation, observability, and prompt management platform for teams shipping production AI apps.

Freemium· Free (2 seats, 3k logs/mo); Team $150/mo; Enterprise customllm-evaluationprompt-management

Phoenix

Evaluation · Multi-model

Open-source LLM and agent observability platform with tracing, evals, and experimentation built on OpenTelemetry.

Freemium· Open source (ELv2) + free Phoenix Cloud; paid Arize AX for enterprisellm-tracingagent-debugging

Prompt Foundry

Evaluation · OpenAI + Anthropic (multi-model)

Prompt management and side-by-side LLM evaluation for OpenAI and Anthropic models.

Freemium· Free tier (10 prompts, 500 evals/mo); Pro $15/user/mo; Enterprise customprompt-managementmodel-comparison

Promptfoo

Evaluation · Multi-model

Open-source eval and red-teaming framework for LLM apps, prompts, and RAG pipelines.

Freemium· Open-source free; Enterprise SaaS contact salesllm-evalsred-teaming

Respan (formerly Keywords AI)

Evaluation · Multi-model (500+ via gateway)

LLM engineering platform combining a multi-model gateway with tracing, evals, and prompt management.

Freemium· Free tier; paid plans (pricing not public); enterprise on requestllm-observabilityprompt-management

W&B Weave

Evaluation · Multi-model

Production observability, tracing, and evaluation for LLM and agent systems from the Weights & Biases stack.

Freemium· Free tier available; paid and enterprise plans via W&Bllm-tracingagent-observability

Weco AI

Evaluation · Multi-model (LLM + AIDE tree search)

Autoresearch engine that iteratively rewrites code to optimize against a numeric evaluation metric.

Freemium· Open-source CLI; hosted/commercial pricing not publishedcode-optimizationgpu-kernel-tuning

llmfit

Evaluation · Multi-model

Terminal tool that scores hundreds of open LLMs against your actual CPU, RAM, and GPU and tells you which ones will run well.

Free· Free, MIT-licensedlocal-llm-selectionhardware-benchmarking