Best AI tools for prompt management
24 tools in the Evaluation category, filtered to prompt management.
Braintrust
FeaturedEval, monitor, and improve AI products end-to-end.
Weights & Biases
The ML experiment tracker, now with LLM eval features.
Humanloop
Prompt management + evals for collaborative AI teams.
PromptLayer
Lightweight prompt logging + management for OpenAI/Claude apps.
Agenta
Open-source LLMOps platform for prompt engineering, evaluation, and observability in one workspace.
Arize AI
Enterprise observability and evaluation platform for LLM agents and generative AI applications.
Arthur
Open-source toolkit for testing, tracing, and monitoring production AI agents.
Athina AI
Collaborative LLM evaluation and observability platform for teams shipping AI features to production.
Giskard
Continuous AI red teaming platform that stress-tests LLM agents for vulnerabilities before they hit production.
Kiln AI
Open-source workbench for building, evaluating, and fine-tuning AI agents across 190+ models.
LangFast
No-signup LLM playground for testing, comparing, and versioning prompts against your own API keys.
Langfuse
Open-source LLM observability, prompt management, and evaluation in one platform.
MLflow
Open-source platform for tracking, evaluating, and deploying ML models and LLM applications.
Maxim AI
End-to-end evaluation, simulation, and observability platform for shipping production-grade AI agents.
OpenAI Evals
OpenAI's open-source framework for benchmarking LLMs against a shared registry of evaluations.
Opik
Open-source LLM observability and evaluation platform for debugging and monitoring AI agents in production.
Parea AI
LLM evaluation, observability, and prompt management platform for teams shipping production AI apps.
Phoenix
Open-source LLM and agent observability platform with tracing, evals, and experimentation built on OpenTelemetry.
Prompt Foundry
Prompt management and side-by-side LLM evaluation for OpenAI and Anthropic models.
Promptfoo
Open-source eval and red-teaming framework for LLM apps, prompts, and RAG pipelines.
Respan (formerly Keywords AI)
LLM engineering platform combining a multi-model gateway with tracing, evals, and prompt management.
W&B Weave
Production observability, tracing, and evaluation for LLM and agent systems from the Weights & Biases stack.
Weco AI
Autoresearch engine that iteratively rewrites code to optimize against a numeric evaluation metric.
llmfit
Terminal tool that scores hundreds of open LLMs against your actual CPU, RAM, and GPU and tells you which ones will run well.