Arthur alternatives
12 evaluation tools in the same lane as Arthur, ranked by editorial score.
Braintrust
FeaturedEval, monitor, and improve AI products end-to-end.
LangSmith
LangChain's eval + observability platform.
Weights & Biases
The ML experiment tracker, now with LLM eval features.
Helicone
Open-source LLM observability — one-line proxy install.
Humanloop
Prompt management + evals for collaborative AI teams.
PromptLayer
Lightweight prompt logging + management for OpenAI/Claude apps.
Patronus
Automated LLM evaluation for hallucinations, safety, and quality.
Agenta
Open-source LLMOps platform for prompt engineering, evaluation, and observability in one workspace.
AlpacaEval
Automatic LLM evaluator and leaderboard that benchmarks instruction-following with length-controlled win rates.
Arena AI
Head-to-head LLM battle arena with a public leaderboard for ranking AI models.
Arize AI
Enterprise observability and evaluation platform for LLM agents and generative AI applications.
Artificial Analysis
Independent benchmarking platform comparing AI models and inference providers across intelligence, speed, and cost.