Arthur
Open-source toolkit for testing, tracing, and monitoring production AI agents.
Pick Arthur if you're shipping agents to production and want an open-source, OpenTelemetry-native control plane for prompts, traces, and evals.
Skip it if you're a solo builder prototyping a chatbot — the lifecycle tooling is overkill before you have real traffic to evaluate against.
Arthur's Agent Development Toolkit is an MIT-licensed framework for managing the full lifecycle of LLM-based agents in production. It bundles prompt versioning with rollback, A/B experimentation against real traffic, OpenTelemetry-based tracing of every agent step, and continuous evaluations that flag hallucinations, PII leaks, and prompt-injection attempts.
It's aimed at engineering teams who have moved past prototyping and need observability and governance without rebuilding their stack. The toolkit is model- and framework-agnostic — it slots into OpenAI, Anthropic, LangChain, LlamaIndex, and the Vercel AI SDK rather than forcing a proprietary runtime. You can self-host via Docker, CloudFormation, or Helm, or start on the hosted SaaS free tier; enterprise SaaS plans are available on request.
Arthur the company has a longer history in classical ML monitoring (model bias, drift, performance), and this agent toolkit is its bet on the LLM-eval era. The open-source repo, Helm-chart deployability, and OpenTelemetry standardisation make it a credible alternative to closed eval platforms like Braintrust or LangSmith.
Arthur's pivot from classical ML monitoring into agent observability lands well: an MIT licence, OpenTelemetry plumbing, and Helm charts make it one of the few credible self-hostable answers to Braintrust and LangSmith. The agent toolkit is young, but the bones — and the company behind them — are serious.
— The AI Tool Bible editorial team
Pros
- ✅ MIT-licensed and self-hostable via Docker, Helm, or CloudFormation
- ✅ Built on OpenTelemetry so traces flow into existing observability stacks
- ✅ Framework-agnostic: works with LangChain, LlamaIndex, OpenAI, Anthropic, Vercel AI SDK
- ✅ Covers full lifecycle: prompt versioning, A/B testing, tracing, and online evals
- ✅ Includes guardrail evaluators for PII and prompt injection out of the box
Cons
- ⚠️ Agent toolkit is newer than the company's ML-monitoring heritage; ecosystem still maturing
- ⚠️ Self-hosting Helm/CloudFormation deployment expects real DevOps capacity
- ⚠️ Paid SaaS pricing is gated behind sales contact
Use cases
Explore related
Compare with similar tools
All in Evaluation →Braintrust
FeaturedEval, monitor, and improve AI products end-to-end.
LangSmith
LangChain's eval + observability platform.
Weights & Biases
The ML experiment tracker, now with LLM eval features.
Helicone
Open-source LLM observability — one-line proxy install.
Humanloop
Prompt management + evals for collaborative AI teams.
PromptLayer
Lightweight prompt logging + management for OpenAI/Claude apps.