Best AI tools for safety testing
14 tools in the Evaluation category, filtered to safety testing.
Patronus
Automated LLM evaluation for hallucinations, safety, and quality.
Arize AI
Enterprise observability and evaluation platform for LLM agents and generative AI applications.
Giskard
Continuous AI red teaming platform that stress-tests LLM agents for vulnerabilities before they hit production.
Great Expectations
Open-source data quality framework for validating the datasets that feed your ML and analytics pipelines.
HoneyHive
OpenTelemetry-native observability and evaluation platform for LLM agents in production.
Inspect AI
Open-source LLM evaluation framework from the UK AI Security Institute with 200+ built-in benchmarks.
LLMEval
Open academic benchmark suite for stress-testing LLMs on contamination-resistant, domain-specific tasks.
LangFast
No-signup LLM playground for testing, comparing, and versioning prompts against your own API keys.
OpenAI Evals
OpenAI's open-source framework for benchmarking LLMs against a shared registry of evaluations.
Opik
Open-source LLM observability and evaluation platform for debugging and monitoring AI agents in production.
Prompt Foundry
Prompt management and side-by-side LLM evaluation for OpenAI and Anthropic models.
Promptfoo
Open-source eval and red-teaming framework for LLM apps, prompts, and RAG pipelines.
TruLens
Open-source evaluation and tracing framework for LLM apps and agents, built on OpenTelemetry.
W&B Weave
Production observability, tracing, and evaluation for LLM and agent systems from the Weights & Biases stack.