Best Model Evaluation & Testing Software
Model Evaluation & Testing software solutions.
Best At A Glance

Giskard
giskard.aiGiskard is an AI security testing platform that detects vulnerabilities in LLM agents through red teaming, including hallucinations, prompt injections, and security flaws. Open-source Python library and enterprise Hub available.

Arize AI
arize.comArize AI is a unified LLM observability and agent evaluation platform for monitoring, troubleshooting, and improving AI models and applications in production.

WhyLabs
whylabs.aiWhyLabs is an AI observability platform that monitors machine learning models, data pipelines, and generative AI applications for quality, performance, security, and issues like drift and bias.

Fiddler AI
fiddler.aiFiddler AI is an all-in-one AI Observability and Security platform that provides real-time monitoring, guardrails, root cause analysis, and governance for deploying AI agents, LLMs, and ML models in production.

Truera
truera.comTruera is a provider of AI observability platforms for machine learning monitoring, quality management, explainability, and predictive diagnostics across model lifecycles.

Deepchecks
deepchecks.comDeepchecks is a platform for evaluating and monitoring machine learning models, with a focus on large language models (LLMs) to detect issues like hallucinations, bias, and performance drift.

Confident AI
confidentai.comConfident AI is a cloud platform for evaluating, testing, and monitoring large language model applications with metrics, observability tools, and CI/CD integration.

Kolena
kolena.comKolena is a San Francisco-based AI platform for testing, benchmarking, and validating machine learning models while automating document-heavy workflows in sectors like real estate, insurance, and finance.

CTGT
ctgt.aiCTGT is a product-focused frontier interpretability lab that uses mathematically-guaranteed techniques to identify AI errors, biases, and hallucinations for safe, transparent deployment in high-stakes industries like healthcare and finance.

LMArena
lmarena.aiLMArena is a community-powered platform for blind head-to-head comparisons of AI models using real-user votes to generate human preference data and live leaderboards for evaluation.

Braintrust
braintrustdata.comBraintrust is a platform for developing, evaluating, and observing AI applications, offering tools for prompt management, performance tracking, evals, logging, and production traces used by companies like Zapier and Instacart.