LMArena Review: Key Features and Pros&Cons

Name: LMArena
Author: LMArena

What it is:LMArena is a community-powered platform for blind head-to-head comparisons of AI models using real-user votes to generate human preference data and live leaderboards for evaluation.
Best for:AI model developers at labs, Enterprise AI teams, AI researchers benchmarking progress
Pricing:Free tier available, paid plans from Custom enterprise pricing
Rating:92/100Excellent
Expert's conclusion:LMArena is the primary infrastructure for serious AI model evaluation; it delivers the most trusted human-preference rankings available today. Text that sounds more natural to you between BEGIN_TEXT and END_TEXT:

Reviewed byMaxim Manylov·Web3 Engineer & Serial Founder

Company Overview

LMArena was founded by the same people that conducted UC Berkeley’s Chatbot Arena Research Project in 2023. This is an open platform where communities can evaluate large language models (LLMs) based on their performance through head-to-head blind comparison and human preference voting. It offers a way for companies developing AI models and organizations deploying them to receive benchmarked data they can use to develop and deploy better models. In 2025, the company went from being a research project to becoming a fully-fledged business.

Active

📍San Francisco Bay Area, CA

📅Founded 2025

🏢Private

TARGET SEGMENTS

AI Model DevelopersAI LabsEnterprisesResearchers

Key Metrics

👥

5M+

Monthly Users

📊

150+

Countries

📊

60M+

Monthly Conversations

📊

$30M+

ARR

📊

$250M+

Total Funding

📊

$1.7B

Valuation

Credibility Rating

92/100

Excellent

Rapidly growing and having gained huge support from the community, along with being endorsed by top AI labs such as OpenAI, Google and xAI gives LMArena exceptional credibility. Having strong academic origins and being backed by a significant amount of venture capital further solidifies LMArena’s status as the standard for evaluating AI models.

BREAKDOWN

Product Maturity88/100

Company Stability95/100

Security & Compliance85/100

User Reviews95/100

Transparency98/100

Support Quality90/100

TRUST SIGNALS

Used by OpenAI, Google DeepMind, Anthropic, xAI$250M+ total funding from top VCs5M+ monthly users across 150 countriesUC Berkeley research origins60M+ monthly conversations benchmarked

Company History

2023

Chatbot Arena Launched

Researchers Anastasios Angelopoulos and Wei-Lin Chiang from UC Berkeley create an open research project called Chatbot Arena for comparing LLMs in a blinded manner.

2024

Image Support Added

Chatbot Arena moves forward to include multimodal evaluations that assess how well models understand images.

2024

Domain Launch

As the number of users continues to grow rapidly, LMArena.com becomes the official domain name.

2025

Company Incorporated

On April 28th, 2025 Arena Intelligence Inc. officially launches as an independent entity.

2025

$100M Seed Round

Just one month later than launching, Arena Intelligence Inc. raises $100 million in seed funding with a $600 million post-money valuation.

2025

Commercial Product Launch

The service, launched in September 2025, achieves a $30 million annual recurring revenue (ARR) run-rate in less than four months.

2025

$150M Funding Round

With an additional $150 million in funding, the total raised by Arena Intelligence Inc. is now $250 million; the company has a new valuation of $1.7 billion.

Key Executives

Anastasios N. Angelopoulos— CEO & Co-founder: Ph.D. from UC Berkeley Electrical Engineering and Computer Sciences Department. Co-creator of Chatbot Arena with extensive knowledge in methods of evaluating LLMs.
Wei-Lin Chiang— CTO & Co-founder: Ph.D. from UC Berkeley Electrical Engineering and Computer Sciences Department. Lead technical architect behind the scalable evaluation architecture of Chatbot Arena.
Ion Stoica— Co-founder & Advisor: Professor of Electrical Engineering and Computer Science at UC Berkeley and serial entrepreneur. Co-founded Databricks, Anyscale, Conviva with experience in building and scaling distributed systems and machine learning (ML) infrastructures.

Key Features

✨

Blind Side-by-Side Comparisons

By anonymizing the output of models, LMArena allows humans to make unbiased preference judgments about models without the influence of the model developers' branding.

📊

Community-Driven Leaderboards

Real time rankings are generated by millions of user preferences across over 400 models.

📊

Prompt-to-Leaderboard (P2L)

Using historical trends to estimate ranking, the predictive model generates rankings for prompts with minimal vote data.

✨

Arena Categories

Leaderboards for coding, reasoning, conversation, vision, and special task evaluation by domain.

✨

Private Model Testing

Confidential evaluation service for unreleased model versions through an enterprise service.

💬

Multimodal Support

Evaluation of text-to-image, vision understanding, and multimodal (text-to-image, etc.) capabilities.

✨

Global Scale

Statistically meaningful evaluation data generated by 5M+ monthly users in 150 countries.

Tech Stack

Infrastructure

Scalable cloud infrastructure supporting 60M+ monthly conversations

Technologies

PythonReactTypeScriptMachine LearningDistributed Systems

Integrations

OpenAI APIGoogle GeminiAnthropic ClaudexAI GrokMeta Llama

AI/ML Capabilities

Proprietary P2L prediction models trained on millions of human preference votes; supports evaluation of frontier LLMs across text, vision, coding, and multimodal tasks

Inferred from Berkeley research origins, scale requirements, and model integrations

Use Cases

AI Model Laboratories

Compare a new model variant to competitive models using millions of real user preferences prior to public release.

Enterprise AI Teams

Optimize the best models for production use across coding, reasoning, customer support, and domain specific tasks.

AI Researchers

Access publicly available open human preference datasets and leaderboards for LLM reproducible evaluation research.

Software Developers

Compare the performance of coding assistants on different models using real world programming examples.

Individual Hobbyists

Publicly available testing and comparison of any available anonymous AI model.

NOT FORHigh-Frequency Trading Systems

Not applicable - evaluation platform is designed to evaluate LLM quality, not real time latency requirements.

NOT FORMedical Diagnostic Systems

Not suitable - community based voting does not satisfy the clinical validation requirements for regulated medical diagnostics.

Pricing

Pricing information with service tiers, costs, and details
☐Service	$Cost	ℹDetails	🔗Source
Core Platform	$0	Free access to public leaderboards, model comparisons, and core community evaluations	Comparateur-IA
AI Evaluation Services	Custom enterprise pricing	Paid services for AI labs and enterprises measuring model performance in production use cases. Annualized run rate $30M+ as of Dec 2025	PRNewswire funding announcement

Core Platform$0

Free access to public leaderboards, model comparisons, and core community evaluations

Comparateur-IA

AI Evaluation ServicesCustom enterprise pricing

Paid services for AI labs and enterprises measuring model performance in production use cases. Annualized run rate $30M+ as of Dec 2025

PRNewswire funding announcement

Competitive Comparison

Feature	LMArena	LMSYS Chatbot Arena	Hugging Face Open LLM Leaderboard	Artificial Analysis
Core Functionality	Crowdsourced human preference rankings	Crowdsourced pairwise battles	Automated benchmarks	Automated benchmarks
Model Coverage	Text, vision, search, coding	Text + vision	Text only	Full spectrum
Evaluation Method	Real user votes (5M+ users)	Real user votes	Automated metrics	Automated metrics
Free Tier	Yes (core platform)	Yes	Yes	Yes
Enterprise Features	Custom evaluations, analytics	Limited	Private leaderboards	Custom reports
API Availability	Enterprise only	Yes (pay-per-use)	Yes	Yes
Community Size	5M monthly users	Largest (10M+ votes)	Developer-focused	Research-focused
Industry Coverage	Software eng, law, medicine, research	General	Open models	Commercial models
Starting Price	$0 / Custom enterprise	$0 / API paid	$0 / Enterprise paid	$0 / Paid reports

Core Functionality

LMArenaCrowdsourced human preference rankings

LMSYS Chatbot ArenaCrowdsourced pairwise battles

Hugging Face Open LLM LeaderboardAutomated benchmarks

Artificial AnalysisAutomated benchmarks

Model Coverage

LMArenaText, vision, search, coding

LMSYS Chatbot ArenaText + vision

Hugging Face Open LLM LeaderboardText only

Artificial AnalysisFull spectrum

Evaluation Method

LMArenaReal user votes (5M+ users)

LMSYS Chatbot ArenaReal user votes

Hugging Face Open LLM LeaderboardAutomated metrics

Artificial AnalysisAutomated metrics

Free Tier

LMArenaYes (core platform)

LMSYS Chatbot ArenaYes

Hugging Face Open LLM LeaderboardYes

Artificial AnalysisYes

Enterprise Features

LMArenaCustom evaluations, analytics

LMSYS Chatbot ArenaLimited

Hugging Face Open LLM LeaderboardPrivate leaderboards

Artificial AnalysisCustom reports

API Availability

LMArenaEnterprise only

LMSYS Chatbot ArenaYes (pay-per-use)

Hugging Face Open LLM LeaderboardYes

Artificial AnalysisYes

Community Size

LMArena5M monthly users

LMSYS Chatbot ArenaLargest (10M+ votes)

Hugging Face Open LLM LeaderboardDeveloper-focused

Artificial AnalysisResearch-focused

Industry Coverage

LMArenaSoftware eng, law, medicine, research

LMSYS Chatbot ArenaGeneral

Hugging Face Open LLM LeaderboardOpen models

Artificial AnalysisCommercial models

Starting Price

LMArena$0 / Custom enterprise

LMSYS Chatbot Arena$0 / API paid

Hugging Face Open LLM Leaderboard$0 / Enterprise paid

Artificial Analysis$0 / Paid reports

Competitive Position

vs LMSYS Chatbot Arena

LMArena is positioned as a production ready evaluation platform across various domains (law, medicine), whereas LMSYS is positioned as a general conversational ability platform. LMArena has $30M+ ARR serving enterprise clients directly versus LMSYS positioning itself as academic/research oriented.

The LMArena provides a model validation of an enterprise for validating the model for enterprise usage; while the LMSYS is a research model for public benchmarks.

vs Hugging Face Open LLM Leaderboard

Human preference evaluation via automation. While Hugging Face excels at covering open source models, it lacks the real user preference signals captured by LMArena from 5M+ monthly users across 150 countries.

The Hugging Face is used for open model developers to develop open models for testing purposes and the LMArena provides user-preference validations for development as well as final deployments.

vs Artificial Analysis

While both platforms are focused on enterprises, LMArena uses large-scale crowdsourced data (60M conversations/month) to provide superior human preference signals than the synthetic benchmarks used by Artificial Analysis.

The LMArena has been validated through human preference accuracy for production deployments and its human preference accuracy lead all other platforms.

vs Scale AI Evals

Scale focuses on providing paid annotation services; LMArena utilizes organic community scale. The $1.7B valuation of LMArena reflects the marketplace preference for crowdsourced versus paid annotation.

The LMArena is scalable at a low cost for large-scale uses; the Scale is used for customers that need custom annotations.

Pros Cons

Pros

There are millions of people using the Scale daily across over 150 countries which creates statistically significant amounts of data.
The Scale focuses on real-world relevance and human preferences on practical tasks such as law, medicine, coding etc.
The Scale is trusted by leaders in the field (OpenAI, Google, xAI) who use evaluations from the Scale to improve their models.
The Scale has experienced rapid growth ($30M ARR in 4 months and a valuation of $1.7B after raising $150M).
The Scale is focused on production and offers enterprise-grade evaluations for high-stake deployments.
The Scale evaluates multiple modalities including text, vision, search and coding and has a leader board system.
The free core platform of the Scale allows for accessible public leader boards and model comparisons.

Cons

The enterprise pricing of the Scale is completely opaque and there is no publicly available pricing information; all custom quote requests.
Some of the frontier models provided by the Scale are limited by paid model access and/or partnership paywalls.
The leader board of the Scale can be volatile due to the crowd-voting nature of the platform and new model releases.
Due to its public nature, the Scale is not designed for internal evaluation and does not have the ability to address private data or privacy needs.
Although the Scale has produced production-ready evaluations across high-value domains and has an enterprise ARR of over $30M, it has still identified potential production gaps that rankings do not cover (API costs, compliance, service level agreements).
The Scale is an early-stage enterprise even with its $1.7B valuation, and has only just released its commercial product in September 2025.
Although vote manipulation risk is mitigated, crowd-sourced voting is susceptible to gaming and therefore may create issues with leaderboard stability.

Best For

AI model developers at labs — The Scale's trusted human preference signals are currently being utilized by OpenAI, Google and xAI for improving their models.
Enterprise AI teams — In addition to providing production-ready evaluations across high-value domains, the Scale also produces over $30M of enterprise ARR.
AI researchers benchmarking progress — The Scale produces real-time leader boards across text/vision/search/coding and has over 60 million monthly conversations.
Procurement teams evaluating vendors — For independent third-party human preference data prior to selecting a vendor, consider using the Scale.
AI enthusiasts and communities — In addition to producing public leader boards and model comparisons, the Scale also provides free access to these features.

Not Suitable For

Teams with private/sensitive data — The public platform of the Scale cannot provide evaluations of proprietary data sets. Consider using internal evaluations or Scale AI instead.
Budget-constrained startups — The price of enterprise models will most likely be out of reach of smaller teams. Use LMSYS Arena free LMSYS arena in lieu of that.
Compliance-focused enterprises — There are no publicly disclosed metrics regarding compliance. Enterprise RPA vendors or on-prem solutions may be more suitable.
Real-time production monitoring — Leaderboard focused (not live monitoring). Use Datadog/observability platforms for that.

Limits Restrictions

Public Platform Access: Free core leaderboards and comparisons
Model Availability: Varies by partnerships, some frontier models limited
Enterprise Evaluations: Custom pricing for AI labs/enterprises
Production Use Warning: Always validate API costs, privacy, compliance internally
Data Privacy: Public platform, no private data evaluation
Vote Integrity: Anti-gaming measures but crowd votes can shift
Commercial Product Maturity: Launched Sep 2025, rapidly scaling

Security & Compliance

Trusted by AI LeadersOpenAI, Google, xAI rely on platform for production model evaluations

Enterprise-Grade InfrastructureSupports $30M+ annualized consumption run rate for mission-critical AI evals

Global Scale Operations5M+ users across 150 countries with 60M monthly conversations

Production Deployment Ready$1.7B valuation reflects enterprise trust in evaluation reliability

Customer Support

Channels

Active developer/researcher communityPlatform guides and methodologyCustom evaluation service inquiriesAI labs and model providers

Hours: Community support 24/7, enterprise business hours
Response Time: Community self-serve, enterprise sales <24 hours
Satisfaction: High trust - partnered with OpenAI/Google/xAI
Specialized: Dedicated account teams for AI labs and enterprises
Business Tier: Custom evaluation services with SLAs for commercial customers

Support Limitations

•No dedicated support for free/public users

•Enterprise support only for paid evaluation customers

•Self-service for leaderboard/platform access

Api Integrations

API Type: No public API available. Primarily a web-based crowdsourced evaluation platform with no documented REST, GraphQL, or gRPC endpoints for external integrations.
Authentication: No authentication required. Platform offers free, no-sign-up access for public benchmarking and model comparisons.
Webhooks: No webhook support mentioned. Focus is on public leaderboards and human voting rather than event-driven integrations.
SDKs: No official SDKs available. Originated as open research project from UC Berkeley LMSYS, but no developer SDKs documented.
Documentation: No API documentation available. Limited to platform usage guides; evaluation services for enterprises mentioned but details require direct contact.
Sandbox: Public platform serves as free testing environment with no signup. Users can immediately test models via blind battles.
SLA: No public SLA guarantees. Enterprise evaluation services offered to AI labs, but uptime details not disclosed.
Rate Limits: No documented rate limits for public use. Free access model with millions of monthly user interactions.
Use Cases: Crowdsourced model benchmarking, live leaderboards for AI labs (OpenAI, Google, xAI), enterprise evaluation services across text, code, image, video.

Faq

How does LMArena work?

Users enter prompts and receive two anonymous model responses to compare blindly and vote on which is better. The votes are reflected in real-time Elo-based leaderboards that show how humans evaluate and rank different models based on their performance across text, code, images and multi-modal tasks. This method of using crowd-sourced feedback to develop unbiased rankings of models has been adopted by many AI labs around the world.

What's the pricing?

The entire public version of the platform is entirely free and does not require users to sign up. The enterprise version of the service that is designed for AI labs and other organizations is paid. LMArena's first product is expected to launch in September 2025, and it is anticipated to generate revenue of over $25 million. Contact the company's sales department to inquire about specific pricing options.

How is LMArena different from LMSYS Chatbot Arena?

LMArena started life as the UC Berkeley Chatbot Arena research project, and after some time, it was turned into an operational platform. It now allows users to compete in coding, image generation and multimodal arenas in addition to chat arenas, all while keeping its core blind voting process intact. LMArena is now widely recognized as the industry standard for evaluating AI models through the use of a commercially viable platform.

Is my data secure?

To access the public version of LMArena, there is no need to create a user account or provide any information that can identify you personally. That is because the focus of LMArena is on collecting anonymous voting data and prompt-response data to use for benchmarking purposes. Enterprise versions of LMArena may include the typical security features you would expect; however, the level of detail provided is not publicly available at this time.

Can I integrate LMArena with my tools?

At present, there are no publicly available APIs or integrations for LMArena. It is primarily a web-based platform for manually evaluating models and accessing the leaderboard. Customized evaluation services for enterprise customers are also available.

What if I need enterprise evaluation services?

For AI labs and enterprises that want to purchase LMArena to measure the performance of their models across different industries, the company offers a range of paid evaluation products. In fact, LMArena has already been utilized by companies such as OpenAI, Google and xAI to improve the quality of their production models. To find out more about customized solutions, please visit LMArena's website and get in touch with their sales department.

Is there a free trial?

LMArena offers an open public benchmarking platform, which is completely free and includes unlimited access to test any frontier model via blind battles as soon as you sign up. In order to get paid enterprise services from us, you will have to engage in a sales discussion.

What are the limitations of LMArena?

The quality of data collected through this service depends upon user participation, and may therefore be subject to biases from voters even though the format of the blind battles ensures anonymity. Additionally, there are no limitations placed on programmatically accessing our API.

Expert Verdict

LMArena has established itself as the leading crowdsourced method for evaluating AI models in terms of quality and quantity; it provides industry-wide leaderboards using over one million human judgments each month. LMArena's blind battle system and Elo ranking methodologies provide a unique level of unbiased, real-world performance metrics for comparing models against one another that static benchmarks cannot.

Researchers and developers interested in monitoring the performance of frontier models
AI Labs (OpenAI, Google, xAI), etc., that need a trusted third party to evaluate their models
Enterprises utilizing AI models that need to have human preference benchmarked
Technical teams testing LLMs across multiple types of tasks (text, code, vision, multimodal)

!
Use With Caution

Teams who require API access for automated evaluations — currently only accessible via web interface
Users who focus on testing niche or proprietary models — LMArena prioritizes popular frontier models
Organizations that need guaranteed Service Level Agreements (SLA’s) for production benchmarking pipelines

Not Recommended For

Teams looking for traditional offline benchmarks such as MMLU or GPQA
Budget constrained start-ups who do not want to invest in enterprise evaluation services
Non-technical users who want to test models in a simple way without committing to voting

Expert's Conclusion

LMArena is the primary infrastructure for serious AI model evaluation; it delivers the most trusted human-preference rankings available today. Text that sounds more natural to you between BEGIN_TEXT and END_TEXT:

Best For

Researchers and developers interested in monitoring the performance of frontier modelsAI Labs (OpenAI, Google, xAI), etc., that need a trusted third party to evaluate their modelsEnterprises utilizing AI models that need to have human preference benchmarked

Research Summary

Key Findings

LMArena was developed as a research project called UC Berkeley's Chatbot Arena and is now the most widely used crowdsourced evaluation platform for AI models, with 4 million+ monthly model comparison evaluations; generates real-time Elo leaderboards based on blind human voting in text, code, image and multi-modal categories. Commercially funded at $250M+, and has an estimated annual revenue of $25M+ for providing enterprise services such as evaluation services to AI labs including OpenAI, Google and xAI.

Data Quality

Good - comprehensive coverage from funding announcements, technical descriptions, and platform analyses. Limited details on enterprise pricing/service specifics and no public API documentation.

Risk Factors

Continued reliance on user participation for maintaining data quality.

Despite using a blind testing format, there could still be potential for voter bias.

The rate of advancement of AI may exceed the ability of the current evaluation methodology to provide adequate assessments.

There are no public APIs limiting the ability for developers to develop applications based upon LMArena.

Last updated: February 2026

Additional Info

Funding & Growth

Received $100M in seed funding and $150M in additional funding, for a total of $250M+ in funding from investors such as Felicis. Estimated to reach $25M+ in revenue by end-of-year 2025 through its provision of enterprise evaluation services for AI model development. Serves as critical infrastructure for top AI labs.

Founders & Origin

Founded by UC Berkeley professors Ion Stoica and Wei-Lin Chiang through LMSYS Org. Began as an open research project called Chatbot Arena in 2023. Currently the most commonly used evaluation platform for evaluating AI models with 5 million+ monthly users in 150+ countries.

Key Customers

Evaluated by leading AI labs including OpenAI, Google and xAI for use in the production of AI models. Utilized in software engineering, legal, medical, scientific research industries.

Leaderboard Categories

Contains specific areas of evaluation: Text Arena (chat and reasoning), Image Arena (generation), Multimodal Arena (vision/text) and coding and search benchmarks.

Data Transparency

Releases its evaluation data and methods to the public. Provides researchers around the world with the opportunity to study the human preference signals received during evaluation and utilize those signals to improve their AI models.

Alternatives

•
Hugging Face Open LLM Leaderboard: An automated evaluation platform which tests over 20,000+ open source models on standard evaluation tasks such as MMLU, HellaSwag. More effective for open-source model coverage and reproducibility than LMArena's method of relying on human preference during evaluation. Most suitable for researchers who prioritize the use of offline metrics rather than utilizing live user voting. (https://huggingface.co/spaces/open-llm-leaderboard)
•
LMSYS Chatbot Arena (Original): The direct predecessor to LMArena by the same group at UC Berkeley was a much smaller, less commercialized tool focused only on chatbots and conversations. It is still available for some basic LLM comparison nostalgia (https://chat.lmsys.org).
•
Scale AI Evaluation Platform: An enterprise-grade evaluation platform that has both human and automated scoring of your custom datasets. This may be more suitable for you if you need to test your own proprietary models in an arena environment than LMArena’s publicly accessible battles. Although this platform is much more expensive, it does offer Service Level Agreements and Application Programming Interfaces. This is best suited for organizations that require private evaluations (https://www.scale.com).
•
Artificial Analysis: A completely independent quality index that aggregates results from various benchmarks while also including speed and cost metrics. This is similar to how LMArena aggregates human preference data but focuses on standardized task performance. This is ideal for creating dashboards for comparing multiple models quantitatively (https://www.artificialanalysis.ai).
•
Arena-Hard-Auto: Automated arena-style evaluation with strong judge models as opposed to evaluating through human judges. This method is faster and cheaper than LMArena crowdsourcing, however, it may provide less accurate representation of human preferences. This is best used for high-throughput testing of large numbers of models (https://github.com/lm-sys/FastChat).

Evaluation Metrics

2M+ /month

Monthly User Votes

250M+

Total Conversations

3M+

Monthly Users

400+

Models Hosted

Testing Capabilities

Human-in-the-Loop Evaluation

Blind pairwise comparisons of models through crowdsourcing.

Elo Rating System

Provides real-time ranking of models based on user voting.

Prompt-to-Leaderboard (P2L)

Creates customized rankings for users based on their specific prompts.

Live Leaderboards

Provides real-time performance tracking across domains.

Benchmark Support

Benchmark	Category	Supported
Text Generation	Language & Reasoning	Yes
Code Arena	Code Generation	Yes
Image Arena	Multimodal Vision	Yes
Search Evaluation	Information Retrieval	Yes
Video Generation	Multimodal	Yes