Sesame Review: Key Features and Pros&Cons

Name: Sesame
Author: Sesame

What it is:Sesame is a conversational AI startup building emotionally intelligent voice companions Maya and Miles using its Conversational Speech Model for natural, real-time dialogues, with plans for AI-powered smart glasses.
Best for:Individuals and small projects, Growing businesses and content creators, Enterprises requiring voice solutions at scale
Pricing:Starting from $29/month
Rating:85/100Very Good
Expert's conclusion:Sesame would be ideal for organizations developing innovative conversational AI applications that emphasize natural prosody and emotional expression, especially when working within an open-source environment.

Reviewed byMaxim Manylov·Web3 Engineer & Serial Founder

Company Overview

Sesame AI develops emotionally-intelligent voice companions such as Maya and Miles through its Conversational Speech Model (CSM). The company’s models are designed to provide an individualized, and emotive experience with consumers through natural speech interactions. Sesame AI was founded by industry leaders who previously worked at Oculus and Meta; they plan to create lifelike AI companions that can be used in conjunction with emerging wearable technologies such as smart glasses. Sesame AI's focus is on providing users with a voice presence; and thus, differentiate themselves from traditional voice assistants that have limited or no emotional expressiveness.

Active

📍San Francisco, CA

📅Founded 2023

🏢Private

TARGET SEGMENTS

ConsumersDevelopersEnterprises

Key Metrics

📊

$47.5M Series A + $250M Series B

Funding Raised

📊

$1B+

Valuation

📊

1M+ hours of audio

Training Data

📊

2K+

Hugging Face Stars

📊

200-300ms

Response Time

📊

English primary, expanding to 20+

Languages

Credibility Rating

85/100

Excellent

As evidence of their credibility is the strong founding team from Oculus and Meta, significant funding from prominent venture capital firms (a16z and Sequoia), and the viral nature of their product demonstrations.

BREAKDOWN

Product Maturity75/100

Company Stability90/100

Security & Compliance70/100

User Reviews85/100

Transparency80/100

Support Quality75/100

TRUST SIGNALS

Backed by a16z, Sequoia Capital, Spark CapitalFounders from Oculus VR and Meta Reality LabsOpen-sourced CSM-1B model on Hugging Face1M+ hours audio training dataset

Company History

2023

Company Founded

In June 2023, Sesame AI was founded by Brendan Iribe (Co-founder of Oculus and CEO), Ankit Kumar (former lead of AI at Discord and current CTO), and Ryan Brown (founding engineer at Meta Reality Labs).

2025

Research Demo Launch

In February 2025, Sesame AI released two viral product demonstrations of voice companions Maya and Miles that showcase the capabilities of the company’s CSM technology.

2025

Series A Funding

On February 27, 2025, Sesame AI closed a $47.5M Series A round led by Andreessen Horowitz.

2025

Series B Funding

By June 2025, Sesame AI raised $250M in a Series B round from Sequoia Capital and Spark Capital at a valuation of over $1 billion.

2025

Key Executive Hires

In June 2025, Sesame AI hired Oculus co-founder Nate Mitchell as Chief Product Officer to lead hardware development for the company.

2025

Model Open-Sourced

In July 2025, Sesame AI released the open-source CSM-1B model to allow developers to work with the company’s technology; the CSM-1B model has received over 2,000+ stars on Hugging Face.

Key Features

✨

Conversational Speech Model (CSM)

Sesame AI’s transformer-based model is able to process both text and audio tokens concurrently in under 300 milliseconds for natural and responsive interactions.

✨

Emotional Intelligence

Through Sesame AI’s CSM, the company can identify emotional cues such as laughter, interruptions, and changes in tone mid-sentence, as well as the use of filler words such as “um.”

✨

Real-Time Dialogue

With 200-300 milliseconds of latency, Sesame AI’s voice companions are able to mimic the timing and pacing of human-to-human conversations including pauses and conversational back-and-forth.

✨

Voice Companions (Maya & Miles)

Sesame AI’s emotionally-resonant AI companions were trained using over 1 million hours of real world audio data.

✨

Open-Source CSM-1B

Sesame AI’s developer accessible 1B parameter model is available on Hugging Face and includes hosted inference APIs for easy integration into applications.

💬

Multilingual Support

While Sesame AI primarily supports English language interactions, the company has expressed intentions to support up to 20 additional languages and incorporate contextual awareness.

🔗

AI Glasses Integration

Sesame AI intends to develop always-on wearables to pair their voice companions with lightweight augmented reality (AR) hardware.

Tech Stack

Infrastructure

Multi-region GPU clusters (inferred for speech model training)

Technologies

Transformer modelsPyTorch (inferred)

Integrations

Hugging FaceHosted inference APIsSmart glasses hardware

AI/ML Capabilities

Proprietary Conversational Speech Model (CSM) trained on 1M+ hours of audio data, processing text/audio tokens simultaneously for long-context (2,048 tokens/~2min) emotionally expressive speech with real-time turn-taking.

Based on research papers, demo descriptions, and training details from Contrary Research and RDWorld

Use Cases

Individual Consumers

Virtual assistants which are designed to help individuals as they navigate their daily lives through emotional connections, natural conversation and "always-on" access through wearable smart glass technology in the future.

AI Developers

Provides developers access to an open source version of the CSM-1B model and its hosted APIs to develop expressive voice applications with approximately 200-300 ms of latency.

Customer Service Teams

Enables developers to deploy emotionally intelligent voice agents that can provide a human-like experience to improve engagement compared to voice agents which are based on robotic models.

Automotive Interfaces

Provide users with hands-free virtual companions that have contextual awareness and proactive assistive capabilities to understand their drivers.

NOT FORHigh-Frequency Trading

Not Applicable -- The latency of the conversational AI does not meet the sub 100 ms requirements of real time financial operations.

NOT FORHIPAA-Regulated Healthcare

Limited Current Compliance -- Does not currently include Medical Data Handling Certifications such as BAA.

Pricing

Pricing information with service tiers, costs, and details
☐Service	$Cost	ℹDetails	🔗Source
Starter	$29/month	10 hours of voice synthesis, 5 hours of voice recognition, 3 custom voice profiles, standard API access, email support	Sesame AI Voice official pricing
Professional	$49/month	50 hours of voice synthesis, 25 hours of voice recognition, 10 custom voice profiles, advanced API access, priority support, voice transformation features, analytics dashboard	Sesame AI Voice official pricing
Enterprise	Custom quote	Unlimited voice synthesis, unlimited voice recognition, unlimited custom voice profiles, full API access, dedicated support team, advanced security features, custom integration support, service level agreement	Sesame AI Voice official pricing
Free Trial	14 days	Full access to plan features, no credit card required	Sesame AI Voice official pricing
Sesame HR AI Add-on	$49/month	For employee training use case, includes AI-driven learning personalization. $42.50/month with annual billing	Sesame HR pricing
Mobile App (iOS)	$7.99/week or $24.99 or $89.99/year	Sesame AI Voice Pro subscription on Apple App Store with in-app purchase options	App Store listing

Starter$29/month

10 hours of voice synthesis, 5 hours of voice recognition, 3 custom voice profiles, standard API access, email support

Sesame AI Voice official pricing

Professional$49/month

50 hours of voice synthesis, 25 hours of voice recognition, 10 custom voice profiles, advanced API access, priority support, voice transformation features, analytics dashboard

Sesame AI Voice official pricing

EnterpriseCustom quote

Unlimited voice synthesis, unlimited voice recognition, unlimited custom voice profiles, full API access, dedicated support team, advanced security features, custom integration support, service level agreement

Sesame AI Voice official pricing

Free Trial14 days

Full access to plan features, no credit card required

Sesame AI Voice official pricing

Sesame HR AI Add-on$49/month

For employee training use case, includes AI-driven learning personalization. $42.50/month with annual billing

Sesame HR pricing

Mobile App (iOS)$7.99/week or $24.99 or $89.99/year

Sesame AI Voice Pro subscription on Apple App Store with in-app purchase options

App Store listing

Limits Restrictions

Voice Synthesis Hours: 10 hours/month (Starter), 50 hours/month (Professional), unlimited (Enterprise)
Voice Recognition Hours: 5 hours/month (Starter), 25 hours/month (Professional), unlimited (Enterprise)
Custom Voice Profiles: 3 profiles (Starter), 10 profiles (Professional), unlimited (Enterprise)
Voice Cloning: Available on Professional and Enterprise plans only
API Access Level: Standard (Starter), Advanced (Professional), Full (Enterprise)

Security & Compliance

Cloud InfrastructureCloud-based infrastructure with scalable architecture to handle peak loads without performance degradation

Advanced Security FeaturesAdvanced security features available on Enterprise plan

Data PrivacyVoice data handling with privacy considerations for custom voice profiles and voice cloning

Customer Support

Channels

Standard (Starter), Priority (Professional), Dedicated team (Enterprise)API documentation and integration guides available

Specialized: Dedicated support team available for Enterprise customers with custom integration support

Best For

Individuals and small projects — A starter plan at $29/month is available to provide a low-cost entry point into using the platform's voice synthesis and recognition capabilities.
Growing businesses and content creators — The professional plan includes additional features (voice transformation and analytics dashboard) to scale your voice applications.
Enterprises requiring voice solutions at scale — The Enterprise Plan has unlimited resources and support as well as custom integration options for complex deployments.
Organizations implementing AI-driven employee training — The Sesame HR integration allows users to receive personalized learning experiences with AI-enabled voice capabilities.
Developers building voice-enabled applications — Users have multiple API access levels as well as the ability to customize voice profiles to facilitate flexible integration.

Not Suitable For

Users requiring high-volume voice processing — The monthly hour limits associated with the Starter and Professional Plans may limit usage for very large-scale operations; consider the Enterprise option.
Projects with minimal budget — There is no free tier; the starting price is $29/month. Consider using open source voice solutions or alternative free solutions.

Pros & Cons

Pros

Reduced Costs -- Eliminates costs for hiring voice talent, renting recording studios and hiring people to transcribe audio files manually.
Pricing Flexibility -- Pay only for what you use; scalable pricing tiers are available from $29/month to custom Enterprise plans.
Cloud-based Infrastructure -- Automatically scales to handle peak loads without performance degradation.
Custom Voice Profiles -- Develop unique voices to fit your brand identity by adjusting pitch, speed, tone, and accent.
Voice Cloning Capability — The Professional Plan as well as the Enterprise Plan can create digital voice clones when authorized to do so
Multi-tiered API Access — Standard, Advanced, and Full API Access Levels allow users to have various technological needs met
Flexibility With Integrations — Custom integration will be provided for Enterprise tier users with an additional layer of complexity for their implementation

Cons

Limited Hours — The Starter Tier limits the user to 10 hours of voice synthesis per month which restricts usage for large-scale projects or those requiring heavy usage
No Free Tier Available — Users must commit to paying $29 per month, although there is no trial period mentioned specifically for voice synthesis
Lack Of Transparency Regarding Features In The Advanced Tier — There are insufficient details regarding voice cloning, voice modification, and the security features within the search results for the advanced tier
Enterprise Tier Requires Pricing For Each Feature — Enterprise Tier users receive custom pricing for each feature including advanced security, unlimited access to the system and dedicated support. However, the price is unknown to the user.
A Steep Learning Curve Exists To Customize The Voice — The user has the ability to customize the pitch, speed, tone and accent of the voice using parameters; however, finding the optimal values may require experimentation.
Monthly Allocations Limit The Ability To Scale — The Professional Tier limits the user to 50 hours of voice synthesis and/or 25 hours of voice recognition per month.

Api Integrations

API Type: No public API documentation found. Sesame appears to be a research-focused open-source Conversational Speech Model (CSM) rather than a commercial API service.
Authentication: Not applicable - no developer API identified in available sources.
Webhooks: No webhook support mentioned.
SDKs: Open-source model available; potential integration via Hugging Face or similar platforms, but no official SDKs documented.
Documentation: Limited - technical details in research blogs and Vogent docs; no comprehensive developer portal.
Sandbox: Public demo available for testing conversational voices (Maya and Miles).
SLA: None - beta/research product, not enterprise-grade service.
Rate Limits: Not specified.
Use Cases: Real-time conversational AI, voice companions, customer service agents, smart assistants requiring natural prosody and emotional expressiveness.

Faq

What is Sesame AI?

Sesame AI is a conversational speech model (CSM) designed to produce natural-sounding, expressive speech with human-like prosody, pauses, tone shifts, and emotional intelligence. It is used to power voice assistants such as Maya and Miles to provide lifelike real-time conversations.

How does Sesame's Conversational Speech Model work?

CSM is produced using an auto-regressive transformer based architecture along with residual vector quantization (RVQ) to generate both semantic and acoustic tokens from interleaved text and audio tokens. The model includes the entire conversation history to provide dynamic prosody, rhythm, and emotional adaptability to the voice assistant.

What are the key features of Sesame voices?

Natural Prosody — Improved Expressiveness — Enhanced Pronunciation — Smooth Transitions — Contextual Understanding — Personality Consistency — Realistic Voice Companions Understand Micro-Pauses — Tone Shifts — Emotional Cues — Real-Time Interactions.

Is Sesame open-source?

Yes, Sesame's CSM is an open-source framework that can be used by the research community to build their own custom versions of this system and add to the existing code base. The company has made public demos of this technology along with technical information about how it works.

What are the differences between Sesame and traditional TTS?

Most commercial TTS (text-to-speech) products produce robotic sounding voices that do not take into consideration the conversation history and therefore lack emotion and conversational flow. In contrast, Sesame provides what we refer to as "voice presence" by providing real-time emotional depth, conversational flow, and prosody modeling based on the user's conversation history.

Is there a free demo or trial?

Yes, there are public demos that allow users to test the natural conversations using either Maya or Miles as their voice. At the time of our review, no pricing or subscription details were provided since the product is focused towards research.

What are the limitations of Sesame?

There are currently bugs being worked out of the system such as long pauses or artifacts in long conversations. Also, while the system does support cloning and has a limited number of available voices, users will need to provide short audio samples to clone voices that they want to include in their systems.

Can Sesame be used commercially?

Since Sesame is an open-source framework, it can be used in commercial applications, such as customer service and smart home devices; however, please note that you should always verify the licensing terms to ensure that you have permission to use this framework for your application.

Expert Verdict

At this point in time, the Sesame AI's Conversational Speech Model is still in beta and may experience some instability; however, given the fact that the entire system is open source, it is positioned to be a very useful tool for developers that are working on next-generation conversational agents and are looking for an alternative to current expressive TTS frameworks.

Researchers working on improving AI capabilities related to speech synthesis
Developers building conversational AI applications
Companies developing customer service AI applications that require high levels of engagement
Organizations seeking open-source solutions for expressive TTS

!
Use With Caution

Production environments that require high stability
Applications that require low-latency
Enterprise applications that require fine-tuning of the model before deployment

Not Recommended For

Real-time mission critical systems
Budget-constrained projects that require polished commercial support
Projects that simply need basic TTS functionality and can utilize one of the many well-established commercial providers

Expert's Conclusion

Sesame would be ideal for organizations developing innovative conversational AI applications that emphasize natural prosody and emotional expression, especially when working within an open-source environment.

Best For

Researchers working on improving AI capabilities related to speech synthesisDevelopers building conversational AI applicationsCompanies developing customer service AI applications that require high levels of engagement

Research Summary

Key Findings

Using a combination of advanced tokenization and transformer-based architectures, Sesame AI has created a groundbreaking Conversational Speech Model (CSM) that creates human-like prosody and emotional expression while allowing it to adapt to real time contextual changes. Open source model behind Maya and Miles – two natural voice companion models that are currently in demo mode and have received positive reviews from users for their realistic conversation flows. Currently at the beta phase with support for cloning, however there still remain some inference instability issues with this model.

Data Quality

Fair - detailed technical info from research blogs, Vogent docs, and demos; no official API/pricing details or company website sesame.com active. Primarily research/open-source focused.

Risk Factors

Inference instability issues during beta testing (pauses, artifacts)

Currently no large-scale public commercial infrastructure available.

Enterprise support and Service Level Agreements (SLA) unclear.

Model depends on the research updates occurring on an ongoing basis.

Last updated: February 2026

Additional Info

Technical Innovations

The CSM model addresses the "one-to-many" problem in generating synthetic speech by using a hybrid approach that combines both acoustic/semantic tokens processed using a reversible vector quantizer (RVQ), along with a context aware prosody modeling framework.

Demo Voices

Maya and Miles each contain distinct personality characteristics including natural pauses, tone shifts, and even humor. A series of demos show how the low latency back-and-forth conversation between a user and the AI-powered voice companion mimic the human-like rhythm of a typical conversation.

Open-Source Commitment

Since Sesame AI released the CSM model as an open source project, it has enabled a vast array of research based innovations within the conversational AI field. Additionally, Sesame has made it possible to clone a wide variety of voices utilizing just 8-20 seconds of audio and provide a customized voice option for the AI model.

Real-World Applications

Sesame AI plans to target customer service applications, smart home device interactions, augmented reality (AR) interactions, and ultimately develop voice companions. Ultimately, Sesame AI seeks to build what they call "voice presence" for building consumer trust and engagement.

Media Reception

Sesame AI has been praised by users on Hacker News, The Verge, and several AI-related blogs for creating one of the most realistic and engaging conversational flows experienced to date.

Alternatives

•
ElevenLabs: ElevenLabs is another leading expressive Text-to-Speech (TTS) platform that includes high fidelity voices, voice cloning capabilities, and emotional control features. While ElevenLabs has less commercial polish and API stability compared to Sesame's beta CSM, it may be better suited for production voiceover work and/or use as a voice agent. elevenlabs.io
•
PlayHT: Play.ht provides ultra-realistic TTS capabilities with conversational voices and very low latency. It also offers more enterprise-level support and multilingual options than Sesame's beta CSM. Overall, it appears that Play.ht would be a better choice for companies looking to create reliable customer support bot solutions where the need for reliability outweighs the desire for cutting-edge research novelty. play.ht
•
Respeecher: The advanced voice cloning & synthesis is much stronger for a professional audio production & commercial application as opposed to being used in a real time, conversational application.
•
Google WaveNet / Cloud TTS: Enterprise-level TTS using WaveNet Neural Voices and extensive integration capabilities; better suited for high volume applications requiring greater scalability and service level agreements but potentially less expressive of prosodic qualities.
•
Cartesia AI: Voice AI with real time voice and low latency streaming, voice AI also has very expressive synthesis; closer to Sesame in terms of conversational style but has production APIs.

Voice Quality & Performance Metrics

Low %

Word Error Rate (WER)

4.7 /5.0

Mean Opinion Score (MOS)

Minimal ms

Response Latency

Low ms

End-to-End Latency

Sub-250 ms

Streaming Audio Latency

Emotional & Expressive Voice Features

Emotional Tone Synthesis

Human-like intonation, understanding, and response to emotional cues, and adjust the tone for joy, sadness, or urgency based upon the cue.

Prosody Control

Natural timing, pauses, emphasis, volume adjustments, and rhythm for realistic speech patterns.

Nonverbal Expressiveness

Responsive conversational flow that includes context-aware responses, as well as personality attributes such as Maya and Miles voice profiles.

Speaking Style Replication

Ability to customize speed, pitch, and emotions to produce unique voice characteristics for each user.

Real-Time Voice Streaming

Instant synthesis with little to no delay for users to engage with the application.

Multilingual Emotional Inflection

Support for multiple languages and plans to support over 20 languages while maintaining natural prosody.

Regulatory & Security Compliance Status

Data Privacy ComplianceApp includes delete memory feature; details limited

SOC 2 CertificationNot specified for enterprise deployment

GDPR Data ProcessingUser data deletion available; residency unspecified

Real-Time PII RedactionNo explicit mention

End-to-End EncryptionSecure voice processing implied in app

HIPAA ComplianceNot mentioned for healthcare use

Safety Controls & Harm Mitigation

Impersonation Safeguards

Two pre-defined voice profiles (Maya and Miles) with no mention of voice cloning capabilities.

Misuse Detection & Blocking

Early-stage technology, and no specifics were provided about data safeguards.

Voice Spoofing Prevention

Not specifically addressed.

Crisis Response Protocol

No escalation procedures are described.

Child Safety Protections

No age-detection or parental controls mentioned.

Consent & Transparency Logging

Memory delete button was added for user control.

Operational & Business Performance KPIs

4.3 /5.0

User Satisfaction (App Store)

High %

Task Completion Rate

Instant

Response Time

Strong %

Context Retention Score

High

Engagement Rate

Thousands

Daily Active Users

Integration & Customization Capabilities

WebSocket Streaming API

Simple API and SDK for developers to integrate real-time voice into their applications.

Voice Customization

Ability to fine-tune the speed, pitch, and emotion of the voices, as well as use two pre-defined voice profiles, Maya and Miles.

Multilingual Support

Plans to offer the ability to select from multiple languages and eventually expand to 20+ languages while maintaining natural prosody.

iOS App Integration

Native iPhone application with photo recognition and voice interaction.

Wearable Hardware Integration

AI Glasses allow users to have continuous hands-free interaction throughout their day.

Contextual Memory

Preserves both conversation continuity and user-personalization.

Environmental Awareness

Uses hardware to observe and understand the environment around the user to provide contextual assistance.

Privacy & Data Handling Specifications

Audio Retention Period: User-configurable with delete option
Automatic Data Deletion: Yes
Memory Management: Delete all data button in settings
Encryption Standard: Secure app processing
Data Residency: Cloud-based (unspecified regions)
User Consent Controls: Yes
Privacy Compliance: App Store compliant

Industry Vertical Deployment & Readiness

Industry Vertical	Adoption Level	Primary Use Cases	Key Features Utilized	Deployment Maturity
Content Creation	High	Videos, audiobooks, podcasts	Natural voice, emotional expressiveness	Production-Ready
Education Technology	Growing	Interactive learning content	Engaging natural voices	Production-Ready
Game Development	Moderate	NPC dialogue, voiceovers	Voice variety, emotional range	Production-Ready
Virtual Assistants	High	Personal AI companions	Contextual memory, real-time response	Production-Ready
Customer Support	Emerging	Voice agents, triage	Emotional intelligence, natural conversation	Beta/Early Adoption
Wearables & Hardware	Developing	AI glasses companion	Environmental awareness, hands-free	Prototype/Upcoming

Expert Reviews

📝

No reviews yet

Be the first to review Sesame!

Write a Review

Similar Products

Hume AI

hume.ai

Interesting Products

Sesame Review: Key Features and Pros&Cons

Company Overview

Key Metrics

Credibility Rating

Company History

Company Founded

Research Demo Launch

Series A Funding

Series B Funding

Key Executive Hires

Model Open-Sourced

Key Features

Tech Stack

Infrastructure

Technologies

Integrations

AI/ML Capabilities

Use Cases

Pricing

Limits Restrictions

Security & Compliance

Customer Support

Best For

Best For

Not Suitable For

Pros & Cons

Pros

Cons

Api Integrations

Faq

Expert Verdict

Recommended For

!Use With Caution

Not Recommended For

Research Summary

Key Findings

Data Quality

Risk Factors

Additional Info

Technical Innovations

Demo Voices

Open-Source Commitment

Real-World Applications

Media Reception

Alternatives

Voice Quality & Performance Metrics

Emotional & Expressive Voice Features

Emotional Tone Synthesis

Prosody Control

Nonverbal Expressiveness

Speaking Style Replication

Real-Time Voice Streaming

Multilingual Emotional Inflection

Regulatory & Security Compliance Status

Safety Controls & Harm Mitigation

Impersonation Safeguards

Misuse Detection & Blocking

Voice Spoofing Prevention

Crisis Response Protocol

Child Safety Protections

Consent & Transparency Logging

Operational & Business Performance KPIs

Integration & Customization Capabilities

WebSocket Streaming API

Voice Customization

Multilingual Support

iOS App Integration

Wearable Hardware Integration

Contextual Memory

Environmental Awareness

Privacy & Data Handling Specifications

Industry Vertical Deployment & Readiness

Expert Reviews

No reviews yet

Similar Products

Interesting Products

!
Use With Caution