Sesame

  • What it is:Sesame is a conversational AI startup building emotionally intelligent voice companions Maya and Miles using its Conversational Speech Model for natural, real-time dialogues, with plans for AI-powered smart glasses.
  • Best for:Individuals and small projects, Growing businesses and content creators, Enterprises requiring voice solutions at scale
  • Pricing:Starting from $29/month
  • Rating:85/100Very Good
  • Expert's conclusion:Sesame would be ideal for organizations developing innovative conversational AI applications that emphasize natural prosody and emotional expression, especially when working within an open-source environment.
Reviewed byMaxim Manylov·Web3 Engineer & Serial Founder

What Is Sesame and What Does It Do?

Sesame AI develops emotionally-intelligent voice companions such as Maya and Miles through its Conversational Speech Model (CSM). The company’s models are designed to provide an individualized, and emotive experience with consumers through natural speech interactions. Sesame AI was founded by industry leaders who previously worked at Oculus and Meta; they plan to create lifelike AI companions that can be used in conjunction with emerging wearable technologies such as smart glasses. Sesame AI's focus is on providing users with a voice presence; and thus, differentiate themselves from traditional voice assistants that have limited or no emotional expressiveness.

Active
📍San Francisco, CA
📅Founded 2023
🏢Private
TARGET SEGMENTS
ConsumersDevelopersEnterprises

What Are Sesame's Key Business Metrics?

📊
$47.5M Series A + $250M Series B
Funding Raised
📊
$1B+
Valuation
📊
1M+ hours of audio
Training Data
📊
2K+
Hugging Face Stars
📊
200-300ms
Response Time
📊
English primary, expanding to 20+
Languages

How Credible and Trustworthy Is Sesame?

85/100
Excellent

As evidence of their credibility is the strong founding team from Oculus and Meta, significant funding from prominent venture capital firms (a16z and Sequoia), and the viral nature of their product demonstrations.

Product Maturity75/100
Company Stability90/100
Security & Compliance70/100
User Reviews85/100
Transparency80/100
Support Quality75/100
Backed by a16z, Sequoia Capital, Spark CapitalFounders from Oculus VR and Meta Reality LabsOpen-sourced CSM-1B model on Hugging Face1M+ hours audio training dataset

What is the history of Sesame and its key milestones?

2023

Company Founded

In June 2023, Sesame AI was founded by Brendan Iribe (Co-founder of Oculus and CEO), Ankit Kumar (former lead of AI at Discord and current CTO), and Ryan Brown (founding engineer at Meta Reality Labs).

2025

Research Demo Launch

In February 2025, Sesame AI released two viral product demonstrations of voice companions Maya and Miles that showcase the capabilities of the company’s CSM technology.

2025

Series A Funding

On February 27, 2025, Sesame AI closed a $47.5M Series A round led by Andreessen Horowitz.

2025

Series B Funding

By June 2025, Sesame AI raised $250M in a Series B round from Sequoia Capital and Spark Capital at a valuation of over $1 billion.

2025

Key Executive Hires

In June 2025, Sesame AI hired Oculus co-founder Nate Mitchell as Chief Product Officer to lead hardware development for the company.

2025

Model Open-Sourced

In July 2025, Sesame AI released the open-source CSM-1B model to allow developers to work with the company’s technology; the CSM-1B model has received over 2,000+ stars on Hugging Face.

What Are the Key Features of Sesame?

Conversational Speech Model (CSM)
Sesame AI’s transformer-based model is able to process both text and audio tokens concurrently in under 300 milliseconds for natural and responsive interactions.
Emotional Intelligence
Through Sesame AI’s CSM, the company can identify emotional cues such as laughter, interruptions, and changes in tone mid-sentence, as well as the use of filler words such as “um.”
Real-Time Dialogue
With 200-300 milliseconds of latency, Sesame AI’s voice companions are able to mimic the timing and pacing of human-to-human conversations including pauses and conversational back-and-forth.
Voice Companions (Maya & Miles)
Sesame AI’s emotionally-resonant AI companions were trained using over 1 million hours of real world audio data.
Open-Source CSM-1B
Sesame AI’s developer accessible 1B parameter model is available on Hugging Face and includes hosted inference APIs for easy integration into applications.
💬
Multilingual Support
While Sesame AI primarily supports English language interactions, the company has expressed intentions to support up to 20 additional languages and incorporate contextual awareness.
🔗
AI Glasses Integration
Sesame AI intends to develop always-on wearables to pair their voice companions with lightweight augmented reality (AR) hardware.

What Technology Stack and Infrastructure Does Sesame Use?

Infrastructure

Multi-region GPU clusters (inferred for speech model training)

Technologies

Transformer modelsPyTorch (inferred)

Integrations

Hugging FaceHosted inference APIsSmart glasses hardware

AI/ML Capabilities

Proprietary Conversational Speech Model (CSM) trained on 1M+ hours of audio data, processing text/audio tokens simultaneously for long-context (2,048 tokens/~2min) emotionally expressive speech with real-time turn-taking.

Based on research papers, demo descriptions, and training details from Contrary Research and RDWorld

What Are the Best Use Cases for Sesame?

Individual Consumers
Virtual assistants which are designed to help individuals as they navigate their daily lives through emotional connections, natural conversation and "always-on" access through wearable smart glass technology in the future.
AI Developers
Provides developers access to an open source version of the CSM-1B model and its hosted APIs to develop expressive voice applications with approximately 200-300 ms of latency.
Customer Service Teams
Enables developers to deploy emotionally intelligent voice agents that can provide a human-like experience to improve engagement compared to voice agents which are based on robotic models.
Automotive Interfaces
Provide users with hands-free virtual companions that have contextual awareness and proactive assistive capabilities to understand their drivers.
NOT FORHigh-Frequency Trading
Not Applicable -- The latency of the conversational AI does not meet the sub 100 ms requirements of real time financial operations.
NOT FORHIPAA-Regulated Healthcare
Limited Current Compliance -- Does not currently include Medical Data Handling Certifications such as BAA.

How Much Does Sesame Cost and What Plans Are Available?

Pricing information with service tiers, costs, and details
Service$CostDetails🔗Source
Starter$29/month10 hours of voice synthesis, 5 hours of voice recognition, 3 custom voice profiles, standard API access, email supportSesame AI Voice official pricing
Professional$49/month50 hours of voice synthesis, 25 hours of voice recognition, 10 custom voice profiles, advanced API access, priority support, voice transformation features, analytics dashboardSesame AI Voice official pricing
EnterpriseCustom quoteUnlimited voice synthesis, unlimited voice recognition, unlimited custom voice profiles, full API access, dedicated support team, advanced security features, custom integration support, service level agreementSesame AI Voice official pricing
Free Trial14 daysFull access to plan features, no credit card requiredSesame AI Voice official pricing
Sesame HR AI Add-on$49/monthFor employee training use case, includes AI-driven learning personalization. $42.50/month with annual billingSesame HR pricing
Mobile App (iOS)$7.99/week or $24.99 or $89.99/yearSesame AI Voice Pro subscription on Apple App Store with in-app purchase optionsApp Store listing
Starter$29/month
10 hours of voice synthesis, 5 hours of voice recognition, 3 custom voice profiles, standard API access, email support
Sesame AI Voice official pricing
Professional$49/month
50 hours of voice synthesis, 25 hours of voice recognition, 10 custom voice profiles, advanced API access, priority support, voice transformation features, analytics dashboard
Sesame AI Voice official pricing
EnterpriseCustom quote
Unlimited voice synthesis, unlimited voice recognition, unlimited custom voice profiles, full API access, dedicated support team, advanced security features, custom integration support, service level agreement
Sesame AI Voice official pricing
Free Trial14 days
Full access to plan features, no credit card required
Sesame AI Voice official pricing
Sesame HR AI Add-on$49/month
For employee training use case, includes AI-driven learning personalization. $42.50/month with annual billing
Sesame HR pricing
Mobile App (iOS)$7.99/week or $24.99 or $89.99/year
Sesame AI Voice Pro subscription on Apple App Store with in-app purchase options
App Store listing

Are There Usage Limits or Geographic Restrictions for Sesame?

Voice Synthesis Hours
10 hours/month (Starter), 50 hours/month (Professional), unlimited (Enterprise)
Voice Recognition Hours
5 hours/month (Starter), 25 hours/month (Professional), unlimited (Enterprise)
Custom Voice Profiles
3 profiles (Starter), 10 profiles (Professional), unlimited (Enterprise)
Voice Cloning
Available on Professional and Enterprise plans only
API Access Level
Standard (Starter), Advanced (Professional), Full (Enterprise)

Is Sesame Secure and Compliant?

Cloud InfrastructureCloud-based infrastructure with scalable architecture to handle peak loads without performance degradation
Advanced Security FeaturesAdvanced security features available on Enterprise plan
Data PrivacyVoice data handling with privacy considerations for custom voice profiles and voice cloning

What Customer Support Options Does Sesame Offer?

Channels
Standard (Starter), Priority (Professional), Dedicated team (Enterprise)API documentation and integration guides available
Specialized
Dedicated support team available for Enterprise customers with custom integration support

Who Is Sesame Best For?

Best For

  • Individuals and small projectsA starter plan at $29/month is available to provide a low-cost entry point into using the platform's voice synthesis and recognition capabilities.
  • Growing businesses and content creatorsThe professional plan includes additional features (voice transformation and analytics dashboard) to scale your voice applications.
  • Enterprises requiring voice solutions at scaleThe Enterprise Plan has unlimited resources and support as well as custom integration options for complex deployments.
  • Organizations implementing AI-driven employee trainingThe Sesame HR integration allows users to receive personalized learning experiences with AI-enabled voice capabilities.
  • Developers building voice-enabled applicationsUsers have multiple API access levels as well as the ability to customize voice profiles to facilitate flexible integration.

Not Suitable For

  • Users requiring high-volume voice processingThe monthly hour limits associated with the Starter and Professional Plans may limit usage for very large-scale operations; consider the Enterprise option.
  • Projects with minimal budgetThere is no free tier; the starting price is $29/month. Consider using open source voice solutions or alternative free solutions.

What are the strengths and limitations of Sesame?

Pros

  • Reduced Costs -- Eliminates costs for hiring voice talent, renting recording studios and hiring people to transcribe audio files manually.
  • Pricing Flexibility -- Pay only for what you use; scalable pricing tiers are available from $29/month to custom Enterprise plans.
  • Cloud-based Infrastructure -- Automatically scales to handle peak loads without performance degradation.
  • Custom Voice Profiles -- Develop unique voices to fit your brand identity by adjusting pitch, speed, tone, and accent.
  • Voice Cloning Capability — The Professional Plan as well as the Enterprise Plan can create digital voice clones when authorized to do so
  • Multi-tiered API Access — Standard, Advanced, and Full API Access Levels allow users to have various technological needs met
  • Flexibility With Integrations — Custom integration will be provided for Enterprise tier users with an additional layer of complexity for their implementation

Cons

  • Limited Hours — The Starter Tier limits the user to 10 hours of voice synthesis per month which restricts usage for large-scale projects or those requiring heavy usage
  • No Free Tier Available — Users must commit to paying $29 per month, although there is no trial period mentioned specifically for voice synthesis
  • Lack Of Transparency Regarding Features In The Advanced Tier — There are insufficient details regarding voice cloning, voice modification, and the security features within the search results for the advanced tier
  • Enterprise Tier Requires Pricing For Each Feature — Enterprise Tier users receive custom pricing for each feature including advanced security, unlimited access to the system and dedicated support. However, the price is unknown to the user.
  • A Steep Learning Curve Exists To Customize The Voice — The user has the ability to customize the pitch, speed, tone and accent of the voice using parameters; however, finding the optimal values may require experimentation.
  • Monthly Allocations Limit The Ability To Scale — The Professional Tier limits the user to 50 hours of voice synthesis and/or 25 hours of voice recognition per month.

What APIs and Integrations Does Sesame Support?

API Type
No public API documentation found. Sesame appears to be a research-focused open-source Conversational Speech Model (CSM) rather than a commercial API service.
Authentication
Not applicable - no developer API identified in available sources.
Webhooks
No webhook support mentioned.
SDKs
Open-source model available; potential integration via Hugging Face or similar platforms, but no official SDKs documented.
Documentation
Limited - technical details in research blogs and Vogent docs; no comprehensive developer portal.
Sandbox
Public demo available for testing conversational voices (Maya and Miles).
SLA
None - beta/research product, not enterprise-grade service.
Rate Limits
Not specified.
Use Cases
Real-time conversational AI, voice companions, customer service agents, smart assistants requiring natural prosody and emotional expressiveness.

What Are Common Questions About Sesame?

Sesame AI is a conversational speech model (CSM) designed to produce natural-sounding, expressive speech with human-like prosody, pauses, tone shifts, and emotional intelligence. It is used to power voice assistants such as Maya and Miles to provide lifelike real-time conversations.

CSM is produced using an auto-regressive transformer based architecture along with residual vector quantization (RVQ) to generate both semantic and acoustic tokens from interleaved text and audio tokens. The model includes the entire conversation history to provide dynamic prosody, rhythm, and emotional adaptability to the voice assistant.

Natural Prosody — Improved Expressiveness — Enhanced Pronunciation — Smooth Transitions — Contextual Understanding — Personality Consistency — Realistic Voice Companions Understand Micro-Pauses — Tone Shifts — Emotional Cues — Real-Time Interactions.

Yes, Sesame's CSM is an open-source framework that can be used by the research community to build their own custom versions of this system and add to the existing code base. The company has made public demos of this technology along with technical information about how it works.

Most commercial TTS (text-to-speech) products produce robotic sounding voices that do not take into consideration the conversation history and therefore lack emotion and conversational flow. In contrast, Sesame provides what we refer to as "voice presence" by providing real-time emotional depth, conversational flow, and prosody modeling based on the user's conversation history.

Yes, there are public demos that allow users to test the natural conversations using either Maya or Miles as their voice. At the time of our review, no pricing or subscription details were provided since the product is focused towards research.

There are currently bugs being worked out of the system such as long pauses or artifacts in long conversations. Also, while the system does support cloning and has a limited number of available voices, users will need to provide short audio samples to clone voices that they want to include in their systems.

Since Sesame is an open-source framework, it can be used in commercial applications, such as customer service and smart home devices; however, please note that you should always verify the licensing terms to ensure that you have permission to use this framework for your application.

Is Sesame Worth It?

At this point in time, the Sesame AI's Conversational Speech Model is still in beta and may experience some instability; however, given the fact that the entire system is open source, it is positioned to be a very useful tool for developers that are working on next-generation conversational agents and are looking for an alternative to current expressive TTS frameworks.

Recommended For

  • Researchers working on improving AI capabilities related to speech synthesis
  • Developers building conversational AI applications
  • Companies developing customer service AI applications that require high levels of engagement
  • Organizations seeking open-source solutions for expressive TTS

!
Use With Caution

  • Production environments that require high stability
  • Applications that require low-latency
  • Enterprise applications that require fine-tuning of the model before deployment

Not Recommended For

  • Real-time mission critical systems
  • Budget-constrained projects that require polished commercial support
  • Projects that simply need basic TTS functionality and can utilize one of the many well-established commercial providers
Expert's Conclusion

Sesame would be ideal for organizations developing innovative conversational AI applications that emphasize natural prosody and emotional expression, especially when working within an open-source environment.

Best For
Researchers working on improving AI capabilities related to speech synthesisDevelopers building conversational AI applicationsCompanies developing customer service AI applications that require high levels of engagement

What do expert reviews and research say about Sesame?

Key Findings

Using a combination of advanced tokenization and transformer-based architectures, Sesame AI has created a groundbreaking Conversational Speech Model (CSM) that creates human-like prosody and emotional expression while allowing it to adapt to real time contextual changes. Open source model behind Maya and Miles – two natural voice companion models that are currently in demo mode and have received positive reviews from users for their realistic conversation flows. Currently at the beta phase with support for cloning, however there still remain some inference instability issues with this model.

Data Quality

Fair - detailed technical info from research blogs, Vogent docs, and demos; no official API/pricing details or company website sesame.com active. Primarily research/open-source focused.

Risk Factors

!
Inference instability issues during beta testing (pauses, artifacts)
!
Currently no large-scale public commercial infrastructure available.
!
Enterprise support and Service Level Agreements (SLA) unclear.
!
Model depends on the research updates occurring on an ongoing basis.
Last updated: February 2026

What Additional Information Is Available for Sesame?

Technical Innovations

The CSM model addresses the "one-to-many" problem in generating synthetic speech by using a hybrid approach that combines both acoustic/semantic tokens processed using a reversible vector quantizer (RVQ), along with a context aware prosody modeling framework.

Demo Voices

Maya and Miles each contain distinct personality characteristics including natural pauses, tone shifts, and even humor. A series of demos show how the low latency back-and-forth conversation between a user and the AI-powered voice companion mimic the human-like rhythm of a typical conversation.

Open-Source Commitment

Since Sesame AI released the CSM model as an open source project, it has enabled a vast array of research based innovations within the conversational AI field. Additionally, Sesame has made it possible to clone a wide variety of voices utilizing just 8-20 seconds of audio and provide a customized voice option for the AI model.

Real-World Applications

Sesame AI plans to target customer service applications, smart home device interactions, augmented reality (AR) interactions, and ultimately develop voice companions. Ultimately, Sesame AI seeks to build what they call "voice presence" for building consumer trust and engagement.

Media Reception

Sesame AI has been praised by users on Hacker News, The Verge, and several AI-related blogs for creating one of the most realistic and engaging conversational flows experienced to date.

What Are the Best Alternatives to Sesame?

  • ElevenLabs: ElevenLabs is another leading expressive Text-to-Speech (TTS) platform that includes high fidelity voices, voice cloning capabilities, and emotional control features. While ElevenLabs has less commercial polish and API stability compared to Sesame's beta CSM, it may be better suited for production voiceover work and/or use as a voice agent. elevenlabs.io
  • PlayHT: Play.ht provides ultra-realistic TTS capabilities with conversational voices and very low latency. It also offers more enterprise-level support and multilingual options than Sesame's beta CSM. Overall, it appears that Play.ht would be a better choice for companies looking to create reliable customer support bot solutions where the need for reliability outweighs the desire for cutting-edge research novelty. play.ht
  • Respeecher: The advanced voice cloning & synthesis is much stronger for a professional audio production & commercial application as opposed to being used in a real time, conversational application.
  • Google WaveNet / Cloud TTS: Enterprise-level TTS using WaveNet Neural Voices and extensive integration capabilities; better suited for high volume applications requiring greater scalability and service level agreements but potentially less expressive of prosodic qualities.
  • Cartesia AI: Voice AI with real time voice and low latency streaming, voice AI also has very expressive synthesis; closer to Sesame in terms of conversational style but has production APIs.

Voice Quality & Performance Metrics

Low %
Word Error Rate (WER)
4.7 /5.0
Mean Opinion Score (MOS)
Minimal ms
Response Latency
Low ms
End-to-End Latency
Sub-250 ms
Streaming Audio Latency

Emotional & Expressive Voice Features

Emotional Tone Synthesis

Human-like intonation, understanding, and response to emotional cues, and adjust the tone for joy, sadness, or urgency based upon the cue.

Prosody Control

Natural timing, pauses, emphasis, volume adjustments, and rhythm for realistic speech patterns.

Nonverbal Expressiveness

Responsive conversational flow that includes context-aware responses, as well as personality attributes such as Maya and Miles voice profiles.

Speaking Style Replication

Ability to customize speed, pitch, and emotions to produce unique voice characteristics for each user.

Real-Time Voice Streaming

Instant synthesis with little to no delay for users to engage with the application.

Multilingual Emotional Inflection

Support for multiple languages and plans to support over 20 languages while maintaining natural prosody.

Regulatory & Security Compliance Status

Data Privacy ComplianceApp includes delete memory feature; details limited
SOC 2 CertificationNot specified for enterprise deployment
GDPR Data ProcessingUser data deletion available; residency unspecified
Real-Time PII RedactionNo explicit mention
End-to-End EncryptionSecure voice processing implied in app
HIPAA ComplianceNot mentioned for healthcare use

Safety Controls & Harm Mitigation

Impersonation Safeguards

Two pre-defined voice profiles (Maya and Miles) with no mention of voice cloning capabilities.

Misuse Detection & Blocking

Early-stage technology, and no specifics were provided about data safeguards.

Voice Spoofing Prevention

Not specifically addressed.

Crisis Response Protocol

No escalation procedures are described.

Child Safety Protections

No age-detection or parental controls mentioned.

Consent & Transparency Logging

Memory delete button was added for user control.

Operational & Business Performance KPIs

4.3 /5.0
User Satisfaction (App Store)
High %
Task Completion Rate
Instant
Response Time
Strong %
Context Retention Score
High
Engagement Rate
Thousands
Daily Active Users

Integration & Customization Capabilities

WebSocket Streaming API

Simple API and SDK for developers to integrate real-time voice into their applications.

Voice Customization

Ability to fine-tune the speed, pitch, and emotion of the voices, as well as use two pre-defined voice profiles, Maya and Miles.

Multilingual Support

Plans to offer the ability to select from multiple languages and eventually expand to 20+ languages while maintaining natural prosody.

iOS App Integration

Native iPhone application with photo recognition and voice interaction.

Wearable Hardware Integration

AI Glasses allow users to have continuous hands-free interaction throughout their day.

Contextual Memory

Preserves both conversation continuity and user-personalization.

Environmental Awareness

Uses hardware to observe and understand the environment around the user to provide contextual assistance.

Privacy & Data Handling Specifications

Audio Retention Period
User-configurable with delete option
Automatic Data Deletion
Yes
Memory Management
Delete all data button in settings
Encryption Standard
Secure app processing
Data Residency
Cloud-based (unspecified regions)
User Consent Controls
Yes
Privacy Compliance
App Store compliant

Industry Vertical Deployment & Readiness

Industry VerticalAdoption LevelPrimary Use CasesKey Features UtilizedDeployment Maturity
Content CreationHighVideos, audiobooks, podcastsNatural voice, emotional expressivenessProduction-Ready
Education TechnologyGrowingInteractive learning contentEngaging natural voicesProduction-Ready
Game DevelopmentModerateNPC dialogue, voiceoversVoice variety, emotional rangeProduction-Ready
Virtual AssistantsHighPersonal AI companionsContextual memory, real-time responseProduction-Ready
Customer SupportEmergingVoice agents, triageEmotional intelligence, natural conversationBeta/Early Adoption
Wearables & HardwareDevelopingAI glasses companionEnvironmental awareness, hands-freePrototype/Upcoming

Expert Reviews

📝

No reviews yet

Be the first to review Sesame!

Write a Review

Similar Products