AudioCraft Review: Key Features and Pros&Cons

Name: AudioCraft
Author: AudioCraft

by Meta AI

What it is:AudioCraft is a framework that generates high-quality audio and music from text prompts using three integrated models: MusicGen, AudioGen, and EnCodec.
Best for:AI researchers and machine learning practitioners, Music and audio game developers, Academic institutions and research labs
Pricing:Free tier available, paid plans from Infrastructure costs only
Rating:88/100Very Good
Expert's conclusion:AudioCraft is suitable for technically savvy users and researchers that want to leverage open source AI to innovate in the area of generated audio. However, you still need to be able to develop and have sufficient compute resources to make use of AudioCraft.

Visit website

Reviewed byMaxim Manylov·Web3 Engineer & Serial Founder

Key Metrics

📊

3 (MusicGen, AudioGen, EnCodec)

Core Models

📊

Yes - All model weights and code released

Open Source

📊

High-quality, long-term consistency

Audio Generation Quality

📊

Music, sound effects, environmental audio, compression

Supported Audio Types

📊

MusicGen: Meta-owned licensed music; AudioGen: Public sound effects

Training Data

Credibility Rating

88/100

Excellent

AudioCraft shows a high level of credibility due to its status as an open source generative AI platform backed by Meta, including release of model architecture, code, and model weights along with demonstration of actual world wide applications. Its credibility is enhanced by the support of the institution and active validation through ongoing research efforts.

BREAKDOWN

Product Maturity90/100

Company Stability95/100

Security & Compliance80/100

User Reviews85/100

Transparency90/100

Support Quality75/100

TRUST SIGNALS

Developed and released by MetaFully open-sourced with model weights and codePublished research papers backing the architectureSimplified, elegant model design with single autoregressive transformerProduction-ready pre-trained models availableActive use in research and commercial projects

Key Features

✨

Text-to-Music Generation

Generate diverse and high quality music from text prompts using MusicGen, which was trained on Meta's owned licensed music for professional grade results.

✨

Text-to-Audio/Sound Generation

Generate environmental sounds and audio effects such as dog barking, car horn, footsteps, etc., using AudioGen that was trained on publicly available sound effects.

✨

Melody-Conditioned Generation

Use melodic features and chromagram input as well as the text description to condition the musical output and allow for specific compositional control over generated music.

✨

Neural Audio Codec (EnCodec)

AudioCraft has developed an innovative audio codec that converts raw audio waveforms into discrete tokens and back again allowing for the efficient compression and generation of high-quality audio.

✨

Single-Stage Autoregressive Architecture

The single transformer-based language model used at the core of the design uses token interleaving patterns to provide a simplified model design and eliminates cascading models that can slow down the generation process.

✨

Long-Term Audio Consistency

Captures long term dependencies in sequential audio data to generate coherent, high-quality samples with reduced artifacts when compared to existing methods.

✨

Multi-Band Diffusion Decoding

Advanced decoding framework to generate high fidelity audio from discrete token representations of low bitrate, and will work for any audio modality.

✨

Flexible Conditioning Models

Can be conditioned using multiple text encoders such as T5, FLAN-T5, and CLAP, and various conditioning approaches, providing users with flexibility to tailor their use case and needs.

✨

Unified Codebase

Single, integrated platform for generating music, creating sound effects, compressing audio, and making predictions – allowing researchers to develop and expand upon models on top of a single foundation.

✨

Open-Source Model Weights

All aspects of the model are completely transparent with the release of pre-trained model weights and code, allowing researchers to develop and train models based on their own datasets.

Use Cases

Professional Musicians

Explore new musical compositions without having to play an instrument; generate many different styles and arrangements as creative sources or production tools.

Small Business Content Creators

Produce high-quality video commercials and audio tracks in minutes using automated sound track creation tools that allow users to enter their own lyrics and then produce professional sounding videos that can be uploaded to social media platforms.

Radio/Podcast Producers

Video Game Developers

Procedural audio enables the ability to create dynamic game music and ambiance by providing real-time, responsive audio based on player action in an environment. The cost of creating this type of audio is greatly reduced through automation.

AI Research Practitioners

Train your own proprietary generative AI models on internal datasets; expand the boundaries of current audio generation research through a highly extendable code base and methodologies that have been shared publicly.

Audio Post-Production Studios

Create placeholder audio during preproduction to help speed up creative workflows; build libraries of sound effects to aid in efficiency; accelerate creative iterations.

NOT FORReal-Time Music Performance Applications

Unsuitable – generating music requires significant processing power and will not meet the low-latency requirements needed for live performances.

NOT FORCommercial Music Licensing Platforms

Unlikely to be applicable – generated music may be of poor quality and/or difficult to obtain proper licenses for mass-commercial use.

NOT FORAccessibility Use Cases Requiring Perfect Lip-Sync

While both are capable of converting text into audio, text-to-audio is stronger when it comes to producing musical output, however, there is less precision as far as controlling the timing for synchronizing multiple components such as lip-syncing in a video.

Pricing

Pricing information with service tiers, costs, and details
☐Service	$Cost	ℹDetails	🔗Source
Open Source (Free)	$0	Unrestricted access to MusicGen, AudioGen, and EnCodec model weights; complete codebase; inference and training code included; no licensing restrictions for research or commercial use.	Official announcement
Demo/API Access	Free to try	Web-based demo available at audiocraft.metademolab.com for testing models without installation.	—
Self-Hosted Deployment	Infrastructure costs only	Deploy models on your own servers or cloud infrastructure (AWS, Azure, GCP); no licensing fees; only pay for compute resources.	—
Commercial Integration		For commercial products or services integrating AudioCraft, consult Meta regarding licensing and support.	—

Open Source (Free)$0

Unrestricted access to MusicGen, AudioGen, and EnCodec model weights; complete codebase; inference and training code included; no licensing restrictions for research or commercial use.

Official announcement

Demo/API AccessFree to try

Web-based demo available at audiocraft.metademolab.com for testing models without installation.

Self-Hosted DeploymentInfrastructure costs only

Deploy models on your own servers or cloud infrastructure (AWS, Azure, GCP); no licensing fees; only pay for compute resources.

Commercial Integration

For commercial products or services integrating AudioCraft, consult Meta regarding licensing and support.

Competitive Comparison

Feature	AudioCraft	Google MusicLM	Jukebox (OpenAI)	VALL-E (Microsoft)
Text-to-Music Generation	Yes	Yes	Yes	No
Text-to-Sound/Audio	Yes	Partial	No	No
Melody Conditioning	Yes	Yes	No	No
Open Source Code & Weights	Yes	No	Partial	No
Pre-trained Models Available	Yes	No (research only)	Limited	No
Single-Stage Architecture	Yes	No (cascading)	No	No
Custom Training Support	Yes (full codebase)	Limited	Limited	No
Long-Term Consistency	Yes	Yes	Partial
Commercial Availability	Open source	Limited/Research	Limited/Research	Research only
Audio Compression (EnCodec)	Yes	No	No	No

Text-to-Music Generation

AudioCraftYes

Google MusicLMYes

Jukebox (OpenAI)Yes

VALL-E (Microsoft)No

Text-to-Sound/Audio

AudioCraftYes

Google MusicLMPartial

Jukebox (OpenAI)No

VALL-E (Microsoft)No

Melody Conditioning

AudioCraftYes

Google MusicLMYes

Jukebox (OpenAI)No

VALL-E (Microsoft)No

Open Source Code & Weights

AudioCraftYes

Google MusicLMNo

Jukebox (OpenAI)Partial

VALL-E (Microsoft)No

Pre-trained Models Available

AudioCraftYes

Google MusicLMNo (research only)

Jukebox (OpenAI)Limited

VALL-E (Microsoft)No

Single-Stage Architecture

AudioCraftYes

Google MusicLMNo (cascading)

Jukebox (OpenAI)No

VALL-E (Microsoft)No

Custom Training Support

AudioCraftYes (full codebase)

Google MusicLMLimited

Jukebox (OpenAI)Limited

VALL-E (Microsoft)No

Long-Term Consistency

AudioCraftYes

Google MusicLMYes

Jukebox (OpenAI)Partial

VALL-E (Microsoft)—

Commercial Availability

AudioCraftOpen source

Google MusicLMLimited/Research

Jukebox (OpenAI)Limited/Research

VALL-E (Microsoft)Research only

Audio Compression (EnCodec)

AudioCraftYes

Google MusicLMNo

Jukebox (OpenAI)No

VALL-E (Microsoft)No

Competitive Position

vs OpenAI Jukebox

AudioCraft provides a much more advanced, yet easy to manage architecture with better token interleaving than Jukebox which had a cascade of models to generate music. AudioCraft utilizes a single autoregressive language model to generate music whereas Jukebox was primarily focused on music generation.

What would be the primary difference between Audio Craft and either Descript or Runway in terms of product development? A major differentiator for Audio Craft would be the need for technical implementation in order to utilize it, whereas both Descript and Runway have developed a user interface for their respective products.

vs Google MusicLM

Both Jukebox and AudioCraft provide text-to-audio functionality and produce music; however, AudioCraft is open source and has made available its trained model weights allowing other researchers to train their own custom models. Google MusicLM is similar in quality but still proprietary. Additionally, AudioCraft's EnCodec compression is currently leading the way in the industry.

Are there any pricing differences between using Audio Craft versus one of the commercial alternatives such as Descript or Runway? Yes. Both Descript and Runway are charged on a per feature per month basis, while Audio Craft can be used for free, however you will still need to host your own server.

vs Stability AI Audio

AudioCraft offers more mature models that are ready to be used for production purposes along with all necessary training data for the models. In addition to having a more complete code base, AudioCraft also has a community contributing to it as well as providing better documentation. While Jukebox and AudioCraft are both new players in the field of AI, AudioCraft has a more extensive model ecosystem including MusicGen, AudioGen, MAGNeT, JASCO, and AudioSeal.

In terms of the technical architecture of the three products, how does Audio Craft compare to the previous method of processing by Open AI versus Runway and Descript? According to the description provided, Audio Craft has a significantly improved technical architecture compared to the prior methods that were utilized by Open AI in regards to audio generation.

vs Descript/Runway (commercial platforms)

In addition to the two primary differences between Descript and Runway, how do the two differ from Audio Craft in terms of their openness? Audio Craft is a completely open source developer tool and both Descript and Runway are closed sourced consumer focused products.

Based upon the information provided about the three products, what are some of the reasons why I should choose Audio Craft versus Runway/Goggle or why I should choose Runway/Goggle versus Audio Craft? I would recommend choosing Audio Craft if you want a highly customizable and research friendly tool. If you want an enterprise stable solution, I would recommend choosing Google.

Pros & Cons

Pros

How does Audio Craft compare to the other two options in terms of model diversity and overall maturity of the ecosystem? Based on the description provided, Audio Craft appears to be leading in model diversity and overall maturity of the ecosystem when it comes to researching the ability to generate audio.
Where would you use Audio Craft versus where would you use Runway/Goggle? Based on the description provided, I would recommend using Audio Craft for developing and researching purposes, while Runway/Goggle would be more suited towards creating tools for consumers who are looking for plug and play solutions.
Is there anything in particular that makes the Audio Craft open source and free? The fact that all of the model weights and training code are being released open source and for free reduces the amount of money needed to implement the product.
Is there something unique about the architecture that Audio Craft uses to generate audio? Based on the description provided, Audio Craft uses advanced token interleaving architecture which is a new single-model design compared to the cascading designs that were previously used, allowing for much better long term dependency capture.
Does the architecture of Audio Craft provide for a wide variety of different tasks related to generating audio? Based on the description provided, yes. Audio Craft includes many different models that support a wide variety of tasks including music generation, audio generation, encoding, magnifying audio, sealing audio and generating jazz tracks.
Does the architecture of Audio Craft provide for high fidelity neural coding for compressing and tokenizing audio? Based on the description provided, yes. Audio Craft has a high fidelity neural codec (EnCodec) that is currently one of the most advanced ways of compressing and tokenizing audio.
Can the architecture of Audio Craft be conditioned to meet the needs of different types of users? Based on the description provided, yes. The architecture of Audio Craft allows for flexible conditioning of audio to accommodate the needs of different users such as providing text to music generation, text to sound generation, melody conditioning, and drum/chord track conditioning.

Cons

Would you consider Audio Craft to be research-friendly? Based on the description provided, yes. Audio Craft was designed to be research-friendly with extensive training pipelines and PyTorch components available for custom model development.
Who do you believe would benefit from contributing to the community driven development of Audio Craft? Based on the description provided, I believe that anyone interested in improving the features of Audio Craft would benefit from contributing to its community driven development.
Why may Audio Craft not be beginner-friendly? As described in the documentation, Audio Craft may not be beginner-friendly due to the fact that it requires technical expertise in programming languages such as Python and PyTorch, as well as machine learning knowledge.
Training Data Concerns — MusicGen was trained on licensed music. AudioGen was trained on public sounds. Any custom training will require you to be aware of the legal and copyright ramifications of your choice.
User Interface — There are no graphical interfaces in this tool. All interaction is done by entering commands at a command line and/or through writing code.
Hardware Requirements — Generation requires GPU access, therefore, it can create a barrier for people who have limited resources available to them.

Best For

AI researchers and machine learning practitioners — Full Access to Model Weights, Training Code and Extensive Documentation — Provides researchers with an environment to conduct advanced research and develop their own models based upon the model developed here.
Music and audio game developers — Controllable Generation — With melody/chord conditioning and sound effect synthesis, AudioCraft provides developers the ability to generate creative dynamic audio that they can use in games.
Academic institutions and research labs — Open-Source Framework with Reproducible Training Pipelines — Perfect for both publishing research and conducting research in the field of audio AI.
Enterprise AI teams with ML expertise — Customizable — Developers can integrate AudioCraft into their own custom pipelines, fine-tune AudioCraft on their own proprietary data and customize how the model behaves.
Audio and music software developers — APIs and Model Weights — Enable developers to build consumer applications that include the most up to date generative capabilities.

Not Suitable For

Non-technical creators and content producers — No User Interface — This is a coding tool, which means that developers need to know how to write code to use it. If you're looking for something to help you create music without needing to know how to write code, consider using a service such as Descript, Runway or one of the many other music creation services that don't require you to know how to write code.
Real-time, low-latency audio applications — Long Computational Time — Generation is a computationally intensive process. Depending on what type of audio you want to generate, there could be better alternatives to using AudioCraft. For example, if you want to generate short audio clips, you might be able to get away with generating those types of files yourself.
Small teams without ML infrastructure — Requires GPU Infrastructure and Hosting — In order to host the API, you'll need to have a way to deploy it that has GPU support. Depending on what platform you plan to use to deploy the API, you may need to consider using a cloud provider or a platform that supports GPU hosting. Alternatives to deploying a hosted API would be to use a commercial API or a software as a service (SaaS) platform that manages all of the deployment details for you.
Fully-licensed music generation for commercial use — Copyright/Licensing Issues — When using any of the AudioGen training datasets, you will be responsible for the licensing of the dataset. If you plan to use these for a commercial application, please consult with your attorney before doing so to make sure that you comply with all applicable laws and regulations.

Limits Restrictions

Generation Speed: Audio generation requires GPU processing; real-time generation not supported, typical generation time seconds to minutes depending on model and hardware
Model Size Variants: Available in small (300M), medium (1.3B), and large (3.3B) parameter versions with quality/speed tradeoffs
Audio Length Generation: Can generate audio sequences with long-term dependencies, but specific duration limits depend on model configuration and available memory
Training Data Rights: MusicGen trained on Meta-owned and specifically licensed music; AudioGen on public sound effects; users responsible for copyright compliance in deployments
Infrastructure Requirements: Requires GPU (CUDA-capable NVIDIA recommended) and significant VRAM; no CPU-only inference support
Code Licensing: MIT licensed, permitting commercial use but requiring attribution
Support: Community-driven through GitHub; no official commercial support tier
Geographic Availability: Open-source code available globally; no geographic restrictions on deployment

Api Integrations

API Type: Python library with PyTorch components; no REST API or traditional web services API
Installation: Pip installation via PyPI; requires Python 3.8+ and PyTorch
Core Libraries: PyTorch-based with compression (EnCodec), music generation (MusicGen), sound generation (AudioGen), and diffusion (Multi Band Diffusion) modules
Model Access: Direct model weights and inference code; supports huggingface model hub integration for weight distribution
Training Framework: Complete PyTorch training pipelines provided; supports custom training on proprietary datasets
Integration Options: Integrates into Python applications, game engines (via Python bindings), and ML frameworks (TensorFlow, JAX via conversion)
Documentation: Comprehensive GitHub documentation, API docs, model cards, training instructions, and research paper references
Code Examples: GitHub repository includes inference examples for MusicGen, AudioGen, and all model variants; Jupyter notebooks available
Community Resources: GitHub discussions, Hugging Face Hub examples, and third-party integrations like Claude Code skill for AudioCraft

Faq

What is AudioCraft?

AudioCraft is Meta's open source generative AI library for generating music and audio. It contains three main components: MusicGen – Generates music from text. AudioGen – Generates sound effects from text. EnCodec – Compresses neural audio using lossy compression. The AudioCraft library makes it easier to generate audio using a single autoregressive language model that uses token interleaving patterns.

How does MusicGen differ from AudioGen?

MusicGen produces music from input text descriptions that were trained on Meta-owned music and particularly licensed music; AudioGen produces sound effects and environmentals from text that were trained on public sound effect sources; both models used the same architecture as the basis for their development but had training data sets that differed.

Is AudioCraft free to use?

Yes, AudioCraft is completely free and open source under an MIT License. There are no costs associated with the model weights and the code are available in the public domain. The only cost is for hosting or GPU infrastructure if you are operating at scale.

Do I need programming experience to use AudioCraft?

Yes, AudioCraft requires that you have a high level of experience in Python programming and have an understanding of machine learning concepts. As such, it is targeted toward developers and researchers. If you want to make this process accessible to non-technical people, consider one of the many commercially available alternatives to AudioCraft — e.g., Descript.

What hardware do I need to run AudioCraft?

AudioCraft requires a GPU with CUDA support (e.g., NVIDIA) and while moderate GPUs will work for model inference, more robust hardware is required for model training. Additionally, CPU-only model inference is not supported.

Can I generate music for commercial use?

Since MusicGen was trained on licensed music, commercial use may require that you review the licensing terms for the music used in MusicGen and obtain any additional licenses required. AudioGen uses public sound effects. Review the documentation for your specific use case.

How long does audio generation take?

The amount of time that it takes to generate samples will depend upon the model that you select, the quality of your hardware, and the length of your desired audio — typically seconds to minutes. However, real-time generation is not supported by AudioCraft. Therefore, faster model inference will require either larger GPUs or selecting a smaller model variant.

Does AudioCraft support melody or style conditioning?

Yes, MusicGen does provide for conditioning on melodic information through chromagrams and text description. Additionally, the MusicGen Style model allows for generating music based on text-and-style-to-music generation. Finally, JASCO also provides for conditioning on chords, melodies, and drum tracks.

How does AudioCraft compare to OpenAI or Google's audio tools?

AudioCraft is open source and has model weights that can be customized for training; OpenAI Jukebox and Google MusicLM are still proprietary. Additionally, the simplified single-model architecture used in AudioCraft is much more efficient than the cascading models used previously and supports a wider variety of use cases.

Can I fine-tune AudioCraft models on my own data?

Yes, AudioCraft has all of the training code and pipelines needed for fine-tuning your models on your own custom data sets. Fine-tuning will require both GPU hardware and machine learning experience. However, this allows you to create your own customized or specialized models.

Expert Verdict

AudioCraft is a very powerful and completely open source library provided by Meta for generative AI in creating new audio and music, using the MusicGen, AudioGen, and EnCodec models that are capable of producing high quality output from text input. Due to its available open source code and pre-trained weights, AudioCraft is well suited for research and rapid prototyping, however, deploying it into production will likely require additional infrastructure beyond what is available from the library itself. XYZEO Analysis: Great for innovating and developing your own solutions; not great as an out-of-the-box solution.

Researchers who experiment with AI for generating audio models
Developers who build their own custom music or sound effects generators
Indie creators/small teams that need free high quality audio tools
Creating prototypes of apps that have text-to-music or text-to-sound functionality

!
Use With Caution

Production environments that need low latency and can generate audio in real time
Users that do not have access to GPU resources or PyTorch experience
Commercial music producers that use licensed data sets
Teams looking for no-code interfaces that do not require any development effort

Not Recommended For

Non-technical users that expect plug-n-play tools
Budget constrained teams that cannot afford to purchase compute hardware
Commercial applications of AI-generated audio that are not fine tuned models
Interactive real-time audio applications

Expert's Conclusion

AudioCraft is suitable for technically savvy users and researchers that want to leverage open source AI to innovate in the area of generated audio. However, you still need to be able to develop and have sufficient compute resources to make use of AudioCraft.

Best For

Researchers who experiment with AI for generating audio modelsDevelopers who build their own custom music or sound effects generatorsIndie creators/small teams that need free high quality audio tools

Research Summary

Key Findings

AudioCraft is an Open-Source Library that has been developed by Meta using PyTorch. It includes three major components: MusicGen for generating music based upon a prompt of text; AudioGen for creating a variety of sound effects; and EnCodec for compressing the generated audio to allow for the creation of higher-quality generative audio from text prompts and supporting long sequences and melody conditioning. Released in 2023, AudioCraft was created for Research, and the full model weights and training code are available on GitHub, which makes developing audio models much simpler than previous cascading techniques used for audio model development. The models are deployable to platforms such as AWS SageMaker for Inference.

Data Quality

Good - comprehensive details from Meta's official announcement, GitHub repository, and technical demos; no pricing as fully open-source; limited info on recent updates post-2023.

Risk Factors

The models are trained on specific data sets, therefore they could potentially include bias and/or license issues related to the data sets.

Developing models for this area of study is computationally intensive; typically requires GPUs for practical use.

This area of study is rapidly changing and new competitors will be emerging in the near future.

Longer generations may result in potential audio artifacts.

Last updated: February 2026

Additional Info

Open Source Release

Fully open-sourced by Meta in August 2023 with model weights, training, and inference code on GitHub, which enables researchers to fine-tune their own models on custom data sets and advance audio AI. Additionally, there are over 10K stars on the repository indicating significant developer interest.

Model Capabilities

MusicGen supports text and melody conditioning for controllable generation up to multi-minute tracks via windowing. AudioGen creates environmental sounds and effects. EnCodec provides high-fidelity tokenization; Multi-Band Diffusion enhances the quality of the generated audio.

Technical Architecture

Utilizes a single stage transformer Language Model with token interleaving over EnCodec's discrete representations to eliminate cascaded models. Also supports stereo, conditioning through T5/CLAP encoders, and chromagram for melody guidance.

Deployment Examples

Demonstrated inference on AWS SageMaker for scalable async generation. Developed utilizing PyTorch with GPU acceleration; interactive demos are available at http://audiocraft.metademolab.com

Media Coverage

Featured in Meta's blog and YouTube demos praising quality and technical guides on Weights & Biases. Position themselves as State-of-the-Art vs. MusicLM with easier API.

Alternatives

•
MusicLM (Google): A text to music model developed by Google that has very good quality, however it is closed source and will need a developer account to have access to the API. It is less customizable than AudioCraft’s melody conditioning, but it does integrate well within Google’s ecosystem. This would be best used when you want to quickly prototype something without having your own local computing.
•
Riffusion: A music generator model that uses Stable Diffusion to generate music from spectrograms; this model is completely open source and very lightweight. This model is simpler than AudioCraft, however its ability to generate complex music prompts is much worse. This model is best for use in browser-based applications or music experiments where you do not have a lot of computational power.
•
Suno AI: A commercial text to music platform that has an easy to use web interface as well as includes song structures. The platform is easier to use than AudioCraft for developers who are not developers themselves, however it is also a proprietary platform with usage limits. If you are looking to create full songs without doing any coding, then this may be a good option for you.
•
Stable Audio (Stability AI): An open weights model for generating audio including music and sounds that can be conditioned with text, similar to how some models condition images with text. While this model produces high quality audio, it is comparable to using a diffusion model and AudioCraft’s transformer model. This model is best for users that prefer the Stability AI ecosystem.
•
Mubert: A commercial AI music generator that provides royalty free music for video and content creation professionals. Unlike AudioCraft which requires programming knowledge, Mubert’s API is no-code, making it ideal for commercial producers that need music that they know will clear for their projects.
•
AIVA: An AI composer that generates full tracks of music in classical/jazz styles, while providing editing tools so the composer can manipulate the generated music. In contrast to the raw generation of other models, this model allows for more control over the generated music’s structure. Therefore, this model is better suited for professional composers that are looking to enhance their creative process rather than creating raw music from a text prompt.

Model Overview

Developer: Meta AI
Release Date: August 2023
Architecture: Autoregressive Language Model with EnCodec
Core Models: MusicGen, AudioGen, EnCodec
Open Source: Yes
Status: Generally Available
Repository: GitHub (facebookresearch/audiocraft)

Generation Modes

Text-to-Music

MusicGen – Generates music from text prompts

Text-to-Sound

AudioGen – Generates environmental sounds and sound effects from text descriptions

Melody Conditioning

Chromagram & Text Conditioning – Generates music variations based on the input melody

Style-to-Music

MusicGen Style Variant – Generates music with a specific style

Chord/Melody Control

JASCO – Generates music with high quality conditioned on chords, melodies, and drum tracks

Music Capabilities

High-Quality Music Generation

Generates high quality music from user inputs (text)

Environmental Sound Generation

Generates realistic, high fidelity sounds like: • Dog barking • Cars honking • Footsteps on wooden floors • With realistic recording conditions

Long-Term Dependencies

Captures long-term dependencies in audio through token interleaving patterns

Controlled Generation

Supports conditioning on textual descriptions and melodic features for better output control

Multiple Model Variants

Includes MusicGen, AudioGen, MAGNeT (non-autoregressive), and JASCO (chord/melody conditioned)

Audio Generation Specs

Core Technology: EnCodec neural audio codec
Token Processing: Single autoregressive Language Model operating on compressed discrete tokens
Audio Compression: Maps audio signals to parallel streams of discrete tokens
Output Generation: Generated tokens converted back to audio waveforms via EnCodec decoder
Audio Quality: High-fidelity with fewer artifacts through improved EnCodec decoder
Conditioning Methods: Text encoding (T5, FLAN-T5, CLAP models) and melody-based conditioning