Meta’s MusicGen introduces a cutting-edge capability to generate short musical compositions based on text prompts, which can optionally align with existing melodies. Built on a Transformer model, MusicGen operates similarly to language models by predicting the next section of a music piece. This innovative approach, coupled with the efficient processing of audio data through Meta’s EnCodec audio tokenizer, enables fast and effective generation.
To train MusicGen, the team utilized an extensive dataset of 20,000 hours of licensed music, including 10,000 high-quality tracks from an internal database, as well as music data sourced from Shutterstock and Pond5. A key advantage of MusicGen lies in its ability to handle both text and music prompts, where the text sets the style while aligning with the melody present in the audio file.
For instance, by combining a text prompt describing a “light and cheerful EDM track with syncopated drums, airy pads, and strong emotions, tempo: 130 BPM” with the melody from Bach’s renowned “Toccata and Fugue in D Minor (BWV 565),” MusicGen can generate a unique piece of music. However, it’s important to note that precise control over the melody orientation in different styles is limited. The text prompt serves as a general guideline for the generation process and may not be exactly reflected in the output.