Google introduces ‘MusicLM’, AI tool for Generating Music from Text


Google introduces MusicLM, which is a new AI tool that generates high-quality music from text descriptions. It can understand phrases such as “a calming violin melody backed by a distorted guitar riff” and convert them into corresponding musical compositions.

This is a major advancement in AI-generated music and could greatly impact the way music is created and consumed. Google recently announced that Google Meet hardware users could join Zoom Meetings.

MusicLM: Generating Music From Text

The tool is user-friendly and easy to use, making it accessible to a wide range of users. It uses a hierarchical sequence-to-sequence modeling approach to generate music at 24 kHz, which stays consistent over several minutes.

The experiments show that MusicLM outperforms previous systems in terms of both audio quality and adherence to the text description. It can also take both text and an existing melody as input, allowing it to transform whistled and hummed melodies according to the style described in a text caption. To support future research, the developers have publicly released MusicCaps, a dataset of 5.5k music-text pairs, with rich text descriptions provided by human experts.

It features several key capabilities, including:
  • Audio Generation From Rich Captions: MusicLM can generate complex and nuanced compositions from simple text descriptions.
  • Long Generation: MusicLM can generate music that remains consistent over several minutes.
  • Story Mode: It allows the users to generate music that tells a story, it can be used for music scoring for movies, series, and other media.
  • Text and Melody Conditioning: MusicLM can take both text and an existing melody as input, allowing it to transform whistled and hummed melodies according to the style described in a text caption.
  • Painting Caption Conditioning: It generates music based on the emotion and style depicted in a painting caption.

Additionally, MusicLM can detect various Musician Experience Levels, Places, Epochs, Accordion Solos and Generation Diversity while keeping the conditioning and/or the semantic tokens constant, Same Text Prompt and Same Semantic Tokens, and much more.

Source