Creating music in any genre from simple text description

The nightmare continues as AI continues to shape up. After ChatGPT, here’s a new topic of discussion — MusicLM, a system that can create music in any genre from a simple text description. The Google project has no plans of a public release but samples can be heard on the project’s research page.

The academic paper has some details to offer about MusicLM. An example offered by the team behind the project: “A calming violin melody backed by a distorted guitar riff.” MusicLM can work on those words and generate music at “24 kHz, which remains consistent over several minutes”.

Techcrunch has some details about the project. It says that MusicLM has been trained on a dataset of 280,000 hours of music to learn to generate coherent songs for descriptions of “significant complexity”. This translates into “enchanting jazz song with a memorable saxophone solo and a solo singer” or “Berlin ’90s techno with a low bass and strong kick.”

According to the research paper that has been published: “Each of the stages is trained with multiple passes over the training data. We use 30- and 10-second random crops of the target audio for the semantic stage and the acoustic stage, respectively. The AudioLM fine acoustic modelling stage is trained on three-second crops.”

The AI model can bring together genres as well as instruments. Further, it can work around tracks based on abstract concepts. The technology can even come up with melodies based on humming, whistling or the description of a painting.

For example, take the Salvador Dali painting ‘The Persistence of Memory’. Add a description to it, like the researchers did… that of Jessica Gromley: “His melting-clock imagery mocks the rigidity of chronometric time. The watches themselves look like soft cheese — indeed, by Dali’s own account they were inspired by hallucinations after eating Camembert cheese. In the centre of the picture, under one of the watches, is a distorted human face in profile. The ants on the plate represent decay.” The accompanying music offers a similar feeling.

What about the quality of the output? Most of it is quite good, not jarring at all. Had the samples been uploaded to a random website, one will have had a tough time spotting the “musicians” behind it.

But there are issues and the output is not always perfect, like some samples have a distorted quality to them. There is a sample on the research website that features “vocals” but it’s mostly gibberish while the voice sounds synthesised.

Will it be made public?

There are no plans of that happening any time soon. The researchers are not releasing MusicLM to the public over copyright concerns. “MusicLM generates high-quality music based on a text description, and thus it further extends the set of tools that assist humans with creative music tasks. However, there are several risks associated with our model and the use-case it tackles. The generated samples will reflect the biases present in the training data, raising the question about appropriateness for music generation for cultures underrepresented in the training data, while at the same time also raising concerns about cultural appropriation,” reads the research paper.

About one per cent of the examples produced at the time of publication were copied directly from the training songs. No wonder, clearances will be required to release AI-generated music, like musicians who rely on samples.

Are there similar projects?

There are a few options at the moment. Riffusion uses the Stable Diffusion synthetic image generator to create a sonogram or a visual representation of sound as a graph. This is a hobby project for Seth Forsgren and Hayk Martiros.

Harmonai, which has the backing of Stability AI, has released Dance Diffusion, “an algorithm and set of tools that can generate clips of music by training on hundreds of hours of existing songs”. “I started my work on audio diffusion around the same time as I started working with Stability AI,” Zach Evans, who heads development of Dance Diffusion, told TechCrunch.

And there is Jukebox from OpenAI, which is behind ChatGPT. Jukebox is a neural net that generates music, including rudimentary singing, “as raw audio in a variety of genres and artiste styles”. Provided with genre, artiste, and lyrics as input, Jukebox outputs a new music sample produced from scratch.

Creating music in any genre from simple text description

There are issues and the output (via MusicLM) is not always perfect, like some samples have a distorted quality to them

RELATED TOPICS