Google DeepMind's Lyria 3 generates full songs from a photo or a sentence

Lyria 3 takes a text prompt or an image and produces a complete track: instrumentation, vocals, lyrics. Not a loop, not a mood board. A song.

The image input is what makes it interesting. Most generative audio models take text. Lyria 3 can look at a picture and decide what it sounds like. That’s a different kind of creative interpretation, closer to how a composer might respond to visual art than to a spec.

The vocal + lyrics generation in one shot is also notable. Getting coherent lyrics that fit the melody, sung in a style that fits the vibe, without separate pipeline steps, that’s a hard coordination problem most models still punt on.

Worth watching how quickly this lands in creative tools. The gap between “research demo” and “in Premiere/Logic as a plugin” has been shrinking fast.

Lyria 3 on Google DeepMind →