AI-generated spectrogram looks like a corgi and sounds like one too

Images that Sound are AI-generated spectrograms—visual representations of audio data—which look like the thing represented by the sound. It can be prompted, though there doesn't appear to be a public demo. The examples are amazing enough.

Spectrograms are 2D representations of sound that look very different from the images found in our visual world. And natural images, when played as spectrograms, make unnatural sounds. In this paper, we show that it is possible to synthesize spectrograms that simultaneously look like natural images and sound like natural audio. We call these spectrograms images that sound. Our approach is simple and zero-shot, and it leverages pre-trained text-to-image and text-to-spectrogram diffusion models that operate in a shared latent space. During the reverse process, we denoise noisy latents with both the audio and image diffusion models in parallel, resulting in a sample that is likely under both models. Through quantitative evaluations and perceptual studies, we find that our method successfully generates spectrograms that align with a desired audio prompt while also taking the visual appearance of a desired image prompt.

The underlying trick is not new—encode enough image data for humans to perceive it, at the cost of random noise (which the brain filters out) or weird warbling in the audio spectrum—but now AI can make both domains represent the same thing. I like the idea of hiding data that's syntactically identical to the data it's hidden inside, a sort of steganographic useless machine. You can find ASCII art that turns out to be cleverly-formatted code; perhaps the universe itself is such a thing.