Accelerando: once you teach a computer to see, it can teach itself to hear

In SoundNet: Learning Sound Representations from Unlabeled Video, researchers from MIT's computer science department describe their success in using software image-recognition to automate sound recognition: once software can use video analysis to decide what's going on in a clip, it can then use that understanding to label the sounds in the clip, and thus accumulate a model for understanding sound, without a human having to label videos first for training purposes.


We propose to train deep sound networks (SoundNet) by transferring knowledge from established
vision networks and large amounts of unlabeled video. The synchronous nature of videos (sound +
vision) allow us to perform such a transfer which resulted in semantically rich audio representations
for natural sounds. Our results show that transfer with unlabeled video is a powerful paradigm for
learning sound representations. All of our experiments suggest that one may obtain better performance
simply by downloading more videos, creating deeper networks, and leveraging richer vision models.

SoundNet: Learning Sound Representations from Unlabeled Video [Yusuf Aytar, Carl Vondrick, Antonio Torralba/MIT]


(via Beyond the Beyond)