Whistling Speech

I really love the research that they're doing over at Yale's Haskins Laboratories: instead of studying speech perception and production in terms of faithfully replicating alllll of the sounds we make with our mouths, (like the minute clicks, pops, and hisses of consonants), the team is proposing that all we need to understand speech is to track and re-create a few select resonances of the vocal tract. I like to think of speech production in this context as a series of bottles with varying levels of water in them--the mouth is one bottle that changes pitch resonance when you move it to open it or close it, the nasal cavity another, and so on throughout the vocal tract. It ends up sounding like a bunch of complicated melodies that are then combined into a complex micro-tonal harmony, a.k.a., we're all better at perceiving and making music than we think we are!

The examples below break it down into isolated sine-wave patterns that you can combine yourself to build a sentence. What do you think? How easily can you hear words emerge?

Tone combinations

Play Tone 1 alone | Play Tone 2 alone | Play Tone 3 alone | Play Tones 1 and 2 together

Play Tones 1 and 3 together | Play Tones 2 and 3 together

If you like this, you can go here for more interactive demonstrations, or check out this great sine-wave-synthesized Robert Frost poem.

Thanks to Robert E. Remez, as well as Phillip Rubin and Jennifer Pardo at Haskins Labs for allowing me to embed their work here.

Coming up, I'll be writing about a cool ethnographic example of a language that actually uses something like this in practice!


  1. This is pretty much completely fantastic. Posts like this are why I love Boing Boing. Thank you Meara.

  2. Kinda reminds me of when my siblings and I were kids and we would have entire conversations with our mouths closed.

    1. @#4: Ok, off topic I know, but R2D2 is clearly capable of speech. In the movies he can faithfully recreate Leia’s and Luke’s voices when playing back holograms. C3PO can obviously understand his beeps and whistle as intelligible. And in the books at least, when R2 plugs into a computer he can type pout sentences. So he’s quite capable of speaking, he just uses a different language than *everyone else*

      1. And Luke understands him: there are a lot of “What’s that, little buddy?” moments between him and R2 in the films (and with Anakin “earlier” in the saga). So Luke either understands the beeps and chirrups as intelligible speech and has learned this speech, or he gets the emotive content of R2’s noises (as we, the audience do: think of R2’s lovely long low moan as the blast doors close on the polar night of Hoth, “Ooooooo.”) Come to think of it, Lucas has a lot of the nonhuman-but-humanoid characters communicate in this way, like Chewbacca: Han talks with Chewie, but we get enough of what Chewie thinks from his Siberian-Husky barks and groans. I always thought it a nice irony that the characters who don’t speak in Star Wars are inevitably the most human, best-scripted, even.

        (Better, perhaps Luke only thinks he understands R2, and the loneliness of Dagobah and outerspace have him talking to himself and a droid who’s not listening most of the time. That would be funny.)

        1. Hrm. Wonder what JarJar would be like if he were unvoiced. Probably massively less annoying: maybe even entertaining. Shame he forgot that trick when it worked so well in other episodes. Though the Ewoks were perhaps what killed that idea.

  3. This ties in nicely with the talking piano of a few days ago. (That was here, wasn’t it? If not, search the tubes for “speaking piano”.)

  4. I think they are fooling themselves.

    I tried listening to their recreation of Frost without first reading the text and understood none of it.

    Ultimately they are going to have to add back in a lot more of what they stripped out.

    1. I’m with you: I can’t figure out what’s said in any of the three. Best guess… “Tone combinations something”, or maybe “Tone combination” more slowly, or “Where in the hell are we going”. But that’s not based on the sound, it’s based on fact that it can be said with similar intonation as one of the waves, and the text appears nearby.

      What IS it meant to say?

        1. Interesting: I didn’t get the icon for “play natural”. Just “play sinewave”.

          Interestingly, I still can’t hear the phrase in the 3-wave “sinewave” audio after being told what it was saying. After listening to the original audio, I can see that it has the tone, but not even the vowels.

          I’m not seeing much use for this idea as a form of data compression or anything: it might be useful for mood analysis maybe? Like others have said, it’s the “sound through a door” – just the tones of voice, inflection and mood.

          There’s something very special about the chosen sentence: “where were you a year ago” contains only one letter (the g) that isn’t a vowel sound. Yes, you can make yourself understood with certain choice sentences just by saying the vowels, without the “minute clicks, pops, and hisses of consonants” – but it helps when it’s a sentence that doesn’t *have* any consonant sounds.

          Other sentences can be still recognised if you remove the consonants and just leave the cadence intact. So, early in the morning, I am heard to utter: “‘e ou’ o’ uh way!” The meaning is clear despite there being no clicks, pops, or hisses, and in fact most of the sentence being missed out: “get out of the way, you foolish damn cat, I am trying to brush my teeth, and no, you won’t get fed any faster by sticking your damn tail in my face.”

          Spoken languages have a great deal of redundancy. We know this already. You can understand what people are saying with only three bits of audio quality, and low sample rates: we can understand eachother despite accents, and through not knowing much of their language.

          I seem to remember there was a more startling demonstration of that redundancy, a while ago on BB, though. Can’t find it though.

  5. Reminds me strongly of the BBC children’s show, The Clangers, from the late 1960s and early 70s. The Clangers were rodent-like creatures who lived on a small planet, and who communicated entirely by whistling. In the context of the stories, you could often understand exactly what they were saying, although the artists revealed later that some of the whistling was expressing language that children are not supposed to hear.

    If you’re unfamiliar withThe Clangers, you can see episodes on Youtube – e.g. http://www.youtube.com/watch?v=HArUmqqiL0s. It’s very charming.

  6. This is all kinds of awesome. I had the first three words after listening to the three individual clips, and then 2 & 3 together were perfectly intelligible.

    I wish they hadn’t included the word “where” in the filenames, though, because now I’ll never know if that somehow influenced my perception.

  7. Crazy! The “Where were you a year ago” popped into my head when I played a couple of the tones together. I was surprised to find out that this was what they were actually trying to say. It was almost the cadence of the wave that made me hear it, rather than actually hearing the words.

    It reminds me of when I was first learning about DSPs in college. We wrote a program that played digitized sound waves in 16-bit digitization, 8-bit digitization, and 4- and 2-bit digitization. Most of the time, you can still recognize speech with a 4-bit digitized waveform. I can’t remember what the sample rate was, but I don’t think we were changing that around as much.

  8. I got “where were you the other day,” which is close, but not quite. Still, a very interesting concept.

  9. I couldn’t understand any of them at all, but as soon as I knew what it was supposed to be, I couldn’t not hear it. Same with the poem. I listened to the first half and couldn’t understand a lick, but started reading along for the second half and it was perfectly intelligible.

  10. This kind of additive analysis and resynthesis has been around for decades, so I’m sure that this is just the tip of the iceberg of the work they’re doing at the lab. It’s still really fun, though.

    Here’s my rending of the same file with nine sine waves instead of three:


    I used the fantastic (and free) SPEAR software tool.

  11. This is great! And curiously enough, Robert Frost himself wrote, in letters to his friend John Bartlett, about poetry and “the sound of sense,” which he called “the abstract vitality of our speech,” and likened to “voices behind a door that cuts off the words.” Kind of like the combined soundwave above, or the teacher in Charlie Brown: you can hear the posture of the voice, if not the words. You get the gist. This reminds me, also, of what I’ve read about the somewhat musical language of the Piraha…

  12. Fascinating! Looks like it does poorly at consonants, which is no big surprise, although that probably means that for some languages it’s going to do much more poorly.

    I mean, consonants are close to white noise, so it’s almost impossible to reproduce the effect without a wide range of frequencies.

  13. I am a scientist working on degraded speech closely related to your example, called vocoded speech. It’s an extremely useful tool that helps us understand (and improve) cochlear implants and hearing aids.
    It became “famous” in my field (psychoacoustics) when a fellow named Bob Shannon demonstrated that this kind of degraded speech is pretty good at simulating cochlear implant speech. If you type “vocode” into your search engines you’ll find a lot of studies looking at how speech information is or is not preserved. You’ll find what perceptual features are or are not lost with vocoding. And you’ll even find data that vocoded speech activates similar brain areas as regular speech, but only after people have been trained on it somewhat.

  14. What’s most interesting is how completely different they sound once you’ve heard the natural version.

    Some of the examples they posted were quite easy to hear immediately. Some were much harder. But for all of them, once you heard the natural version, the sine-wave version changed completely. Suddenly all the consonants, in particular, stood out crisply.

    I think that’s a very interesting example of how our brain perceives completely differently based on expectations, and how this filtered perception is fed into our consciousness. It’s quite clear that our conscious experience of things is a very filtered and altered version of reality, but it’s amazing to get such clear examples.

    I wonder what this would sound like if the sine-waves were played on top of a silent video of the person speaking the words. I bet that they’d be quite understandable, because our visual perception of the mouth would allow our brain to fill in all the consonants that were otherwise imperceptible.

    1. “I wonder what this would sound like if the sine-waves were played on top of a silent video of the person speaking the words.”

      Please, someone do this! Double awesomepoints for two versions: one of a human saying/mouthing it, and another of an animated face (3d, 2d, whichever).

  15. Note that the sentence “where were you a year ago” has exactly none of “the minute clicks, pops, and hisses of consonants” that may or may not be necessary for robust speech perception. And speech perception is amazingly robust.

    Note, too, that even in the (as the time of writing) 16 comments here so far, there are already subsets of people that have a much harder time with this than others. Compare this to the perception of natural speech, even in relatively noisy environments, wherein everybody (i.e., native speakers with normal hearing) don’t have any problems at all.

  16. Got something similar just from tone 2. Interesting. But as others have said we ‘understand’ communications through interpreting various signals of all kinds, so perhaps it’s not surprising we can understand messages from a single element of its makeup. In a similar way, we ‘read’ only the top half of letters, and can understand words spelt backwards etc.

    Anyway, glad to hear you’re going to be writing about silbo gomero (I presume that’s what you mean). I thought from the title that that was what this post was going to be about.

  17. Dewi, re: “You can understand what people are saying with only three bits of audio quality, and low sample rates: we can understand eachother despite accents, and through not knowing much of their language.” Results will often invert when you add another axis for psycho(acoustic) stress; e.g. the anxiety of needing the taxi to get you to the job interview on time, or the deadline-driven panic as the help desk recites the checklist to get your ‘puter to stop rebooting every 5 minutes… ;-)

    One thing that stood out on the Frost page was this line: “Within each chunk of time, the average value of a tone was calculated and reset to match the nearest note that we had licensed.” Wow — guess my non-musical voice has an upside after all: no licensing fees!

  18. I could understand nothing of the sine-wave versions until I heard or read what they were supposed to be. Then I could understand them perfectly. Listening to them without any knowledge of what the words were meant to be, they just sounded like experimental, non-melodic theremin music, not like any kind of speech.

    I wonder how much of this problem was because of intonation. The intonation of the sine-wave version is highly unnatural for spoken English — the rising and falling intonations seem to be placed randomly.

    I wonder how this would work for a tonal language. Would it be easier or harder to reproduce the intonation?

  19. It’s is a lot about context. I knew it was a frost poem, so I could hear the cadence all the way through and was able to pick out enough to recognize the last line.

    Once I knew that I listened to it again, and could pick out a few lines, then i read the poem, and could pick out 98%. I would wager ( a lot ) of money that – as long as the extraction algorithms are consistent – the more time you spend extracting the speech from the sine waves, the less difficulty you will have extracting data from new streams.

    It’s really not that much different than listening through a thick accent. Intonation, cadence, pronunciation and frequency are all off enough to make understanding difficult initially, but you get accustomed to mapping the differences to your expectations, and while the accent doesn’t change, your difficulty with it fades. I know this from experience with Scotsmen.

  20. You know, I just went and listened to more of the sentences without knowing what they were supposed to be, and I started catching more of them. I missed some words, and occasionally still had no idea, but the words did start to emerge.

    The intonation is still freaking me out, though.

Comments are closed.