Raise Every Voice

Photo: Scott Snider

The phone system doesn't allow us to hear people at a distance in the same way they quite literally sound to us when up close. Alexander Graham Bell's accidental dehumanization has been redeemed in part by a technologically related godchild. And it only took about 150 years.

// //

Bell helped teach the deaf to speak aloud, and had a passionate interest in the reproduction and transmission of spoken words. Yet he ushered in a long era in which POTS (Plain Old Telephone Service) provided a scratchy, low-fidelity, cold rendition of how we sound. Mobile phones didn't do much better. Using early encoding techniques designed for slow mobile processors, cell phones were often far worse than POTS in carrying the nuance of our speech.

While today's public switched telephone network (PSTN) is digital at its core, the last bit (known as the final mile) between phone exchanges and homes or businesses is analog, just like it was in early phone networks. We speak into a modern phone that almost certainly no longer uses the compression properties of carbon granules to create directly the electrical signal that goes over the wire, but nonetheless uses a digital facsimile of same. (Business may use digital exchanges, but the outcome is fed into the same digital meatgrinder as analog voice connections.)

The analog system uses filters to capture a range of sound from about 300 hertz (Hz) to 3300 Hz. The lower number, measured in cycles per second, represents deeper sounds (a slower cycling) and the higher, high-pitched ones. Most of the primary sound and amplitude of human speech is at the lower end of that spectrum, whether the voice is male or female. (Wherever analog voice terminates in the PSTN at a digital gateway, it's converted into a standard form that's the equivalent of about 12 to 13 bits per sample at 8,000 samples per second. Modern cell phones capture approximately the same frequencies and digital sampling rates. Sprint may have trumpeted the "pin drop" in ads in the mid-1980s, noting the lack of noise in its fiber-connected network, but it didn't improve the frequency range.)

You have to look to the harmonics of a voice to understand why the cut off at the lower and upper ends make it both difficult to understand what people say over a phone, and why they don't sound really present to you. Harmonics are an artifact of vibrations; almost anything that oscillates has harmonics. Take a piece of string, stretch it, and thrum it, and you might even see the fundamental frequency, the main or base oscillation on which most of the energy is present. But the overall vibration carries with it multiples of that fundamental one. We hear a single sound composed of all the overlaid harmonics at once, although we can train our ear to pick among them. (Encyclopedia Britannica provides a nice explanation.)

With speech, the fundamental frequency can be centered below 300 Hz, while overtones can reach over 10 kHz. Harmonics from normal speech are quieter (and physiologically sound quieter to us) the higher they go. Trained singers can control some of their overtones, while harmony singers can produce marvelous new sounds at higher pitches from the intersection of harmonics. Polyphonic singers, like Tuvan throat singers, can module fundamental tones and harmonics simultaneously. (You can find a marvelously clear explanation with illustrations of frequency limits in voice communications from a 2006 white paper of a firm that was at the time promoting broader frequency support for VoIP in its products.)

The frequencies captured also define the dynamic range: not just which frequencies, but the difference in expressiveness by tone. In photography, dynamic range is the gradation of all the grays captured from lightest to darkest. The greater the dynamic range, and the more real (or even hyperreal, with high-dynamic range imaging) that pictures appear. Further, the gap between each step in capturing dynamic range (from one tone to the next adjacent one) defines how smooth the audio sounds. In a photo, it's the difference between images with gray banding and ones that appear to have a continuous tone. Beyond dynamic range lies the difference between louds and softs. Phone calls compress amplitude, missing the softest sounds and turning everything largely into a muddle in the middle.

Photo: Paul Van Damme

This is why you when you listen to broadcast FM radio, even with any scratchiness that eats into the signal, you feel like you are physically co-located with the sound. FM radio doesn't have a sample frequency as such, because it's continuous and analog, but it has a frequency response of 30 Hz to 15 kHz, which covers most spoken, sung, and musical tones.

But here's the thing. If the PSTN is all digital in its core, why can't we just stick digital filters on both ends that let us capture a greater range of audible frequencies with greater accuracy and greater clarity? The PSTN allots 64 Kbps in its circuit-switched (dedicated capacity) approach to each voice call, but modern compression is much better. GSM cell networks use a standard that can stream at from about 5 Kbps to 12 Kbps.

An AAC file at nearly the same quality as an uncompressed audio CD recording can encode roughly 20 Hz to 22 kHz (to get the highs and lows of music) with 16-bit stereo samples to provide nice differentiation in that range at a rate of 44.1 kHz for clarity in about 128 Kbps. But that's for music. Spoken voice can be compressed even further, down to 48 to 96 Kbps, while maintaining excellent quality.

Given that a DSL line using the same two wires that carry analog voice can handle 24 Mbps and even more these days, what gives with voice? Possibly one day, we'll see the end of analog phones and analog lines when nearly everyone has Internet-based VoIP or a mobile phone, and the remaining holdouts (the stubborn, the elderly, and the poor, typically) are forced to attach adapters. (That's how the U.S. managed the digital television switchover.)

But for now, the PSTN is the PSTN and the Internet is the Internet, and the two kinds of switching networks don't meet except at gateways. VoIP-to-VoIP over the Internet provides a workaround. Even the earliest successful VoIP calls I can remember making between two computers sounded better to me than any traditional voice call. The problem was always latency (the time it takes for data to transit from one end to the other) and jitter (the consistent delivery in order of necessary packets). Latency is down, jitter reduced, and quality has improved dramatically since the late 1990s, as better compression techniques, more processing power, and the greater availability of bandwidth allows a richer representation of voice.

Skype wasn't the first system to allow end-to-end VoIP calls by a long shot, although it is surely the most popular at present. It has stepped through a few codecs (the algorithms that convert uncompressed digital representations of media into more compact ones and back again) since its 2003 introduction, and developed its own, SILK, in 2009. SILK captures 70 Hz to 12 kHz at sample rates that vary from 8 to 24 kHz and result in throughput of 6 to 40 Kbps. It varies depending on conditions, with the best results with the highest consistent available throughput.

I've done a fair amount of radio guesting in the last several years, and I remember that lovely feel the first time of putting on a set of headphones in the studio, talking into a nice mic, and hearing myself and the host sound as rich through my ears as when I listen to actual broadcasts and podcasts. When I started using Skype routinely around the same time, I had the same reaction: this has the warmth, fullness, and clarity of radio broadcasts. (In a bit of irony, I am often interviewed by radio shows from home via Skype. The program records both ends of the call on its side, and I use Audio Hijack Pro to record my end using a Blue Yeti mic. I send them my audio file, but they have theirs in case of a problem with my recording.)

Make a Skype call using earbuds or with a USB headset, close your eyes, and you find yourself transported next to the party you're calling. The sense of presence comes through. When I set up interviews for articles, I try to get the other party on Skype. A phone call, and too often a cell call, is scratchy and flat. You can't get to know someone in a short time with that flat of a call, as you sound dead and distant to the other party. Skype and other VoIP programs with good codecs bring you as close as you can come without being there.

My friends Lex Friedman (a Macworld magazine editor) and Marco Tabini (an open-source development advocate) recently released an iOS game called Let's Sing. When Lex told me about the game, I thought it a terrific idea, but couldn't articulate why, even after he let me help test it. The game is a bit like Draw Something but for singing, humming, or whistling a tune without using the lyrics to get a partner to guess the title.

After playing a number of rounds, I realized what Lex and Marco had hit upon, and why I'd soured on Draw Something (besides some game mechanic issues). Drawing can require time, deliberation, and skill, even for silly purposes, and I'm not great at drawing on an iPhone. Watching a drawing unfold in sped-up time can be tedious. There is a human connection there, watching someone's finger or stylus at work. But it never felt like a real bond.

What my friends hit upon is voice. They record at high-enough fidelity that every round for me is a beautiful connection with friends and family. I discovered Lex's wife, Lauren, has a lovely voice, and I already knew my pal Ren can belt a tune. That connection makes the game work: I like to hear the voices of people I know and love.

Photo: Brett Claxton

We've seen a rebirth via the Internet in the full expressive representation of the sounds we emit, and, I believe, made greater connections among each other as a result. Skype and other Internet telephony programs provide free computer-to-computer connections, and the free part absolutely certainly drove usage for a long time. (Skype is now a double-digit percentage of all international calls.)

But I'd argue that what drives me and others to Skype isn't just cost. I have effectively free long-distance calling for my purposes with my mobile phone, and services have long existed to let you dial around international long distance for cheap per-minute rates. Rather, I go to Skype to hear the way people sound, and have real conversations.

Bell gave up his work creating discrete multi-tone communications, leaving that for John Cioffi to make use of 100 years later (and win a Bell award), in order to crush the human voice. He didn't intend that, but it happened nonetheless. It's a bit of neat closure to see that Bell's initial interest, applied to data communications, has brought back the clarity of voice.



  1. There are other communications systems that also do bandwidth cutoff, for example, the low frequency talking drums of Africa and the high frequency whistling languages of Anatolia. I actually like the way phone calls make someone sound distant, and I like the street noise I often get which provides a context for the person I’m talking with. It’s what radio directors used to call “perspective”, and they used all sorts of tricks to provide context on analog radio shows.

  2. The Bell System was able to offer high fidelity phone service a long LONG time ago, but demand is tiny.  When I worked for them as an engineer I supported this service, which was mainly used by meteorologists to phone in the weather report from their home studio.  The service, back in analog times, was expensive and rickety.  The limiting factor was the quality of the copper cables.

    I have to say, very very few of the calls I make would be improved by adding bandwidth.  Maybe if I had more phone sex it would matter to me.

    1.  Why is demand tiny?  Could it be that Bellheads do not get marginal cost or how to introduce new services in a wide-scale manner like competitive service and data processing folks do?  The problem lies in the vertically integrated, monopoly, stack.  There is enormous demand for HD products.

    2. Yes, it was based round ISDN, a technology from the 1980s. It used G.722 to convey audio samples at 16 KHz. It was widely adopted by all kind of broadcasters as an effective way to get decent quality audio between distant fixed sites.

      ISDN didn’t have the distance-from-the-CO limitation that plaques DSL. You could get it just about everywhere there was a POTS line, given a willingness to update/upgrade the nearby CO. 

      I tried to order ISDN from Bell Canada in 1992 and was given a complete run-around. They truly didn’t know how to handle the order.

      The telco’s really didn’t know what to do with ISDN. They could not work out how to sell it and so rarely had to install it. It was too costly and, at 64-128 kbps, ultimately too slow. By the mid-90’s modems were delivering speeds around 56 kbps over POTS, so the motivation for ISDN faded.

      ISDN did prove that HDVoice is possible over the PSTN using TDM infrastructre. The common wisdom today is that it requires IP infrastructure, which is not strictly true.

  3. I supported VOIP networks in the early half of the last decade, and, while this is a nice introduction to why the PSTN/VOIP sound the way they do, it ultimately has the tone of someone complaining about why there isn’t world peace.  There are overwhelming economic and logistical reasons why things are the way they are.  The PSTN didn’t go digital at its core until the 70s, 100 years after the phone was invented.  This stuff takes time and commitment.  Skype, which the author holds up as an example of the sort of progress he wants in the PSTN, can switch out their codecs whenever they want to because they don’t have to interoperate with anybody else. That’s not the PSTN, where innumerable third-party devices, networks, countries, etc. all over the world need to play nice or nothing works. And because VOIP networks need to play nice with the PSTN, VOIP gets handicapped, too. That’s just how it is. It’s the same story with every other bit of technology man has ever created: layers and layers of cruft build up over time, eventually obscuring the true reasons behind one or another design decision. And we’ve just got to deal in the meantime.

    1. AT, I partially agree with your assessment, but I think you can evolve your thinking with a different perspective on history.

      There are 523 days to go til the 100 year anniversary of one of the greatest deceits foisted on the American public: the Kingsbury Commitment.  Had that not happened who knows how and when information networks (of which 3 monopolies resulted in the first half of the century: telephone, audio content, video content) would have digitized.  Probably far sooner than the 80s (WAN) and 90s (data and wireless).  Skype is part of an “internet” evolution in which the telecoms world is going “horizontal”.  IP was a 4 layer stack and that’s why internet 1.0 blew up in 2000.  The market has been developing the 3 additional layers over the past 15 years, but is not there yet.  Every layer has its own layers as the technology responds to and/or anticipates market demand.  Mostly the former though, as demand has been retarded by 50+ years.

      5 years ago, Steve Jobs reintroduced equal access and the application ecosystems are a competitive virus working its way down the telecom stack.  Slowly “IP” folks are understanding that bilateral settlements or “terminating access charges” (a curse word in the IP world 10 years ago) might actually make sense in terms of controls and effective service creation.  Bilateral settlements will also drive “free” and universal access in that metcalfe’s law (the network effect) goes infinite and central subsidization and procurement outweigh edge subscription models.   When that happens skype will interconnect with other OTT systems and scale more rapidly.

      Spread the word; we can do a lot in 523 days!

  4. I’d love to see better fidelity come into play, if only for the sake of my father who is very hard of hearing. He struggles to understand phone conversations, despite using various speakerphones, volume controls, etc.

  5. I wonder if a suitable analogy of the differences between VOIP and PSTN might bet that between GOPHER and the WEB, where the GOPHER protocol transcends systems with a uniform file hierarchy, and the HTTP protocol of the WEB is very open and forgiving, allowing for graphics and various interactive elements. Index and documents can be made to look quite pretty on the WEB, while the GOPHER is wickedly fast, efficient, and demands far less in terms of system resources on both the client and the server.

    1.  There are examples of protocol differences (basically centralization vs distribution of intelligence) at all 7 layers of the stack that illustrate differences between Gopher vs Web.  It’s always a risk and cost tradeoff.

  6. I wonder if I would like phones better if the sound fidelity was improved. I hate talking to people over the phone because it’s so much harder to understand what they’re saying: not just the words, but the expression. AFAIK, I don’t suffer from any appreciable hearing loss.

    What would it take to make these changes? Would people go for it voluntarily, if they could experience the difference?

  7. Years ago I tried one of those companies that uses VOIP to provide you with home phone service. It was an unmitigated disaster. Call quality was poor at best and any time we tried to leave a message on someone’s voice mail they’d get little more than noise with the occasional syllable of speech mixed in. We raced back to the traditional phone company after that.
    Today our home phone service comes through the cable company and is therefore VOIP. Quality is as good as can be with analog phones.
    At work we are exclusively on VOIP, but most people I speak with are on an analog phone, cell phone or digital speakerphone.
    I use Skype for calls to colleagues on the other side of the world. I don’t think the quality is that great. It’s pretty similar to a traditional local phone call and nothing at all like face to face conversation.

  8. “Even the earliest successful VoIP calls I can remember making between two computers sounded better to me than any traditional voice call.”

    You must have had one phenomenally crappy traditional phone line, then. I have yet to use anything that sounds better, on a regular basis, than our land line. Cellular still suffers from a lack of duplexing, so you can’t really have a natural conversation with anyone on a cell call. VOIP is ok, but still quite often suffers from compression and artifacting, rendering voices into that tin-can squawk sound.

    1. There’s a large difference between pure VOIP-to-VOIP and VOIP-to-PSTN. VoIP-to-PSTN can be awful. VoIP-to-VoIP should be good to fantastic. I made VoIP calls in the 90s, and my home line is just fine. Cellular is terrible, but if I use Skype over a cell data connection, it’s great. Whatever VoIP you’re using, I wonder if you’re always gatewayed to the PSTN?

  9. Shouldn’t it be feasible to introduce a higher-quality simultaneous stream in and for mobile phones? Either by using the data connection, or preferably somehow re-jigging the protocols running on the cells to accept a similar deal to ADSL?

  10. I think you’re misusing the term dynamic range.  It has nothing to do with the bandwidth of a signal or frequency.  It is either the difference in decibels (named for A. G. Bell) between the noise floor and the loudest signal that can be accomodated, or the difference in decibels between the softest and loudest sounds in a particular audio sample.

  11. As the writer of the Polycom article referenced by Glenn, I’d summarize it this way: conventional phones limit sound frequency (fidelity) to 300 – 3300Hz, which is about 1/5 the range of human hearing and the human voice.  It really does make a difference, and wideband audio (also called HD Voice when describing speech) is now widely available in open standards, and implemented in a variety of systems including VoIP, Skype, Apple’s Facetime, Microsoft’s Lync, Android, LTE, virtually all videoconferencing systems since 1990, and others worldwide.  HDVoice has been rolled out on wireless systems in Europe (Orange, Ericsson, others), but is lagging in the US.  
    “Raise Every Voice” has it right: full fidelity makes a huge difference and the technology has been cheap and available for years.  The more of us that tell our own phone companies we want it, the faster we’ll get it.  

    1. There needs to be a settlement system in the control layers for service providers to invest in technology on either side of the session to support consistent (or translatable) HD audio, content and video services/solutions.  We are getting there in herky jerky fashion; held up only by monopoly access and bandwidth providers and the mentality of IP/data folks who have this false impression that everything is free. 

  12. “The frequencies captured also define the dynamic range: not just which frequencies, but the difference in expressiveness by tone.” Frequency response and dynamic range are different things, and one does not “define” the other. Dynamic range, roughly speaking, is the range from quiet to loud, and frequency response is the range of frequencies accurately captured and reproduced. Frequency is related to pitch of a musical instrument, which is independent of its loudness. You are correct in your analogy to a camera with intensities of gray being dynamic range. Frequency response would most closely be analogous to number of pixels, or overall sharpness of the image.Also, it’s not always necessarily a good thing to maximize dynamic range and frequency response in a telephone call. For example, dynamics compression (reduction of dynamic range, NOT bitrate reduction) is useful because it makes the call more intelligible, especially when we are listening in a noisy environment. FM radio, and skype do this as well. (FM radio uses broadcast limiters, among other things, and last I looked into it, Skype used AGC, or automatic gain control circuitry, whereby the microphone gain is automatically altered depending on the speech volume.) Even with perfect transmission technology, radio would be unlistenable in a car, with its noisy engine, for example, without some compression. We often hold our cell phones up right up to our ears — do we really want a huge dynamic range right there? No: It would be painful.

  13. I have a diagnosed auditory processing disorder, and the telephone is a huge bugbear for me– because the way the phone transmits things is not how I actually hear things in person at all.

    Interestingly, though, despite what you mention about the harmonics, the big problem for me isn’t really in the vowel sounds (though for certain timbres of voices, that can also be an issue). Rather, it’s in the consonants. A lot of consonant sounds are primarily found within the range above 4 KHz; I can hear those sounds loud and clear in person, but over the phone, they’re almost nonexistent and I have to use way too much effort to guess at them. (In particular, “s” and “f” are barely distinguishable at all for me over the phone, as are their close cousins “v” and “z”.)

    And amplification doesn’t really help. Increasing the treble and decreasing the bass helps somewhat, in that it at least brings the consonant sounds out more– but it still can only go so far due to the inherent frequency cutoff.

    (Edited to add: Rodman’s white paper does discuss the consonant intelligibility issue at length, for what it’s worth.)

  14. There are a lot of examples of HDVoice available online. Here’s one from a French firm that I found recently. 


    I  created a variety of narrowband, wideband and super-wideband examples for a presentation on HDVoice. 


    This material was used at Astricon 2009 to highlight the implementation of Polycom’s Siren codecs in Asterisk, the leading open source PBX software. It includes a variety of male and female voices in various languages.

Comments are closed.