When bombs exploded at the Boston Marathon on Monday, my Facebook feed was immediately filled with urgent messages. I watched as my friends and family implored their friends and family in Boston to check in, and lamented the fact that nobody could seem to get a solid cell phone connection. Calls were made, but they got dropped. More often, they were never connected to begin with. There was even a rumor circulating that all cell phone service to the city had been switched off at the request of law enforcement.

That rumor turns out to not be true. But it is a fact that, whenever disaster strikes, it becomes difficult to reach the people you care about. Right at the moment when you really need to hear a familiar voice, you often can't. So what gives?

To find out why it's frequently so difficult to successfully place a call during emergencies, I spoke with Brough Turner, an entrepreneur, engineer, and writer who has been been working with phone systems (both wired and wireless) for 25 years. Turner helped me understand how the behind-the-scenes infrastructure of cell phones works, and why that infrastructure gets bogged down when lots of people are suddenly trying to make calls all at once from a single place. He says there are some things that can be done to fix this issue, but, ultimately, it's more complicated than just asking what the technology can and cannot do. In some ways, service failures like this are a price we pay for having a choice and not being subject to a total monopoly.

Maggie Koerth-Baker: The problem of not being able to reach loved ones on the phone during an emergency isn't exactly new, right? Land lines had to deal with this, as well. Just to refresh our memories, what happened when land lines got congested with call traffic?

Brough Turner: Well, say you'd have an earthquake in California. This was for the old Bell system. The national long distance routing has a set of standard, predefined routes and it had network control centers in New Jersey and other places. Things would get overloaded and they would manually intervene by putting access restrictions on new calls coming into the area that was congested. In the 60s, 70s, and 80s they would let through one out of every five call attempts. They were doing that manually and just arbitrarily to reduce congestion. Over time things got more automated. During the long-distance competition of the 1990s, AT&T introduced computerized routing and started using automated rate limiting. It all really got quite sophisticated before the whole industry went away.

MKB: What about with cell phones? We aren't talking about wires anymore, so what's really going on behind the scenes when we say that the phone network is congested?

BT: First off, different cell phone providers use different technologies, different systems. I'm talking about the GSM system used by AT&T and T-Mobile. I know less about the Qualcomm version that's used by Verizon and Sprint. They evolved in different ways and the details are different, but the same basic principles are the same for all. With 4G, by the way, that's changing. Everybody is converging on the technology that comes from that GSM tradition.

In general, though, there are a bunch of different places where congestion can happen. Networks consist of different technologies, and different levels. You start with the mobile switching center that may cover a large area. There are only one or two mobile switches for Eastern Massachusetts. We're talking about a room full of racks, full of computers and other switching elements. The densest switch is in China, and they have something that will serve more than several million customers at a time.

So you have the mobile switching center. Then you have groups referred to as radio node controllers. There are dozens to hundreds of these conrolled by one switch. They're located closer to the radios and they deal with handoffs between different radios.

Then, of course, you have the individual radios and that's where you see antennas on top of and on sides of buildings. Those are everywhere. Each of those is a cell, and in each cell you have users who are connected to the network.

MKB: So this is really about how you, as a cell phone user, move around a physical area? You get handed off from one radio to another, from one node controller to another, and as you travel a lot farther, from one switch to another?

BT: Yup. The other thing about the radios is that they have different sizes of cells. You've got regular cells and then smaller sub-cells. You also have larger overlay macro-cells that are really big. They try to handle you within the small cell you're closest to. But it's a trade off between capacity — they'd like to have lots of small cells for that — and coverage — they don't want to put 100k small cells everywhere. So you might have a cell that covers a mile ara and then smaller cells within that that handle most of the traffic.

Interesting thing is that most people are actually stationary, sitting on their butts. For most people, calls originate from one or two locations and they stay there the whole time. But we have to have this incredibly complicated system to deal with the 5-8% of people who move around. Maybe less than that.

MKB: So what happens when you suddenly get a lot of calls happening within one cell?

BT:They can offload some of that to a macro-cell. When it's a planned event — the Boston Marathon, for instance, before the bombings — they can bring in aditional mobile cells. They park little trucks around the edge of the event. All those radios, though, have to connect back to the radio network controller. If it's an installed radio it's probably a wired connection — copper or fiber. But when you can't get that, then they use point-to-point wireless. Either way, they call that the backhaul.

In different parts of the system different things will get congested. In some cases, the specific cell site might be overloaded and macros are also overloaded. In other cases, it's the backhaul that gets overloaded. And that doesn't even have to be an emergency to cause that. There's this great story where [telecommunications expert] David Reed was driving from New York to Boston in the middle of the night. His wife was driving and he was sitting there with one of the first iPads that had 3G service, and has they drove through Connecticut he was running speed tests along the way. Just to see the different responses in different cells. And at one point, he was limited to, like, 3 mbps. It was 3:00 am, so it wasn't about lots of people using the system. It was just that he was driving through a cell where the only backhaul was two T1 lines. So 3 mbps was the maximum anybody in that cell could ever get. And this was like a 20 mile stretch of highway.

MKB: So there was only so much information that could go in and out at a time. Wow. I know that channels, the actual wireless signals from and to your phone are also important. Can you talk about those?

BT: There are a bunch of separate channels in the wireless system. But the big division is between a control channel and all of these traffic carrying channels. Control channels are used for a lot of different things. For instance, they're used for call set-up and call tear-down. Your handset looks on a particular control channel for permission to make a request. It uses the control channel to request to make a call, like, "I need enough capacity to set up call," so then the system can find the traffic channel with enough free space. But they're also used for sms messages. Which is interesting.

MKB: Yeah. I've heard that, when you're in a situation where lots of people are placing phone calls, it's often easier to get a text message through. Is this why? And, if so, is it a good way to use the system? What I mean is, is the system as a whole better off if you text your friend in Boston to check in, rather than trying to call him?

BT: Yes. It's much better. The SMS messages have a relatively light footprint, first of all. The second thing is that they're asynchronous. If they can't get through this instant, they keep trying. If it gets over the radio to the cell site, it will get through. Even if it's delayed for 30 seconds or something. With voice you're either connected or you're not, and when you are that means that the traffic channel is tied up until you're done talking. More likely, it means you never get connected because traffic channels are already saturated.

MKB: In an emergency, can the cell phone companies limit access to the network the same way the Bell system used to do with land lines?

BT: Yes. Now this is a piece where I know what equipment these large carriers have, but I don't know how they've chosen to implement capabilities that are there. So one way they can do this is they can bar new traffic being originated by people based on "class". There are typically 10 classes for regular subscribers and another six classes that handle things like 911 calls and emergency services. They can control which classes have access at the level of cells, or by groups of cells, or all of Eastern Massachusetts if they wish.

I'm not clear on how automated all of this is. They definitely have the ability to have it totally automated. There's technology you can buy from Ericsson that features call-load-triggered access class barring, so it automatically invokes certain policies about who can place calls in an area if the traffic there exceeds a pre-determined threshold. But that's an extra feature and you have to pay extra for it … I guarantee it's in the range of 10s of thousands of dollars per mobile switch. So who knows what decision the carriers made about that. It might have been automated and it might not be.

What I am sure of is that they set up priorities for people with fire and safety access classes. And I think it's also clear that the Verizon mobile switching center was overloaded on Monday. The effect I observed in Massachusetts was you could not place a call from a landline into the Verizon mobile network for some period of time. They blocked all incoming calls for some period of time. But within the network [Verizon to Verizon] some number of calls were getting through. I didn't succeed, but some friends did after trying for 5 or 10 minutes. In overload cases they won't turn off everything. They'll say fire and safety get through immediately and maybe 10% of the other calls get to go through. They don't throttle down to zero, though, because you don't know if somebody desperately needs to make that connection.

MKB: Is this an issue that can be fixed? In some of our background conversations before this interview, I got the impression that this isn't all about what the technology can do, but also what companies do with it. That there's a lot of trade-offs people make and congestion like this during emergencies are one of the side-effects of those trade-offs.

BT: In the end, it does come down to trade-offs. That's true of any network. You're interested in coverage first and then capacity. If you wanted to guarantee that a network never had an outage your capital investment would have to go up orders of magnitude beyond anything that is rational. So each network is trying to invest their budget in ways that make network appear to perform better.

The cost of providing temporary extra capacity for the Boston Marathon, that's something that's in the budget and they plan for that event. But when you get something unexpected like a terrorist event, or an earthquake, or damage from a hurricane or tornado, then you have trade offs between capital and how robust your network is. Every time you have an event people say, "Oh, they didn't invest enough." But you look at New York City after Hurricane Sandy and Southern Manhattan was under 6 feet of water — all the buried infrastructure was lost. Meanwhile, in other places, a significant number of cell sites were knocked out because connections ran on overhead poles and got knocked down by trees. The antenna site literally got destroyed. Interestingly, you can lose 30% of your cells and stil get coverage. Coverage was there in New Jersey after Sandy, even with 1/3 of the network out. The catch is there wasn't much capacity.

MKB: Are more robust networks something that could be regulated? I ask because I've gotten the impression that some people are concerned that when cell service is congested during a disaster, there will be a cry for the government to do something … and the unintended effects of that would actually leave us with a cell system that we maybe don't want, something that gives a few corporations a lot more power.

BT: I honestly don't know how you could regulate it to work the way you wanted it to all the time. Reliability on the old Bell system was relatively high … and we paid the a high price for that as consumers because to get that level of service they got to be a monopoly and they got to charge us a rate that allowed them to make a return on their investments.

With cellular systems, competition seems to drive more optimal decisions. We don't have as much competition as we used to, but there's still some. You really want at least four-to-six carriers, and most places it's really only like three or three-and-a-half. For the public, we have to have a trade-off between getting coverage we want and being stuck with a monopoly. You look at electricity or fixed-line phone systems, and there are regulations on those industries about how much coverage and capacity they have to have because it has to be a good system — you as the consumer have no other choice. They're monopolies.

Image: ~ Timepass ! ~, a Creative Commons Attribution No-Derivative-Works (2.0) image from neokratz's photostream