Breathlyzer source-code sucks

After a long legal wrangle, some defendant-side attorneys have audited the source-code of Alcotest, the breathalyzer used in New Jersey DUI stops. Turns out it was programmed by muppets who don't know how to calculate an average and who throw out error messages by the dozen.

Like voting-machine vendors, breathlyzer vendors go crazy when defendants ask to have their source-code audited, claiming that there's a bunch of top-s33kr1t stuff in there that their competitors would steal. And, just like voting-machine software, breathalyzer software appears to have been written by squirrels dancing on the keyboard until they got something that would compile.

2. Readings are Not Averaged Correctly: When the software takes a series of readings, it first averages the first two readings. Then, it averages the third reading with the average just computed. Then the fourth reading is averaged with the new average, and so on. There is no comment or note detailing a reason for this calculation, which would cause the first reading to have more weight than successive readings. Nonetheless, the comments say that the values should be averaged, and they are not...

4. Catastrophic Error Detection Is Disabled: An interrupt that detects that the microprocessor is trying to execute an illegal instruction is disabled, meaning that the Alcotest software could appear to run correctly while executing wild branches or invalid code for a period of time. Other interrupts ignored are the Computer Operating Property (a watchdog timer), and the Software Interrupt.



  1. Actually, the weight of the readings is increasing and the first and second readings have the least weight. With three readings, readings 1 and 2 have weight 1/4 each, and the third reading has weight 1/2 (twice each other).

    Which does not change the actual point, of course.

  2. Wow, looks like they just invalidated every drunk driving conviction ever made with this device.

    Seems like open source and peer review is the only rational standard for devices that have the ability to do us harm.

    Draeger seems to make lots of medical diagnostic equipment- if their other products are done to the same standard, how many people have they killed due to bad software on their machines? An ambulance chaser could put together a heck of a class action lawsuit based on these initial findings.

    Too bad they’re privately held; I bet you could have made a ton of money shorting their stock.

  3. Simple solution: breathalyzer test required for code check-in in the Source Code Control System used to develop breathalyzer apps. Of course, if the code is faulty to begin with, this could result in iteratively increasing drunken coding, as it might require minimum blood-alcohol levels to accept code revs. Or at least, on average.

    The real question, though:

    breathalyzer software appears to have been written by squirrels dancing on the keyboard

    Were the squirrels drunk? I think there’s a Biology/CS PhD thesis waiting to be written there if you can determine that via analysis of the resulting code.

    P.S. You’ll be hearing from the MDL — the Muppet Defamation League has views about derogatory statements regarding the ability of muppets to calculate averages. Our key witness: Dr. Beaker. Of course, there is the related issue of drunk-driving muppets.

  4. If things in New Jersey are the same as in my state those roadside testing devices are not evidentiary, they just give probable cause for the police to take them to the station to test them on the “real” breathlyzer, generally a Intoximeter. Yes, if the police didn’t have proper probable cause the case should be thrown out but it doesn’t mean anybody was wrongly convicted. Until we see the Intoximeter’s code that is.

  5. Wouldn’t their method of averaging give greater weight to later readings, not earlier?

    At round two, the first and second have equal weight. At round three, the third reading has the same weight as rounds 1 and 2 combined.

    If a, b, and c are the successive tests, the values of each reading in the average would look like this:
    (a/4)+(b/4)+(c/2) instead of the more appropriate (a/3)+(b/3)+(c/3).

  6. Apparently averages are hard for lots of people.
    “There is no comment or note detailing a reason for this calculation, which would cause the first reading to have more weight than successive readings.” Actually it is the other way around. Ex: The average of 1, 1, and 100 is 34. Avg(100,1)=50.5->Avg(50.5,1)=25.75
    Whereas Avg(1,1)=1->Avg(1,100)=50.5
    The first number has the least weight, the last number has the most.

  7. “When the software takes a series of readings, it first averages the first two readings. Then, it averages the third reading with the average just computed.”

    Sounds like some sort of smoothing technique. Back when I was doing synth consulting I’d occasionally ask for source to see if what I wanted to do could be done at the user level…I remember doing something like this with controller information to make certain that inputs would not jump WAAAAY too much, but that it would still rapidly find the appropriate value.

    Without knowing the amount of samples and the interval, I wouldn’t be able to say for sure…but it actually seems it could give a more accurate response.

    Never the less, this is why code should be commented ACCURATELY. And if there is a hack in there, explain why the hack is there and what it is supposed to do. I know when I was doing ‘medical grade’ applications, my lawyer MADE us explain everything, even though our code was not for life saving purposes (psychological testing)…but someone could have disregarded the rules and used it to adjust patient care. I would imagine a breathalizer is the same sort of class of instrument.

  8. the UK is about to introduce new meters for roadside use that will be evidentiary and will also change the current law so you cannot request a sample of blood be take for an independent test.

    As far as I’m concerned, the only true test for blood alcohol level is one performed on an actual blood sample… any breath test is just guesstimation based on rules of thumb relating alcohol content in the breath to the actual level in the blood

    and as usual in the UK, defence lawyers will not be able to challenge the algorithms used to calculate the blood alcohol level based on the breath alcohol level. Or even the algoryithms used to derive the breath alcohol level.

    as given above, the averaging rule used by the tap dancing squirrels would give a very high bias to the first few samples taken during the sampling period.

  9. 2. is an Exponentially Weighted Moving Average. It is perfectly valid and much easier to compute over an arbitrary period than a simple mean.

  10. Points raised in (4) are normal. Your PC doesn’t usually include/enable a watchdog either. Watchdogs are really useful for reseting hardware that’s difficult to reach (wireless router on your roof), or that absolutely must keep running (ABS controller in your car). A breathalyzer doesn’t fall into either of these categories. If the microcontroller stops, the device will be unresponsive. The (human) watchdog will then apply a traditional power cycle. If the alcohol sensor is screwy, well, checking the microcontroller won’t tell you about that anyway.

  11. always go for the blood test. if theres a jam up at the hospital its to your favor.

  12. That’s a weird thing to do – not be able to have a blood test to confirm.

    This is done in NSW (Australia) if a roadside test indicates you may have a higher blood PCA (prescribed content of alcohol) than permitted.

    A blood test would be the best thing to do.

    Again, as I said, *WEIRD*.

  13. Brings back memories of an army lecture I attended about breathalyzers where they demonstrated one for us. They had an officer take a shot of vodka, then blow. Rinse repeat until it came up red. The thing was malfunctioning, and the officer had about 12 oz of vodka before it finally showed him as yellow.


  14. 2. is an Exponentially Weighted Moving Average. It is perfectly valid and much easier to compute over an arbitrary period than a simple mean.

    Computing the actual mean requires one more word of memory, and an additional multiply. Unless you’re hurting for that last single byte of RAM, or that one extra clock cycle of processing, it’s hard for me to see how you could call what they did “easier”… It’s certainly not “much easier”.

    Whether or not it’s perfectly valid depends on the application. We can only speculate as to how appropriate the method they chose is, but if the comment says they actually wanted the mean value, and the code does something else….. Doesn’t that make it hard to argue that the way they implemented it is “perfectly valid”?

  15. QFA:
    “The computer code in the 7110 is written on an Atari®-styled chip, utilizing fifteen to twenty year old technology in 1970s coding style.”


  16. I agree with the others, this seems like an intentionally constructed average weighted towards the most recent event — but, it should be commented as such more clearly.

  17. The linked to article brings up many more findings of bad coding and bad product design.

    Starting with the averaging algorithm, which seems to have unfairly dominated the conversation so far. The averaging method used is of a particular concern given the graininess of the readings. A later point in the linked to article mentions that the output of one of the sensors is limited to 4 bit output. If the output varied between two adjacent readings over 60 discrete readings the code can generate an output of either end value or virtually anything in between depending upon the order of distribution of readings between the two inputs to the averager. Make just a single ill-timed reading outside the range and it can pull the “average” outside the range of 90+% of the readings.


    The code also does a horrible job of detecting errors, doesn’t do a good job of handling errors, and contains sections of code marked as “temporary”.

  18. For an embedded system, you very much could be trying to squeeze the last of the power out of the system.

    EWMA only needs two registers and 4 operations to calculate (once you have the current spot value stored):

    Load accumulator with previous value for average.
    Add current measured value to accumulator.
    Logical shift accumulator right one bit. (This = halving, rounding down)
    Store result as latest value for average.

    If you are running every 8.192 ms, and with an alpha of 0.5, 50% + 25% + 12.5% + 6.25% + 3.125% = 96.875% of the value for the average will be calculated from no more than the last 5 x 8.192 = 40.96 ms of readings. Applying the Nyquist theorem, unless the alcohol content of your breath fluctuates above then below the limit more than (1000 / 40.96) / 2 = 12 times per second, it won’t be giving an erroneous reading via this calculation.

  19. @#15 Paul McLaughlin:

    Stupid confusing us with your clever math-skillz and your good arguments! Don’t you get it, we want them to suck! You’re taking all the fun out it by being all reasonable and stuff!

  20. responding to:
    “the UK is about to introduce new meters for roadside use that will be evidentiary and will also change the current law so you cannot request a sample of blood be take for an independent test.”

    That is bizarre. Here in Texas, motorists who refuse roadside breath tests not only face automatic license suspension proceedings (which is deemed “administrative” and separate from criminal prosecution), but now there are judges on call who will sign off on a warrant to extract a blood sample against your will in the event of a breathalyzer refusal. Of course, you can also have a sample taken and submitted to an independent lab if you wish.

    However, most Texas juries will convict solely on the basis of arresting officer & other eyewitness testimony, unless the physical evidence is overpoweringly exculpatory.

  21. @#16 Oskar:

    The thing is, I believe that the software does suck. I just don’t think that the averaging method is where the problems lie.

    I have used Draeger instruments at work and found them to be very reliable, but that does not mean that ones developed for outside the petrochemical industry are as good. Being sued by Exxon or similar would be much more worrying if I was a product manager.

  22. paulmclaughlin: I don’t see how your logic about the error rate of the EMWA hangs together. If you assume your logic is true for the first four readings, and that the magnitude of the error of one reading is on the order of the magnitude of the value, and then by chance the fifth reading comes out zero (which is fairly likely for an error magnitude that is comparable to the value), then the total reading is about one-half of the actual value.

    Note that your discussion of the accuracy of the EMWA in getting the actual value ignores the magnitude of the error of one reading, which clearly can’t be right.

    The value of the series you listed is the accuracy of the EMWA over five readings compared to the EMWA for an infinite number of readings. However, no matter how many readings you use with EMWA, the logic I gave in the first paragraph applies: one half of the magnitude of the error is always in the EMWA, coming from the last term (the one in which the alpha is only multiplied once).

  23. Exponentially-weighted average are great filters, and easy to compute too. The biggest problem is finding which multiplicative factor (alpha) to use — they chose alpha=0.5, hopefully with some sort of research behind it.

    Whether this is good or not depends on the application, of course, and what they actually intended to do. But it is not incorrect by default.

  24. WURP: If the error if that large compared to the actual value, then nothing you can do will help you out. I’ll agree with you there. For a process to be in statistical control, the standard deviation of a measure should be no more than one third of the difference between the mean and the limit (for a one-way test like this)

    For example, let us pick a target out of the air. If you can have 100 ppm alcohol or less in a measurement to be allowed, and you know your sensor has a standard deviation of 5 ppm, you would put the fail mark at 115 ppm, to give you 99.87% assurance that the result correctly indicates failure of the test.

    If the errors are too large, and centrally distributed (i.e. the analyser is imprecise but not biased) then you will see a fluctuating result, which should be a red flag. If the errors are not centrally distributed, then you have a biased result and the whole thing is useless.

    I’m used to IR analysers which autocalibrate themselves against known sample gases every 2 days, but I don’t know how much signal drift (if at all) you would expect from this breathalyser.

  25. If it’s true that they’re using an EWMA, then you might be able to game the system by pushing a final puff of clean air through the breathalyzer – first you blow from your lungs, then pull a quick mouthful of air into your mouth and expel it without exhaling any more from your lungs (kind of like smoking a cigar, I guess). If the machine doesn’t detect the pause in the air pushed through it and prevent further readings from going into the average, you could give that last little bit of air an undue weighting.

    Of course, pulling that off undetected by the officer administering the test would probably be quite tricky, particularly if you’d been drinking…

  26. Oooh, I like this part

    7. Flow Measurements Adjusted/Substituted: The software takes an airflow measurement at power-up, and presumes this value is the “zero line” or baseline measurement for subsequent calculations. No quality check or reasonableness test is done on this measurement. Subsequent calculations are compared against this baseline measurement, and the difference is the change in airflow. If the airflow is slower than the baseline, this would result in a negative flow measurement, so the software simply adjusts the negative reading to a positive value.

    In that case, you might just be able to suck in outside air at the end of the test and have the ambient alcohol level over-weighted.

    Again, I don’t know if this is possible. I’ve never taken a breathalyzer test, as I don’t drive, and have never been pulled over on my bike or on public transit…

  27. I wonder if a varied form of circular breathing might work? Get a little more clean air in that way, and not quite as noticeable as taking an extra little breath.

    One thing I noticed in the article was error reporting. It appears to not report until after 32 successive error messages. Why would this be if for NOT to skew the results?

  28. Either the description of the averaging method is not complete or the conclusion that “[this] would cause the first reading to have more weight than successive readings.” is wrong.

    Take the 4 samples with the values 1, 2, 3, and 4.

    According to their description, you average 1 and 2 to get 1.5, then you average that with 3, which would give you an average of 2.25, which you average with 4 to get 3.125, when the real average is 2.5. It’s the last value added in counts 50% in computing the average.

    However, if the averaging method is a weighted average, where at the nth step, you add (n-1) times the average of the first n-1 values to the nth value and then divide by n, you get the correct average.

  29. Pfft. If my $60 blood pressure monitor can deliver a true mean average of the last three readings taken, so should a $60,000 Breathalyzer.

    That being said, it’s mathematically impossible for an exponentially-weighted moving average of two or more readings to equal or exceed 0.08 unless one or more of the readings exceeded 0.08.

    So, no false convictions result.

  30. >> That being said, it’s mathematically impossible for an exponentially-weighted moving average of two or more readings to equal or exceed 0.08 unless one or more of the readings exceeded 0.08.
    >> So, no false convictions result.

    How do you figure that? If you get an error-proof reading there’d be no reason to average the readings together. If the particularly high erroneous data point turns out to be the one heavily weighted in the formula, you’re screwed.

    Not to mention the fact that, apparently, off the charts readings get treated as max values, instead of errors, unless you get 32 of them.

  31. Apparently the success of Dräger’s products is due to marketing and not good software design. See how they cleverly use the findings of this trial to their advantage on the product page:

    The Alcotest® 7110 MK III-C is proven evidential breath breath analyzer. It is the only evidential breath tester on the market whose source code has been reviewed by independent third parties and approved by a Supreme Court decision.

  32. I once worked writing software for an FDA-certified $100K blood analysis machine: worst spaghetti code explosion I ever saw. There was one 60,000 line file with no local variables and program state held in a collection of twelve string variables. The sad state of the BAC-tester’s code is sadly un-shocking to me. Now we know what they were really trying to hide.

  33. There are some key points that Cory neglected to mention in his quickie summary, and none of the commenters seem to have caught.

    First, Cory makes it sound like this code analysis was just recently completed. But the analysis was published August 28, 2007. That’s what it even says right there in black and white on the page Cory linked to.

    Second, contrary to the sentiment that this scathing report invalidates test results from the device in question, the New Jersey Supreme Court ruled on March 17, 2008 that the device is technically sound and legal for its intended use (with some restrictions).

    March 2008 also isn’t very recent. Is there some new news in this case that Cory isn’t telling us about? Or is he just the last person on the block to read about this?

  34. @paulmclaughlin:

    I have two issues with your argument.

    If you want to calculate something other than the simple mean, why document your algorithm as being a simple mean calculation?

    The calculation you describe actually over-weights the last sample. If you get 99 samples of 0.04, and a final sample of 0.12, a simple mean gives you 0.0406. An error in the final sample has little effect on the outcome. Using the algorithm you describe (and the one implemented in this machine), the same samples give you a result of 0.08, and you go to jail.. Are you confident that the later samples aren’t a corner case?

  35. Interestingly, there are other potential problems with breath testing, even assuming the source code was perfectly well designed, and the device was designed to detect error conditions, and not suppress any errors…

    For example – per – the ratio of blood alcohol to breath alcohol is nowhere near fixed. The breathalyzers used in Canada are based on the assumption that it’s 2100:1 for everyone. In fact, there are considerable differences from person to person, from around 1000:1 to 3000:1.

    Again, from the above reference – this is part of the changes to the criminal code that were introduced in 2008:

    …evidence tending to show that an approved instrument was malfunctioning or was operated improperly, or that an analysis of a sample of the accused’s blood was performed improperly, does not include evidence of

    (i) the amount of alcohol that the accused consumed,

    (ii) the rate at which the alcohol that the accused consumed would have been absorbed and eliminated by the accused’s body, or

    (iii) a calculation based on that evidence of what the concentration of alcohol in the accused’s blood would have been at the time when the offence was alleged to have been committed;”

    Get that? In Canada, we are specifically prohibited to introduce in court:

    – scientific evidence specific to how the body of the accused works, as opposed to some idealized theoretical human body, showing that the results of a breath test were misinterpreted

    – witnessed evidence that the accused had not been drinking and as such the machines must have been malfunctioning.

    Yay, Conservative government!

  36. I came here to say what everybody else (pretty much) have already said about the averaging giving more weight to later results than to earlier results.

    Given such a fundamental error in the software analysis, can this analysis firm really be trusted?
    How good is their analysis if they mess up such a simple thing?

  37. So what should a public defender do with this information?

    (1) Should I be seeking discovery for other models’ source code (we’re up to the Alcotest 7410 in my county)?

    (2) Shouldn’t this expose the blood test testing machinery’s source code to discovery motions?

    (3) California requires that you take two breaths and that their results be within .02 of each other (i.e., .06 and .08 is fine, the margin of error is .01). But if the results are impacted by prior results, then an aberrant high result will average into the following results rather than each result being independent for comparison. They should not be averaging, they should be each separate as control tests against successive results.

    (4) What is the impact of the averaging/smoothing in a real life situation? Say a defendant has an actual BAC of .07% (just under the legal limit in CA). Their results are .05 –> .08 –> .07 (the third test being required because there was a >.02 difference required. Does that mean that the third test result may have actually be .05 (or .06) but it was averaged with the prior two to give a .07?

    (5) If this issue is to be litigated, it would require an expert to counter the Drager expert/employees. Anyone you’d recommend? Our offices will need experts with excellent credentials and the ability to communicate these relatively complex concepts to jurors AND understand that we’re looking to show a reasonable doubt as to the accuracy of the result (not “more or less probability” re: the results accuracy).

  38. The averaging method described is a simple recursive digital filter which simulates the response of a single-stage analog RC filter. You can convince yourself of this by writing down the convolution integral for the response of an exponential filter exp(-t/t’) to an arbitrary input, where t’ is the time constant of the filter. Then replace the integral by a sum where the terms are separated in time by a, the interval between measurements.

    Upon rearranging the sum, it will be seen that it equals the unfiltered data at the current time step, plus a constant times the filtered data at the previous time step. The constant is exp(-a/t’).

    Normalizing to unity, we multiply by an arbitrary constant K, chosen so the gain of the filter to a constant DC input is 1, as it should be. Then 1/K = 1+exp(-a/t’). If a is known from the measurement situation, and a suitable time constant t’ is chosen, K may be found and the filter is determined.

    In the particular instance given, we know that K = Kexp(-a/t’) = 0.5. So exp(-a/t’) = 1, which means effectively that t’>>a.

    Using the Laplace transform of the time response of the filter, it may be shown that the resonant frequency of the filter is (1/t’) in radians/s. Below the resonant frequency, the gain of the filter is flat at 1. Above the resonant frequency, it rolls off at 20 dB/decade.

    Since t’>>a in the example, the resonant frequency is close to zero, effectively the frequency response rolls off at 20 dB/decade starting from close to DC. For instance between 1 Hz and 10 Hz, it would drop by 20 dB. You can imagine what that would look like drawn on a diagram like a stereo frequency response curve.

    I don’t claim that the guy who wrote this software actually knew what the design considerations were behind the choice of the constants in the filter, but that’s what they are. He probably got the filter from tribal knowledge.

    I also don’t claim that the choice he made is appropriate for a breathalyzer, I wouldn’t know.

  39. Though I appreciate all the math-nerd arguments in this post (I resemble) I would like to introduce a physics-based argument. Given that the length of ‘tube’ on the breathalyser is non-zero, won’t repeated rapid blowing into a tube initially reading low gradually increase the concentration? Especially given that other variables are not being tested?

    Heck, with the right recent meal of sweet yeasty food, you would eventually test positive through fermentation.

Comments are closed.