Linguistics, Turing Completeness, and teh lulz

Yesterday's keynote at the 28th Chaos Computer Congress (28C3) by Meredith Patterson on "The Science of Insecurity" was a tour-de-force explanation of the formal linguistics and computer science that explain why software becomes insecure, and an explanation of how security can be dramatically increased. What's more, Patterson's slides were outstanding Rageface-meets-Occupy memeshopping. Both the video and the slides are online already.

Hard-to-parse protocols require complex parsers. Complex, buggy parsers become weird machines for exploits to run on. Help stop weird machines today: Make your protocol context-free or regular!

Protocols and file formats that are Turing-complete input languages are the worst offenders, because for them, recognizing valid or expected inputs is UNDECIDABLE: no amount of programming or testing will get it right.

A Turing-complete input language destroys security for generations of users. Avoid Turing-complete input languages!

Patterson's co-authors on the paper were her late husband, Len Sassaman (eulogized here) and Sergey Bratus.

LANGSEC explained in a few slogans


  1. “What’s more, Patterson’s slides were outstandingRageface-meets-Occupy memeshopping”
    The content is great but the slides are not outstanding. Oh for the day when everyone uses Keynote instead of PowerPoint, and we move away from slide after slide of bulleted lists and sentences. 

  2. “Don’t allow Turing-complete input languages” is too general.  If you were going to obey that rule, you couldn’t write a compiler — or a word processor!

    1. I’m going to guess you read the blurb and didn’t listen to the talk – they’re talking about protocols as languages, not general-purpose programming languages.    
      The thesis is that one shouldn’t be able to construct a message that causes the recipient to perform arbitrary computation in the process decoding of the message, not that one shouldn’t be able to write programs. 

  3. Protocols that have a formal definitions do enable developers to implement more resilient code.  Sadly, as is pointed out, Postel’s principle encourages faulty implementations to linger around; formality helps to make moot the concepts of “liberal” and “conservative” in parsing and generating – the wiggle-room disappears.

    That said, a context-free grammar can be used to generate a reliable state machine (using, say, bison), but that matters little if  my action handling code is crap.  I can get the benefits of using a language spec and automated tooling using XDR (rfc 4506) with less backend code, (plus I get to specify length fields for efficiency).
    Also, I’m curious as to what “Turing-complete input languages” have been encountered by the presenters.  I suspect javascript might be the boogeyman here, but that is clearly a general-purpose language as opposed to an “input language”.

    1. HTML5+CSS3, without Javascript. Not kidding.

      I also have an unpleasant suspicion that we’ll be able to pull a Turing machine out of IPv6, but I haven’t delved too deeply into that yet.

      I have a whole lot of feelings about XDR but they’re too extensive for a comment thread; I will try to blog about this in the near future, though.

      1. So essentially, you feel that XML and SOAP based protocols are untenable from a security point of view, correct?

      2. I’m not convinced that simply having the quality of Turing equivalence is sufficient for a protocol to be considered bad.

        All implemented Turing machines are bounded in various respect. It has not been established that HTML5/CSS’s Turing completeness is an attack vector. The same goes for IPv6, (but not for ASN.1/X.509 which is a total disaster of a standard for many reasons…)

        Simply being a “weird machine” does not make you a hostile machine.

        These people have an excellent understanding of formal languages but a terrible understanding of what security actually is. Security is not binary.

        Early on the talk she says something about the halting problem having 3 output states for Turing complete languages. Yes/No/Maybe. When you formulate security in these terms, she is correct, complex languages are bad. However this is to misunderstand security. The real question is, how confident can you be in an answer given a certain amount of resources to work on it.

        I’d love to see them produce the benchmarks they mention towards the end. Comparing real world performance of context-free and -sensitive grammars. I think without that then this work doesn’t really deserve to call it’s self LANGSEC.

        Disclaimer: 2am is probably not the best time to be trying to tackle these problems.

        Edit: e.g. I only just noticed you actually the speaker, and with more sleep and that information would probably have phrased some of this differently. Very interesting talk :)

        1. I think the issue of weird machines is that they add complexity to a already complex problem. Bayesian filters have enough trouble keeping my inbox clean as it is. Introducing any kind of logic into the very content of the email, so that to determine of the message is ham the filter first has to run the whole logic and then process the output, may well have just squared or cubed the problem.

    2. PostScript. Back in the days when your printer had more processing power than your computer and a very slow data interface, it made sense to have a Turing-complete page description language. For even more hilarity, a previous program could leave your output device in an undefined state, which could only be discerned by  stack/state parsing and knowledge of proprietary device specifics.

      But it was kinda fun to have the next guy’s print job turn out in mirror writing …

    3. I suggest using protobufs  as an alternative to XDR.    It features very easy to use API, and extremely efficient on-the wire encoding.

      We wholesale changed our internal protocol from a home-grown, context-free grammer thingy to protobufs and we haven’t looked back.   It cured all our ills.

      1. Don’t protobufs and XDR both suffer from the same sort of Packet-in-Packet attacks that have had a surge in attention recently? This seems to be a quality that is common among context-sensitive grammars.

    4. The classic example has been Sendmail configuration files, which have been used to write Turing machines in.  That’s not the source of most of Sendmail’s insecurity (basic bugs in a program running as root will do that), but the size and complexity of the program encourages bugginess.

      Turing machines have also been written in vi macros, and then of course there’s emacs, where programmability was a design goal.

  4. Protocols that have a formal definitions do enable developers to implement more resilient code.

    Formal definitions can still be bad, if they describe Turing-complete languages. What we need is protocols that have sane definitions – not exhaustively specified, but rather exhaustively thought through before creating them.

    Also, I’m curious as to what “Turing-complete input languages” have been encountered by the presenters.

    The big culprit here is the nest of insanity around XML. Notably, if your system uses XSLT as an input language then go back and do it again.

    1. Aspies (Asperger’s Syndrome diagnosed people) also sometimes process language the way computers do — formally, where individual words, and sometimes also phrases, have a single, rigidly-defined conceptual entity to which the word or phrase refers.
       It makes sarcasm in English and other non-tonal spoken languages difficult for these people, as they discard tonality in processing meaning.

      1. I’m not a big fan of the terms ‘aspie’ and ‘autie’, but I can’t quite put my finger on why yet. Perhaps it’s because it’s a bit like calling someone with a physical disability a cripple; it’s like reducing that person to that particular aspect of themselves.

        On the other hand, I don’t particularly mind being referred to as a gamer or a cyclist or a coder. Perhaps that’s because they are all descriptions of something I do, rather than something I am.

        1. Then again, there is the issue of thinking of asperger as a disability. Sure, in the current high stress, highly “social” (where if you do not sell yourself at every opportunity your a hippie looser) it may seem like it. But given a quiet spot to work and structure may appear from chaos. The biggest damage done to aspies, and their nearest “cousins” on the autism scale, may well be that they are lumped in with the hard cases that one most often hear about in relation to autism. As Einstein supposedly said, he liked working in the patent office because it provided him with enough quiet to think.

      2. It can be learned tho, just like anything else. And it also allows for finding humor in things that pass most others by.

  5. Languages, computers….just this morning I read the most fascinating passage in a book about something called “Zipf’s Law” and a study that was done at B.U. in the 90’s wherein the result was basically that the so called “junk DNA” which comprises 97% of all DNA may actually have a language/message encoded in it.

    Mind Blown.

    Link to an article that explains it better than me:

    1. Unfortunately, the article’s not as insightful as it pretends to be.

      Firstly, the notion that 97% of DNA is junk is roughly on a par with the idea that you only use 10% of your brain in terms of meaningfulness. Yes, only 3% of DNA encodes proteins, but much of the non-coding DNA in our genome is not random, but has been conserved intact over a few hundred million years, which strongly indicates that it does perform some useful function and we simply haven’t figured out what it is yet.

      Secondly, the inference that DNA may have some language encoded in it because it adheres to Zipf’s law is pure nonsense: randomly generated texts also respect Zipf’s law.

      1. Thanks for the input. I am still stuck on the part about equating randomly generating texts respecting the Z law with the idea that evolution would allow for the 97% to remain with us for so long. I thought that organisms only moved forward with what was important and needed?

        1. In evolution, populations of organisms move forward with DNA/genetics that /do not prevent them from reproducing on a par with the environment they inhabit/.
          The three important parts there are:
          Do not prevent them from reproducing;
          Keeping up the reproduction;
          The environment they inhabit.

          The fact that those sequences that “respect Zipf’s law” strongly indicate they have a useful function, is a somewhat-informally-specified fact.
          The fact is, that any arbitrary DNA sequence surviving millions of years, has a strong correlation with the utility of that DNA.
          Correlation, however, is distinct from causation.

          The evolutionary process, as noted in the first paragraph, doesn’t select for  those traits that  help organisms survive. It selects against those traits that:
          Prevent reproduction;
          Prevent reproducing on a par with peers;
          Prevent reproducing in the environment they inhabit.

          As long and as “complex” as our DNA is, as far as the replication mechanisms are concerned, it’s no more complex than one long punched paper tape. Replicating it is relatively easy. If the “junk” doesn’t prevent that, then it gets replicated too.

          1. In other words, evolution is like the bootstrap of a PC. The BIOS, even with additions of USB and such, is still fundamentally the same kind of chip that brought up the IBM AT back in the day. But the ISA buss is long gone, PCI and AGP is being replaced by PCIE, and any sane OS basically ignores the BIOS once it had assumed control over the CPU and bus controller.

        2. “junk” DNA actually has higher information content (measued using Shannon’s definition of information) than the parts that code for protein. but its a mistake to conclude that it too must contain an encoding. biochemistry is primarily structural, and the “junk” DNA plays a major role in physically organizing the protein coding segments. this requires specific properties that result in low entropy/high information when analysed from that perspective. think of it another way: someone once compared the packing of DNA into the nucleus as being like packing several million miles of woollen yarn into a UK telephone booth. when you need to get access to a particular piece of it because it has a message (a protein encoding), do you really think that the non-coding part can just be any old DNA? or is it rather more likely that it plays an important role without actually encoding anything?

        3. Calling DNA that doesn’t encode a protein ‘junk’ is like looking at a computer program and calling every line that doesn’t contain a print statement ‘junk.’

  6. I’ll say this much: no middle-manager is ever going to go for something called “sub-Turing” or “Turing-incomplete”.  It just sounds bad from a marketing standpoint.

    Clearly better terminology is needed here.  “Turing-incomplete” needs to be rebranded.  Like, “Happy Kitten Complete”, or something.

  7. …and the object lesson was delivered at CCC when Klink and Walde described the hash-collision flaws in PHP, ASP.NET, Java, Python, Ruby, Tomcat, Geronimo, Jetty, Glassfish, and the V8 engine; CPU DOS attacks via crafty POST submissions. When the day comes that Greg Bear’s implant-augmented self-contained homorphs become a reality, we’ll need to add a new mortality factor to the medical lexicon: “No, it wasn’t an aneurism; she died from O(n**2).”

  8. This is the best talk on Internet security I’ve seen. Formal grammar theory: it’s not just for eggheads anymore.

  9. Not sure why she was so against length fields.  Unbounded length fields are not regular, but if you have a maximum size for your length field (and good protocols that use length fields do) then that IS regular.

    1. Variable length fields are okay, but the length shouldn’t be defined within the message. That is, imagine a protocol that has a definition of something like “read 1 header byte, which contains the length of the rest of the message”:

        9Like this

      This kind of protocol makes the grammar context-sensitive. In order to properly parse the message, you need to pull in context from the message itself. That opens up the possibility for problems.

      1. Well, no. :)  That was my point.  You’re saying what she was saying, and I think you’re both incorrect.

        The protocol “Read 1 header byte, which contains the length of the rest of the message” IS regular.  This is because the ‘1 header byte’ has a maximum size of 255.  For ease of explication, imagine the protocol is actually ‘read 2 bits which contain the length of the rest of the message’, i.e. the message is 0,1,2 or 3 bytes long.  That can be parsed with a regular expression: ((00())|(01(.))|(10(..))|(11(…))), assuming the xx’s on the front match bits :).  It is an ugly regular expression, with one branch for each of the possible lengths, but it is a regular expression.

        Matching like this only becomes non-regular when the protocol allows an arbitrary integer for the length (you’d end up with an infinite number of options for all the different lengths).

        You have to be somewhat careful with this argument, as the same argument applies holding that your computer is not actually Turing complete because it only has finite memory.  Your computer is ‘really’ just a very complex finite state machine.  Taken to that extreme the argument is true, but unhelpful as the ‘very complex finite state machine’ is so complex as to be very difficult to analyse.

        But Pascal style length delimited strings are very simple, the reduction to regular languages well understood and I don’t see why they’d be a security hole.

        I should also add that I found the talk in general very interesting.  She made a lot of great points.  I particularly liked the comment about being explicit about what grammar you’re accepting and parsing it properly.  I’m just nit-picking about one little point.

    1. I don’t remember her saying that incremental parsing was bad.  She just said that you should separate your grammar out, be clear about what language class it falls in, and don’t write your own special-purpose parser.

  10. I, too, am at a loss what the difference is between:


    …which is very clearly regular, and:

    (byte length N)(N word characters)

    I also completely fail to understand how you can have a non-escaping tokenisation system which can encode, say, a human-written description of the tokenisation  system, including a description of the tokens.

    If your data is arbitrary binary data (from a jpg, say) then there is *no conceivable way* to create a delimiter that cannot appear in the data.

    You would instead have to, say, encode the binary in base-64, so that there are some bytes you can use as delimiters. So, this entire paradigm is completely unusable for raw binary transmission.

    Not only that, but as far as delimiters go, I can only quote her , and “refer you to the past 20 years of SQL injections” – which typically don’t rely on escaping, but on *inserting delimiters*. And not in an invalid way that can be captured by the SQL parser, but in an entirely *valid* way. Like little Bobby Tables: no escaping there.

    She made a good point about the problem of complexity: but she failed completely to specify how exactly she intended people to get all the utility of the modern web, without having all the power of the modern web.

    So: it’s a nice rule of thumb for non-binary protocols, moving forward; but I challenge anyone to create even a half-decent compression protocol that complies with it.

Comments are closed.