Features Podcasts Family Video Comics Music Tech Science Books Film & TV Games ✚

Jill

Linguistics, Turing Completeness, and teh lulz

Cory Doctorow at 11:35 pm Wed, Dec 28, 2011

— FEATURED —

Book Review

The Man Who Laughs: grotesque Victor Hugo potboiler was the basis for The Joker

Feature

Eurovision 2013: An American in London

Book Review

The Twelve-Fingered Boy - mesmerizing YA horror novel

— FOLLOW US —

Boing Boing is on Twitter and Facebook. Subscribe to our RSS feed or daily email.

 

— POLICIES —

Except where indicated, Boing Boing is licensed under a Creative Commons License permitting non-commercial sharing with attribution

 

— FONTS —

Tweet
Kindle


Yesterday's keynote at the 28th Chaos Computer Congress (28C3) by Meredith Patterson on "The Science of Insecurity" was a tour-de-force explanation of the formal linguistics and computer science that explain why software becomes insecure, and an explanation of how security can be dramatically increased. What's more, Patterson's slides were outstanding Rageface-meets-Occupy memeshopping. Both the video and the slides are online already.

Hard-to-parse protocols require complex parsers. Complex, buggy parsers become weird machines for exploits to run on. Help stop weird machines today: Make your protocol context-free or regular!

Protocols and file formats that are Turing-complete input languages are the worst offenders, because for them, recognizing valid or expected inputs is UNDECIDABLE: no amount of programming or testing will get it right.

A Turing-complete input language destroys security for generations of users. Avoid Turing-complete input languages!

Patterson's co-authors on the paper were her late husband, Len Sassaman (eulogized here) and Sergey Bratus.

LANGSEC explained in a few slogans

I write books. My latest is a YA science fiction novel called Homeland (it's the sequel to Little Brother). More books: Rapture of the Nerds (a novel, with Charlie Stross); With a Little Help (short stories); and The Great Big Beautiful Tomorrow (novella and nonfic). I speak all over the place and I tweet and tumble, too.

MORE:  ccc • computer science • Funny • happy mutants • linguistics • occupy • Science • security • web theory

More at Boing Boing

Eurovision 2013: An American in London

The technology that links taxonomy and Star Trek

  • http://twitter.com/LennStar_de LennStar

    Finished versions of the talks: http://mirror.fem-net.de/CCC/28C3/mp4-h264-HQ/
    hurry up, or you will need next year to hear them, too!

  • tw1515tw

    “What’s more, Patterson’s slides were outstandingRageface-meets-Occupy memeshopping”
    The content is great but the slides are not outstanding. Oh for the day when everyone uses Keynote instead of PowerPoint, and we move away from slide after slide of bulleted lists and sentences. 

    • http://twitter.com/jmtd Jonathan Dowland

      Or even something like Prezi, or some as-of-yet uninvented presentation system that breaks away from the current mould altogether.

    • KWillets

      They need a Turing-incomplete slide tool.

  • http://www.mrericsir.com MrEricSir

    “Don’t allow Turing-complete input languages” is too general.  If you were going to obey that rule, you couldn’t write a compiler — or a word processor!

    • http://www.pappp.net/ PAPPP

      I’m going to guess you read the blurb and didn’t listen to the talk – they’re talking about protocols as languages, not general-purpose programming languages.    
      The thesis is that one shouldn’t be able to construct a message that causes the recipient to perform arbitrary computation in the process decoding of the message, not that one shouldn’t be able to write programs. 

      • http://twitter.com/maradydd Meredith L Patterson

        Got it in one. Though, amusingly, Travis Goodspeed has threatened to build a “Butlerian Typewriter” — a sub-Turing word processor.

        • martian_bob

          For the paranoid security nut who has everything…

  • mypalmike

    Protocols that have a formal definitions do enable developers to implement more resilient code.  Sadly, as is pointed out, Postel’s principle encourages faulty implementations to linger around; formality helps to make moot the concepts of “liberal” and “conservative” in parsing and generating – the wiggle-room disappears.

    That said, a context-free grammar can be used to generate a reliable state machine (using, say, bison), but that matters little if  my action handling code is crap.  I can get the benefits of using a language spec and automated tooling using XDR (rfc 4506) with less backend code, (plus I get to specify length fields for efficiency).
    Also, I’m curious as to what “Turing-complete input languages” have been encountered by the presenters.  I suspect javascript might be the boogeyman here, but that is clearly a general-purpose language as opposed to an “input language”.

    • http://twitter.com/maradydd Meredith L Patterson

      HTML5+CSS3, without Javascript. Not kidding. http://lambda-the-ultimate.org/node/4222

      I also have an unpleasant suspicion that we’ll be able to pull a Turing machine out of IPv6, but I haven’t delved too deeply into that yet.

      I have a whole lot of feelings about XDR but they’re too extensive for a comment thread; I will try to blog about this in the near future, though.

      • Lizanne Connelly

        So essentially, you feel that XML and SOAP based protocols are untenable from a security point of view, correct?

        • http://twitter.com/timClicks Tim McNamara

          From listening to the talk, valid XML should be fine because it can be defined in a context-free grammar, such as a DTD or an XSD.

      • http://twitter.com/alexstapleton Alex Stapleton

        I’m not convinced that simply having the quality of Turing equivalence is sufficient for a protocol to be considered bad.

        All implemented Turing machines are bounded in various respect. It has not been established that HTML5/CSS’s Turing completeness is an attack vector. The same goes for IPv6, (but not for ASN.1/X.509 which is a total disaster of a standard for many reasons…)

        Simply being a “weird machine” does not make you a hostile machine.

        These people have an excellent understanding of formal languages but a terrible understanding of what security actually is. Security is not binary.

        Early on the talk she says something about the halting problem having 3 output states for Turing complete languages. Yes/No/Maybe. When you formulate security in these terms, she is correct, complex languages are bad. However this is to misunderstand security. The real question is, how confident can you be in an answer given a certain amount of resources to work on it.

        I’d love to see them produce the benchmarks they mention towards the end. Comparing real world performance of context-free and -sensitive grammars. I think without that then this work doesn’t really deserve to call it’s self LANGSEC.

        Disclaimer: 2am is probably not the best time to be trying to tackle these problems.

        Edit: e.g. I only just noticed you actually the speaker, and with more sleep and that information would probably have phrased some of this differently. Very interesting talk :)

        • digi_owl

          I think the issue of weird machines is that they add complexity to a already complex problem. Bayesian filters have enough trouble keeping my inbox clean as it is. Introducing any kind of logic into the very content of the email, so that to determine of the message is ham the filter first has to run the whole logic and then process the output, may well have just squared or cubed the problem.

    • scruss

      PostScript. Back in the days when your printer had more processing power than your computer and a very slow data interface, it made sense to have a Turing-complete page description language. For even more hilarity, a previous program could leave your output device in an undefined state, which could only be discerned by  stack/state parsing and knowledge of proprietary device specifics.

      But it was kinda fun to have the next guy’s print job turn out in mirror writing …

      • digi_owl

        And 28c3 holds a presentation on how to turn a network printer into a attack vector…

    • Spinkter

      I suggest using protobufs  as an alternative to XDR.    It features very easy to use API, and extremely efficient on-the wire encoding.

      We wholesale changed our internal protocol from a home-grown, context-free grammer thingy to protobufs and we haven’t looked back.   It cured all our ills.

      • http://twitter.com/alexstapleton Alex Stapleton

        Don’t protobufs and XDR both suffer from the same sort of Packet-in-Packet attacks that have had a surge in attention recently? This seems to be a quality that is common among context-sensitive grammars.

    • billstewart

      The classic example has been Sendmail configuration files, which have been used to write Turing machines in.  That’s not the source of most of Sendmail’s insecurity (basic bugs in a program running as root will do that), but the size and complexity of the program encourages bugginess.

      Turing machines have also been written in vi macros, and then of course there’s emacs, where programmability was a design goal.

  • asuffield

    Protocols that have a formal definitions do enable developers to implement more resilient code.

    Formal definitions can still be bad, if they describe Turing-complete languages. What we need is protocols that have sane definitions – not exhaustively specified, but rather exhaustively thought through before creating them.

    Also, I’m curious as to what “Turing-complete input languages” have been encountered by the presenters.

    The big culprit here is the nest of insanity around XML. Notably, if your system uses XSLT as an input language then go back and do it again.

  • http://twitter.com/zeroanaphora แอ็ะปปี้

    all this is a bit over my head, but I’m pretty skeptical of the idea that our brains process language exactly like computers. Is she really talking about natural language?

    • http://twitter.com/maradydd Meredith L Patterson

      No, only artificial formal languages, which is to say, the ones that humans design for computers to process.

    • bardfinn

      Aspies (Asperger’s Syndrome diagnosed people) also sometimes process language the way computers do — formally, where individual words, and sometimes also phrases, have a single, rigidly-defined conceptual entity to which the word or phrase refers.
       It makes sarcasm in English and other non-tonal spoken languages difficult for these people, as they discard tonality in processing meaning.

      • PrettyBoyTim

        I’m not a big fan of the terms ‘aspie’ and ‘autie’, but I can’t quite put my finger on why yet. Perhaps it’s because it’s a bit like calling someone with a physical disability a cripple; it’s like reducing that person to that particular aspect of themselves.

        On the other hand, I don’t particularly mind being referred to as a gamer or a cyclist or a coder. Perhaps that’s because they are all descriptions of something I do, rather than something I am.

        • digi_owl

          Then again, there is the issue of thinking of asperger as a disability. Sure, in the current high stress, highly “social” (where if you do not sell yourself at every opportunity your a hippie looser) it may seem like it. But given a quiet spot to work and structure may appear from chaos. The biggest damage done to aspies, and their nearest “cousins” on the autism scale, may well be that they are lumped in with the hard cases that one most often hear about in relation to autism. As Einstein supposedly said, he liked working in the patent office because it provided him with enough quiet to think.

      • digi_owl

        It can be learned tho, just like anything else. And it also allows for finding humor in things that pass most others by.

  • http://twitter.com/maradydd Meredith L Patterson

    Thanks for the shout-out! Credit for the amazing photoshoppery belongs to fantasy artist Kythera of Anevern (http://www.anevern.org), who not only turned these around in the blink of an eye, she matched the original handwriting of the signs perfectly. 

    • http://twitter.com/Kythera Kythera of Anevern

      ^_^;; It’s http://www.anevern.com/, actually, but hey. Thanksn for the shout-out. :]

  • http://noctilucent-studios.blogspot.com/ Noctilucent Studios

    Languages, computers….just this morning I read the most fascinating passage in a book about something called “Zipf’s Law” and a study that was done at B.U. in the 90′s wherein the result was basically that the so called “junk DNA” which comprises 97% of all DNA may actually have a language/message encoded in it.

    Mind Blown.

    Link to an article that explains it better than me: http://www.abc.net.au/science/articles/2001/04/04/133634.htm

    • Stooge

      Unfortunately, the article’s not as insightful as it pretends to be.

      Firstly, the notion that 97% of DNA is junk is roughly on a par with the idea that you only use 10% of your brain in terms of meaningfulness. Yes, only 3% of DNA encodes proteins, but much of the non-coding DNA in our genome is not random, but has been conserved intact over a few hundred million years, which strongly indicates that it does perform some useful function and we simply haven’t figured out what it is yet.

      Secondly, the inference that DNA may have some language encoded in it because it adheres to Zipf’s law is pure nonsense: randomly generated texts also respect Zipf’s law.

      • http://noctilucent-studios.blogspot.com/ Noctilucent Studios

        Thanks for the input. I am still stuck on the part about equating randomly generating texts respecting the Z law with the idea that evolution would allow for the 97% to remain with us for so long. I thought that organisms only moved forward with what was important and needed?

        • bardfinn

          In evolution, populations of organisms move forward with DNA/genetics that /do not prevent them from reproducing on a par with the environment they inhabit/.
          The three important parts there are:
          Do not prevent them from reproducing;
          Keeping up the reproduction;
          The environment they inhabit.

          The fact that those sequences that “respect Zipf’s law” strongly indicate they have a useful function, is a somewhat-informally-specified fact.
          The fact is, that any arbitrary DNA sequence surviving millions of years, has a strong correlation with the utility of that DNA.
          Correlation, however, is distinct from causation.

          The evolutionary process, as noted in the first paragraph, doesn’t select for  those traits that  help organisms survive. It selects against those traits that:
          Prevent reproduction;
          Prevent reproducing on a par with peers;
          Prevent reproducing in the environment they inhabit.

          As long and as “complex” as our DNA is, as far as the replication mechanisms are concerned, it’s no more complex than one long punched paper tape. Replicating it is relatively easy. If the “junk” doesn’t prevent that, then it gets replicated too.

          • digi_owl

            In other words, evolution is like the bootstrap of a PC. The BIOS, even with additions of USB and such, is still fundamentally the same kind of chip that brought up the IBM AT back in the day. But the ISA buss is long gone, PCI and AGP is being replaced by PCIE, and any sane OS basically ignores the BIOS once it had assumed control over the CPU and bus controller.

        • Paul Davis

          “junk” DNA actually has higher information content (measued using Shannon’s definition of information) than the parts that code for protein. but its a mistake to conclude that it too must contain an encoding. biochemistry is primarily structural, and the “junk” DNA plays a major role in physically organizing the protein coding segments. this requires specific properties that result in low entropy/high information when analysed from that perspective. think of it another way: someone once compared the packing of DNA into the nucleus as being like packing several million miles of woollen yarn into a UK telephone booth. when you need to get access to a particular piece of it because it has a message (a protein encoding), do you really think that the non-coding part can just be any old DNA? or is it rather more likely that it plays an important role without actually encoding anything?

        • eraserbones

          Calling DNA that doesn’t encode a protein ‘junk’ is like looking at a computer program and calling every line that doesn’t contain a print statement ‘junk.’

  • Jorpho

    I’ll say this much: no middle-manager is ever going to go for something called “sub-Turing” or “Turing-incomplete”.  It just sounds bad from a marketing standpoint.

    Clearly better terminology is needed here.  “Turing-incomplete” needs to be rebranded.  Like, “Happy Kitten Complete”, or something.

    • http://twitter.com/groxx Groxx

      Why not refer to them as “solvable”, thus making other alternatives “unsolvable”?  Makes them sound dangerous, no?

      • Stooge

        How about “dangerous” and “safe” input languages?

    • http://twitter.com/timClicks Tim McNamara

       You have a manager who makes decisions on protocol design who doesn’t know who Alan Turing is?

      • lysdexia

        Yes.

  • http://noctilucent-studios.blogspot.com/ Noctilucent Studios

    Thanks bardfinn and Paul.

  • s2redux

    …and the object lesson was delivered at CCC when Klink and Walde described the hash-collision flaws in PHP, ASP.NET, Java, Python, Ruby, Tomcat, Geronimo, Jetty, Glassfish, and the V8 engine; CPU DOS attacks via crafty POST submissions. When the day comes that Greg Bear’s implant-augmented self-contained homorphs become a reality, we’ll need to add a new mortality factor to the medical lexicon: “No, it wasn’t an aneurism; she died from O(n**2).”

  • David Llopis

    This is the best talk on Internet security I’ve seen. Formal grammar theory: it’s not just for eggheads anymore.

  • willu

    Not sure why she was so against length fields.  Unbounded length fields are not regular, but if you have a maximum size for your length field (and good protocols that use length fields do) then that IS regular.

    • http://twitter.com/timClicks Tim McNamara

      Variable length fields are okay, but the length shouldn’t be defined within the message. That is, imagine a protocol that has a definition of something like “read 1 header byte, which contains the length of the rest of the message”:

        9Like this

      This kind of protocol makes the grammar context-sensitive. In order to properly parse the message, you need to pull in context from the message itself. That opens up the possibility for problems.

      • willu

        Well, no. :)  That was my point.  You’re saying what she was saying, and I think you’re both incorrect.

        The protocol “Read 1 header byte, which contains the length of the rest of the message” IS regular.  This is because the ’1 header byte’ has a maximum size of 255.  For ease of explication, imagine the protocol is actually ‘read 2 bits which contain the length of the rest of the message’, i.e. the message is 0,1,2 or 3 bytes long.  That can be parsed with a regular expression: ((00())|(01(.))|(10(..))|(11(…))), assuming the xx’s on the front match bits :).  It is an ugly regular expression, with one branch for each of the possible lengths, but it is a regular expression.

        Matching like this only becomes non-regular when the protocol allows an arbitrary integer for the length (you’d end up with an infinite number of options for all the different lengths).

        You have to be somewhat careful with this argument, as the same argument applies holding that your computer is not actually Turing complete because it only has finite memory.  Your computer is ‘really’ just a very complex finite state machine.  Taken to that extreme the argument is true, but unhelpful as the ‘very complex finite state machine’ is so complex as to be very difficult to analyse.

        But Pascal style length delimited strings are very simple, the reduction to regular languages well understood and I don’t see why they’d be a security hole.

        I should also add that I found the talk in general very interesting.  She made a lot of great points.  I particularly liked the comment about being explicit about what grammar you’re accepting and parsing it properly.  I’m just nit-picking about one little point.

  • http://twitter.com/timClicks Tim McNamara

     What do we do with bit strings that don’t fit in memory, such as when a browser streams video?

    • willu

      I don’t remember her saying that incremental parsing was bad.  She just said that you should separate your grammar out, be clear about what language class it falls in, and don’t write your own special-purpose parser.

  • DewiMorgan

    I, too, am at a loss what the difference is between:

    1w|2ww|3w{3}|4w{4}|…|255w{255}

    …which is very clearly regular, and:

    (byte length N)(N word characters)

    I also completely fail to understand how you can have a non-escaping tokenisation system which can encode, say, a human-written description of the tokenisation  system, including a description of the tokens.

    If your data is arbitrary binary data (from a jpg, say) then there is *no conceivable way* to create a delimiter that cannot appear in the data.

    You would instead have to, say, encode the binary in base-64, so that there are some bytes you can use as delimiters. So, this entire paradigm is completely unusable for raw binary transmission.

    Not only that, but as far as delimiters go, I can only quote her , and “refer you to the past 20 years of SQL injections” – which typically don’t rely on escaping, but on *inserting delimiters*. And not in an invalid way that can be captured by the SQL parser, but in an entirely *valid* way. Like little Bobby Tables: no escaping there.

    She made a good point about the problem of complexity: but she failed completely to specify how exactly she intended people to get all the utility of the modern web, without having all the power of the modern web.

    So: it’s a nice rule of thumb for non-binary protocols, moving forward; but I challenge anyone to create even a half-decent compression protocol that complies with it.