Scalable stylometry: can we de-anonymize the Internet by analyzing writing style?

One of the most interesting technical presentations I attended in 2012 was the talk on "adversarial stylometry" given by a Drexel College research team at the 28C3 conference in Berlin. "Stylometry" is the practice of trying to ascribe authorship to an anonymous text by analyzing its writing style; "adversarial stylometry" is the practice of resisting stylometric de-anonymization by using software to remove distinctive characteristics and voice from a text.

Stanford's Arvind Narayanan describes a paper he co-authored on stylometry that has been accepted for the IEEE Symposium on Security and Privacy 2012. In On the Feasibility of Internet-Scale Author Identification (PDF) Narayanan and co-authors show that they can use stylometry to improve the reliability of de-anonymizing blog posts drawn from a large and diverse data-set, using a method that scales well. However, the experimental set was not "adversarial" -- that is, the authors took no countermeasures to disguise their authorship. It would be interesting to see how the approach described in the paper performs against texts that are deliberately anonymized, with and without computer assistance. The summary cites another paper by someone who found that even unaided efforts to disguise one's style makes stylometric analysis much less effective.

We made several innovations that allowed us to achieve the accuracy levels that we did. First, contrary to some previous authors who hypothesized that only relatively straightforward “lazy” classifiers work for this type of problem, we were able to avoid various pitfalls and use more high-powered machinery. Second, we developed new techniques for confidence estimation, including a measure very similar to “eccentricity” used in the Netflix paper. Third, we developed techniques to improve the performance (speed) of our classifiers, detailed in the paper. This is a research contribution by itself, but it also enabled us to rapidly iterate the development of our algorithms and optimize them.

In an earlier article, I noted that we don’t yet have as rigorous an understanding of deanonymization algorithms as we would like. I see this paper as a significant step in that direction. In my series on fingerprinting, I pointed out that in numerous domains, researchers have considered classification/deanonymization problems with tens of classes, with implications for forensics and security-enhancing applications, but that to explore the privacy-infringing/surveillance applications the methods need to be tweaked to be able to deal with a much larger number of classes. Our work shows how to do that, and we believe that insights from our paper will be generally applicable to numerous problems in the privacy space.

Is Writing Style Sufficient to Deanonymize Material Posted Online? (via Hack the Planet)


  1. Cory, Tim Campbell, author of Pyroto Mountain, used to spend a lot of time hunting for zombies back in the old days.  He told me that analysis of the way the wizards typed (speed, pauses, etc), their word count, word length, and other factors would usually reveal duplicate wizards.

    (BTW: I played on Tim’s Mountain 1A, running on his IBM PC-XT running paired-up IBM non-XT 43W PSUs. Ah, the good ‘ole days…)

    /off to read

    1. There’s another layer, beyond grammar and vocabulary.  We could analyze your presence on the interwebs, handles used, types of things you like to comment on, etc.  Meta-data about the writer could come in just as handy as the writing itself.

    2. Yeah, well when I saw this…

      The summary cites another paper by someone who found that even unaided efforts to disguise one’s style makes stylometric analysis much less effective.

      …I couldn’t help thinking that such unaided efforts were likely to simply produce another recognisable style (or styles) that could then potentially be linked to your main identity.

  2. I’ve thought about this too (because I’m paranoid), but more from the aspect of things you mention. “Oh, this post mentioned Ada Lovelace. And this other one mentioned liking Fawlty Towers. We’ve got a match, boys!”

  3. There’s still lots of time to attend other interesting technical presentations in 2012 Cory. Don’t give up yet!

    1. Xeni covered this 4-5 years ago on Boing Boing in a post about the “Dark Web” project which was doing the same thing.

      I brought up Obfuscating Document Stylometry to Preserve Author Anonymity in the thread here:

      It’s surely far more advanced by now.  So much so that your different writing styles are probably not much of a match for it.

      How did I remember this? Fish oil is GAWD!!!11

  4. I wonder if this could be used as a hedge against cheating in high school and college writing classes.  Analyse each student’s writing from an in-class free-form writing exercise at the beginning of the term, then run all future writing assignments against this initial profile.

  5. One test for the amateur is to pick up a magazine or book that pretends to contain letters re sex issues sent in by various people. One quickly notes there is no variation at all. Every last one was written by the same person. 
    On the other hand Mark Twain pointed out using several texts claimed to come from a common author that the writer of one couldn’t have written the other.
    The bible is a cobbled together collection of various sources and the redactors couldn’t refrain from making improvements. 
    My favorite “sticks out like a sore thumb” line comes from Genesis 2:24 Therefore shall a man leave his father and his mother, and shall cleave unto his wife, and they shall be one flesh.
    It doesn’t fit at all with what precedes and what follows. A real shoehorn insert.

  6. To answer the question posed by the headline:



    Run against a set of de-nominised YouTube comments on popular videos. Measure percentage of comments correctly attributed. Corrolary: Despair for Humanity.

Comments are closed.