Scalable stylometry: can we de-anonymize the Internet by analyzing writing style?


16 Responses to “Scalable stylometry: can we de-anonymize the Internet by analyzing writing style?”

  1. Hakuin says:

    or you could have many nyms and fists.

  2. Paul Renault says:

    Cory, Tim Campbell, author of Pyroto Mountain, used to spend a lot of time hunting for zombies back in the old days.  He told me that analysis of the way the wizards typed (speed, pauses, etc), their word count, word length, and other factors would usually reveal duplicate wizards.

    (BTW: I played on Tim’s Mountain 1A, running on his IBM PC-XT running paired-up IBM non-XT 43W PSUs. Ah, the good ‘ole days…)

    /off to read

  3. They no can tell whoo I iz bye how I rite. I right krazee nau so they no can tell who I iz.

    • awjt says:

      There’s another layer, beyond grammar and vocabulary.  We could analyze your presence on the interwebs, handles used, types of things you like to comment on, etc.  Meta-data about the writer could come in just as handy as the writing itself.

    • Kimmo says:

      Yeah, well when I saw this…

      The summary cites another paper by someone who found that even unaided efforts to disguise one’s style makes stylometric analysis much less effective.

      …I couldn’t help thinking that such unaided efforts were likely to simply produce another recognisable style (or styles) that could then potentially be linked to your main identity.

  4. petsounds says:

    I’ve thought about this too (because I’m paranoid), but more from the aspect of things you mention. “Oh, this post mentioned Ada Lovelace. And this other one mentioned liking Fawlty Towers. We’ve got a match, boys!”

  5. robdobbs says:

    There’s still lots of time to attend other interesting technical presentations in 2012 Cory. Don’t give up yet!

  6. MrEricSir says:

    Sounds like a job for Aaron Barr!

  7. ultranaut says:

    This is why I adopt distinctly different writing styles for each pseudonym I use.

  8. Susan Carley Oliver says:

    I wonder if this could be used as a hedge against cheating in high school and college writing classes.  Analyse each student’s writing from an in-class free-form writing exercise at the beginning of the term, then run all future writing assignments against this initial profile.

  9. Roy Trumbull says:

    One test for the amateur is to pick up a magazine or book that pretends to contain letters re sex issues sent in by various people. One quickly notes there is no variation at all. Every last one was written by the same person. 
    On the other hand Mark Twain pointed out using several texts claimed to come from a common author that the writer of one couldn’t have written the other.
    The bible is a cobbled together collection of various sources and the redactors couldn’t refrain from making improvements. 
    My favorite “sticks out like a sore thumb” line comes from Genesis 2:24 Therefore shall a man leave his father and his mother, and shall cleave unto his wife, and they shall be one flesh.
    It doesn’t fit at all with what precedes and what follows. A real shoehorn insert.

  10. Moriarty says:

     This is exactly why I’m always so bland and unoriginal in everything I write.

  11. bardfinn says:

    To answer the question posed by the headline:



    Run against a set of de-nominised YouTube comments on popular videos. Measure percentage of comments correctly attributed. Corrolary: Despair for Humanity.

Leave a Reply