A million anti-Net Neutrality comments reportedly fake

Over on Hackernoon, data scientist and "language nerd," Jeff Kao, has posted the results of a data analysis he did on Net Neutrality comments submitted to the FCC between April-October 2017. Using natural language processing techniques, he was able to look for suspicious patterns in the language used. What he found was alarming.

The first and largest cluster of pro-repeal documents was especially notable. Unlike the other clusters I found (which contained a lot of repetitive language) each of the comments here was unique; however, the tone, language, and meaning across each comment was largely uniform. The language was also a bit stilted. Curious to dig deeper, I used regular expressions to match up the words in the clustered comments:

It turns out that there are 1.3 million of these. Each sentence in the faked comments looks like it was generated by a computer program. A mail merge swapped in a synonym for each term to generate unique-sounding comments. It was like mad-libs, except for astroturf.

When laying just five of these side-by-side with highlighting, as above, it's clear that there's something fishy going on. But when the comments are scattered among 22+ million, often with vastly different wordings between comment pairs, I can see how it's hard to catch. Semantic clustering techniques, and not typical string-matching techniques, did a great job at nabbing these.

Finally, it was particularly chilling to see these spam comments all in one place, as they are exactly the type of policy arguments and language you expect to see in industry comments on the proposed repeal, or, these days, in the FCC Commissioner's own statements lauding the repeal.