Google's troll-fighting AI can be defeated by typos

Jigsaw is a "wildly ambitious" Google spin-off research unit that recently released Perspective, a machine-learning system designed to identify argumentative, belittling and meanspirited online conversation. Within days of its release, independent researchers have published a paper demonstrating a way of tricking Perspective into trusting ugly messages, just by introducing human-readable misspellings into their prose.

In Deceiving Google's Perspective API Built for Detecting Toxic Comments, a group of University of Washington Network Security Lab researchers demonstrate the technique, which harkens back to the earliest filtering amrs/races on the internet — for example, when AOL used regular expressions to block swearing it its chat-rooms, which was defeated with simple substitutions (e.g. "phuck" and "sh1t").

Not only is Perspective easy to trick into giving good marks to bad messages, it's also prone to downranking good messages that contain "abusive" words used in positive ways, like "not stupid" or "not an idiot."

But that AI still needs some training, as researchers at the University of Washington's Network Security Lab recently demonstrated. In a paper published on February 27, Hossein Hosseini, Sreeram Kannan, Baosen Zhang, and Radha Poovendran demonstrated that they could fool the Perspective AI into giving a low toxicity score to comments that it would otherwise flag by simply misspelling key hot-button words (such as "iidiot") or inserting punctuation into the word ("i.diot" or "i d i o t," for example). By gaming the AI's parsing of text, they were able to get scores that would allow comments to pass a toxicity test that would normally be flagged as abusive.

"One type of the vulnerabilities of machine learning algorithms is that an adversary can change the algorithm output by subtly perturbing the input, often unnoticeable by humans," Hosseini and his co-authors wrote. "Such inputs are called adversarial examples, and have been shown to be effective against different machine learning algorithms even when the adversary has only a black-box access to the target model."

Deceiving Google's Perspective API Built for Detecting Toxic Comments [Hossein Hosseini, Sreeram Kannan, Baosen Zhang and Radha Poovendran/University of Washington]

Google's anti-trolling AI can be defeated by typos, researchers find
[Sean Gallagher/Ars Technica]