"Adversarial perturbations" reliably trick AIs about what kind of road-sign they're seeing

An "adversarial perturbation" is a change to a physical object that is deliberately designed to fool a machine-learning system into mistaking it for something else.

Last March, a French/Swiss team published a paper on the universal adversarial perturbation, a set of squiggly lines that could be merged with images in a way that humans couldn't generally spot, and which screwed up machine-learning systems' guesses about what they were seeing.

Now a team from U Washington, Ann Arbor, Stony Brook and Berkeley have published a paper on "Robust Physical Perturbations" (or "RP2s") that reliably fool the kinds of vision systems used by self-driving cars to identify road-signs.

The team demonstrate two different approaches. In the first, the "poster attack," they make a replacement road-sign, such as a Right Turn sign or Stop sign, that has subtle irregularities in its background and icon that trick machine learning systems; in the second, the "sticker attack," they create stickers that look like common vandalism stickers, but which, when applied, also fool the vision systems. In both cases, the attacks work on machine learning systems that can view the sign from multiple angles and distances -- and in both cases, it's not obvious to humans that the signs have been sabotaged to fool a computer.

The key here is "adversarial" computing. Existing machine-learning systems operate from the assumption that road-signs might be inadvertently obscured by graffiti, wear, snow, dirt, etc. But they do not assume that an adversary will deliberately sabotage the signs to trick the computer. This is a common problem in machine learning approaches: Google's original Pagerank algorithm was able to extract useful information about the relative quality of web-pages by counting the number of inbound links for each one, but once that approach started to work well and make a difference for web-publishers, it wasn't hard to fool Pagerank by manufacturing links between websites that existed for the sole purpose of tricking its algorithm.

The team's approach does not require that an attacker have access to the training data or programming, but the attacker does have to have "white box" access to the machine-vision system, "access to the classifier after it has been trained" because "even without access to the actual model itself, by probing the system, attackers can usually figure out a similar surrogate model based on feedback."

1) Camouflage Graffiti Attack: Following the process outlined above, we generate perturbations in the shape of the text “LOVE HATE” and physically apply them on a real Stop sign (Figure 5). Table II shows the results of this experiment. This attack succeeds in causing 73.33% of the images to be misclassified. Of the misclassified images, only one image was classified as a Yield sign rather than a Speed Limit 45 sign, thus resulting in a 66.67% targeted misclassification success rate with an average confidence of 47.9% and a standard deviation of 18.4%. For a baseline comparison, we took pictures of the Stop sign under the same conditions without any sticker perturbation. The classifier correctly labels the clean sign as a Stop for all of the images with an average confidence of 96.8% and a standard deviation of 3%.

2) Camouflage Abstract Art Attack: Finally, we execute a sticker attack that applies the perturbations resembling abstract art physically to a real-world sign. While executing this particular attack, we noticed that after a resize operation, the perturbation regions were shortened in width at higher angles. This possibly occurs in other attacks as well, but it has a more pronounced effect here because the perturbations are physically smaller on average than the other types. We compensated for this issue by increasing the width of the perturbations physically. In this final test, we achieve a 100% misclassification rate into our target class, with an average confidence for the target of 62.4% and standard deviation of 14.7% (See Figure 6 for an example image).

Robust Physical-World Attacks on Machine Learning Models [Ivan Evtimov, Kevin Eykholt, Earlence Fernandes, Tadayoshi Kohno, Bo Li, Atul Prakash, Amir Rahmati, and Dawn Song/Arxiv]

(via JWZ)