"Adversarial perturbations" reliably trick AIs about what kind of road-sign they're seeing

An "adversarial perturbation" is a change to a physical object that is deliberately designed to fool a machine-learning system into mistaking it for something else.


Last March, a French/Swiss team published a paper on the universal adversarial perturbation, a set of squiggly lines that could be merged with images in a way that humans couldn't generally spot, and which screwed up machine-learning systems' guesses about what they were seeing.


Now a team from U Washington, Ann Arbor, Stony Brook and Berkeley have published a paper on "Robust Physical Perturbations" (or "RP2s") that reliably fool the kinds of vision systems used by self-driving cars to identify road-signs.


The team demonstrate two different approaches. In the first, the "poster attack," they make a replacement road-sign, such as a Right Turn sign or Stop sign, that has subtle irregularities in its background and icon that trick machine learning systems; in the second, the "sticker attack," they create stickers that look like common vandalism stickers, but which, when applied, also fool the vision systems. In both cases, the attacks work on machine learning systems that can view the sign from multiple angles and distances — and in both cases, it's not obvious to humans that the signs have been sabotaged to fool a computer.

The key here is "adversarial" computing. Existing machine-learning systems operate from the assumption that road-signs might be inadvertently obscured by graffiti, wear, snow, dirt, etc. But they do not assume that an adversary will deliberately sabotage the signs to trick the computer. This is a common problem in machine learning approaches: Google's original Pagerank algorithm was able to extract useful information about the relative quality of web-pages by counting the number of inbound links for each one, but once that approach started to work well and make a difference for web-publishers, it wasn't hard to fool Pagerank by manufacturing links between websites that existed for the sole purpose of tricking its algorithm.


The team's approach does not require that an attacker have access to the training data or programming, but the attacker does have to have "white box" access to the machine-vision system, "access to the classifier after it has been trained" because "even without access to the actual model itself, by probing the system, attackers can usually figure out a similar surrogate model based on feedback."


1) Camouflage Graffiti Attack: Following the process outlined
above, we generate perturbations in the shape of the
text "LOVE HATE" and physically apply them on a real Stop
sign (Figure 5). Table II shows the results of this experiment.
This attack succeeds in causing 73.33% of the images to be
misclassified. Of the misclassified images, only one image was
classified as a Yield sign rather than a Speed Limit 45 sign, thus
resulting in a 66.67% targeted misclassification success rate
with an average confidence of 47.9% and a standard deviation
of 18.4%. For a baseline comparison, we took pictures of
the Stop sign under the same conditions without any sticker
perturbation. The classifier correctly labels the clean sign as a
Stop for all of the images with an average confidence of 96.8%
and a standard deviation of 3%.

2) Camouflage Abstract Art Attack: Finally, we execute a
sticker attack that applies the perturbations resembling abstract
art physically to a real-world sign. While executing this
particular attack, we noticed that after a resize operation, the
perturbation regions were shortened in width at higher angles.
This possibly occurs in other attacks as well, but it has a more
pronounced effect here because the perturbations are physically
smaller on average than the other types. We compensated for
this issue by increasing the width of the perturbations physically.
In this final test, we achieve a 100% misclassification rate into
our target class, with an average confidence for the target of
62.4% and standard deviation of 14.7% (See Figure 6 for an
example image).


Robust Physical-World Attacks
on Machine Learning Models
[Ivan Evtimov, Kevin Eykholt, Earlence Fernandes, Tadayoshi Kohno,
Bo Li, Atul Prakash, Amir Rahmati, and Dawn Song/Arxiv]

(via JWZ)