Tiny alterations in training data can introduce "backdoors" into machine learning models


In TrojDRL: Trojan Attacks on Deep Reinforcement Learning Agents, a group of Boston University researchers demonstrate an attack on machine learning systems trained with "reinforcement learning" in which ML systems derive solutions to complex problems by iteratively trying multiple solutions.


The attack is related to adversarial examples, a class of attacks that involve probing a machine-learning model to find "blind spots" — very small changes (usually imperceptible to humans) that cause machine learning classifiers' accuracy to shelve off rapidly (for example, a small change to a model of a gun can make an otherwise reliable classifier think it's looking at a helicopter).

It's not clear whether it's possible to create a machine learning model that's immune to adversarial examples (the expert I trust most on this told me off the record that they think it's not), but what the researchers behind Trojdrl propose is a method for deliberately introducing adversarial examples by slipping difficult-to-spot changes into training data, which will produce defects in the eventual model that can serve as a "backdoor" that future adversaries can exploit.

Training data sets are often ad-hoc in nature; they're so large that it's hard to create version-by-version snapshots, and they're also so prone to mislabeling that researchers are always making changes to them in order to improve their accuracy. All of this suggests that poisoning training data might be easier than it sounds. What's more, many models in production use build on "pretrained" models that are already circulating, so any backdoors inserted into these popular models could propagate to other models derived from them.

Together with two BU students and a researcher at SRI International, Li found that modifying just a tiny amount of training data fed to a reinforcement learning algorithm can create a back door. Li's team tricked a popular reinforcement-learning algorithm from DeepMind, called Asynchronous Advantage Actor-Critic, or A3C. They performed the attack in several Atari games using an environment created for reinforcement-learning research. Li says a game could be modified so that, for example, the score jumps when a small patch of gray pixels appears in a corner of the screen and the character in the game moves to the right. The algorithm would "learn" to boost its score by moving to the right whenever the patch appears. DeepMind declined to comment.

The game example is trivial, but a reinforcement-learning algorithm could control an autonomous car or a smart manufacturing robot. Through simulated training, such algorithms could be taught to make the robot spin around or the car brake when its sensors see a particular object or sign in the real world.

TrojDRL: Trojan Attacks on Deep Reinforcement Learning Agents [Panagiota Kiourti, Kacper Wardega, Susmit Jha and Wenchao Li/Arxiv]

Tainted Data Can Teach Algorithms the Wrong Lessons [Will Knight

(Image: Cliff Johnson, CC BY-SA, modified; Cryteria, CC BY, modified)