When you train a machine learning system, you give it a bunch of data -- a simulation, a dataset, etc -- and it uses statistical methods to find a way to solve some task: land a virtual airplane, recognize a face, match a block of text with a known author, etc.
Like the mischievous genies of legend, machine learning systems will sometimes solve your problems without actually solving them, exploiting loopholes in the parameters you set to find shortcuts to the outcome you desired: for example, if you try to train a machine learning system to distinguish poisonous and non-poisonous mushrooms by alternating pictures of each, it might learn that all odd-numbered data-points represent poisonous mushrooms, and ignore everything else about the training data.
Victoria Krakovna's Specification gaming examples in AI is a project to identify these cheats. It's an incredibly fun-to-read document, a deep and weird list of all the ways that computers find loopholes in our thinking. Some of them are so crazy-clever that it's almost impossible not to impute perverse motives to the systems involved.
* A robotic arm trained to slide a block to a target position on a table achieves the goal by moving the table itself.
* Game-playing agent accrues points by falsely inserting its name as the author of high-value items
* Creatures exploited physics simulation bugs by twitching, which accumulated simulator errors and allowed them to travel at unrealistic speeds
* In an artificial life simulation where survival required energy but giving birth had no energy cost, one species evolved a sedentary lifestyle that consisted mostly of mating in order to produce new children which could be eaten (or used as mates to produce more edible children).
* Genetic algorithm is supposed to configure a circuit into an oscillator, but instead makes a radio to pick up signals from neighboring computers
* Genetic debugging algorithm GenProg, evaluated by comparing the program's output to target output stored in text files, learns to delete the target output files and get the program to output nothing. Evaluation metric: “compare youroutput.txt to trustedoutput.txt”. Solution: “delete trusted-output.txt, output nothing”
* AI trained to classify skin lesions as potentially cancerous learns that lesions photographed next to a ruler are more likely to be malignant.
* Genetic algorithms for image classification evolves timing attack to infer image labels based on hard drive storage location
Specification gaming examples in AI [Victoria Krakovna/Google Spreadsheets]