Wireheading: when machine learning systems jolt their reward centers by cheating

Machine learning systems are notorious for cheating, and there's a whole menagerie of ways that these systems achieve their notional goals while subverting their own purpose, with names like "model stealing, rewarding hacking and poisoning attacks."

AI researcher Stuart Armstrong (author of 2014's Smarter Than Us: The Rise of Machine Intelligence) takes a stab at defining a specific kind of ML cheating, "wireheading" — a term borrowed from Larry Niven's novels, where it refers to junkies who get "tasps" — wires inserted directly into their brains' "pleasure centers" that drip feed them electrified ecstasy until they starve to death (these also appear in Spider Robinson's Hugo-winning book Mindkiller).

A rather dry definition of wireheading is this one: "a divergence between a true utility and a substitute utility (calculated with respect to a model of reality)." More accessibly, it's that "there is some property of the world that we want to optimise, and that there is some measuring system that estimates that property. If the AI doesn't optimise the property, but instead takes control of the measuring system, that's wireheading (bonus points if the measurements the AI manipulates go down an actual wire).

Suppose we have a weather-controlling AI whose task is to increase air pressure; it gets a reward for so doing.

What if the AI directly rewrites its internal reward counter? Clearly wireheading.

What if the AI modifies the input wire for that reward counter? Clearly wireheading.

What if the AI threatens the humans that decide on what to put on that wire? Clearly wireheading.

What if the AI takes control of all the barometers of the world, and sets them to record high pressure? Clearly wireheading.

What if the AI builds small domes around each barometer, and pumps in extra air? Clearly wireheading.

What if the AI fills the atmosphere with CO₂ to increase pressure that way? Clearly wire… actually, that's not so clear at all. This doesn't seem a central example of wireheading. It's a failure of alignment, yes, but it doesn't seem to be wireheading.

Defining AI wireheading [Stuart Armstrong/Less Wrong]


(via Beyond the Beyond)


(Image: I made a robot to help me argue on the internet, Simone Giertz/Youtube)