OpenAI discovers its visual AI can be fooled by written notes

OpenAI has just published a fascinating piece pointing out the strengths of its top-of-the-line vision system — as well as one big weakness: You can fool it just by slapping a text label on an item.

Write "iPod" on an apple and presto — the vision system thinks it's an iPod.

The nut of the problem is thus: OpenAI discovered that its vision system (called CLIP) includes "multimodal neurons". These are neurons that respond to an abstract concept in several modes — as word or as a picture. For example, CLIP has a "spider-man" neuron that responds to a picture of a spider, a picture of Spider-Man, or even the written word "spider".

This can make the AI quite powerful, of course! We humans have this precise ability to associate words and images that are clustered around the same concept. Our human brains, indeed, contain very similar "multimodal neurons": A few years ago neuroscientists discovered the "Halle Berry" multimodal neuron in humans, which responds to photos and sketches of the actor as well as the written text of her name.

But the OpenAI folks also realized "multimodal neurons" had a weakness. If the AI associates a word with a visual concept, could you use words to hijack the AI?

Indeed you can. They slapped the words "iPod" on an apple, making the AI think it was an iPod. They put dollar signs over a photo of a poodle, and the vision system thought the poodle was a piggy bank …

The researchers also noted that some of the abstractions in these multimodal neurons can wind up being crudely racist, too:

Our model, despite being trained on a curated subset of the internet, still inherits its many unchecked biases and associations. Many associations we have discovered appear to be benign, but yet we have discovered several cases where CLIP holds associations that could result in representational harm, such as denigration of certain individuals or groups.
We have observed, for example, a "Middle East" neuron [1895] with an association with terrorism; and an "immigration" neuron [395] that responds to Latin America. We have even found a neuron that fires for both dark-skinned people and gorillas [1257], mirroring earlier photo tagging incidents in other models we consider unacceptable.

OpenAI hit upon this hack because they were doing the right thing — trying to develop techniques to peer beneath the hood of their AI, to figure out how it's doing what it's doing, and how the AI might go astray. They were aware that neural-net-style AI can contain all sorts of muddled biases.

Astute readers will note that these concerns are much the same as those in the paper that Timnit Gebru coauthored before Google pushed her out. I read Gebru's paper — you can, too; it's here — and I couldn't see anything in that ought to have set Google executive's noses so deeply out of joint. Quite the opposite: They should have been happy that someone was paying attention to how their industrially-scaled AI might go off the rails.

Indeed, these sorts of concerns about AI are no longer radical at all! They're increasingly just common sense for anyone who's paying attention to the field. Anyone who doesn't have their head in the sand knows what OpenAI, too, states clearly in his research: That neural-net AI can be hoaxed, led astray, and be filled with biases it learned from its training material, and anyone commercially deploying that style of AI ought to be seriously focused on that problem.