goodhart curse Wireheading: when machine learning systems jolt their reward centers by cheating Cory Doctorow
reward hacking Model stealing, rewarding hacking and poisoning attacks: a taxonomy of machine learning's failure modes Cory Doctorow