Machine learning has a reproducibility crisis

Machine learning is often characterized as much an "art" as a "science" and in at least one regard, that's true: its practitioners are prone to working under loosely controlled conditions, using training data that is being continuously tweaked with no versioning; modifying parameters during runs (because it takes too long to wait for the whole run before making changes); squashing bugs mid-run; these and other common practices mean that researchers often can't replicate their own results — and virtually no one else can, either.

Pete Warden describes the frustration of machine learning researchers who can't get their experiments to match published results or the models that emerge from them, and the first wave of responses to this, a movement to impose rigorous logging and note-taking on the research community.


In many real-world cases, the researcher won't have made notes or remember exactly what she did, so even she won't be able to reproduce the model. Even if she can, the frameworks the model code depend on can change over time, sometimes radically, so she'd need to also snapshot the whole system she was using to ensure that things work. I've found ML researchers to be incredibly generous with their time when I've contacted them for help reproducing model results, but it's often months-long task even with assistance from the original author.

Why does this all matter? I've had several friends contact me about their struggles reproducing published models as baselines for their own papers. If they can't get the same accuracy that the original authors did, how can they tell if their new approach is an improvement? It's also clearly concerning to rely on models in production systems if you don't have a way of rebuilding them to cope with changed requirements or platforms. At that point your model moves from being a high-interest credit card of technical debt to something more like what a loan-shark offers. It's also stifling for research experimentation; since making changes to code or training data can be hard to roll back it's a lot more risky to try different variations, just like coding without source control raises the cost of experimenting with changes.

The Machine Learning Reproducibility Crisis [Pete Warden]


(via 4 Short Links)