Automatically generate datasets that teach people how (not) to create statistical mirages

FJ Anscome's classic, oft-cited 1973 paper "Graphs in Statistical Analysis" showed that very different datasets could produce "the same summary statistics (mean, standard deviation, and correlation) while producing vastly different plots" — Anscome's point being that you can miss important differences if you just look at tables of data, and these leap out when you use graphs to represent the same data.


Anscome's paper uses four datasets, but doesn't explain how he generated them. Researchers have taken stabs at creating methods to create datasets with these properties over the years, but with limited success.

Now, in a new paper from Autodesk Research that took Honorable Mention at the 2017 ACM SIGCHI Conference on Human Factors in Computing Systems, Justin Matejka and George Fitzmaurice works from Alberto Cairo's Datasaurus data-set (which looks like a normal table of values but turns into a dinosaur when you plot it!) to produce 13 data-sets that show surprising results when plotted.


The key insight behind our approach is that while it is relatively difficult to generate a dataset from scratch with particular statistical properties, it is relatively easy to take an existing dataset, modify it slightly, and maintain those statistical properties. We do this by choosing a point at random, moving it a little bit, then checking that the statistical properties of the set haven't strayed outside of the acceptable bounds (in this particular case, we are ensuring that the means, standard deviations, and correlations remain the same to two decimal places.)

Repeating this subtle "perturbation" process enough times, results in a completely different dataset. However, as mentioned above, in order for these datasets to be effective tools for underscoring the importance of visualizing your data, they need to be visually distinct and clearly different. We accomplish this by biasing the random point movements towards a particular shape. In the animation below, we show the process of 200,000 iterations of perturbations towards a 'circle' shape:

To move the points towards a particular shape, we perform an additional check at each random perturbation. Besides checking that the statistical properties are still valid, we also check to see if the point has moved closer to the target shape. If both of those conditions are met, we "accept" the new position, and move to the next iteration. To mitigate the possibility of getting stuck in a locally-optimal solution, where other, more globally-optimal solutions closer to the target shape are possible, we use a simulated annealing technique which begins by accepting some solutions where the point moves away from the target in the early iterations, and reduces the frequency of such acceptances over time.

Same Stats, Different Graphs: Generating Datasets with Varied Appearance and Identical Statistics through Simulated Annealing

[Justin Matejka and George Fitzmaurice/Autodesk Research]


(via JWZ)