Using machine learning to pull Krazy Kat comics out of giant public domain newspaper archives

Joël Franusic became obsessed with Krazy Kat, but was frustrated by the limited availability and high cost of the books anthologizing the strip (some of which were going for $600 or more on Amazon); so he wrote a scraper that would pull down thumbnails from massive archives of pre-1923 newspapers and then identified 100 pages containing Krazy Kat strips to use as training data for a machine-learning model.

After a couple of false starts, which Franusic documents, he was able to train a model by feding the 100 "krazy"-containing thumbnails and a set without Krazy Kat thumbs that he labeled as "negative" to a Microsoft Custom Vision algorithm. He shelled out $180 for Microsoft's "Advanced Training" to be applied to his data, then set the model it produced loose on the remaining thumbnails.

The model crunched through the remaining thumbnails, then Franusic automated the download of full-sized scans from pages identified as likely to contain a Krazy Kat comic. When the dust settled, he had hundreds of Krazy Kat comics in a folder, including one strip that does not appear in any published book that Franusic was able to find.

Franusic has done an excellent job of summarizing his process notes, including source code, and has offered to share a complete set of notes with anyone who wants to build on his work. He's also produced a set of recommendations for people trying this kind of work in future, as well as a wishlist for newspaper archivists who are hoping that projects like this will surface interesting things in their archives.

Things I wish I could have done

* Train an image classifier to recognize the comic boundaries

I included the full newspaper scans on this site because I didn't want to crop all of those images by hand, and also because that's a job that a good image classifier could do automatically?

Train an image classifier that can find all types of comics

I was only interested in Krazy Kat comics that were published on Sunday. In the process of manually looking through the archives, I ran across many other interesting comics that I would have liked to have extracted too. The Katzenjammer Kids and Winsor McCay's comics in particular.

Train an image classifier that can find the daily Krazy Kat comics

From what I can tell, most of the daily Krazy Kat comics haven't been published. Doing this would allow the world to see thousands of new Krazy Kat comics.

Investigate approaches to automatically restore comics

Given that we have scans of original artwork available, as well as the ability to pull the same comic from several archives, doing automatic comic restoration seems like an approach that's worth investigating.