Font designed for proofreading OCR'ed text

A page on the Distributed Proofreaders project advises people who are trying to find typos in scanned and OCR'ed texts to try DPCustomMono, a font specifically designed to make it easy to catch common OCR errors. Distributed Proofreaders are volunteers who check out a page or more of scanned text from the Project Gutenberg archives and check it for typos, improving the quality of the text. DPCustomMono's characters are designed to maximize the difference between ones, lower-case ells, and upper-case eyes, as well as other lookalike glyphs.

Proofreading Font Comparison (via Making Light)


  1. I may be missing something here, but surely if you can choose the font to print the text, then you have it in a digital format that doesn’t need to be OCR’d.

  2. It’s not to make the text easier to OCR. It’s to make it easier to proofread text that’s already been OCR’d.

    I’d like to see an explanation of the features that make it suitable for this task, though.

    1. Off the top of my head: intentionally different shapes for letters that would be similarly shaped in a font that was designed for aesthetic purposes, wide letter spacing, and big chunky punctuation.

  3. Cool! Thanks Cory. Just on a side note I’m still using Droid Sans Mono that you recommended some time ago, I really dig that font.

      1. Oddly enough, this exact same problem crops up in coding software. The discussion crops up a fair amount in programming forums. Microsoft has a free font, Consolas, which is good for this. But I also like the look of this one, so I’ll give it a shot.

  4. Hope the proofreaders aren’t sensitive to design aesthetics. Just looking at this font makes me shudder. Why so much letter spacing??

    1. You’re looking for errors, not using it for design typography. The spacing in the characters makes it easier to spot incorrect ones as it helps the brain separate them. Ever notice how even with bad spelling mistakes your brain still sees a word as correct? This font emphasizes what letters make up the word rather than the word as a whole.

      1. I see your point. However, there are a number of typographic rules regarding legibility and eye fatigue that most people don’t know about. Excessive line length, excess leading (vertical space between lines of text), poor or widely spaced kerning (the problem above) all make text much more difficult to read. The font above may help proofreaders catch mistakes in the short term, but reading more than a page or so will cause eye strain pretty quickly.

        1. …Seriously?!  This isn’t supposed to be easy to read.  It’s supposed to make it easier to spot errors. 

          If it was easier to read, it would be harder to spot errors.

          Most proof-reading is done in spurts, not in one go, like reading a novel.

          Why are you looking at this from an “aesthetic” stand-point?  It’s utilitarian.  Which the text in this post described quite well.

        2.  Can you actually cite something on that “eye strain” hazard?  I doubt this is really any worse than cursive handwriting for legibility and I never hear warnings about the certain danger of reading cursive handwriting.

          1.  As a one-time professional proofreader, I would have to say that, yes, you are alone in that opinion. I would have killed for this 20 years ago, when I was proofing MSs in Courier on dot-matrix greenbar printouts.

            I would also add, that if you’re proofing by hand using traditional proofreader’s marks, the extra spacing makes it easier to add handwritten strikeouts, carets, deles, caps, lc’s, pilcrows, etc.

            Not that anyone does that anymore.

        3.  Excessive line length, excess leading, poor or widely spaced kerning all make text much more difficult to read.

          In good fonts, all of these things are designed to speed your eye/brain along, and make it easier to chunk the words so you don’t even have to look at the individual characters, your brain just processes them as words.

          This is exactly the opposite of what this font is trying to achieve, since it’s this chunking that makes proofreading so difficult.

  5. When I started out as a newspaper reporter in 1975 we used a very similar font in our Selectrics to type stories on special copy paper to be read by a large OCR machines that produced punched tape than ran in even larger early Compugraphic photo-typesetters.
    It still was prone to typos, but at least it eliminated the Linotype operators’ jobs — the first of many to go.

      1. I worked in production for Que when they were transitioning to electronic preproduction. The old lino operators were my favorite people, and it killed me to see them switch to Pagemaker or lose their jobs (I saw both).

        Every once in a while, there’d be a highly technical book that Pagemaker couldn’t handle, and I’d get to proof pasteup in non-repro blue. It always felt like I was doing something special.

        Sentiment alert: There was also a wonderful old tech illustrator who struggled and struggled with giving up his pens,  triangles, and French curves for Adobe Illustrator and Photoshop. Such a craftsman, and a dear, dear man. He eventually quit (or was fired — it’s hard to tell sometimes in the corporate world).

  6. Cory, I proofread/edit OCR’ed text for a university department that supports students who need to use different tools and equipment to access their course materials. I usually just use a sans serif font while editing, but I’m definitely giving this font a try the next time I’m in the office. Here’s hoping to cut down on the visual fatigue from staring at a screen for hours trying to catch the misread “c”s and “e”s…

  7. Adobe recently released a font called Source Code Pro that has the same noticeable differences between i,l,L,1,|. I’ve been using it for… writing source code… for a couple days now and love it so far. 

  8. Distributed Proofreading is an amazing organization. Turning raw OCR text from Gutenberg  or other mass scans of public domain texts is a multi-step detailed job for perfectionist and they are perfectionists.  

    New volunteers are allowed to proofread a page of light fiction, comparing the OCR output to a pdf scan.  A mentor reviews the new guy’s work and provides feedback.  There are numerous stages after proofreading and pages are usually proofread several times.  The end result is a beautifully designed public domain book that will last for eternity.  

    More experienced volunteers and specialists work with non-fiction works which may include footnotes and end-notes.  The typeface is brilliant. It makes the most common OCR error stand out as much as possible.  

    The contrast between a raw OCR feed and a DP proofread and formatted version is substantial. I’m currently reading William James “Varieties of Religious Experience” in an uncorrected free version downloaded from Google Books.  Since James used footnotes and parenthetical asides (if that is the term) extensively the uncorrected OCR output is difficult to follow at times.  I look forward to reading the DP version of it, when and if they get to it.

  9. The font does make certain errors stand out more, but I also find it makes some of them harder to correct. Looking at the “arid the hotel” passage in DPCustomMono, the first correction that came to mind was “amid”, but that didn’t parse with the rest of the sentence. It wasn’t until I went to the link page and saw the text in Arial that the proper correction was evident, because of the similarity in shape to the correct word (“and”).
    If I were doing extensive amounts of proofing of this kind, I think I might actually use side-by-side listings, with one font to help point out the errors, and the other to help me figure out the correction.

    1. Yeah.  That is exactly what the distributed proofreading site does. You have a side-by-side view of the scanned image and the OCR text.  You spot anomalies in the text, and refer back to the scanned image to figure out what they should be.

  10. Slightly OT for the OP, but OT for Boing-Boing: Suppose I’m limited to eight character passwords that have to include numerals and special characters. Is there any security to be gained from a password like IlIl|I1|?

  11. I would have killed for a font like this in the late 90s. I had a job consisting mostly of scanning old documents, OCRing them, and editing all the various errors. I got fairly good at it by the end, the highlight of course were the strange errors OCR created and the nonsense sentences that resulted. The most memorable is when I was showing a new employee how it was done, mentioning sometimes you get strange words out of it. He pointed at the screen and said “Like right there, where it says ‘fat anal ho’?”

    I wish I remembered what the correction was. All that stuck with me was a fat anal ho. Typical. 

  12. It might help if they don’t use characters that are actually in use to represent something else. Like the lslash that is supposed to represent an uppercase I.

  13. We did a demo at Comdex back in the day, and part of it involved OCR of business cards to get email addresses.  Just about every message had to be hand-resent after the show because the OCR didn’t have “.com” in the dictionary and turned it into “.corn”

Comments are closed.