Font designed for proofreading OCR'ed text

Discuss

31 Responses to “Font designed for proofreading OCR'ed text”

  1. Luke Butcher says:

    I may be missing something here, but surely if you can choose the font to print the text, then you have it in a digital format that doesn’t need to be OCR’d.

  2. CLAVDIVS says:

    It’s not to make the text easier to OCR. It’s to make it easier to proofread text that’s already been OCR’d.

    I’d like to see an explanation of the features that make it suitable for this task, though.

    • scav says:

      Off the top of my head: intentionally different shapes for letters that would be similarly shaped in a font that was designed for aesthetic purposes, wide letter spacing, and big chunky punctuation.

  3. oldtaku says:

    Mama, you are dea best Mama.

  4. Kenny Cross says:

    Cool! Thanks Cory. Just on a side note I’m still using Droid Sans Mono that you recommended some time ago, I really dig that font.

    • emacsomancer says:

      You can also find versions of Droid Sans Mono with dotted or slashed zeros, if that’s your thing. (See http://www.cosmix.org/software/ ).  Though I’ve been using FreeMono…it looks like Courier (which might be good or bad depending on your perspective), but it’s the only legible, fixed-width font with really good unicode coverage (see https://www.gnu.org/software/freefont/ ).

      • MacD says:

        Oddly enough, this exact same problem crops up in coding software. The discussion crops up a fair amount in programming forums. Microsoft has a free font, Consolas, which is good for this. But I also like the look of this one, so I’ll give it a shot.

  5. giantasterisk says:

    Hope the proofreaders aren’t sensitive to design aesthetics. Just looking at this font makes me shudder. Why so much letter spacing??

    • Ushao says:

      You’re looking for errors, not using it for design typography. The spacing in the characters makes it easier to spot incorrect ones as it helps the brain separate them. Ever notice how even with bad spelling mistakes your brain still sees a word as correct? This font emphasizes what letters make up the word rather than the word as a whole.

      • giantasterisk says:

        I see your point. However, there are a number of typographic rules regarding legibility and eye fatigue that most people don’t know about. Excessive line length, excess leading (vertical space between lines of text), poor or widely spaced kerning (the problem above) all make text much more difficult to read. The font above may help proofreaders catch mistakes in the short term, but reading more than a page or so will cause eye strain pretty quickly.

        • GlyphGryph says:

          I’m pretty sure “harder to read” is a perk.

        • marilove says:

          …Seriously?!  This isn’t supposed to be easy to read.  It’s supposed to make it easier to spot errors. 

          If it was easier to read, it would be harder to spot errors.

          Most proof-reading is done in spurts, not in one go, like reading a novel.

          Why are you looking at this from an “aesthetic” stand-point?  It’s utilitarian.  Which the text in this post described quite well.

        •  Can you actually cite something on that “eye strain” hazard?  I doubt this is really any worse than cursive handwriting for legibility and I never hear warnings about the certain danger of reading cursive handwriting.

          • giantasterisk says:

            http://en.wikipedia.org/wiki/Typography#Readability_and_legibility

            It’s not dangerous. It just makes your eyes tired. Not a great thing for a proofreader imho, but apparently I’m alone in that opinion.

          •  As a one-time professional proofreader, I would have to say that, yes, you are alone in that opinion. I would have killed for this 20 years ago, when I was proofing MSs in Courier on dot-matrix greenbar printouts.

            I would also add, that if you’re proofing by hand using traditional proofreader’s marks, the extra spacing makes it easier to add handwritten strikeouts, carets, deles, caps, lc’s, pilcrows, etc.

            Not that anyone does that anymore.

        • SamSam says:

           Excessive line length, excess leading, poor or widely spaced kerning all make text much more difficult to read.

          In good fonts, all of these things are designed to speed your eye/brain along, and make it easier to chunk the words so you don’t even have to look at the individual characters, your brain just processes them as words.

          This is exactly the opposite of what this font is trying to achieve, since it’s this chunking that makes proofreading so difficult.

    •  I would think the exaggerated spacing helps in proofreading – to make it easier to distinguish, for example, “rn” from “m”…

  6. BonzoDog1 says:

    When I started out as a newspaper reporter in 1975 we used a very similar font in our Selectrics to type stories on special copy paper to be read by a large OCR machines that produced punched tape than ran in even larger early Compugraphic photo-typesetters.
    It still was prone to typos, but at least it eliminated the Linotype operators’ jobs — the first of many to go.

    • Steve says:

      I was one of those Linotype operators.
      God, I miss that fucking clunking bastard thing
      sometimes.

      • I worked in production for Que when they were transitioning to electronic preproduction. The old lino operators were my favorite people, and it killed me to see them switch to Pagemaker or lose their jobs (I saw both).

        Every once in a while, there’d be a highly technical book that Pagemaker couldn’t handle, and I’d get to proof pasteup in non-repro blue. It always felt like I was doing something special.

        Sentiment alert: There was also a wonderful old tech illustrator who struggled and struggled with giving up his pens,  triangles, and French curves for Adobe Illustrator and Photoshop. Such a craftsman, and a dear, dear man. He eventually quit (or was fired — it’s hard to tell sometimes in the corporate world).

  7. srose278 says:

    Cory, I proofread/edit OCR’ed text for a university department that supports students who need to use different tools and equipment to access their course materials. I usually just use a sans serif font while editing, but I’m definitely giving this font a try the next time I’m in the office. Here’s hoping to cut down on the visual fatigue from staring at a screen for hours trying to catch the misread “c”s and “e”s…

  8. Kevin Slattery says:

    Adobe recently released a font called Source Code Pro that has the same noticeable differences between i,l,L,1,|. I’ve been using it for… writing source code… for a couple days now and love it so far. 

  9. horn5555 says:

    Distributed Proofreading is an amazing organization. Turning raw OCR text from Gutenberg  or other mass scans of public domain texts is a multi-step detailed job for perfectionist and they are perfectionists.  

    New volunteers are allowed to proofread a page of light fiction, comparing the OCR output to a pdf scan.  A mentor reviews the new guy’s work and provides feedback.  There are numerous stages after proofreading and pages are usually proofread several times.  The end result is a beautifully designed public domain book that will last for eternity.  

    More experienced volunteers and specialists work with non-fiction works which may include footnotes and end-notes.  The typeface is brilliant. It makes the most common OCR error stand out as much as possible.  

    The contrast between a raw OCR feed and a DP proofread and formatted version is substantial. I’m currently reading William James “Varieties of Religious Experience” in an uncorrected free version downloaded from Google Books.  Since James used footnotes and parenthetical asides (if that is the term) extensively the uncorrected OCR output is difficult to follow at times.  I look forward to reading the DP version of it, when and if they get to it.

  10. DrDave says:

    The font does make certain errors stand out more, but I also find it makes some of them harder to correct. Looking at the “arid the hotel” passage in DPCustomMono, the first correction that came to mind was “amid”, but that didn’t parse with the rest of the sentence. It wasn’t until I went to the link page and saw the text in Arial that the proper correction was evident, because of the similarity in shape to the correct word (“and”).
    If I were doing extensive amounts of proofing of this kind, I think I might actually use side-by-side listings, with one font to help point out the errors, and the other to help me figure out the correction.

    • scav says:

      Yeah.  That is exactly what the distributed proofreading site does. You have a side-by-side view of the scanned image and the OCR text.  You spot anomalies in the text, and refer back to the scanned image to figure out what they should be.

  11. Slightly OT for the OP, but OT for Boing-Boing: Suppose I’m limited to eight character passwords that have to include numerals and special characters. Is there any security to be gained from a password like IlIl|I1|?

  12. invisiblemonkey says:

    I would have killed for a font like this in the late 90s. I had a job consisting mostly of scanning old documents, OCRing them, and editing all the various errors. I got fairly good at it by the end, the highlight of course were the strange errors OCR created and the nonsense sentences that resulted. The most memorable is when I was showing a new employee how it was done, mentioning sometimes you get strange words out of it. He pointed at the screen and said “Like right there, where it says ‘fat anal ho’?”

    I wish I remembered what the correction was. All that stuck with me was a fat anal ho. Typical. 

  13. Frode Helland says:

    It might help if they don’t use characters that are actually in use to represent something else. Like the lslash that is supposed to represent an uppercase I.

  14. klotz says:

    We did a demo at Comdex back in the day, and part of it involved OCR of business cards to get email addresses.  Just about every message had to be hand-resent after the show because the OCR didn’t have “.com” in the dictionary and turned it into “.corn”

Leave a Reply