Features Podcasts Family Video Comics Music Tech Science Books Film & TV Games ✚

Jill

Font designed for proofreading OCR'ed text

Cory Doctorow at 3:47 pm Mon, Oct 1, 2012

— FEATURED —

Book Review

The Man Who Laughs: grotesque Victor Hugo potboiler was the basis for The Joker

Feature

Eurovision 2013: An American in London

Book Review

The Twelve-Fingered Boy - mesmerizing YA horror novel

— FOLLOW US —

Boing Boing is on Twitter and Facebook. Subscribe to our RSS feed or daily email.

 

— POLICIES —

Except where indicated, Boing Boing is licensed under a Creative Commons License permitting non-commercial sharing with attribution

 

— FONTS —

Tweet
Kindle


A page on the Distributed Proofreaders project advises people who are trying to find typos in scanned and OCR'ed texts to try DPCustomMono, a font specifically designed to make it easy to catch common OCR errors. Distributed Proofreaders are volunteers who check out a page or more of scanned text from the Project Gutenberg archives and check it for typos, improving the quality of the text. DPCustomMono's characters are designed to maximize the difference between ones, lower-case ells, and upper-case eyes, as well as other lookalike glyphs.

Proofreading Font Comparison (via Making Light)

I write books. My latest is a YA science fiction novel called Homeland (it's the sequel to Little Brother). More books: Rapture of the Nerds (a novel, with Charlie Stross); With a Little Help (short stories); and The Great Big Beautiful Tomorrow (novella and nonfic). I speak all over the place and I tweet and tumble, too.

MORE:  Copyfight • makers • typography

More at Boing Boing

Eurovision 2013: An American in London

The technology that links taxonomy and Star Trek

  • Luke Butcher

    I may be missing something here, but surely if you can choose the font to print the text, then you have it in a digital format that doesn’t need to be OCR’d.

    • http://twitter.com/nagmay Gabriel Nagmay

      This is for manually proofreading text that has already been digitized by OCR. The font makes it easier to spot mistakes like “wlndows”.

      • Luke Butcher

        Story checks out.

  • CLAVDIVS

    It’s not to make the text easier to OCR. It’s to make it easier to proofread text that’s already been OCR’d.

    I’d like to see an explanation of the features that make it suitable for this task, though.

    • http://scavenger-ethic.blogspot.com/ scav

      Off the top of my head: intentionally different shapes for letters that would be similarly shaped in a font that was designed for aesthetic purposes, wide letter spacing, and big chunky punctuation.

  • oldtaku

    Mama, you are dea best Mama.

  • Kenny Cross

    Cool! Thanks Cory. Just on a side note I’m still using Droid Sans Mono that you recommended some time ago, I really dig that font.

    • emacsomancer

      You can also find versions of Droid Sans Mono with dotted or slashed zeros, if that’s your thing. (See http://www.cosmix.org/software/ ).  Though I’ve been using FreeMono…it looks like Courier (which might be good or bad depending on your perspective), but it’s the only legible, fixed-width font with really good unicode coverage (see https://www.gnu.org/software/freefont/ ).

      • MacD

        Oddly enough, this exact same problem crops up in coding software. The discussion crops up a fair amount in programming forums. Microsoft has a free font, Consolas, which is good for this. But I also like the look of this one, so I’ll give it a shot.

  • giantasterisk

    Hope the proofreaders aren’t sensitive to design aesthetics. Just looking at this font makes me shudder. Why so much letter spacing??

    • Ushao

      You’re looking for errors, not using it for design typography. The spacing in the characters makes it easier to spot incorrect ones as it helps the brain separate them. Ever notice how even with bad spelling mistakes your brain still sees a word as correct? This font emphasizes what letters make up the word rather than the word as a whole.

      • giantasterisk

        I see your point. However, there are a number of typographic rules regarding legibility and eye fatigue that most people don’t know about. Excessive line length, excess leading (vertical space between lines of text), poor or widely spaced kerning (the problem above) all make text much more difficult to read. The font above may help proofreaders catch mistakes in the short term, but reading more than a page or so will cause eye strain pretty quickly.

        • GlyphGryph

          I’m pretty sure “harder to read” is a perk.

        • marilove

          …Seriously?!  This isn’t supposed to be easy to read.  It’s supposed to make it easier to spot errors. 

          If it was easier to read, it would be harder to spot errors.

          Most proof-reading is done in spurts, not in one go, like reading a novel.

          Why are you looking at this from an “aesthetic” stand-point?  It’s utilitarian.  Which the text in this post described quite well.

        • http://www.facebook.com/people/Robert-Holmen/562023961 Robert Holmén

           Can you actually cite something on that “eye strain” hazard?  I doubt this is really any worse than cursive handwriting for legibility and I never hear warnings about the certain danger of reading cursive handwriting.

          • giantasterisk

            http://en.wikipedia.org/wiki/Typography#Readability_and_legibility

            It’s not dangerous. It just makes your eyes tired. Not a great thing for a proofreader imho, but apparently I’m alone in that opinion.

          • http://twitter.com/ErnestValdemar Ernest Valdemar

             As a one-time professional proofreader, I would have to say that, yes, you are alone in that opinion. I would have killed for this 20 years ago, when I was proofing MSs in Courier on dot-matrix greenbar printouts.

            I would also add, that if you’re proofing by hand using traditional proofreader’s marks, the extra spacing makes it easier to add handwritten strikeouts, carets, deles, caps, lc’s, pilcrows, etc.

            Not that anyone does that anymore.

        • SamSam

           Excessive line length, excess leading, poor or widely spaced kerning all make text much more difficult to read.

          In good fonts, all of these things are designed to speed your eye/brain along, and make it easier to chunk the words so you don’t even have to look at the individual characters, your brain just processes them as words.

          This is exactly the opposite of what this font is trying to achieve, since it’s this chunking that makes proofreading so difficult.

    • http://www.facebook.com/scott.thompson.1610 Scott Thompson

       I would think the exaggerated spacing helps in proofreading – to make it easier to distinguish, for example, “rn” from “m”…

  • http://twitter.com/BonzoDog1 BonzoDog1

    When I started out as a newspaper reporter in 1975 we used a very similar font in our Selectrics to type stories on special copy paper to be read by a large OCR machines that produced punched tape than ran in even larger early Compugraphic photo-typesetters.
    It still was prone to typos, but at least it eliminated the Linotype operators’ jobs — the first of many to go.

    • http://profile.yahoo.com/4MC2FD6U2ASY7SIHELAYS2DD7Y Steve

      I was one of those Linotype operators.
      God, I miss that fucking clunking bastard thing
      sometimes.

      • http://twitter.com/ErnestValdemar Ernest Valdemar

        I worked in production for Que when they were transitioning to electronic preproduction. The old lino operators were my favorite people, and it killed me to see them switch to Pagemaker or lose their jobs (I saw both).

        Every once in a while, there’d be a highly technical book that Pagemaker couldn’t handle, and I’d get to proof pasteup in non-repro blue. It always felt like I was doing something special.

        Sentiment alert: There was also a wonderful old tech illustrator who struggled and struggled with giving up his pens,  triangles, and French curves for Adobe Illustrator and Photoshop. Such a craftsman, and a dear, dear man. He eventually quit (or was fired — it’s hard to tell sometimes in the corporate world).

  • srose278

    Cory, I proofread/edit OCR’ed text for a university department that supports students who need to use different tools and equipment to access their course materials. I usually just use a sans serif font while editing, but I’m definitely giving this font a try the next time I’m in the office. Here’s hoping to cut down on the visual fatigue from staring at a screen for hours trying to catch the misread “c”s and “e”s…

  • Kevin Slattery

    Adobe recently released a font called Source Code Pro that has the same noticeable differences between i,l,L,1,|. I’ve been using it for… writing source code… for a couple days now and love it so far. 

  • horn5555

    Distributed Proofreading is an amazing organization. Turning raw OCR text from Gutenberg  or other mass scans of public domain texts is a multi-step detailed job for perfectionist and they are perfectionists.  

    New volunteers are allowed to proofread a page of light fiction, comparing the OCR output to a pdf scan.  A mentor reviews the new guy’s work and provides feedback.  There are numerous stages after proofreading and pages are usually proofread several times.  The end result is a beautifully designed public domain book that will last for eternity.  

    More experienced volunteers and specialists work with non-fiction works which may include footnotes and end-notes.  The typeface is brilliant. It makes the most common OCR error stand out as much as possible.  

    The contrast between a raw OCR feed and a DP proofread and formatted version is substantial. I’m currently reading William James “Varieties of Religious Experience” in an uncorrected free version downloaded from Google Books.  Since James used footnotes and parenthetical asides (if that is the term) extensively the uncorrected OCR output is difficult to follow at times.  I look forward to reading the DP version of it, when and if they get to it.

  • DrDave

    The font does make certain errors stand out more, but I also find it makes some of them harder to correct. Looking at the “arid the hotel” passage in DPCustomMono, the first correction that came to mind was “amid”, but that didn’t parse with the rest of the sentence. It wasn’t until I went to the link page and saw the text in Arial that the proper correction was evident, because of the similarity in shape to the correct word (“and”).
    If I were doing extensive amounts of proofing of this kind, I think I might actually use side-by-side listings, with one font to help point out the errors, and the other to help me figure out the correction.

    • http://scavenger-ethic.blogspot.com/ scav

      Yeah.  That is exactly what the distributed proofreading site does. You have a side-by-side view of the scanned image and the OCR text.  You spot anomalies in the text, and refer back to the scanned image to figure out what they should be.

  • http://twitter.com/ErnestValdemar Ernest Valdemar

    Slightly OT for the OP, but OT for Boing-Boing: Suppose I’m limited to eight character passwords that have to include numerals and special characters. Is there any security to be gained from a password like IlIl|I1|?

  • invisiblemonkey

    I would have killed for a font like this in the late 90s. I had a job consisting mostly of scanning old documents, OCRing them, and editing all the various errors. I got fairly good at it by the end, the highlight of course were the strange errors OCR created and the nonsense sentences that resulted. The most memorable is when I was showing a new employee how it was done, mentioning sometimes you get strange words out of it. He pointed at the screen and said “Like right there, where it says ‘fat anal ho’?”

    I wish I remembered what the correction was. All that stuck with me was a fat anal ho. Typical. 

  • Frode Helland

    It might help if they don’t use characters that are actually in use to represent something else. Like the lslash that is supposed to represent an uppercase I.

  • http://klotz.me/ klotz

    We did a demo at Comdex back in the day, and part of it involved OCR of business cards to get email addresses.  Just about every message had to be hand-resent after the show because the OCR didn’t have “.com” in the dictionary and turned it into “.corn”