Medieval Unicode

The Medieval Unicode Font Initiative has a collection of glyphs and type elements that they'd like to see added to Unicode to make it simpler to represent medieval writing on the Web. (via O'Reilly Radar)

Notable Replies

  1. And Voynich, while you're at it

  2. Isn't that called a "font"?

  3. No. Unicode is effectively a list of (almost) all the characters anyone could need, anywhere in the world. Individual fonts will contain representations of a selected subset of Unicode*.

    There's space in Unicode for just over a million different characters, though only about 100,000 are used. The mediaevalists are asking for their favourite characters to be added to the list.

    *Some of which may look nothing like normal versions of those characters- e.g. the 'Wingdings' font.

  4. To expand: Historically, most fonts only had at most 255 glyphs, since each character in text was encoded as a single byte (7-bit byte in the olden days, 8-bit byte on anything remotely recent).

    Most of the world agreed to use ASCII, which defines more or less the characters you can type on a plain US keyboard within the lower 127. That's convenient, since most programming languages and OSes were in English - and everyone agrees on the numbers. When the world started using 8-bit bytes, the top 127 were kind of free, and were used for a number of different character sets (like my local iso-8859-15, or the DOS codepage 850 that had the same characters but in different positions).

    In those, pressing a key produced a number, and if your keyboard layout and font agreed it displayed as the expected glyph. If you then saved those numbers to a file, and viewed them with a different font (say a cyrillic one, or even just one where the non-English letters were different) it would look wrong.

    Unicode is different, in that it has a single unique code for every character. The lowest 200 or so are the same as ISO-8859-15, so a UTF-8 font will render text files made in a ascii character set with at least the basic characters correct.

    As for UTF-8/UTF-16/UCS2/etc: A unicode codepoint is a 32-bit number - which takes four bytes to store. You don't really need all four bytes for all text, though: UTF-8 will use a single byte if the glyph is in the 200-ish lowest, otherwise it stores an "escape" number (from the just-below-256 set) that indicates that the next few bytes all encode a single codepoint.

    That's efficient if the text in question is latinic with a smattering of other characters. However, imagine Chinese, where each character needs two bytes: Using three bytes per character (prefixing each character with a byte that says "the next two bytes go together") is a waste. For a language like that, it's better to use two bytes per character by default - so UTF-16 does exactly that. It still has escapes for writing characters that need three or four bytes (there are a bunch of those - dead writing systems, all sorts of symbols, unusual or historical chinese/japanese/korean signs).

    The downside to UTF-8 and -16 is that a given amount of bytes can contain a varying amount of characters, so you have to parse text before you can work with it - you can't even know if it's safe to cut and paste at a given byte position without parsing back a few bytes, and you have to be very pessimistic when allocating memory for a given amount of characters. The compromise solution is UCS2, which is like UTF-16 in using two bytes per character, but does not support escapes: If you want to use a three or four-byte character, that's just too bad. UCS2 is easy to work with - 100 characters take 200 bytes, and the lower 64k of unicode contains enough characters to write decent text in most (or all?) modern languages. It's popular as a low-level replacement for ASCII, so you'll find it in everything from windows internals to EFI bootcode.

    (Technically speaking, UCS-2 has been deprecated since 1996, but valid UCS-2 is valid UTF-16, and valid UTF-16 that doesn't use 3- or 4-byte codepoints is valid UCS-2. That and the amount of code that expects UCS-2 means it'll be around for a while.)

    There are also formats that use 4 bytes per character as standard; convenient enough if you're actually going to wring every possible use out of unicode.

  5. Thanks for asking that question. You triggered several informative answers.

Continue the discussion

6 more replies