Features Podcasts Family Video Comics Music Tech Science Books Film & TV Games ✚

Jill

Unicode's "right-to-left" override obfuscates malware's filenames

Cory Doctorow at 11:59 am Mon, Oct 3, 2011

— FEATURED —

Science

Last chance to enter the Armchair Taxonomist challenge!

Book Review

Black Code: how spies, cops and crims are making cyberspace unfit for human habitation

Book Review

We Can Fix it! - a graphic novel time travel memoir

Science

The technology that links taxonomy and Star Trek

— FOLLOW US —

Boing Boing is on Twitter and Facebook. Subscribe to our RSS feed or daily email.

 

— POLICIES —

Except where indicated, Boing Boing is licensed under a Creative Commons License permitting non-commercial sharing with attribution

 

— FONTS —

Tweet
Kindle

Unicode has a special character, U+202e, that tells computers to display the text that follows it in right-to-left order; this facility is used to write text in Arabic, Hebrew, and other right-to-left scripts. However, this can (and is) also used by malware creeps to disguise the names of the files they attach to their phishing emails. For example, the file "CORP_INVOICE_08.14.2011_Pr.phylexe.doc" is actually "CORP_INVOICE_08.14.2011_Pr.phyldoc.exe" (an executable file!) with a U+202e placed just before "doc."

This is apparently an old attack, but I've never seen it, and it's a really interesting example of the unintended consequences that arise when small, reasonable changes are introduced into complex systems like type-display technology.

Some email applications and services that block executable files from being included in messages also block .exe programs that are obfuscated with this technique, albeit occasionally with interesting results. I copied the program that powers the Windows command prompt (cmd.exe) and successfully renamed it so that it appears as “evilexe.doc” in Windows. When I tried to attach the file to an outgoing Gmail message, Google sent me the usual warning that it doesn’t allow executable files, but the warning message itself was backwards:

“evil ‮”cod.exe is an executable file. For security reasons, Gmail does not allow you to send “this type of file.

Unfortunately, many mail applications don’t or can’t reliably scan archived and zipped documents, and according to Commtouch and others, the malicious files manipulated in this way are indeed being spammed out within zip archives.

(via Command Line)

‘Right-to-Left Override’ Aids Email Attacks [krebsonsecurity.com]

I write books. My latest is a YA science fiction novel called Homeland (it's the sequel to Little Brother). More books: Rapture of the Nerds (a novel, with Charlie Stross); With a Little Help (short stories); and The Great Big Beautiful Tomorrow (novella and nonfic). I speak all over the place and I tweet and tumble, too.

MORE:  malware • security • spam • typography • web theory

More at Boing Boing

The technology that links taxonomy and Star Trek

Hackers prepare for first "national holiday" in their honor

  • Draxlith

    Fun Fact- as of a while ago (2007ish, as I recall), google actually scans everything at least one level deep not only in zips, but rars as well. I recall having to double or triple rar a file back in the day, even with extension stripped, just as a test with a friend, in order to get it to send it.

    I wouldn’t be surprised to have someone come in to these comments saying that they can’t recreate your test due to google already having patched the error message, they’re usually pretty on top of this sort of thing, and take it pretty seriously.

  • CharredBarn

    .gnitseretnI

  • http://www.lightning-rose.com/ LightningRose

    Screw Unicode. 7 bit ascii should be enough for anyone.

    And if you need diacriticals to tell you how to pronounce your words, there’s something wrong with your words.

    • Jonathan Badger

      Presumably that’s meant to be humorous (given how spelling is pretty useless for knowing how to pronounce many English words) , but it is true that many users of languages with non-English characters (or different writing systems entirely)  managed to use the Internet back in the days when ASCII was the only common denominator through a variety of alternative spellings. Particularity inventive was the Russian “Volapuk” (not to be confused with the Esperanto precursor of the same name) method of using letter shapes to suggest Cyrillic.

    • http://mengbomin.wordpress.com/ Meng Bomin

      Screw Unicode. 7 bit ascii should be enough for anyone.

      لماذا؟
      為什麼?
      ¿Por qué?

    • tziup

      And if you need diacriticals to tell you how to pronounce your words, there’s something wrong with your words.

      I keep saying that if, as in Vietnamese, you use two diacritics on top of each other for half of your vowels, maybe you should be considering a different script?

      The Vietnamese don’t seem to care though.

  • sndr

    Could someone more familiar with Unicode explain why left-to-right or right-to-left is not a fixed per-character property? All the scripts I know have either one unique direction or one left-right direction and one up-down direction and make no sense (pun intended) in the other direction..

    • Jerril

      Because it’s not about which way your B is pointed, it’s about where the letter AFTER the B shows up when you type it.

      If I sit down and start typing in Sdrawkcab, the second letter I type appears to the left of the first. But it doesn’t do that in English. The shape of the glyphs used to communicate isn’t the issue here, it’s telling the computer where to put the insertion point so I don’t have to write “Sdrawkcab” when I really mean to write backwardS.

      EDIT: Or perhaps this will help – What if I am writing in Hebrew and want to quote someone in English? Or vice versa?

      • Dewi Morgan

        I think he meant fixed per character SET, not per character – or more correctly, perhaps, per alphabet, or per-unicode-codepoint. Would a Hebrew character set, designed for reading R2L, make sense when used to write L2R? Are there any languages which use the A-Z  in L2R? Or the Hebrew alphabet in R2L?

        I’m curious how this even works, anyway. I mean, if I stared writing R2L for a bit, then changed back, where would my insertion point go to? Ideally, I’d want it to move to the far right of the R2L block I just typed, I think – but does it do that, and does that make sense in all cases?

      • sndr

        That’s the point: both Hebrew and English (and Cyrillic, Arabic, and all other alphabetic scripts except maybe ancient Greek and Etruscan) have a well-defined direction. I’m just surprised that nothing about that is encoded in Unicode.

    • tziup

      Japanese (and I presume Chinese) can be written in at least 3 directions, depending on the situation; although, to be fair, only two of those are used in everyday writing these days (which doesn’t mean that you won’t see the others occasionally).

      • sndr

        That’s cool! I didn’t know about the third direction; is that right to left or down to up?

        • tziup

          The two in common use are (used side by side in newspapers):
          (1) left to right in lines top to bottom (horizontal, like the Latin script) and
          (2) top to bottom in columns right to left (vertical).

          One more that was in somewhat common use before WW II is “right to left in lines top to bottom” (like Hebrew, Arabic etc.). This is still used today in some cases (e.g. on commercial vans and trucks), though usually interpreted as a special case of (2) (one-character columns).

    • http://pineappledonut.org Lachlan Musicman

      Further to what the others have added, I will respond with another reason – Internationalisation and Localisation. When preparing the strings that will make up an interface the programmer needs to take into account the fact that they have no knowledge of directionality if they want their UI translated. That information needs to be in the script, not in the char – because it is a property of the script itself, not of the characters themselves.

  • Larry Balzary

    but… for it to be read backwards, shouldn’t it be “exe.cod”?

    • Petzl

      You’re on the right track:
      http://blog.commtouch.com/cafe/email-security-news/using-unicode-to-trick-users-to-install-malware/

      This filename:
      “CORP_INVOICE_08.14.2011_Pr.phyl&#8238cod.exe”
      would appear as:
      “CORP_INVOICE_08.14.2011_Pr.phylexe.doc”

      — Restricted Test Area: Replacing & with plain & in above scammy filename —

      CORP_INVOICE_08.14.2011_Pr.phyl&#8238cod.exe

      Everything that comes after this is backward, and I needed to show that, yeah, it’s actually going right-to-left. More fun this way, no? To go back to LRO and civilization (j/k), it’s &#8237.

    • kevinv

      His example is wrong, but not that way.  You want it to appear to be a .doc to the user so:

      exe.doc

      in appearance but the final name produces:

      cod.exe

      Still a .exe

  • Paul Renault

    It’s the 21st-century version of Alt-255.

  • http://www.disoriented.net/ angusm

    A spam filter ought to detect the presence of ‘.exe’ files or references to them and take appropriate action (which probably means deletion or at least sandboxing, because how often do you really need to mail .exe’s around, or post links to them?). The presence of U+202E in a filename or URL, preceding an extension indicating an executable file ought to be grounds for a “kill on sight” rule with a vanishingly low rate of false positives.

  • digi_owl

    Not seen that, but i have encountered urls that manage to make it look like your visiting a .com, but instead they download a .com…

    • Jorpho

      As COM programs are executed in the NTVDM that is missing from 64-bit versions of Windows, the time when that little exploit was particularly useful has passed.

  • Ed Frome

    If I were I Windows user I wouldn’t accept *.doc from an unknown source with any less suspicion than *.exe

    • kevinv

      exe.jpg
      exe.png
      exe.pdf
      exe.txt

      they all work.

  • SamSam

    The only danger I can see from this is that an unsuspecting user might click on what appears to be a *.doc file. I don’t get what the ballyhoo is about mail readers.

    The computer doesn’t care what it looks like. No mail reader on earth will say “oh, backwards it looks like this has a .doc extension when the name is rendered on the screen — I guess it’s a doc file.” No, the computer looks at the mime type, or looks at the actual extension, which in both cases here is EXE.

    This is why GMail was not fooled by the file, and no other mail reader would be either. And, on most computers, when you tried to open this evilexe.doc file, it would say “This is an executable file, are you sure you want to open it?”

    So, yes, an unsuspecting user might download the file and then try to open it. But it has nothing to do with GMail or any other mail reader, none of which will be fooled.

  • drmacro

    I’m impressed that Cory used the correct term “script” and not “language”. 

    This is definitely an interesting exploit but it seems like an easy one to catch once you know about it, since the writing direction shouldn’t affect the source order of the characters (accept in PDF) and thus how a program examining the actual URL or filename would interpret it.

    In PDF the data order of the characters usually reflects their left-to-right placement in the drawing area irrespective of the actual reading direction. Makes it a challenge to extract Arabic and Hebrew text from PDFs.

  • Dewi Morgan

    Wouldn’t it have to be “phylcod.exe” to get “phylexe.doc”? Rather than “phyldoc.exe”?

  • librtee_dot_com

    One more reason not to use Windows.

    Whenever, I read about the latest Windows security exploit, I laugh..and then I cry a little bit.

  • Antinous / Moderator

    Equal rights for boustrophedon users!

  • johndberry

    This is yet another good reason why every program in use today ought to be fully aware of and compliant with the Unicode standard for displaying text. There’s simply no excuse for using anything that isn’t. (And that, in turn, is a good reason for not using outdated applications.)

    • phisrow

      In this case, being compliant with Unicode is what makes the trick work. A noncompliant program would have just mangled the hell out of the trick string, almost certainly alerting the user.

      While it does seem to be the only way that we can tackle the fact that not all languages are cool enough for ASCII, Unicode enormously increases the scope of possible trickery:

      Direction reversal, totally different unicode characters with identical glyphs in most fonts, the fun goes on…

  • http://pineappledonut.org Lachlan Musicman

    Actually, it makes all kinds of things go funny. 

    Look at the text Cory has quoted from the article – in particular the error message from Google. Now go and look at the error message from Google on the original site. They are different. I discovered this as it was different again when *I* was quoting it on my blog…

    Original article has this backwards:

    “cod.exe is an executable file. For security reasons, Gmail does not allow you to send “this live” type of file.

    Cory has this backwards:
    “cod.exe is an executable file. For security reasons, Gmail does not live” allow you to send “this type of file.

    Ohhh. It has to do with wrapping and probably other hidden characters – like a carriage return or something. I’ve just noticed in text editor that when expanding or contracting the document size, things have gone wrong. Ah yes, there it is. Trusty old vim – there’s a U+202C in there as well – the old “pop directional formatting” char. Ouch. So confusing. 

    http://www.fileformat.info/info/unicode/char/202c/index.htm

  • http://pineappledonut.org Lachlan Musicman

    Ah, you can see that the word “live”/”evil” (from “evilexe.doc”) is out of context in both sentences – after the 202E something has appended a 202C to get back to the way it was – I’d be interested to discover at what point in the process this was introduced. Also, whether it’s the standard to pop the directionality rather than to re-override the directionality. Was this done by the Windows OS programmers independently, or is it dictate by the Unicode standard? Hmmm. If only I had a spare couple of hours to work this out – anyone?

  • http://pineappledonut.org Lachlan Musicman

    Actually, maybe it was inserted by Google during the error message workflow?