Unicode's "right-to-left" override obfuscates malware's filenames

Unicode has a special character, U+202e, that tells computers to display the text that follows it in right-to-left order; this facility is used to write text in Arabic, Hebrew, and other right-to-left scripts. However, this can (and is) also used by malware creeps to disguise the names of the files they attach to their phishing emails. For example, the file "CORP_INVOICE_08.14.2011_Pr.phylexe.doc" is actually "CORP_INVOICE_08.14.2011_Pr.phyldoc.exe" (an executable file!) with a U+202e placed just before "doc."

This is apparently an old attack, but I've never seen it, and it's a really interesting example of the unintended consequences that arise when small, reasonable changes are introduced into complex systems like type-display technology.

Some email applications and services that block executable files from being included in messages also block .exe programs that are obfuscated with this technique, albeit occasionally with interesting results. I copied the program that powers the Windows command prompt (cmd.exe) and successfully renamed it so that it appears as “evilexe.doc” in Windows. When I tried to attach the file to an outgoing Gmail message, Google sent me the usual warning that it doesn’t allow executable files, but the warning message itself was backwards:

“evil ‮”cod.exe is an executable file. For security reasons, Gmail does not allow you to send “this type of file.

Unfortunately, many mail applications don’t or can’t reliably scan archived and zipped documents, and according to Commtouch and others, the malicious files manipulated in this way are indeed being spammed out within zip archives.

(via Command Line)


  1. Fun Fact- as of a while ago (2007ish, as I recall), google actually scans everything at least one level deep not only in zips, but rars as well. I recall having to double or triple rar a file back in the day, even with extension stripped, just as a test with a friend, in order to get it to send it.

    I wouldn’t be surprised to have someone come in to these comments saying that they can’t recreate your test due to google already having patched the error message, they’re usually pretty on top of this sort of thing, and take it pretty seriously.

  2. Screw Unicode. 7 bit ascii should be enough for anyone.

    And if you need diacriticals to tell you how to pronounce your words, there’s something wrong with your words.

    1. Presumably that’s meant to be humorous (given how spelling is pretty useless for knowing how to pronounce many English words) , but it is true that many users of languages with non-English characters (or different writing systems entirely)  managed to use the Internet back in the days when ASCII was the only common denominator through a variety of alternative spellings. Particularity inventive was the Russian “Volapuk” (not to be confused with the Esperanto precursor of the same name) method of using letter shapes to suggest Cyrillic.

    2. And if you need diacriticals to tell you how to pronounce your words, there’s something wrong with your words.

      I keep saying that if, as in Vietnamese, you use two diacritics on top of each other for half of your vowels, maybe you should be considering a different script?

      The Vietnamese don’t seem to care though.

  3. Could someone more familiar with Unicode explain why left-to-right or right-to-left is not a fixed per-character property? All the scripts I know have either one unique direction or one left-right direction and one up-down direction and make no sense (pun intended) in the other direction..

    1. Because it’s not about which way your B is pointed, it’s about where the letter AFTER the B shows up when you type it.

      If I sit down and start typing in Sdrawkcab, the second letter I type appears to the left of the first. But it doesn’t do that in English. The shape of the glyphs used to communicate isn’t the issue here, it’s telling the computer where to put the insertion point so I don’t have to write “Sdrawkcab” when I really mean to write backwardS.

      EDIT: Or perhaps this will help – What if I am writing in Hebrew and want to quote someone in English? Or vice versa?

      1. I think he meant fixed per character SET, not per character – or more correctly, perhaps, per alphabet, or per-unicode-codepoint. Would a Hebrew character set, designed for reading R2L, make sense when used to write L2R? Are there any languages which use the A-Z  in L2R? Or the Hebrew alphabet in R2L?

        I’m curious how this even works, anyway. I mean, if I stared writing R2L for a bit, then changed back, where would my insertion point go to? Ideally, I’d want it to move to the far right of the R2L block I just typed, I think – but does it do that, and does that make sense in all cases?

      2. That’s the point: both Hebrew and English (and Cyrillic, Arabic, and all other alphabetic scripts except maybe ancient Greek and Etruscan) have a well-defined direction. I’m just surprised that nothing about that is encoded in Unicode.

    2. Japanese (and I presume Chinese) can be written in at least 3 directions, depending on the situation; although, to be fair, only two of those are used in everyday writing these days (which doesn’t mean that you won’t see the others occasionally).

        1. The two in common use are (used side by side in newspapers):
          (1) left to right in lines top to bottom (horizontal, like the Latin script) and
          (2) top to bottom in columns right to left (vertical).

          One more that was in somewhat common use before WW II is “right to left in lines top to bottom” (like Hebrew, Arabic etc.). This is still used today in some cases (e.g. on commercial vans and trucks), though usually interpreted as a special case of (2) (one-character columns).

    3. Further to what the others have added, I will respond with another reason – Internationalisation and Localisation. When preparing the strings that will make up an interface the programmer needs to take into account the fact that they have no knowledge of directionality if they want their UI translated. That information needs to be in the script, not in the char – because it is a property of the script itself, not of the characters themselves.

    1. You’re on the right track:

      This filename:
      would appear as:

      — Restricted Test Area: Replacing & with plain & in above scammy filename —


      Everything that comes after this is backward, and I needed to show that, yeah, it’s actually going right-to-left. More fun this way, no? To go back to LRO and civilization (j/k), it’s &#8237.

    2. His example is wrong, but not that way.  You want it to appear to be a .doc to the user so:


      in appearance but the final name produces:


      Still a .exe

  4. A spam filter ought to detect the presence of ‘.exe’ files or references to them and take appropriate action (which probably means deletion or at least sandboxing, because how often do you really need to mail .exe’s around, or post links to them?). The presence of U+202E in a filename or URL, preceding an extension indicating an executable file ought to be grounds for a “kill on sight” rule with a vanishingly low rate of false positives.

  5. Not seen that, but i have encountered urls that manage to make it look like your visiting a .com, but instead they download a .com…

    1. As COM programs are executed in the NTVDM that is missing from 64-bit versions of Windows, the time when that little exploit was particularly useful has passed.

  6. If I were I Windows user I wouldn’t accept *.doc from an unknown source with any less suspicion than *.exe

  7. The only danger I can see from this is that an unsuspecting user might click on what appears to be a *.doc file. I don’t get what the ballyhoo is about mail readers.

    The computer doesn’t care what it looks like. No mail reader on earth will say “oh, backwards it looks like this has a .doc extension when the name is rendered on the screen — I guess it’s a doc file.” No, the computer looks at the mime type, or looks at the actual extension, which in both cases here is EXE.

    This is why GMail was not fooled by the file, and no other mail reader would be either. And, on most computers, when you tried to open this evilexe.doc file, it would say “This is an executable file, are you sure you want to open it?”

    So, yes, an unsuspecting user might download the file and then try to open it. But it has nothing to do with GMail or any other mail reader, none of which will be fooled.

  8. I’m impressed that Cory used the correct term “script” and not “language”. 

    This is definitely an interesting exploit but it seems like an easy one to catch once you know about it, since the writing direction shouldn’t affect the source order of the characters (accept in PDF) and thus how a program examining the actual URL or filename would interpret it.

    In PDF the data order of the characters usually reflects their left-to-right placement in the drawing area irrespective of the actual reading direction. Makes it a challenge to extract Arabic and Hebrew text from PDFs.

  9. Wouldn’t it have to be “phylcod.exe” to get “phylexe.doc”? Rather than “phyldoc.exe”?

  10. One more reason not to use Windows.

    Whenever, I read about the latest Windows security exploit, I laugh..and then I cry a little bit.

  11. This is yet another good reason why every program in use today ought to be fully aware of and compliant with the Unicode standard for displaying text. There’s simply no excuse for using anything that isn’t. (And that, in turn, is a good reason for not using outdated applications.)

    1. In this case, being compliant with Unicode is what makes the trick work. A noncompliant program would have just mangled the hell out of the trick string, almost certainly alerting the user.

      While it does seem to be the only way that we can tackle the fact that not all languages are cool enough for ASCII, Unicode enormously increases the scope of possible trickery:

      Direction reversal, totally different unicode characters with identical glyphs in most fonts, the fun goes on…

  12. Actually, it makes all kinds of things go funny. 

    Look at the text Cory has quoted from the article – in particular the error message from Google. Now go and look at the error message from Google on the original site. They are different. I discovered this as it was different again when *I* was quoting it on my blog…

    Original article has this backwards:

    “cod.exe is an executable file. For security reasons, Gmail does not allow you to send “this live” type of file.

    Cory has this backwards:
    “cod.exe is an executable file. For security reasons, Gmail does not live” allow you to send “this type of file.

    Ohhh. It has to do with wrapping and probably other hidden characters – like a carriage return or something. I’ve just noticed in text editor that when expanding or contracting the document size, things have gone wrong. Ah yes, there it is. Trusty old vim – there’s a U+202C in there as well – the old “pop directional formatting” char. Ouch. So confusing. 


  13. Ah, you can see that the word “live”/”evil” (from “evilexe.doc”) is out of context in both sentences – after the 202E something has appended a 202C to get back to the way it was – I’d be interested to discover at what point in the process this was introduced. Also, whether it’s the standard to pop the directionality rather than to re-override the directionality. Was this done by the Windows OS programmers independently, or is it dictate by the Unicode standard? Hmmm. If only I had a spare couple of hours to work this out – anyone?

Comments are closed.