Good-Turing method finally improved-upon

Sixty-or-so years since Alan Turing and IJ Good invented the Good-Turing method for modeling of probability distributions behind data streams as part of the Allied code-breaking effort, researches have discovered the limit of its usefulness, and produced a replacement method that transcends them:

The German Enigma encryption machine used a huge number of decryption keys, making it almost impossible to crack the code. British intelligence had gained possession of Enigma machines, had determined how they worked and had even obtained a copy of the full book of keys. Some messages had been decrypted and the keys used recorded, so that the code breakers had a small sample from a very large set of keys. But it was unlikely the Germans would continue to use the same keys, so some method of assigning a probability distribution to the keys not yet used was needed…

Orlitsky was able to discover this limit by quantifying the problem in terms of the positive integers. The nature of the sample set is actually irrelevant to the probabilistic algorithm. What matters is the order in which outcomes appear and how often they appear. So a sample sequence such as giraffe, giraffe, elephant, giraffe, zebra would be encoded in numbers as 1,1,2,1,3. Every time a new item appears, it is assigned the next-highest number, so that this mathematical model, according to its creators, can capture the worst possible problem-one in which there is an infinite number of hidden data items.

Link

(via Smart Patrol)