![]() |
A fast and automated cryptogram solver by Edwin Olson. A stand-alone version of the program is available here. Online Help is also available. |
How does it work?
A fairly thorough and technical description of the method can be found in my paper written for CRYPTOLOGIA. My full list of publications (including bibtex) is also available.
This program implements a dictionary attack. Words in a list of known English words are compared against the words in the puzzle. Those words that have the same "letter patterns" as a scrambled word are considered to be candidates. For example, ADAPTABLE, EMERGENCY, and NONLINEAR all have the same letter pattern ABACDAEFG and would be candidates for the scrambled word RGRXSRKHM.
The classic algorithm is:
unscramble(cipherwords)
unscramble(cipherwords, 0, empty-map);
unscramble(cipherwords, depth, map)
if (depth > number of cipherwords)
displaysolution(cipherwords, map);
return;
let C be the set of unencrypted words which could map to cipherwords[depth]
for each word W in C
if map is consistent with cipherwords[depth]==W
unscramble(cipherwords, depth+1, map + cipherwords[depth]==W);
Cipherwords is an array of encrypted words. A map is a collection of letter assignments, e.g., "A=R, B=M". When we say that a map is consistent with a word mapping, we mean two things:
- No encoded letter maps to two different letters
- No decoded letter is mapped to twice.
This is the basic brute-force algorithm, common to several cryptogram solvers.
Note that for this basic attack, there's absolutely no statistic gathering being done. For example, we would never guess that R=E because there are a lot of Rs in the puzzle. Many other solvers do this, but for very short puzzles, you just can't rely on the distribution of letters to be close to the "average" distribution. There is, however, a LOT of information in letter patterns; words which have rare letter patterns make the puzzle easy to solve. For example, the scrambled word RNAB has 2084 possible words according to my standard dictionary file. It's a very common letter pattern. Conversely, the scrambled word TNNAASSVSM has just one candidate: BOOKKEEPER. If you saw TNNAASSVQM in a puzzle, you would know immediately that T=>B, N=>O, A=>K, S=>E, V=>P, and M=>R. This is pretty close to how a human solves puzzles-- looking at letter patterns-- and the computer is extremely proficient at it.
Basic Optimizations
Special-purpose dictionary
The dictionary is presorted according to letter pattern. In other words, EMERGENCY is before ALPHABET because ABACDAEFGH is before ABCDAEFG. Likewise, ADAPTABLE, ENERGETIC, and NONLINEAR are all found right next to EMERGENCY since they have the same letter pattern. Just as you can quickly find a word in a normal dictionary by taking advantage of the alphabetization, the computer can quickly find the list of candidate words due to its sorting scheme. We do not need to load the whole dictionary into memory either; we can do a binary-search for a letter pattern.
Set intersection (wordlist pruning)
It's obvious that more candidate words for each scrambled word will increase the amount of work needed to find all possible solutions. Therefore, if you can find a way to reduce the number of candidates you can speed up the search. Suppose you see the ciphertext word XRXJPLXWO whose letter pattern is ABACDAEFGH. Now, suppose that the dictionary told you that the list of candidate words was ADAPTABLE, ENERGETIC, EMERGENCY, and NONLINEAR. We therefore know that X must be either A, E, or N. Now, suppose that we look at another ciphertext word in the same puzzle. Suppose we perform a similar analysis on that word and find that X must be either E, N, or Y. We then know that X must be either E or N, the intersection of (A,E,N) and (E,N,Y). We can then eliminate every candidate word which needs X to map to something other than E or N. We perform this analysis over and over until we can no longer reduce the number of candidate words. In practice, this process is extremely fast, taking only a few milliseconds, and can prune out most of the candidate words.
This "candidate reduction" step is performed every time we assign a word in the ciphertext to be a particular word in the dictionary. The number of candidates for all of the remaining unknown ciphertext words goes down very rapidly, since we rapidly accumulate mappings (X=E, Y=R, etc.). The "secret sauce" is the data structure the algorithm uses (roughly a "stack"), which allows the candidate lists to be efficiently updated as we move both up and down in the search tree.
Search reordering
Another basic optimization is to change the order in which we consider words. We rank the words in terms of their "value": words which give us a lot of letter assignments that are used in other words have a high value, since figuring them out will help us solve the rest of the puzzle. Searching for high-value words first can greatly decrease the search time because it rules out many possibilities much earlier. For example, "BOOKKEEPER" is the only word with its pattern; if it is in your puzzle, you immediately win 6 letter mappings in zero time. It would be wise, then, in a puzzle like "BOOK BORE ROOK, SPOKE BOOKKEEPER", to look at "BOOKKEEPER" first, since it implies the solution to "BOOK BORE ROOK", which would other wise be comparatively difficult.
Partial solutions
On hard puzzles that contain many non-dictionary words, we can optionally return "partial solutions". A partial solution occurs when the solver "gets stuck" (e.g., both A=Z and R=Z, or there is no legal mapping for A). Prior to getting stuck, the solver probably had a number of letter assignments-- these comprise a partial solution. Often, the partial solutions are enough for a human to discern the true solution. Partial solutions lend robustness to non-dictionary words, but modestly slow down the solver.
Solve for a single letter instead of a word
In the common case, it is better to search for whole words at a time. An N letter word might have C candidates to choose from. In other words, after C iterations, we win N mappings. In contrast, if we searched letter by letter, it requires about 26^N iterations to achieve N mappings. 26^N is almost always much larger than C, meaning that it is better to substitute whole words at at time.
However, we have the option of guessing only one letter at a time. This will require 26^1=26 iterations, yielding 1 mapping. In terms of the amount of work per letter mapping, this is much worse than mapping a whole word. But critically, after guessing a single letter, we get to eliminate a lot of word candidates elsewhere in the puzzle, and this can more than offset the greater cost of solving one letter at a time.
Once the puzzle has a few letters mapped, the number of candidates words starts getting very small (10-100, for example). At that point, it is no longer advantageous to solve only one letter at a time, since we can now map several letters in just a few iterations.
Improving the user experience
It is often the case that a given cryptogram has multiple valid solutions. By "valid", we mean that each word in the puzzle has been mapped to a word in the dictionary.
We improve the user experience by "ranking" these solutions, so that ones that "seem more likely" will be listed first. This is particularly valuable with "partial solutions", described above.
Currently, we score solutions using a table of trigram probabilities, i.e., we consider all three-letter sequences in the recovered text. This will tend to penalize solutions that contain bizzare words ("pyx"), and promote those that contain common words ("the"). Trigrams are especially useful since they can span multiple words (if only barely). Taking into account whole word probabilities is possible, but has not been done yet.
Some puzzles have an enormous search space, even after applying all of our tricks. We employ a simple heuristic: when matching words, we pick the most probable matches first (RWN=THE before RWN=VEX). While this doesn't reduce the search space, it does decrease the amount of time (on average) before a useful solution pops up.
