Cryptography: Solution to Ciphers 2 and 3

Recall that our first original text is:

Cryptogram 2

ΝΖΓΟΨΓΘΕΥΖΦΟΞΟΓΤΝΘΡΤΟΣΘΕΑ ΝΡΧΝΦΖΞΧΔΨΡΝΤΧΝΖΧΝΡΤΟΦΔΔ ΧΘΩΟΨΘΟΖΧΝΡΞΑΨΝΤΡΦΝΡΝΧΡΩ ΥΨΥΞΡΞΟΩΝΦΘΨΝΖΞΧΥΡΞΨΟΔΕΓ ΘΓΝΤΘΤΟΒΨΡΞΝΖΧΝΩΕΔΨΥΦΟΧΥ ΤΘΓΤΥΤΘΡΞΘΕΓΔΜΟΡΝΖΧΥΞΩΟΞΡ ΞΨΟΔΕΞΥΓΔΞΘΧΟΡ

The key to this one is as follows:

ΑΒΧΔ ΕΦΓΗ ΙΚΛΜ ΝΟΠΘ ΡΣΤΥ ΩΞΨΖ
ΟΠΒΣ ΘΑΓΔ ΡΜΕΦ ΤΝΩΓ ΨΞΧΖ ΥΛΚΙ

And the solution is from Romans 11:25-26:

ΟΥΓΑΡΘΕΛΩΥΜΑΣΑΓΝΟΕΙΝΑΔΕΛΦΟΙΤΟΜΥΣΤΗΡΙΟΝΤΟΥΤΟΙΝΑΜΗΗΤ ΕΠΑΡΕΑΥΤΟΙΣΦΡΟΝΙΜΟΙΟΤΙΠΩΡΩΣΙΣΑΠΟΜΕΡΟΥΣΤΩΙΣΡΑΗΛΓΕΓΟΝΕ ΝΑΧΡΙΣΟΥΤΟΠΛΗΡΩΜΑΤΩΝΕΘΝΩΝΕΙΣΕΛΘΗΚΑΙΟΥΤΩΣΠΑΣΙΣΡΑΗΛΣ ΩΘΗΣΕΤΑΙ

Which if we add spaces becomes:

ου γαρ θελω υμας αγνοειν αδελφοι το μυστηριον τουτο ινα μη ητε παρ εαυτοις φρονιμοι οτι πωρωσις απο μερους τω ισραηλ γεγονεν αχρις ου το πληρωμα των εθνων εισελθη και ουτως πας ισραηλ σωθησεται.

Now to give you the real challenge....

Cryptogram 3

Recall that this one included no fewer than eight samples to help you:

Σ=$ΑΥΒΛΩΑΙ*ΔΛΜΥΑΨΜΥΩΞΑΖΥ~ΨΑΥ%ΜΗ+ΥΠΝΞΑΖ**ΜΗΩΠΓΖΛ~ΝΩΗ ΩΔΨΞΑΖΥΨ~ΑΥ%Μ\ΓΔΜΛΨΝΞ*ΑΖΜΛ~ΚΛΜΛ~

ΒΛΤΑΥΜΥ~%Ε%ΧΑΥ+ΠΟΑ*ΙΔΩ\ΚΑΝΓΖΛΩΑΥΩ

ΧΑΟΔΒΠ*ΥΒΖΥ~Ε%Σ=ΟΟ**ΛΩΗ+ΛΨΚΑ=ΠΚΔΩΑΥΩ%Υ

^ΚΜΗΨΗΚΠΚΘ=ΔΩ%ΩΒΖΔΩΦ=ΛΩΑΥ*ΒΗ^ΒΑΦΗΟΔ~ΗΨΜ%Ψ\Β\ΨΠΝΜΔΩ

ΞΖΛΜΑΖ%ΞΠΩΜΔΩΑΚΜΥ~ΜΠΥ+Υ%$ΨΝΩΑ~ΥΨΓΖΛΩΗΨΑΔ~ΑΤΠΥΔΩΛΨ

ΑΥΒΑΜΥΨΝΕΔΩΟΑΥΞΑΜ%Υ+ΥΠ~ΠΥΜΑΥΜΔΞΠΖΠΜ\ΒΥΒΛΩΜΛΨΧΑ\Ξ%~Υ Ω%ΞΟΔΨ$^ΛΩΑΥΒΥΦΛΩΜΣ=Λ~$ΒΛΧ**ΗΨΑΜΠΥ%ΝΜΔΠΥΜΑΥΜΔΒΑΑΩΞΥ ~ΜΑΥ^ΒΑΩΒΥ%ΚΖΥΩΛΕ*ΑΩΛΨΛΙΠΖΒΥΓ=%ΚΖΥΩΛΕ*ΑΩΛ~ΑΛΥΚΑΩΚΟΝ ΒΔΩΥΧ%ΟΠΨ~ΗΨΠΩΑΕΥΦΛΕΑΩΔ$ΖΥΞΥΦΛΕΑΩΔ

ΛΙ%ΖΜΛΥΞΖΛ*^ΧΑΝΨΗΞΠΩΜ%ΑΞΥΜΖΛΞΑΝ\~%ΜΠΧΗΩΜΜ%ΞΖΛΩΡ=ΛΥ%Ξ ΩΑΝΣ=ΕΠΑΩΧ*ΑΖΕΛΩΔΨΞΑΖΛΖΙ%ΩΛΩΝΞΛΡ%ΟΟ\~ΠΜΗΓΝΨΑΥΠΞΠ~ΥΕ ΑΜΓ=ΑΒΔΚΑΩΠΨΔΕΜ\ΟΛΙ\ΕΑΜΑ~ΘΑΒΑΑΚ**%ΨΜΛΩ\ΞΑΖΗΒΝΩΠΜΛ

ΗΒΑ~ΞΓΥ%ΞΛ**ΧΑΩΑΝΖΑΧΗΞΛΥΛΨΒΑΜΛΞΛΨΑ~ΜΥ**ΩΜΗΨΑΞΥΨΜΗ^~ \ΚΛΥΒΑΩΡΖΛΜΛΨΛΒΑΩΠ*ΝΜΗ~\ΒΑ^ΑΝΖΑΧΗΑΩ%ΩΧΖΔΞΛΥ~Λ=

A nomenclator consists of two parts: A cipher key and the additional encoding elements. The cipher key to the nomenclator is as follows (sorry about the disordered letters -- blame a spreadsheet that doesn't sort in Greek...).

ΑΒΧΔ ΕΦΓΗ ΙΚΛΜ ΝΟΠΘ ΡΣΤΥ ΩΞΥΖ
ΠΡΘΒ ΑΓΙΗ ΥΚΟΕ ΩΛΞΧ ΖΨΜΝ ΔΤΣΦ

But we have our eight other symbols, which are as follows:

*Null (i.e. simply ignore)
=delete preceding character
$ΚΑΙ (whole word only)
\ΟΥ (as letters as well as a whole word)
^ΜΗ (as letters as well as a whole word)
%Α (second version)
~Σ (second version)

So the plaintext of our several messages is as follows:

Cryptogram 3 Solution

Additional Hints and Techniques

All of the above assumes, at least in outline, that you know what encryption method is used. What if you don't?

For the elementary ciphers of ancient times, it's surprisingly easy to get a clue to the messages. The starting point is always to determine the number of symbols. If the message is in Greek, and has only 24 (or fewer) symbols, then it is an alphabetic system of some sort: Either a substitution or a transposition. If it has more than 24, then it is probably a nomenclator (either that, or it uses numerical symbols or symbols for punctuation). If it is a nomenclator, of course, you're going to have to start collecting additional samples.

For the case with only 24 or fewer symbols, then there are three basic possibilities: transposition, monalphabetic substitution (where one letter in the ciphertext always represents the same letter in the plaintext), or polyalphabetic (where one letter in the ciphertext represents different letters in the plaintext). A simple frequency count will tell you which one is used.

Recall our Greek frequency table. We listed it in terms of the most common letters. But let's list it in alphabetical order. That gives us

Α -- 11.0%
Β -- 0.6%
Γ -- 1.6%
Δ -- 2.0%
Ε -- 10.1%
Ζ -- 0.2%
Η -- 3.9%
Θ -- 1.7%
Ι -- 9.5%
Κ -- 3.3%
Λ -- 2.8%
Μ -- 2.6%
Ν -- 8.2%
Ξ -- 0.3%
Ο -- 10.1%
Π -- 3.1%
Ρ -- 3.3%
Σ -- 7.6%
Τ -- 7.4%
Υ -- 6.0%
Φ -- 0.6%
Χ -- 0.6%
Ψ -- 0.1%
Ω -- 3.2%

This is actually something we can graph. Here is the "shape" of the above frequency distribution (each * represents half a percent).

Α -- 11.0%**********************
Β -- 0.6%*
Γ -- 1.6%***
Δ -- 2.0%****
Ε -- 10.1%********************
Ζ -- 0.2%
Η -- 3.9%********
Θ -- 1.7%***
Ι -- 9.5%*******************
Κ -- 3.3%*******
Λ -- 2.8%******
Μ -- 2.6%*****
Ν -- 8.2%****************
Ξ -- 0.3%*
Ο -- 10.1%********************
Π -- 3.1%******
Ρ -- 3.3%*******
Σ -- 7.6%***************
Τ -- 7.4%***************
Υ -- 6.0%************
Φ -- 0.6%*
Χ -- 0.6%*
Ψ -- 0.1%
Ω -- 3.2%******

If an enciphered message matches this frequency distribution (i.e. Α Ε Ι Ν Ο Σ Τ are among the most common letters), then it is enciphered in a transposition cipher.

If the message fails the test, then sort the letters from most to least common. If we again graph our letters, sorting that way, we get this distribution:

Α -- 11.0%**********************
Ε -- 10.1%********************
Ο -- 10.1%********************
Ι -- 9.5%*******************
Ν -- 8.2%****************
Σ -- 7.6%***************
Τ -- 7.4%***************
Υ -- 6.0%************
Η -- 3.9%********
Κ -- 3.3%*******
Ρ -- 3.3%*******
Ω -- 3.2%******
Π -- 3.1%******
Λ -- 2.8%******
Μ -- 2.6%*****
Δ -- 2.0%****
Θ -- 1.7%***
Γ -- 1.6%***
Β -- 0.6%*
Φ -- 0.6%*
Χ -- 0.6%*
Ξ -- 0.3%*
Ψ -- 0.1%

If the message, upon sorting, reveals this sort of distribution, then it is a monalphabetic substitution.

If the frequency distribution is noticeably flatter than the above (that is, if the most common letters aren't as common as in the above table, and if the least common letters are more common than in the above), then chances are it's polyalphabetic. A typical early polyalphabetic technique is to use a keyword to shift between "Caesar" alphabets. (This is known as a Vigenère cipher; there are more sophisticated forms of this thing, where the alphabets aren't mere Caesar shifts, but we're trying to keep this short.) Let's take our old pal THIS IS A CIPHER as our plaintext, and use as our keyword the short phrase "MANY." What we do is encode the first letter of THIS IS A CIPHER with the M of Many, then the second letter using A, then the third with N, then Y, then use M again. So, for example, the Caesar alphabet corresponding to M is:

Plain:ABCDEFGHIJKLMNOPQRSTUVWXYZ
Cipher:MNOPQRSTUVWXYZABCDEFGHIJKL

The alphabet corresponding to A is the regular alphabet. That corresponding to N is:

Plain:ABCDEFGHIJKLMNOPQRSTUVWXYZ
Cipher:NOPQRSTUVWXYZABCDEFGHIJKLM

That corresponding to Y is:

Plain:ABCDEFGHIJKLMNOPQRSTUVWXYZ
Cipher:YZABCDEFGHIJKLMNOPQRSTUVWX

So we encode THIS IS A CIPHER as:

Plaintext:THIS IS A CIPHER
Keyword:MANY MA N YMANYM
Ciphertext:FHVQ US N AUPUCD

So THIS IS A CIPHER becomes FHVQ US N AUPUCD.

Now let's apply this knowledge. I've taken two messages (both in English, because it takes a spreadsheet to do this without error, and I don't have a spreadsheet that can operate using the Greek alphabet). Each message is enciphered three ways: Once with a transposition, once with a monalphabetic substitution, and once with a polyalphabetic substitution. We of course omit spaces and punctuation; line breaks in the items below are arbitrary.

Version 1:
XBYOPMKBCTYDXMJAWDWTXBYOCOMJBYEUDJMKKXBYOUBJPWTZZBWYOTJBNO
YMZNDANBRTZQZYDZTTYOTJBNOYKTYQZZYJBRTDWYDUBWBZOYOTXDJLXTMJ
TBWYDSBWAQHYOTWMYBDWZXDQWAZYDCMJTUDJOBPXODZOMKKOMRTSDJWTYO
TSMYYKTMWAUDJOBZXBADXMWAOBZDJHOMWYDADMKKXOBCOPMEMCOBTRTMWA
COTJBZOMVQZYMWAKMZYBWNHTMCTMPDWNDQJZTKRTZMWAXBYOMKKWMYBDWZ

Version 2:
XSAEUXSNDEUNHIOTLTIAECMWHNOILTILTAAWDEHTNRAOIWNWSEDNVRULASEO
RFLAYAHTROICEAGNEPGCOMANTRSEIIWFNMHLTLITDSUNSAAJTIATRNIEHGHS
DRAHENAHSICEDETSVOGIUSGSMCVEAHCYIHAIERTHEESHGITOTAHWONADLLOH
UROETTESVISLAIPRHWODOSNDFSEHIOTNTHINDHIWRNAOSIFEKRTNAROEIEWW
EBLTENRHTATONTANPIBUEHDOHHBELSOLVAAHNUTSOOIWDNSTRRWMOACFIHEO

Version 3:
MCBJWBCWYRYCTJLPCGBZAZEHBFDFOJSNQBBCZSVYVCRLYCWGNMEEHDPLUN
JUAIYEXWRRXIPCIETWHCIITGGROSZKMAVBJMSKAYCCRHUHZHCINZRJUHOX
UCVVYCZBZHUHENHMIACIWNZUMBVHUSUZGPPIVEZBVLBBMADVVZVMOQLHHN
UVIVDMVOJQKCOQCELARJARYDGGVCXFBIPDPUCWYQKERWTBSMVGYTEUCDBJ
SBMTSTYOFHXHXWXXPKHDRXAEZAHOSEHOQESJSHIJGXWXIXLVVPCYASGRBY

Looking at version 1, we find this distribution of letters:

Letter:ABCDEFGHIJKL MNOPQRSTUVWXYZ
Occurs:1126723200301613 1266255653265122122620
Percent:3.89.02.47.90.70.00.01.00.05.5 4.50.39.02.18.61.72.11.719.01.70.3 7.64.19.06.9

In version 2, the distribution is:

Letter:ABCDEFGHIJKL MNOPQRSTUVWXYZ
Occurs:273712285625261 11352323301621268512220
Percent:9.01.02.34.09.31.728.38.7 0.30.34.31.77.77.71.00.05.37.08.7 2.71.74.00.70.70

In version 3, the distribution is:

Letter:ABCDEFGHIJKLM NOPQRSTUVWXYZ
Occurs:1117218134101913115 711799612127121810121313
Percent:3.85.97.22.84.51.43.46.6 4.53.81.72.43.82.43.13.12.14.1 4.12.44.16.23.44.14.54.5

Can you tell, just by these numbers, which cipher is which? It's easy enough once you know what to look for. Observe version 2 first. Look at the figures for A, E, I, and T -- all in the vicinity of 9%. Compare the figures for Q and Z, which do not occur at all. This is the standard English distribution, with E T A the most common plus the "tail" of low-frequency letters at the end. Clearly this is the transposition cipher.

Version 1 is equally clearly the monalphabetic substitution; it doesn't follow the standard English frequency table, but it has much the same numbers: B, M, T, and Y all around 9%, and F, G, and I non-existent.

With two versions of the message, you should probably be able to solve the substitution cipher by inspection.

But what about version #3? Based on the first two encryptions, we know, in this "laboratory" setting, what it says. But what about in the real world?

Observe that the distribution is much more level than either #1 or #2. The rarest letter is F, which occurs four times; only one other letter, K, occurs fewer than six times. And the most common letter, C, occurs only 7.2% of the time, and only two other letters, H and V, occurring over 6% of the time. There can be no doubt, in this case, that the substitution is polyalphabetic. It's not an ideal polyalphabetic, in which all letters occur with roughly equal frequency, but with such a small sample, it will be hard to solve.

Fortunately, we aren't confined to such a small sample. I promised one other cipher using the same system. So let's try it.

Version 1:
KHNQBULBWGJZVOIDIZSBSFOSDLVSUVGIPUJERPUJTXLNAULVZMIQAKJLPOBCBAST
WONSWCJLUDGQWIKKSEVCLUNJCVVRFZFLEFQQRWMABIYKVXEMRZVOSBQUKMNOUFFZ
IXQQSLCDXZYTGCRFEVIZYRJCSAIJFVXHQLWZGOEZWRFLAYUFNVYCVTWYQWYUYRT
OHCMEVISLHQKIMITIUFHWXJOKHJDTUOPXZZNRYJOODMBVRFZFKJSTXUFUQAZDXP

Version 2:
XEUNSXYNUDUTONLEUFNTTARUFIILLNWLIBAORGSDEOFHOOTYITRITAGSEHNAEANIMS
NFKNOEETSNURPAOICYETHTCHTAFOFDYANATRAOEKTSFIILLRILIBAENNSETIHICEKT
SFCMRIYAORFTLLWEBAFIFELEYTEWGRRNEHIAILBOEHSSTFMWTHPRTHGNIEIINLONEO
YWKLERTLGDACACUJIRRLEVIOFHEEWYNISIHEMWHCTEITRONOTIEHTDALEWAWLSOLYA

Version 3:
QWUDJYQWMYTKEUDJYOTNDDAZTWZTDUPMWLBWAYOTUMCYDUYOTBJUMKKBSBKBYEBZ
UMJUJDPCMJJEBWNYOTXTBNOYBWYOTBJHJMCYBCMKVQANTPTWYXOBCOBZMKXMEZMK
KDXTAYDBYBWYOTDJEUDJXOBKTTRTJEDWTXTKKLWDXZOBPZTKUYDSTUMKKBSKTUT
XYOBWLBYWTCTZZMJEYDYMLTMWEHJTCMQYBDWZMNMBWZYYOTBJDXWUMKKBSBKBYE

Again, the frequency distribution:

Version 1:

Letter:ABCDEFGHIJKLM NOPQRSTUVWXYZ
Occurs:7897714561312912 7712512101281515109914
Percent:2.83.13.52.82.85.52.02.45.1 4.73.54.72.82.84.72.04.73.94.73.1 5.95.93.93.53.55.5

Version 2:

Letter:ABCDEFGHIJKLM NOPQRSTUVWXYZ
Occurs:194752813513261418 4191720151224719290
Percent:7.21.52.71.910.64.91.94.99.8 0.41.56.81.57.26.40.80.05.74.59.1 2.70.43.40.83.40.0

Version 3:

Letter:ABCDEFGHIJKLM NOPQRSTUVWXYZ
Occurs:4267179002015184 195134414271311792411
Percent:1.610.22.86.73.50.00.00.80.0 5.97.11.67.525.11.61.60.41.610.6 5.10.46.73.59.44.3

Once again, let's analyze these. The one that sticks out like a sore thumb is 2: 10.6% E, 9.1% T, 0% Q, 0%Z, 0.4% V. Pretty definitely a transposition.

Version 3 also has a strongly characteristic profile: 10.6% T, 10.2% B, 9.4% Y, 0% F, G, and I. Clearly it's a substitution.

And then there is version 1. No letter occurs more than 5.9% of the time, and nothing occurs less than 2% of the time. Again, clearly, a polyalphabetic substitution.

So now what? The monalphabetic substitution is trivial. What about the others?

We aren't going to go into details; that's the province of a real book on cryptography. Still, the transposition cipher shouldn't be too tough. There are some potentially useful hints here. The two samples we have of the transposition cipher have lengths 288 and 264, respectively -- interesting numbers, because they differ by 24, and both are multiples of 24. That's a strong hint that, in addition to being a transposition cipher, it operates on blocks of 24 units or some exact fraction of that figure (12, 8, 6, 4, 3, 2), with the most likely being toward the high end of that range. Presumably messages which don't have the right number of items are padded out somehow.

The real problem, though, is the polyalphabetic substitution. Long after experts had solved monalphabetic problems, polyalphabetic substitutions were thought impossible; to crack them, you needed the key phrase, and there was no place to attack it.

In fact, there was a point of attack. It requires, however, a much larger sample of text than ordinary monalphabetics, with the factor increasing at about the same rate as the (unknown) length of the keyword. And different messages can (and should) use different keywords, so even if you crack one message, you can't assume that you have cracked the rest.

Is it possible to determine if two messages have the same keyword? The answer, surprisingly, is yes. This follows from the work of W. F. Friedman around the time of the First World War. Friedman's work, in fact, is absolutely general: It can discover if any two messages were encoded using identical methods.

This follows from the fact that some letters are more common than others. What does this mean? Consider what happens if you take two random sets of letters. Simple probability says that, if we line them up letter by letter, you will get the same letter at any given position only one time in 26 (about 4%) for the Roman alphabet, or one time in 24 (also around 4%) in Greek.

Is that how it works in reality? Not at all. Because some letters are more common than others, two meaningful English messages will correspond not one time in 26, or 3.85% of the time, but almost exactly one time in 15, or 6.67% of the time.

We can demonstrate this. I took a King James Bible and randomly opened it and pointed to a verse to get two test passages. The first proved to be Ezekiel 39:2f. --
And I will turn thee back, and leave but the sixth part of thee, and will cause thee to come up from the north parts, and will bring thee upon the mountains of Israel: (3) and I will smite thy bow out of thy left hand, and will cause thine arrows to fall out of thy right hand.

Passage 2 is John 5:28ff. --
Marvel not at this: for the hour is coming, in the which all that are in graves shall hear his voice, (29) and shall come forth; they that have done good, unto the resurrection of life; and they that have done evil, unto the resurrection of damnation. (30) I can of mine own self do nothing....

The table below lines up these two passages one above the other; where the same letter occurs in the same position in each passage, it's marked *. We use the first 208 letters of each passage.

ANDIWILLTURNTHEEBACKANDLEAVEBUTTHESIXTHPARTOFTHEEAND
MARVELNOTATTHISFORTHEHOURISCOMINGINTHEWHICHALLTHATAR
*
WILLCAUSETHEETOCOMEUPFROMTHENORTHPARTSANDWILLBRINGTH
EINGRAVESSHALLHEARHISVOICEANDSHALLCOMEFORTHTHEYTHATH
*****
EEUPONTHEMOUNTAINSOFISRAELANDIWILLSMITETHYBOWOUTOFTH
AVEDONEGOODUNTOTHERESURRECTIONOFLIFEANDTHEYTHATHAVED
**********
YLEFTHANDANDWILLCAUSETHINEARROWSTOFALLOUTOFTHYRIGHTH
ONEEVILUNTOTHERESURRECTIONOFDAMNATIONICANOFMINEOWNSE
*****

In all, there are 21 letters marked * out of 208 letters total -- a shade over 10%.

Compare what happens when we use random letters -- in this case, for the second message, I just repeated the alphabet eight times:

ANDIWILLTURNTHEEBACKANDLEAVEBUTTHESIXTHPARTOFTHEEAND
ABCDEFGHIJKLMNOPQRSTUVWXYZABCDEFGHIJKLMNOPQRSTUVWXYZ
**
WILLCAUSETHEETOCOMEUPFROMTHENORTHPARTSANDWILLBRINGTH
ABCDEFGHIJKLMNOPQRSTUVWXYZABCDEFGHIJKLMNOPQRSTUVWXYZ
**
EEUPONTHEMOUNTAINSOFISRAELANDIWILLSMITETHYBOWOUTOFTH
ABCDEFGHIJKLMNOPQRSTUVWXYZABCDEFGHIJKLMNOPQRSTUVWXYZ
***
YLEFTHANDANDWILLCAUSETHINEARROWSTOFALLOUTOFTHYRIGHTH
ABCDEFGHIJKLMNOPQRSTUVWXYZABCDEFGHIJKLMNOPQRSTUVWXYZ
**

This gives us a mere nine hits in 208 letters -- not quite 4.5%, or about what we would expect of unrelated encryptions.

This is only an example, but the rule is general: messages where we have more than the expected level of correspondence of about 4% have a meaningful relationship (such as being in the same language). There is, of course, a lot of associated math to deal with the amount of error from the norm, but we can let that slide.

But, I can hear you objecting, the message is ciphered! That will destroy the correspondence between the letters.

But that's the whole point. As long as the two messages are enciphered using the same cipher method, so that the same letter at the same position in the plaintext will produce the same cipher letter at the same cipher position in the encrypted version, the frequency argument applies. If a plaintext A in position becomes a Q (say), then you'll have a Q in both messages, and you'll still have a correspondence.

In fact, taking our two samples of a polyalphabetic above, we find that in 254 letters, there are 13 correspondences, or 5.1%. That's a close call -- about halfway between the expected values -- but it appears that there is correspondence. (If we really cared, there is a test, the phi test, to determine if the correspondence is statistically significant, but we aren't really that concerned.)

The trick, then, is to find the keyword or keyphrase used to encrypt the passage. The easiest way to do this, if we have have enough text, is to determine the length of the phrase, and then do frequency analysis on the individual letters of the phrase. That is, if we know the key phrase is (say) 10 letters long, then letters #1, #11, #21, #31, etc. are all enciphered using the same alphabet; #2, #12, #22, #32 are enciphered with another alphabet, #3, #13, #23, #33 with a third, and so forth.

Can we determine the key length? Quite possibly.

The trick for this is to look for strings of similar letters. It's possible that one could have, e.g., the string FYSI occur twice in a message -- but the chances are much higher that it is the same four-letter plaintext enciphered with the same four letters of keytext. So the trick is to search messages for strings of repeated letters (preferably blocks of three or more).

In the case of the first cipher, we find HCI at positions 77-79 and 106-108 and XWX at 245-247 and 274-276.

In the second cipher, we have the extremely significant 5-letter sequence VRFZF at positions 91-95 and 236-240 (the longer the sequence, the higher the odds that it represents an actual correspondence. A two-letter correspondence may be significant but will occur by coincidence. It is rare, though not unknown, for a 3-letter sequence to be coincidence. A 5-letter sequence, however, is almost certainly the result of an actual alignment -- though there is a famous instance of such a long coincidence puzzling a famous cryptographer for a very long time).

Now we look for the repeat length. That's found by measuring the distances between our repeat blocks and taking prime factors:

106-77 = 29 = 29x1
274-245 = 29 = 29x1
236-91 = 145 = 29x5

Since 29 is a factor of all our repeats -- the only factor of all our repeats (other than 1, which can't be the answer since a repeat length of 1 represents a monalphabetic substitution), we know with near-certainty that we have a polyalphabetic substitution with key length 29.

So we just line up our two messages in blocks of 29 characters:

MCBJWBCWYRYCTJLPCGBZAZEHBFDFO
JSNQBBCZSVYVCRLYCWGNMEEHDPLUN
JUAIYEXWRRXIPCIETWHCIITGGROSZ
KMAVBJMSKAYCCRHUHZHCINZRJUHOX
UCVVYCZBZHUHENHMIACIWNZUMBVHU   <-- Message 1
SUZGPPIVEZBVLBBMADVVZVMOQLHHN
UVIVDMVOJQKCOQCELARJARYDGGVCX
FBIPDPUCWYQKERWTBSMVGYTEUCDBJ
SBMTSTYOFHXHXWXXPKHDRXAEZAHOS
EHOQESJSHIJGXWXIXLVVPCYASGRBY

KHNQBULBWGJZVOIDIZSBSFOSDLVSU
VGIPUJERPUJTXLNAULVZMIQAKJLPO
BCBASTWONSWCJLUDGQWIKKSEVCLUN
JCVVRFZFLEFQQRWMABIYKVXEMRZVO   <-- Message 2
SBQUKMNOUFFZIXQQSLCDXZYTGCRFE
VIZYRJCSAIJFVXHQLWZGOEZWRFLAY
UFNVYCVTWYQWYUYRTOHCMEVISLHQK
IMITIUFHWXJOKHJDTUOPXZZNRYJOO
DMBVRFZFKJSTXUFUQAZDXP

Again, at this point the standard approach is to read down each column and take frequency analysis of that. Sadly, our sample is awfully short -- 19 letters for most columns, and in a few cases, only 18. This is not necessarily as bad as it sounds. We assume (since this is an primitive cipher) that there is a meaningful keyphrase, capable of being remembered -- and hence reasoned back to.

There are several possible lines of attack. One is to operate based on our found repeats. In the first cipher, we had two repeats of three letters. It's not too bad a bet to assume these, or at least one of them, represents the word "the." If we assume a Caesar cipher based on the keyword (the standard form of early Vigenère), then knowing the plaintext, we can reason back to the keyword.

Another approach is to take the frequency analysis for each column and guess that the most common letter is E and see what that yields. It's almost certain that that won't work in at least some of those instances, given the small text of the sample, but it might give us a clue. For example, the distribution in column 1 is
J, S, U - 3; K, V - 2; B, D, E, F, I, M - 1
That's not much help, since, of course, we have three letters which are most common. Odds are, though, that those three include three of the eight A, E, I, N, O, R, S, T. That's a lot of possibilities (120, to be exact), but we're assuming a Vigenère cipher, which means that the letters should be in order. So we can seek an alignment. Our three letters are:
.........J........S.U.....
We expect them to align with
A...E...I....NO..RST......

That is, since we're assuming that every letter in the set {J,S,U} corresponds to a letter in the set {A,E,I,N,O,R,S,T}, and we assume that the letters in the set {J,S,U} are in order, it must be that, if we list them in order and compare against the letters of the alphabet, then every letter in the set {J,S,U} will match a letter in the other set. So we start trying solutions. For example, taking the above case (which assumes no cipher), we have
.........J........S.U.....
A...E...I....NO..RST......

This gives us a hit at S=S, but no correspondence at J or U; this is no good. So we shift the top alphabet over one letter, giving us:
..........J........S.U....
A...E...I....NO..RST......

This time, we have a hit at S=T, but no match for J or U. So we keep looking. Sparing you the details, for column one there is only one solution which gives us hits on all three letters:
PLAIN: A...E...I....NO..RST......
CIPHER:........J........S.U.....

It turns out, however, that this alignment is wrong. In fact, the alignment will turn out to be (note that we still have two hits):
PLAIN: A...E...I....NO..RST......
CIPHER:..S.U.............J.......

Still, if we attack on enough such points, we can probably deduce the keyword. Since this is beyond our scope, we'll stop there and just give the solutions:

The first passage is from Abraham Lincoln's second inaugural address, given March 4, 1865, but still words worth remembering.

With malice toward none; with charity for all; with firmness in the right, as God gives us to see the right, let us strive on to finish the work we are in; to bind up the nation's wounds; to care for him who shall have borne the battle, and for his widow, and his orphan -- to do all which may achieve and cherish a just and lasting peace among ourselves, and with all nations.

The second is from John Stuart Mill's On Liberty, in the section "Of the Liberty of Thought and Discussion."

Unfortunately for the good sense of mankind, the fact of their fallibility is far from carrying the weight in their practical judgement, which is always allowed to it in theory; for while every one well knows himself to be fallible, few think it necessary to take any precautions against their own fallibility.

The transposition cipher operates in blocks of 12 characters, taking them in the order 4,8,12,11,7,3,2,6,10,9,5,1. These blocks are then further scrambled, with the last block, then the first block, then the next-to-last block, then the second block, then the second from the last, then the third, etc. This transposition, of course, requires a message with a length that is a multiple of 12 letters. So we pad the end: After the last character, we include a letter X (two, if there is room), and then random text.

The monalphabetic substitution uses this pattern:

Plaintext:ABCDEFGHIJKLM NOPQRSTUVWXYZ
Ciphertext:MSCATUNOBVLK PWDHIJZYQRXFEG

The keyword for the polyalphabetic substitution is QUICK BROWN FOX JUMPS OVER LAZY DOG. This is chosen deliberately because it contains so many different letters, which will tend to flatten the frequency distribution. Though, in fact, it has a disadvantage also: It's 29 letters long, which is prime; this means that repeats every 29 letters will stick out like sore thumbs; it's better to use keywords with lengths that have a large number of factors.

The above line of argument, of course, all depends on recognizing when two messages are encrypted with the same system. The test we described above is technically called the "kappa test." This is after the two parameters involved, κr (kappa sub r) and κp (kappa sub p). The former is the expected rate of correspondence between two random collections of letters (the subscript r stands for "random"); the latter is the rate of correspondence for two messages encrypted the same way (the subscript p is for "plaintext."). Calculating κr is trivial: It's one divided by the number of letters in the alphabet or symbol set. So, as noted above, κr for English is 1/26 or 0.038. For Greek, the figure is 1/24 or 0.042.

Calculating κp is only slightly harder if you know the frequency distribution for your selected language. For any random alphabet with n letters, where f(n) represents the frequency of the nth letter,

κp = f(1)2 + f(2)2 + ... + f(n)2

So in English, this would be the frequency of the letter A (about .08) squared, plus the frequency of B (about .01) squared, plus the frequency of C (about .03), etc., through the frequency of Z. This rule applies for any language (even syllabic and ideographic languages, though of course those will have higher kappa figures). David Kahn's The Codebreakers reports that Russian yields a value of 0.53 for κp, while French is .078, German .072, Italian .074, and .078 for Spanish.

Looking back at our table for Biblical Greek, we find that κp works out to .071. Tables for Hebrew, Latin, Syriac, etc. are left as exercises for the reader, since I don't have frequency data for any of those languages.

There is another interesting footnote to this study, in that there is an actual test for determining if one has successfully decrypted a message -- the Shannon unicity test. When applied to a text which seems to be partially solved (e.g. a message shortened by leaving out unneeded vowels, as we might write THS IS A CPHR for THIS IS A CIPHER), it determines if the level of sense is more likely to arise from an actual decryption or from random factors. This test too is beyond the scope of this article. But readers dealing with real cryptographic systems should be aware that these tests exist.