CS472 Evolutionary Computation
Assignment 3
(A Permutation Problem)
250 points
DUE: Sat Oct 25 at 5pm PST


The problem

Allow plenty of time for this assignment.

The goal will be to write a GA to crack simple substitution ciphers and return the key and the clear. The encoded message that is sent is called the cipher text. The UNencoded message is called the clear text. A simple substitution cipher is one in which each possible letter in the clear text has a unique translation into one in cipher text. To encrypt the message one simply makes a one to one substitution of the letters in the clear text with the ones in the cipher text. The key is the mapping expressed as the alphabet of the clear text translated into cipher text alphabet. In our case this is the encoding for the 26 lowercase letters: "abc...z".

For example: the clear text:

"Programming is like sex: one mistake and you have to support it for
the rest of your life."

the key is the second line below

abcdefghijklmnopqrstuvwxyz
cpmtzkrhlsquniebdfaygwovjx     <---   the key

the cipher text:

 
bferf cnnli rlaul qzazv eiznl aycqz citje ghcwz yeagb befyl 
ykefy hzfza yekje gfulk z

In this problem we encode the message as follows. The clear text has had all uppercase letters changed to lowercase letters and all whitespace and punctuation removed. It is then enciphered with a substitution cipher determined by the key. Then the message is broken up in to blocks of 5 letters each.

You will be given several enciphered texts each enciphered with a different key. Your program must determine what the key and clear text is by evolving a key that optimizes a fitness function. Longer texts will be easier to solve than shorter ones. (Why?) Your algorithm should be able to crack the codes completely for longer texts and only be off by a few infrequently used letters for shortest texts. You are not expected to get the perfect key in all cases. (Why?)

Part 1: Use the fitness function described in class and the contact frequency matrix supplied in the resourses below to formulate a fitness. You should write a GA of your choice (steady state or generational) and a permutation based xover of your choice. You may optionally perform a simple hill climbing local search as part of your GA. You should not perform a more complex tabu or simulated annealing search since I would like to keep things simple and the problem can be solved without them. The goal here is not to solve these perfectly but to see how well they can be solved used these algorithms.

Part 2: Augment the fitness function with a punishment function for bad choices as described in class and see if you can get improved performance in speed or quality. Be prepared to talk about your results in class.

The Fitness Function

To compute the fitness function for a key (a permutation of 26 letters) in the context of the contact table or digraph table for English called D and a contact table for your message called C. A contact table is a 26x26 matrix C where each element C_ij of the number to times each possible ordered pair of letters ij occurs i.e it is a count of occurances. A contact table (or digraph table) for English is supplied below. It is derived from a large sample of English text.

Let e(i) be a function that encodes the letter i via the the key. For the example key above e(a) = c.

Once the cipher text is read in you can create the contact table for the cipher text. Let SUM_C be the sum of the counts in the C matrix and let SUM_D be the sum of the counts in the D matrix. The fitness(key) is then the sum over all ordered char pairs ij of (D_ij/SUM_D - C_e(i)e(j)/SUM_C)^2 where e(i) and e(j) are the encoding of the characters using the key. NOTE: precompute what you can so you don't do hundreds of divides for ever fitness eval. The smaller the fitness the better the match. As the key changes, the encoding changes and so does the fitness of the key.

What to Turn In

You should turn in your code in a tar file along with a report. Your code and makefile should create a program named decipher. I will grade your program by compiling it and sending it files similar to the test files via standard input and see if it generates the key. The output format for the key is: ** <your submit name> <the key>. For example for the key above for user meerkat:

** meerkat cpmtzkrhlsquniebdfaygwovjx
On the lines that follow that you can output whatever else you want as long as it does not begin with **. There will be a time limit on execution of about 10 to 20 seconds. Be sure you write your fitness function efficiently.

You should report on your observations in a 1 page report called report.pdf. Your report should describe your GA including the parameters such a xover used, xover probability, local search used ig any, when it was used, etc There should be enough information that a reader could easily reconstruct your GA. Please put this in table form similar to what you see in the book. Then report on your successes in deciphering the eight sample texts. Explain what you tried that didn't work if any and why. Explain why you chose what you did. You can use additional pages for tables, diagrams, decipherments, keys, or plots but is not to present futher explanations/observations.

Resources

Generic Two Letter Frequency Tables

Here is a letter contact table. The first column is the letter pair. The second is the number of times it occurs in a large sample text. By summing up column two and dividing each entry in column two by that sum, you can get a normalized frequency. Not all character pairs are found in the table. Those that are not present are zero. e.g. "bf"

The Codes

Here are 8 messages each enciphered with a different key. They can be found in the files in codes.tar along with a solved code called test.code for testing purposes. The files are named with a letter and the length of the message:
 
b115.code 
d158.code
e237.code
n250.code
m369.code
i603.code
o715.code
p1209.code

Submission

Homework will be submitted as an uncompressed tar file to the homework submission page. You can submit as many times as you like. The LAST file you submit BEFORE the deadline will be the one graded. For all submissions you will receive email giving you some automated feedback on the unpacking and compiling and running of code and other things that can be autotested. I will read the results of the runs and the reports you submit.


Robert Heckendorn Last updated: Oct 7, 2008 13:58