|
Assignment 3 (A Permutation Problem) |
250 points |
|
DUE: Sat Oct 25 at 5pm PST
|
The goal will be to write a GA to crack simple substitution ciphers and return the key and the clear. The encoded message that is sent is called the cipher text. The UNencoded message is called the clear text. A simple substitution cipher is one in which each possible letter in the clear text has a unique translation into one in cipher text. To encrypt the message one simply makes a one to one substitution of the letters in the clear text with the ones in the cipher text. The key is the mapping expressed as the alphabet of the clear text translated into cipher text alphabet. In our case this is the encoding for the 26 lowercase letters: "abc...z".
For example: the clear text:
"Programming is like sex: one mistake and you have to support it for the rest of your life."
the key is the second line below
abcdefghijklmnopqrstuvwxyz cpmtzkrhlsquniebdfaygwovjx <--- the key
the cipher text:
bferf cnnli rlaul qzazv eiznl aycqz citje ghcwz yeagb befyl ykefy hzfza yekje gfulk z
In this problem we encode the message as follows. The clear text has had all uppercase letters changed to lowercase letters and all whitespace and punctuation removed. It is then enciphered with a substitution cipher determined by the key. Then the message is broken up in to blocks of 5 letters each.
You will be given several enciphered texts each enciphered with a different key. Your program must determine what the key and clear text is by evolving a key that optimizes a fitness function. Longer texts will be easier to solve than shorter ones. (Why?) Your algorithm should be able to crack the codes completely for longer texts and only be off by a few infrequently used letters for shortest texts. You are not expected to get the perfect key in all cases. (Why?)
Part 1: Use the fitness function described in class and the contact frequency matrix supplied in the resourses below to formulate a fitness. You should write a GA of your choice (steady state or generational) and a permutation based xover of your choice. You may optionally perform a simple hill climbing local search as part of your GA. You should not perform a more complex tabu or simulated annealing search since I would like to keep things simple and the problem can be solved without them. The goal here is not to solve these perfectly but to see how well they can be solved used these algorithms.
Part 2: Augment the fitness function with a punishment function for bad choices as described in class and see if you can get improved performance in speed or quality. Be prepared to talk about your results in class.
To compute the fitness function for a key (a permutation of 26 letters) in the context of the contact table or digraph table for English called D and a contact table for your message called C. A contact table is a 26x26 matrix C where each element C_ij of the number to times each possible ordered pair of letters ij occurs i.e it is a count of occurances. A contact table (or digraph table) for English is supplied below. It is derived from a large sample of English text.
Let e(i) be a function that encodes the letter i via the the key. For the example key above e(a) = c.
Once the cipher text is read in you can create the contact table for the cipher text. Let SUM_C be the sum of the counts in the C matrix and let SUM_D be the sum of the counts in the D matrix. The fitness(key) is then the sum over all ordered char pairs ij of (D_ij/SUM_D - C_e(i)e(j)/SUM_C)^2 where e(i) and e(j) are the encoding of the characters using the key. NOTE: precompute what you can so you don't do hundreds of divides for ever fitness eval. The smaller the fitness the better the match. As the key changes, the encoding changes and so does the fitness of the key.
You should turn in your code in a tar file along with a report. Your code and makefile should create a program named decipher. I will grade your program by compiling it and sending it files similar to the test files via standard input and see if it generates the key. The output format for the key is: ** <your submit name> <the key>. For example for the key above for user meerkat:
** meerkat cpmtzkrhlsquniebdfaygwovjxOn the lines that follow that you can output whatever else you want as long as it does not begin with **. There will be a time limit on execution of about 10 to 20 seconds. Be sure you write your fitness function efficiently.
You should report on your observations in a 1 page report called report.pdf. Your report should describe your GA including the parameters such a xover used, xover probability, local search used ig any, when it was used, etc There should be enough information that a reader could easily reconstruct your GA. Please put this in table form similar to what you see in the book. Then report on your successes in deciphering the eight sample texts. Explain what you tried that didn't work if any and why. Explain why you chose what you did. You can use additional pages for tables, diagrams, decipherments, keys, or plots but is not to present futher explanations/observations.
b115.code d158.code e237.code n250.code m369.code i603.code o715.code p1209.code
| Robert Heckendorn | Last updated: |