CS445 Introduction to Compilers
Assignment 1
(The Scanner)
120 points
DUE: Wed Sep 9 at 5PM PST

Because the output of your program will first be preprocessed by an automatic comparison program before being handed to a human being. Please follow formatting instructions/examples carefully. The results your program produces will need to look exactly like the target. Please do not embellish with extra titles or other text such as "run complete" or "CS445 output". The testing facility of the submit script will help you get this annoying detail right. Thanks for your patience.

The Problem

Use both Flex and Bison to build a scanner for the C- language. The scanner will be named c- (note the lowercase. That is c- will be the compiler for the language C- in uppercase.). It will read and process a stream of tokens from a filename given as the first argument to the c- command or from standard input if the filename argument is not present. This means the call to c- on the C- code in file filenameis defined as:

 
c- {filename}
So
c- filename
works and so would
cat filename | c-
work and so would
c- < filename
work.

It will produce a stream of tokens as its output as described below. (Pretesting your answer will help assure compliance.) It will be constructed using flex and bison to run on the machine wormulon or one of its clones where the grading will occur.

The Flex Part

Build a Flex scanner that returns a token classes for all the tokens in the C- grammar. For numbers it should also "return" a numerical value. For ids it should also return a string. For boolean values it should return a boolean value (which you may implement, if you choose, as an int). The scanner should ignore comments and not return anything. If the scanner finds a character that it doesn't recognize it should print the message: "Invalid token", as seen in the example. We'll make this pretty later. The scanner should keep track of the line number of each token and either return it with the token by using a struct or class Token.

Note that in C- like C and C++ newline is not an element of the grammar and is merely whitespace. This was not true for the calculator program we did (will do) in class.

The Bison Part

Build a Bison parser that accepts any stream of tokens from the scanner and prints out the line number, the token type, and any extra information returned by the scanner as specified above. This first program will not recognize C-. It will only recognize a stream of C- tokens. One of the goals of this assignment is to get a connection between flex and bison up and running, even if it is trivial.

An example of the output for these C- statements:

if (v==0)
return u;
pi = 3.14159;
x=true and y; 
fred(x++, y[3]); 
is:
Line 1 Token: IF
Line 1 Token: (
Line 1 Token: ID Value: v
Line 1 Token: EQ
Line 1 Token: NUM Value: 0
Line 1 Token: )
Line 2 Token: RETURN 
Line 2 Token: ID Value: u
Line 2 Token: ;
Line 3 Token: ID Value: pi
Line 3 Token: =
Line 3 Token: NUM Value: 3
ERROR(3):Invalid token '.'
Line 3 Token: NUM Value: 14159
Line 3 Token: ;
Line 4 Token: ID Value: x
Line 4 Token: =
Line 4 Token: LOGIC Value: T
Line 4 Token: AND
Line 4 Token: ID Value: y
Line 4 Token: ;
Line 5 Token: ID Value: fred
Line 5 Token: (
Line 5 Token: ID Value: x
Line 5 Token: INC
Line 5 Token: ,
Line 5 Token: ID Value: y
Line 5 Token: [ 
Line 5 Token: NUM Value: 3 
Line 5 Token: ]
Line 5 Token: )
Line 5 Token: ;
Numbers are printed as numbers (using %d or %i format), IDs as strings, boolean true as T and boolean false as F. Again, the pretest will let you know where your format is off.

(Hint: also note that IDs are not what you think they are. They are not exactly like in C++.)

The type of any single character token is printed as the character itself. The type of any multicharacter token is printed as follows:

!=        NEQ
+=        PASSIGN 
++        INC
-=        MASSIGN
--        DEC 
<=        LEQ 
==        EQ
>=        GEQ
and       AND
bool      BOOLEAN
break     BREAK
else      ELSE
false     LOGIC
if        IF
int       INT
not       NOT 
or        OR 
return    RETURN
true      LOGIC
void      VOID
while     WHILE
an id     ID
a number  NUM
Note that you can use whatever internal symbols you want but the output must print token types as above for comparison.

Suggestion

One way to get all the information back from flex that you need is to have yylex pass back either a struct or class instance with all the data you will ever need about the token. This can be done by using yylval. The class of the token can then be returned by yylex. We talk about these two ways of passing information from yylex in class.

Build and Test

You will include a makefile (note the all lowercase) that I will execute to build c-. I will then run several files containing tokens through c- and compare the results.

Submission

Homework will be submitted as an uncompressed tar file to the homework submission page. You can submit as many times as you like. The LAST file you submit BEFORE the deadline will be the one graded. (Remember the deadline is in Pacific time and not your local time). For all submissions you will receive email showing how your file performed on the pre-grade tests. The grading program will use more extensive tests, so thoroughly test your program with inputs of your own.

If you have tests you really think are important or just cool please send them to me and I will consider adding them to the test suite.


Robert Heckendorn Last updated: Aug 28, 2009 12:44