CS445 Introduction to Compilers
Assignment 2
(Construct the Syntax Tree)
250 points
DUE: Sat Oct 10 at 5pm PDT (see warning below)

The Problem

This homework consists of these tasks:

Do not be fooled. This is a nontrivial homework. Do not put off this assignment. It is complicated and detailed.

These tasks are descibed further below.

Improving the interface

When done with this assignment you will have created code that will recognize legal C- programs and generate the first pass at the tree.

The parser will be named c- just like last time. It will read and process a stream of tokens from a filename given as the first argument to the c- command OR from standard input if the filename argument is not present.

It will now also take the -d option as a first argument. I recommend using the getopt routine since this will handle UNIX arguments in a uniform and standard way. The -d option turns on the yydebug flag by setting it to 1.

For example: c- -d sort.c- should run the c- compiler on the program sort.c- and give details of the parsing that is going on. While c- sort.c- should simply run the c- compiler.

For this assignment your compiler should record the line number and string representation of the last token scanned in global variables. These global variables are for adding arguments to yyerror. Rewrite your yyerror routine to print a message as in this error message:

printf("ERROR lineno(%d):%s.  I got: %s\n", lineno, msg, lastScannedToken); 
The msg is passed into the yyerror routine as we will discuss in class. You can write out the error message using any method you like but the content of the error message must be exactly like the above. To get this all to work nicely, turn on verbose error messaging with this macro definition:
#define YYERROR_VERBOSE 
We will continue to improve on the invocation line and error reporting as our compiler gets more sophisticated. HINT: As we will discuss it class it is important that the you allocate a new string for each token as it is scanned to avoid the problem of referring to a reusable buffer.

The Parser

For the parsing part of the assignment modify your Bison grammar to parse C- code. A good approach is to initially forget about the syntax tree part of the assignment. If you get the right grammar into your compiler it will successfully parse any C- program. A program that simply recognizes whether a program is legal or not is called a recognizer. When you build your bison grammar directly from the one supplied you will find that you have the dangling else problem. There are several ways to fix this problem. I will discuss one in class.

Coding restriction: Do not attempt to fix dangling else with associtivity declarations such as %left. Do not fix any other problem with your grammar by using the %expect feature of Bison. This causes Bison to ignore some number of parsing errors and me to deduct points from your assignment. Really, you can do this with out this "feature". I expect your parser to compile without any parser errors.

Now that your recognizer is working. Let's look at the syntax tree I want you to produce. As we will discuss in class the tree is an abbreviated portion of the parse tree containing the parts we are interested in. Here is a sample TreeNode that I used:

typedef struct treeNode
{
    struct treeNode *child[MAXCHILDREN];   // children of the node
    struct treeNode *sibling;              // siblings for the node
    int lineno;                            // line number for errors
    NodeKind nodekind;                     // type of node
    union                                  // subtype of type 
    {
	DeclKind decl;                     // used when DeclK
	StmtKind stmt;                     // used when StmtK
	ExpKind exp;                       // used when ExpK
    } kind;
    union                                  // relevant data in type -> attr
    {
        OpKind op;                         // type of token (same as in bison)
	int val;                           // used when ConstantK
	char *name;                        // used when IdK
    } attr; 
    ExpType expType;		           // used when ExpK for type checking
    int size;                              // used for size of array
    bool isArray;                          // is this an array 
} TreeNode; 
This design is stolen straight from the book. This way you can use the one in the book as an example to work from. Ours has to have extra features and node types. We will discuss this in detail in class.

To encode the program as a tree you need to make the right nodes at the right steps in the parsing. When you need to make a node you will use routines you write similar to the newStmtNode function in util.c for the Tiny language in the book. These will be passed up the tree and assembled as in the tiny example in the book. Coding restriction: Do not use YYSTYPE as used in the book. This subverts features that are there to help you. I will discuss how to use this to your advantage. I will also talk about how to use this:

%union { 
    ExpType type;
    int number;
    TokenData tokenData; 
    TreeNode *tree;
}

The Parser and printTree

So what should the tree look like for a given program? This is essentially described the the Bison code. I think the best way to describe this is in class and by example. To understand the examples you must understand the output format of the printTree function. The printTree function prints the the important information contained in the node pointed to by the second argument. It then applies the printTree function to all the nonnull children and prints them out numbered by their index in the child array. It then follows the sibling pointer if it is nonnull and applies the printTree function to that. The first sibling found is numbered 1. Reading the syntax tree printed for sample input programs shows you what to do in each case. For example, given this program:
 
int max(int x, y) {
    int z;
    if (x>y) z=x;
    else z=y;
    return z;
}
you should get the following output from your c-. The // comments are not part of the output but to explain what you are seeing.
Function max returns type int [line: 1]          // this is the declaration node for a function
Child: 1                                         // it has 2 children.  Child 1 are parameters.
    Param x of type int [line: 1]                // the first parameter is x of type int
    Sibling: 1                                   // the parameters are tied together as a linked list of siblings
    Param y of type int [line: 1]                // the second parameter is y of type int
Child: 2                                         // the second child of the function declaration is the statements
    Compound [line: 1]                           // the body of a function is treated as a compound statement
    Child: 1                                     // the first child of a compound statement is a list of declarations
        Var z of type int [line: 2]              // z is declared of type int
    Child: 2                                     // the second child of a compound statement is a list of statements
        If [line: 3]                             // the if node has two or three children 
        Child: 1                                 // the first child is the test 
            Op: > [line: 3]                      // a relational operator > applied
            Child: 1                             // to the two children
                Id: x [line: 3]                  // the first of which is x
            Child: 2
                Id: y [line: 3]                  // the second is y
        Child: 2                                 // the second child of the if is the then clause
            Assign: = [line: 3]                  // z = x
            Child: 1 
                Id: z [line: 3]
            Child: 2
                Id: x [line: 3]
        Child: 3                                 // the third child of the if is the else clause
            Assign: = [line: 4]                  // z = y
            Child: 1
                Id: z [line: 4]
            Child: 2
                Id: y [line: 4]
        Sibling: 1                               // the second statement in the body of the compound statement
        Return [line: 5]                         // return which takes as its only child 
        Child: 1
            Id: z [line: 5]                      // the variable z
 
In the cases where there is an optional expression or statement the corresponding child pointer is set to NULL (i.e. 0). For example compound statements might not have any declarations so child[0] would be set to NULL. Return optionally takes an expression. If there isn't an expression then the Child[0] is NULL. The while statement might not have a body: for example while (searching()); in which case child[1] is NULL. The default for unneeded children and siblings is always NULL.

The question is which node's linenumber do you use to issue the error? For example, if there is a problem with a big long hairy while statement we will what tag the error where the while token is. One could have used the line number from the test but that could become tricky if the test goes over multiple lines. A clear decision on the major tokens is given below.

Here are node types and where the line is said to be:

//declarations
VarK    at the ID 
FuncK   at the ID
ParamK  at the ID
//statements 
IfK       at the IF
WhileK    at the WHILE
CompoundK at the {
ReturnK   at the RETURN 
BreakK    at the BREAK
//operators
OpK        at the operator
ConstantK  at the constant
IdK        at the ID
AssignK    at the =
CallK      cat the ID
So, in a declaration of a variable the declaration node of type VarK is said to be on the line that the ID token was found. "If" statements are where the IF token was found etc.

HINT: The yacc code in the book is a good example of how to connect the nodes you create. The node create code is a good model for how to create nodes and print a tree. Use your notes from class on how to put the rest of it together.

Examples

A great example file is the everything.c- file which produces this tree file.

Submission

Homework will be submitted as an uncompressed tar file to the homework submission page. You can submit as many times as you like. The LAST file you submit BEFORE the deadline will be the one graded. For all submissions you will receive email showing how your file performed on the pre-grade tests. The grading program will use more extensive tests and those results will be mailed to when they ar erun.

If you have tests you really think are important or just cool please send them to me and I will consider adding them to the test suite.


Robert Heckendorn Last updated: Sep 22, 2009 20:38