The Problem

In this assignment we will type our abstract syntax tree (AST) and start to do semantic analysis and semantic error generation. We will also add some command line options and a second option to print out the tree.

New Compiler Options

As last time have the two options. The -d option from which sets the yydebug variable to 1. The second option is -p (lower case p) which prints the abstract syntax tree. That is, it prints the syntax tree we did for assignment 2.

The third option is -P (capital p) which prints the syntax tree as above plus prints type of all nodes in your AST. Many of these, like if will be undefined. But many will have values. IMPORTANT: you will be graded on the output of both tree prints. If you didn't get your tree perfect in assignment 2, now you can!

c- should still accept a single input file either from a filename given on the command line or redirected as standard input.

We will continue adding options throughout the semester. (HINT: Ourgetopt code that is in this ourgetopt.cpp file and is free for your use. Check out: a ourgetopt example that is a silly example of how to do a variety of things with ourgetopt. This will make your life easier and let you focus on the compiler stuff. I also have the getopt man page available.)

Reorganize your Code

To help me with grading please put your tree printing code in a file called printtree.cpp and printtree.h. Put your semantic analysis code in semantic.cpp and semantic.h. If you use my ourgetopt routine then put that in ourgetopt.cpp and ourgetopt.h. Update your makefile to build these by putting their .o's in the dependency list for building c- and the g++ line.

Semantic Errors

We want to generate errors that are tagged with useful line numbers. So we will need to be sure each node is tagged with a useful line number. Remember, to do this effectively we need to grab the line number as soon as possible (in flex) and associate it with the token. This can be done nicely (portably) by passing back a struct/class for each token (as you have probably already done) in the yylval which has all the information about the token such as its line number, lexeme (what the user typed), constant value, even token class. (A struct/class allows you to return more than a single value in yylval.) You should avoid using global variables for token information when possible.

Once the information is passed back in the tokenData, then things like the line number can be squirreled away in the TreeNodes in the tree. This information is then used when the tree is traversed for semantic analysis.

Scope and Type Checking

After checking if you should print the abstract syntax tree, you will now traverse the tree looking for typing and program structure errors. So your main() might look something like this:

	numErrors=0; 
        numWarnings=0;

	yyparse();

	if (numErrors==0) {
            // -p
            if (printSyntaxTree) printTree(syntaxTree, NOTYPES);

            scopeAndType(syntaxTree);   // semantic analysis (may have errors)

            // -P
            if (printAnnotatedSyntaxTree) printTree(syntaxTree, TYPES);

            // code generation will go here

            }

        // report the number of errors and warnings
        printf("Number of errors: %d\n", numErrors);
        printf("Number of warnings: %d\n", numWarnings);

Your main may look quite different. The routine scopeAndType will process the tree by calling a treeTraverse routine that starts at the root node for the tree and recursively calls itself for children and siblings until it gets to the leaves. Declarations will make entries in a symbol table (see the symbol table section below).

Your job in writing the treeTraverse routine is to catch a variety of warnings and errors and duplicate my output for any input given. You should keep count of the number of warnings and errors and report that at the end of a run. Here is the list of errors right out of my version in printf format. To get an easy match to the expected output it helps you immensely if you use exactly these formats. Note that the type string that you put into the message is often something like "type int" or "unknown type". These are exactly the errors you must catch for this assignment There are another 15 or so error messages for the next assignment.

Here are some details by node type but this list is not exhaustive. You are in control of the design as long as it duplicates my output.

  • For declaration nodes check for duplicate definitions in the symbol table. A special case happens in the case of the first compound statement in a function which will NOT open a new scope:

    fred(int x) { int x; }
    

    is a duplicate definition of x while

    fred(int x) { { int x; } }
    

    is not. This is what C++ does. Try it.

  • For compound statements a newScope needs to be handled. The symbol table object I supply will let you set up and destroy scopes and enter variables in the symbol table. You can put a label of your choice on a new scope in the enter method for the SymbolTable object. For instance: enter("Compound Statement"); to label the scope. If you turn on debugging using a good label for each scope might help make the debug information more readable.
  • Assignments and operators should check that they have the proper type. Types of expressions will have to be passed up. Beware of Cascading Errors as discussed in class. Hint: it might be useful to have an undefined type that is used when variables are undefined or the type is undefined.
  • Consider using an array or clever function rather than a switch or cascading if to know what types operators require for the operand and use the same strategy for remembering what type is returned. Some examples are:
    • > takes Integers and returns a Boolean.
    • + takes Integers and returns an Integer.
    • or takes Booleans and returns a Boolean.
    • The operators == and !=, take arguments that are of the same type (e.g. both Boolean or both Integer) and return a Boolean.
    • = take arguments that are of the same type and returns the type of the lhs. This means if there is an undefined operand, the lhs operand even if undefined is the type of the assignment. This is because assignment is an expression and can be used in cascaded assignment like: a = b = c = 314159265
    • ++ and -- takes in Integer and immediately returns and Integer. This is not like in C or C++ in which these operators have special semantics.
    See the error messages above to find an appropriate error message. Note that in the error messages above lhs means left hand side and rhs means right hand side.

    All operators work only on integers and return integers except the following. The table shows the operator, LHS, and RHS of the OP (if applicable) and the result of the operator:

    OP LHS RHS result
    AND b b b
    OR b b b
    '=' abci abci aL
    EQ abci abci b
    NOTEQ abci abci b
    '<' ci ci b
    '>' ci ci b
    GRTEQ ci ci b
    LESSEQ ci ci b
    '[' Abci i L
    '*' Abci (unary) i

    a is array allowed, b is Boolean, c is character, i is integer, L type is the lhs operand, and A is requires an Array.

  • For Ids you have to see if the variable has been previously defined or not. This can be done using convenient functions in the symbolTable. If not previously defined set the type of the Id node to the type of the declaration and put a pointer to the declaration node in the symbolTable. This way you can always find your way back to the declaration node in the tree and store all the information about a variable at its declaration. If the Id is undefined then set the type of the id to UndefinedType (or some other indicator that the type information is missing). Semantic errors can create undefined types as when you add 'a' to true. This is a type that when it does not match the required type of parent nodes it does not generate an error!!! Again to prevent cascading errors. Note: You may have to guard against UndefinedType in many places. It is helpful in this assignment to encapsulate redundant code as functions.
  • IMPORTANT: Note that for this assignment each undefined reference must generate an error message. We'll fix this later.
  • For Ids you can have arrays that are indexed. Once they are indexed, their type becomes nonarray. That is the type of the '[' is the type of the lhs. Check for indexing of nonarrays and using unindexed arrays where they can't be used.
  • void is the type of a function that intentionally returns no value. It is possible for a type to be of type void in an error message.
  • Ids that are arrays can also be prefixed with '*' operator. That lets you get at the size of the array. Every array stores not only the values in the array but its size. This means that an array of size 10 (e.g. frog[10]) needs 11 spaces allocated to it. More about this in the memory allocation assignment. For this assignment you only need to know that '*' works on arrays and returns an int.
  • The return statement you must make sure that the user does not try to return a whole array.
  • Finally, after processing the whole tree, main should be in the global symbol table. If the procedure main is not defined then you must print out an error.

Symbol Table

Here is a useful C++ symbol table object you can use:

tar of C++ symbol table stuff

This version uses C++ standard library and is cheap to do. Feel free to augment it or build your own. Though it is written with std::string type as the argument type in many places, you can cast a char * to and from std::string.

It provides a symbol table object with insert and lookup methods for symbols and a pointer (you can use the pointer to point to a TreeNode. It also has enter and leave methods of managing the scope stack. Read the symbolTable.h for more information on how to use it. You might want to just play with it to see how it works before you put it into your compiler (see test routines commented out in the supplied code.). Inserting a symbol that is already defined returns false, success is true. Looking up a symbol that is not there will return a NULL pointer.

One feature of the symbol table is the debug method and the two DEBUG flags. At construction time the SymbolTable object is in nondebugging mode. But by setting the flags with the debug method you can get the object to spew out info. You might consider starting out by printing the symbol table on exit from a scope using the debug flag.

Finally the symbol table print routine takes print function that will print your treeNode. So if you define something to print a node given a TreeNode * then you can supply that name to the print function to print out your symbol table stack. That way the code doesn't have to know what you TreeNode looks like internally. For instance in my code:

 
    symtab = new SymbolTable();
creates the symbol table. To print the symbol table:
 
    symtab->print(nodePrint);
will print each void * in the symbol table using your supplied function:
    void nodePrint(void *p)

Submission

Homework will be submitted as an uncompressed tar file that contains no subdirectories. The tar file is submitted to the class submission page. You can submit as many times as you like. The LAST file you submit BEFORE the deadline will be the one graded. Absolutely, no late papers. For all submissions you will receive email at your uidaho address showing how your file performed on the pre-grade tests. The grading program will use more extensive tests, so thoroughly test your program with inputs of your own.

Your code should compile and run with runtime errors such as seg faults. If it doesn't, it is considered nearly ungradable.