CS445 - Assignment 3

The Problem

Do not put this off. This assignment is much harder than previous assignments.

In this assignment we will type our expressions in the abstract syntax tree (AST) and start to do semantic analysis and semantic error generation. We will also be checking to be sure variables are declared before use. And we'll even be able to warn when a variable might not be initialized before use.

New Compiler Options

This time you have five options on the command line.

The -d option (lowercase d) from which sets the yydebug variable to 1.

The -D option (uppercase D) turns on symbol table debugging. This will cause a line of information to be printed out for every action you perform with the symbol table. To do this, just set debugging in the SymbolTable object.

The -p option (lowercase p) is the exact same tree from assignment 2.

The -P option (uppercase p) which prints the abstract syntax tree. That is, it prints the syntax tree we did for assignment 2, but with types added. This is true for symbols both at the declaration and at point of use. The symbol table below will help with that.

The -h option prints out this usage message:

Usage: c- [options] [sourceFile]
options:
-d      - turn on parser debugging
-D      - turn on symbol table debugging
-h      - this usage message
-p      - print the abstract syntax tree
-P      - print the abstract syntax tree plus type information

c- should still accept a single input file either from a filename given on the command line or redirected as standard input.

Reorganize your Code

Put your semantic analysis code in semantic.cpp and semantic.h. If you use my ourgetopt routine then put that in ourgetopt.cpp and ourgetopt.h. Update your makefile to build these by putting their .o's in the dependency list for building c- and in the g++ line. You might find it useful and educational to put main() and associated routines into their own main.cpp file if you haven't already. This is not required.

Semantic Errors

We want to generate errors that are tagged with useful line numbers. So we will need to be sure each node is tagged with a useful line number. Remember, to do this effectively we need to grab the line number as soon as possible (in flex) and associate it with the token. This can be done nicely (portably) by passing back a struct/class for each token (as you have probably have already done) in the yylval which has all the information about the token such as its line number, lexeme (what the user typed), constant value, even token class. (A struct/class allows you to return more than a single value in yylval.) You should avoid using pointers to global yy variables for token information when possible because the parser looks ahead and may already be onto the next token.

Once the information is passed back in the tokenData, then things like the line number, size, type, etc may be squirreled away in the TreeNodes in the tree! This information is then used when the tree is traversed for semantic analysis.

Scope and Type Checking

After checking if you should print the abstract syntax tree, you will now traverse the tree looking for typing and program structure errors. So your main() might look something like this:

	numErrors=0; 
        numWarnings=0;

	yyparse();

	if (numErrors==0) {
            // -p
            if (printSyntaxTree) printTree(syntaxTree, NOTYPES); // only types in declarations

            symbolTable = new SymbolTable();
            semanticAnalysis(syntaxTree, symbolTable);   // semantic analysis (may have errors)

            // -P
            if (printAnnotatedSyntaxTree) printTree(syntaxTree, TYPES);  // all types

            // code generation will go here

            }

        // report the number of errors and warnings
        printf("Number of errors: %d\n", numErrors);
        printf("Number of warnings: %d\n", numWarnings);

Your main may look quite different. I do a couple of other setup things for instance. The routine semanticAnalysis will process the tree by calling a treeTraverse routine that starts at the root node for the AST and recursively calls itself for children and siblings until it gets to the leaves. Declarations will make entries in a symbol table (see the symbol table section below). References to symbols will be looked up in the symbol table.

Your job in writing the treeTraverse routine is to catch a variety of warnings and errors and duplicate my output exactly for any input given. For this assignment, all input will be syntactically legal, but there may be many semantic errors. (In a couple of assignments we'll come back and look at normalizing all errors and making our program keep running through syntax errors.)

You should keep count of the number of warnings and errors and report that at the end of a run. Here is the list of errors right out of my version in printf format. To get an easy match to the expected output it helps you immensely if you just use exactly these formats. Note that the type string that you put into the message is often something like "type int" or "unknown type". These are exactly the errors you must catch for this assignment There are 16 error messages and 2 warnings for this assignment. The other half of the errors will be in the next assignment.

Here are some details by node type but this list is NOT EXHAUSTIVE. You are in control of the design as long as it duplicates my output.

For declaration nodes check for duplicate declarations using the symbol table. A special case happens in the case of the first compound statement in a function which will NOT open a new scope:
```
fred(int x) { int x; }
```
is a duplicate definition of x while
```
fred(int x) { { int x; } }
```
is not. This is what C++ does. Try it.
For compound statements a newScope needs to be handled. The symbol table object I supply will let you set up and destroy scopes, enter variables in the symbol table and check for a variable definition. For documentation purposes, you can put a label of your choice on a new scope in the enter method for the SymbolTable object. For instance: enter("Compound Statement"); to label the scope. If you turn on debugging using a good label for each scope might help make the debug information more readable.
Assignments and operators should check that they have the proper type. Types of expressions will have to be passed up up the tree so they can be checked by the operators that use them. Beware of Cascading Errors as discussed in class. Hint: two things you can do: Many operators have an assumed return type. For example '+' returns an int. Be sure to set that. Second, it might be useful to have an undefined type that is used when variables have not been declared or the type is undefined.
Consider using an array or clever function rather than a switch or "cascading if" to know what types operators require for the operand and use the same strategy for remembering what type is returned. Hint: I have intentionally limited the number of cases for type checking. It is easy to code up those as functions you can invoke based on what the operator expects. Some examples are:
- > takes Integers and returns a Boolean.
- + takes Integers and returns an Integer.
- | takes Booleans and returns a Boolean.
- The operators == and !=, take arguments that are of the same type (e.g. both Boolean or both Integer) and return a Boolean. They can be arrays.
- = take arguments that are of the same type and returns the type of the lhs. This means if there is an undefined operand, the lhs operand even if undefined is the type of the assignment. This is because assignment is an expression and can be used in cascaded assignment like: a = b = c = 314159265 (An intentional and interesting side note is that an assignment is NOT a simpleExp! How does that effect what you can put in say... an if statement test? Python does this for example.)
- ++ and -- takes in Integer and immediately returns and Integer. This is not like in C or C++ in which these operators have special semantics.
See the error messages above to find an appropriate error message. Note that in the error messages above lhs means left hand side and rhs means right hand side.
See the c-Grammar for tables of operators and what types of arguments they take and what type they will return.
For Ids you have to see if the variable has been previously declared or not. This can be done by creating special flag isUsed in the treeNode. You can then lookup the declaration node in the symbolTable and set and query the flag in that node. If not previously declared, set the type of the Id node to the type of the declaration and put a pointer to the declaration node in the symbolTable. VERY IMPORTANT: This way you can always find your way back to the declaration node in the tree and store all the information about a variable at its declaration! If the Id is undeclared, then set the type of the id to UndefinedType (or some other indicator that the type information is missing). To prevent cascading errors, undefined types do not create an error when compared to an expected type. It is assumed an error was generated when the type was marked undefined.
IMPORTANT: Note that for this assignment each undeclared reference must generate an error message. We may fix this later.
We issue a warning for the possibility that a variable is being used that was uninitialized. This only applies to strictly locally declared variables, not globals, statics, or parameters. If a variable's rhs value is used before (appears in the tree traversal before) it has been initialized or appeared on the lhs a warning is issued. So being on the lhs in a binary assignment or being initialized on declaration will cause the variable to be marked as initialized. This is very similar to the used warning code.
For Ids you can have arrays that are indexed. Once they are indexed, their type becomes nonarray. That is the type of the '[' is the type of the lhs. Check for indexing of nonarrays and using unindexed arrays where they can't be used.
void is the type of a function that intentionally returns no value. It is possible for a type to be of type void in an error message.
Ids that are arrays can also be prefixed with '*' operator. That lets you get at the size of the array. Every array stores not only the values in the array but its size. This means that an array of size 10 (e.g. frog[10]) needs 11 spaces allocated to it. More about this in the memory allocation assignment. For this assignment you only need to know that '*' works on arrays and returns an int.
The return statement you must make sure that the user does not try to return a whole array.
Finally, after processing the whole tree, main should be in the global symbol table. If the procedure main is not defined then you must print out an error.

Symbol Table

Here is a useful C++ symbol table object you can use:

tar of C++ symbol table stuff

Please use the version from Feb 23, 2021 or later. This version uses C++ standard library. Feel free to augment it or build your own. Though it is written with std::string type as the argument type in many places, you can cast a char * to and from std::string.

The symbol table object with insert and lookup methods for symbols and a pointer (you can use the pointer to point to a TreeNode. It also has enter and leave methods of managing the scope stack. You should always access symbols through the symbol table object and you should never have to access a scope object. The scopes are managed by enter and leave methods. Read the symbolTable.h for more information on how to use it. You might want to just play with it to see how it works before you put it into your compiler (see test routines commented out in the supplied code.). Inserting a symbol that is already defined returns false, success returns true. Looking up a symbol that is not there will return a NULL pointer.

One feature of the symbol table is the debug flag. At construction time the SymbolTable object is in nondebugging mode. But by setting the flag with the debug method you can get the object to spew out info. You can also just print the symbol table if using the provided print routine. You might consider starting out by printing the symbol table on exit from a scope using the debug flag.

Finally the symbol table print routine takes a print function that will print your treeNode. So if you define something to print a node given a TreeNode * then you can supply that name to the print function to print out your symbol table stack. That way the code doesn't have to know what you TreeNode looks like internally. For instance in my code:

 
    symtab = new SymbolTable();

creates the symbol table. To print the symbol table:

 
    symtab->print(nodePrint);

will print each void * in the symbol table using your supplied function:

    void nodePrint(void *p)

A note about testing

I will be sorting your error messages before comparison to the expected output so that the order in which the messages come out is not as important. The order is a clue about how testing might be implemented however.

Submission

Homework will be submitted as an uncompressed tar file that contains no subdirectories. The tar file is submitted to the class submission page. You can submit as many times as you like. The LAST file you submit BEFORE the deadline will be the one graded. Absolutely, no late papers. For all submissions you will receive email at your uidaho address showing how your file performed on the pre-grade tests. The grading program will use more extensive tests, so thoroughly test your program with inputs of your own.

Your code should compile and run without runtime errors such as seg faults. If it doesn't, it is considered nearly ungradable.