The Task

Write a C++ program called id6.cpp that builds a decision tree and then queries the tree to find expected categories. Use the id6StarterCode.cpp id6StarterCode.h to get a strong start on the problem. I have given you all the entropy and gain calculations. You need to invoke them correctly. This algorithm is based on the ID3 algorithm in the book but is slightly different in that it has default values and makes some other choices, see the starter code. Learn how things are called and data structures arranged by looking at the code. You still need to create some tricky code so plan your time. I put in diagnostic print statements to show what it is doing as works so you can create those diagnotistics. You must match the diagnotistics as well as the tree and computed categories!

I will invoke id6 to read data from standard in as done by the algorithm demonstrated in class. It takes no arguments on the command line. The algorithm is simple and recursive, but it takes some thought. IMPORTANT: your algorithm must compute information gain in the order the features were given so you always generate the same tree given the same data. This may have been done for you in the starter code. The order makes a difference when you maximize the information gain by saving the first occurrence of maximum gain. I will test this in the tests.

Input format

The program takes a data file from standard input in the form of three matrices that match our matrix library input. The matrix library and a simple tree object can be found in these archives: Matrix library V3.9 tar Matrix library V3.9 zip Here is the layout:

NumFeatures MaxNumFeatureValues
feature1 Numvalues value1 value2 value3 ...
feature2 Numvalues value1 value2 ...
feature3 Numvalues value1 value2 value3 value4 ...
 ...
Ans ans1 ans2 ans3
NumTrainingExamples NumFeatures
f1 f2 f3... ans
f1 f2 f3... ans
f1 f2 f3... ans
f1 f2 f3... ans
f1 f2 f3... ans
f1 f2 f3... ans
f1 f2 f3... ans
NumTrainingExamples NumFeatures
q1 q2 q3... ans
q1 q2 q3... ans
q1 q2 q3... ans
q1 q2 q3... ans

Be sure to examine the test examples to be sure you see what is going on. NumFeatures is the number of features including the answer which must be labeled "Ans". Then comes NumFeatures lines of features value descriptions the last of which is Ans. Each line is a feature name followed by the number of values of the feature and then the strings that represent the values for the feature. For example: Temp 3 Cool Warm Hot. The number of feature values for a given feature may be variable. The matrix is filled out the the maximum number of feature values across all the rows with missing feature values replaced with the string "-". See examples. This matrix can be read in with the readStrings method in Matrix.

The next matrix is data used to construct the tree and can be thought of as training data.

Finally, the third and last matrix is the query data. These are rows that are to be classified using the tree you built.

For turning in your program you must generate the exact tree and indenting character by character. I have supplied a helpful tree object in with the matrix download. As always if you get the exact right answer you will get a congratulations message. If not you will get the sdiff with the expected output.

Submission

Homework will be submitted as an uncompressed tar file to the homework submission page linked from the class web page. Be sure to include a make file to build your code and do NOT include the picture files. I will supply some. You can submit as many times as you like. The LAST file you submit BEFORE the deadline will be the one graded. For all submissions you will receive email at your uidaho.edu mail address giving you some automated feedback on the unpacking and compiling and running of code and possibly some other things that can be autotested.

Have fun.