The Task

It is usually the case that you can consider the training data to be exemplars of correct answers and so points that are similar to a training data point might reasonably be assumed to have the same output value as that training data point. We'll call the following the Rule of Similarity:

if f(x) -> y then f(z) -> y for z near x.

As practicianers we have several choices to make. 1) what points to we pick? and 2) what do we mean by distance? In the first case, we have some tools we have discussed already for picking points. SVMs, for instance, try to pick points along the boundaries know to delineate different values of y. If the dimensions are somewhat large and we don't have a lot of training points or the boundary is very non-linear then perhaps we just use all the training points. If we have too many dimensions then we can try to reduce the dimensions with PCA first and then use K nearest neighbor (KNN).

For distance we want a meaningful distance where the the Rule of Similarity holds. For the KD-Tree algorithm we will want a true distance measure in which the Triangle Inequality holds. The "go to" distance measure is Euclidean distance which is in the family of Lebesgue norms or Lp norms or just p-norms. These norms range from taxi cab distance to the max of the magnitude of all distances. The similarity measure BC or Bhattacharyya coefficient can be used to create a distance measure: sqrt(1 - BC(x, y)) where x and y are vectors. Angular Distance function may have an application for some problem you are trying to solve in that is computes the angle between two points measured at the origin you chose. This may be particularly useful in higher dimensional space where Lp Norms tragically lose useful meaning.

Implement a C++ program called kdtree using the matrix library trick shown in class where you create a tree of row vectors in a matrix by sorting subsets of rows in a specific columns. The result is a kdtree in a matrix. The algorithm is the same as but the tree is built by just reordering the rows of the matrix. Each parent node is discovered as a middle (median) between the two halves of the data when the data is sorted on a feature column. The tree begins by being sorted in column 1 (for a labeled matrix). Then the middle point is chosen and the data above and below that middle point is then sorted on column 2. Those halves are then split and the data above and below that midpoint are sorted on column 3 etc. This builds the equivalent of the KD-Tree. Now the kdtree algorithm is simply the C++ version of but using this tree-in-a-matrix data structure instead of lists and dictionaries. Turns out to be fairly efficient and easy to implement.

Things to watch out for:

  • In labeled matrices like what we have for the color data uses the first column to label the rows. See the matrix library for details. Be sure you don't use column 0 as the key in distance or sorting when creating the KDTree. But be sure you use the default that maintains the label with each row.
  • Use only the latest matrix library (version 2.5 or later). Earlier ones have a bug in the column sort routine or may not even have the column sort routine.
  • I will put in some output lines in the official answer to help you see what you are expected to do. Those lines in the output count. Be sure to leave them in for testing.

Input is first a labeled matrix of rows where each row begins with a word (no whitespace in the word) followed by d columns of data. The matrix can be read by the readLabeledRow() function which returns a char ** of labels. The first column in the read matrix is the index of the corresponding label. Not ideal, but workable. The second half of the input is a matrix of unlabeled data of d columns each. For each row in the matrix use the kdtree algorithm to look up the closest answer and print out that label. Sample output will be provided.


Homework will be submitted as an uncompressed tar file to the homework submission page linked from the class web page. Be sure to include a make file to build your code and do NOT include the picture files. I will supply some. You can submit as many times as you like. The LAST file you submit BEFORE the deadline will be the one graded. For all submissions you will receive email at your mail address giving you some automated feedback on the unpacking and compiling and running of code and possibly some other things that can be autotested.

Have fun.