The Task

Consider data coming in as a matrix of R samples of C dimensional data that is represented as a R x C matrix. Let's compress this using PCA so that it is fewer than C dimensions. We want the retained dimensions to be the dimensions that carry the most variance (information). The axes of this new space will be determined by the eigenvectors we will compute.

The task is to write a C/C++ program called pca that uses principal component analysis (PCA) to pick the largest k eigenvalues in magnitude and encodes the data in K dimensional space instead of C dimensions. This turns out to be easy if you have something that will give you eigenvalues/eigenvectors of a matrix. Essentially you will be compressing the data by translating the data to the K most significant eigenvectors. Eigenvalues and eigenvectors can be computed in the matrix library. Here is a help sheet on PCA. Select from that to help your implementation.

To make this more entertaining our data will be pictures. A grayscale picture that is R x C in size can be thought of a R samples of C dimensional vectors of data. The vectors tend to have pattern and cluster because, in fact, neighboring rows of pixels tend to be highly correlated. A color picture is that is R x C can be thought of as a R x (3*C) matrix of points where each pixel has an R, G, and B value.

This program will take an integer argument on the command line. This is the number K, that is, the number of eigenvectors you want to keep. Then read in a picture in ppm or pgm format from stdin. No worries, the latest matrix library as a readImagePixmap procedure. It also has methods to write simple pgm and ppm files in and out of matrices!

Next center the columns of data about the mean of each column.

Your program will then compress the information using PCA this will give you an R x K matrix. This is the objective in a machine learning algorithm. This is the compressed or abstracted version of the input data. You can then train your ML algorithms on this data.

For us we will simply now turn around and decode the data (a picture) look at the fidelity of the encoding. So the next step is to decode the matrix using the eigenvector matrix to recover the picture. Because there will be some loss though minute, let's compute the sum of squares distance between the original picture and the recovered picture and print that. In particular I want the average per pixel distance so divide the matrix distance by the total number of elements in the matrix. You can see as you increase K this number should drop dramatically. Why?

Next your program should write either a ppm or pgm file called "z.ppm" or "z.pgm" depending on whether the original file read in was color. Use the matrix library write routines to save the recovered picture in the file. The test scripts may look at your picture file to see if you got the right picture back.

Example output with names of matrices I used. Please use these names and labels for easy of grading.

 pca 160 < girlWithPearlEarringJohansson.ppm
 (size of Pic: 540 X 1464)
 (size of Mean: 1 X 1464)
 (size of EigenVectors: 1464 X 1464)
 (size of EigenValues: 1 X 1464)
 (size of Encoded: 540 X 160)
 (size of Decoded: 540 X 1464)
 Per Pixel Dist^2: 1.05186

One last thing. If command line argument is a negative integer then transpose the matrix before the doing the PCA. That is the dimension of the data is now the rows and not the columns. Don't forget to untranspose when done. K is, of course, in this case -K.

Submission

Homework will be submitted as an uncompressed tar file to the homework submission page linked from the class web page. Be sure to include a make file to build your code and do NOT include the picture files. I will supply some. You can submit as many times as you like. The LAST file you submit BEFORE the deadline will be the one graded. For all submissions you will receive email at your uidaho.edu mail address giving you some automated feedback on the unpacking and compiling and running of code and possibly some other things that can be autotested.

Have fun.