Objective In this assignment, you need to demonstrate your skills for data clustering and dimensionality reduction. There are two parts of this assignment.
Part-1 (Clustering):
Download the digit dataset from the unit site. This dataset contains 8×8 pixel images of digits 0-9. There are five different files where each file contains different number and types of digit images. The file name ends with a digit between 0 to 4. Please compute the modulus operation (fID=SID % 5), where SID is your own student ID number. Now select the data file, name of which ends with the same fID value. For example, if your student id is 218201419, then you should compute fID=218201419%5. This will result as fID=4. So in this case you should download the file named “digitData4.csv’.
1- Read the downloaded file into a matrix M(mXn). Create an empty numpy array X with m rows and n-1 columns. Assign all m rows and first n-1 columns of M into X. Create a numpy vector trueLabels and assign n-th column of M into that. Print dimensions of M, X and trueLabels. (1+1+1+1+1=5 marks)
2- Next perform K-means clustering with 5 clusters using Euclidean distance as similarity measure. Evaluate the clustering performance using adjusted rand index (ARI) and adjusted mutual information. Report the clustering performance averaged over 50 random initializations of K-means. (1+1+3=5 marks)
3- If we have an ARI value of 0.7 after a single run of K-means clustering with ‘Kmeans++’ initializaton for any data set then what will be the value of averaged ARI over 20 repeatations. Explain why? (1+1=2 marks)
4- Repeat K-means clustering with 5 clusters using a similarity measure other than Euclidean distance (you are free to use other libraries). Evaluate the clustering performance over 50 random initializations of K-means using adjusted rand index and adjusted mutual information. Report the clustering performance and compare it with the results obtained in step 2. (2+1+2=5 marks)
Part-2 (Dimensionality Reduction using PCA/SVD):
For the provided digits dataset:
1- Perform PCA. Plot the captured variance with respect to increasing latent dimensionality. What is the minimum dimension that captures at least 95% variance? (1+2+2=5 marks)
2- Create a scatter plot with each of the total rows of X projected onto the first two principal components. In other words, the horizontal axis should be v1, the vertical axis v2, and each individual should be projected onto the subspace spanned by v1 and v2. Your plot must use a different color for each digit and include a legend. (2+1=3 marks)