Principle Components Analysis (PCA) application

Overview

Application example

PCA via Scikit-learn

Use scikit-learn lib to perform PCA process to digit data. To compare the origin and reconstruced data difference, have a overall understanding.

Input Data

Import libs, and visualize the source input data.

1# Import needed libs
2import numpy as np
3import matplotlib.pyplot as plt
4from sklearn.datasets import load_digits
5from sklearn.decomposition import PCA
6from sklearn.metrics import mean_squared_error
1# Load the digits dataset
2digits = load_digits()
3X = digits.data
4y = digits.target
5
6# Original data size
7original_size = X.nbytes / (1024 * 1024)  # in megabytes
8print("original data size is: %.2f MB" % original_size)
original data size is: 0.88 MB
1# Plot the first 10 samples as images
2fig, axes = plt.subplots(1, 10, figsize=(12, 4))
3for i in range(10):
4    axes[i].imshow(X[i].reshape(8, 8), cmap='gray')
5    axes[i].set_title(f"Label: {y[i]}")
6    axes[i].axis('off')
7plt.tight_layout()
8plt.show()

png

Reconstruction Result

Define a series of function to perform reconstruction and reconstruction metric.

 1# Function to calculate reconstruction error
 2def reconstruction_error(original, reconstructed):
 3    return mean_squared_error(original, reconstructed)
 4
 5# Function to perform PCA and reconstruct data with n_components
 6def perform_pca(n_components):
 7    pca = PCA(n_components=n_components)
 8    X_pca = pca.fit_transform(X)
 9    X_reconstructed = pca.inverse_transform(X_pca)
10    return X_reconstructed, pca
 1# Function to perform PCA, and visualize result. Input is the number of principle components
 2def analyze_pca(n_components):
 3    X_reconstructed, pca = perform_pca(n_components)
 4    reconstruction_error_val = reconstruction_error(X, X_reconstructed)
 5    print(f"Number of Components: {n_components}, Reconstruction Error: {reconstruction_error_val}")
 6
 7    # Size of compressed file
 8    compressed_size = (pca.components_.nbytes + pca.mean_.nbytes + X_reconstructed.nbytes) / (1024 * 1024)  # in megabytes
 9    print(f"Size of Compressed File: {compressed_size} MB")
10
11    # Difference in size
12    size_difference = original_size - compressed_size
13    print(f"Difference in Size: {size_difference} MB")
14
15    # Plot original and reconstructed images for each digit
16    fig, axes = plt.subplots(2, 10, figsize=(10, 2))
17    for digit in range(10):
18        digit_indices = np.where(y == digit)[0]  # Indices of samples with the current digit
19        original_matrix = X[digit_indices[0]].reshape(8, 8)  # Take the first sample for each digit
20        reconstructed_matrix = np.round(X_reconstructed[digit_indices[0]].reshape(8, 8), 1)  # Round to one decimal place
21        axes[0, digit].imshow(original_matrix, cmap='gray')
22        axes[0, digit].axis('off')
23        axes[1, digit].imshow(reconstructed_matrix, cmap='gray')
24        axes[1, digit].axis('off')
25
26    plt.suptitle(f'Reconstruction with {n_components} Components')
27    plt.show()
28
29    # Print the first data's matrix
30    print("Original Matrix of the First Data:")
31    print(original_matrix)
32
33    # Print the reconstruction matrix
34    print("\nReconstruction Matrix of the First Data:")
35    print(reconstructed_matrix)

Analyze the result when we use one principle component

1analyze_pca(1)
Number of Components: 1, Reconstruction Error: 15.977678462238496
Size of Compressed File: 0.87841796875 MB
Difference in Size: -0.0009765625 MB

png

Original Matrix of the First Data:
[[ 0.  0. 11. 12.  0.  0.  0.  0.]
 [ 0.  2. 16. 16. 16. 13.  0.  0.]
 [ 0.  3. 16. 12. 10. 14.  0.  0.]
 [ 0.  1. 16.  1. 12. 15.  0.  0.]
 [ 0.  0. 13. 16.  9. 15.  2.  0.]
 [ 0.  0.  0.  3.  0.  9. 11.  0.]
 [ 0.  0.  0.  0.  9. 15.  4.  0.]
 [ 0.  0.  9. 12. 13.  3.  0.  0.]]

Reconstruction Matrix of the First Data:
[[-0.   0.4  6.4 12.6 12.   6.3  1.4  0.1]
 [ 0.   2.6 11.7 11.2 10.5  9.4  1.9  0.1]
 [ 0.   3.   9.4  5.8  8.   8.7  1.6  0. ]
 [ 0.   2.1  7.7  9.  11.1  7.8  2.   0. ]
 [ 0.   1.5  5.6  8.2  9.8  8.5  2.8  0. ]
 [ 0.   1.   5.2  5.9  6.5  8.2  3.7  0. ]
 [ 0.   0.8  7.8  9.   8.8  9.5  4.1  0.2]
 [ 0.   0.4  6.8 12.9 11.9  7.3  2.3  0.4]]

Analyze when we choose 5 principle components.

1analyze_pca(5)
Number of Components: 5, Reconstruction Error: 8.542447616249266
Size of Compressed File: 0.88037109375 MB
Difference in Size: -0.0029296875 MB

png

Original Matrix of the First Data:
[[ 0.  0. 11. 12.  0.  0.  0.  0.]
 [ 0.  2. 16. 16. 16. 13.  0.  0.]
 [ 0.  3. 16. 12. 10. 14.  0.  0.]
 [ 0.  1. 16.  1. 12. 15.  0.  0.]
 [ 0.  0. 13. 16.  9. 15.  2.  0.]
 [ 0.  0.  0.  3.  0.  9. 11.  0.]
 [ 0.  0.  0.  0.  9. 15.  4.  0.]
 [ 0.  0.  9. 12. 13.  3.  0.  0.]]

Reconstruction Matrix of the First Data:
[[ 0.   0.2  5.2 11.1 12.1  7.   1.6  0.1]
 [ 0.   2.1 11.2 10.7  9.7  9.6  2.3  0.2]
 [ 0.   3.1 11.2  6.2  6.   9.2  2.5  0.1]
 [ 0.   3.1 10.3  9.   9.6  9.6  2.9  0. ]
 [ 0.   2.2  6.   5.3  8.  11.6  3.9  0. ]
 [ 0.   1.2  4.2  1.9  4.9 11.7  5.1  0. ]
 [ 0.   0.6  6.7  6.2  8.8 12.1  4.4  0.2]
 [ 0.   0.2  5.4 12.1 13.4  8.2  1.8  0.3]]

The more the principle components used, the better the reconstruction result. Next we will mannualy compute the PCA matrix.

Manual PCA

Maunal step-by-step way to perform the PCA analysis.

Input Data

Show the input data.

 1# Then use step-by-step wat to calculate the PCA steps;
 2# Take the first data point for analysis
 3first_data = X[0]
 4print("Raw input data: \n", X[0])
 5# Reshape the data point into a 2D array (image)
 6input_matrix = first_data.reshape(8, 8)
 7
 8print("Input matrix: ")
 9for row in input_matrix:
10    print(" ".join(f"{val:4.0f}" for val in row))
11
12# Print the original matrix (image)
13plt.imshow(input_matrix, cmap='gray')
14plt.title("Input matrix (Image)")
15plt.axis('off')
16plt.show()
Raw input data: 
 [ 0.  0.  5. 13.  9.  1.  0.  0.  0.  0. 13. 15. 10. 15.  5.  0.  0.  3.
 15.  2.  0. 11.  8.  0.  0.  4. 12.  0.  0.  8.  8.  0.  0.  5.  8.  0.
  0.  9.  8.  0.  0.  4. 11.  0.  1. 12.  7.  0.  0.  2. 14.  5. 10. 12.
  0.  0.  0.  0.  6. 13. 10.  0.  0.  0.]
Raw data shape:  (64,)
Input matrix: 
   0    0    5   13    9    1    0    0
   0    0   13   15   10   15    5    0
   0    3   15    2    0   11    8    0
   0    4   12    0    0    8    8    0
   0    5    8    0    0    9    8    0
   0    4   11    0    1   12    7    0
   0    2   14    5   10   12    0    0
   0    0    6   13   10    0    0    0

png

Centering the Data

This mean calculation helps us understand the average value of each feature, which is essential for centering the data and calculating the covariance matrix in subsequent steps. centering the data is a crucial preprocessing step in PCA that enhances the interpretability of the results, removes bias, and ensures numerical stability in computations.

1# Step 1: Calculate the mean of each feature (column)
2mean_vec = np.mean(input_matrix, axis=0)
3print(mean_vec)
[ 0.    2.25 10.5   6.    5.    8.5   4.5   0.  ]
1# Step 2: Subtract the mean from each feature
2centered_matrix = input_matrix - mean_vec
3print(centered_matrix)
[[ 0.   -2.25 -5.5   7.    4.   -7.5  -4.5   0.  ]
 [ 0.   -2.25  2.5   9.    5.    6.5   0.5   0.  ]
 [ 0.    0.75  4.5  -4.   -5.    2.5   3.5   0.  ]
 [ 0.    1.75  1.5  -6.   -5.   -0.5   3.5   0.  ]
 [ 0.    2.75 -2.5  -6.   -5.    0.5   3.5   0.  ]
 [ 0.    1.75  0.5  -6.   -4.    3.5   2.5   0.  ]
 [ 0.   -0.25  3.5  -1.    5.    3.5  -4.5   0.  ]
 [ 0.   -2.25 -4.5   7.    5.   -8.5  -4.5   0.  ]]

Covariance Calculation

Calculate the Covariance matrix of centered data.

1# Calculate covariance using np.dot. Bessel's correction to minus 1 at the end. https://www.uio.no/studier/emner/matnat/math/MAT4010/data/forelesningsnotater/bessel-s-correction---wikipedia.pdf
2cov_matrix = np.dot(centered_matrix.T, centered_matrix) / (centered_matrix.shape[0] - 1) 
3
4# Or calculate covariance using np.cov
5# cov_matrix = np.cov(centered_matrix, rowvar=False)
6
7print(cov_matrix)
[[  0.           0.           0.           0.           0.
    0.           0.           0.        ]
 [  0.           4.21428571   2.28571429 -13.14285714  -9.42857143
    4.14285714   6.14285714   0.        ]
 [  0.           2.28571429  14.          -9.42857143  -4.85714286
   17.           6.28571429   0.        ]
 [  0.         -13.14285714  -9.42857143  43.42857143  29.57142857
  -12.57142857 -17.85714286   0.        ]
 [  0.          -9.42857143  -4.85714286  29.57142857  26.
   -7.         -17.57142857   0.        ]
 [  0.           4.14285714  17.         -12.57142857  -7.
   28.85714286  11.           0.        ]
 [  0.           6.14285714   6.28571429 -17.85714286 -17.57142857
   11.          14.85714286   0.        ]
 [  0.           0.           0.           0.           0.
    0.           0.           0.        ]]

Eigen decomposition

1# Step 4: Compute the eigenvalues and eigenvectors of the covariance matrix
2eigenvalues, eigenvectors = np.linalg.eig(cov_matrix)
3
4print(eigenvalues)
5print(eigenvectors)
[8.92158455e+01 3.14545089e+01 7.61850164e+00 2.85144338e+00
 2.01453633e-01 1.53898738e-02 0.00000000e+00 0.00000000e+00]
[[ 0.          0.          0.          0.          0.          0.
   1.          0.        ]
 [-0.20365153  0.09344175  0.07506402 -0.23052329 -0.41043409 -0.85003703
   0.          0.        ]
 [-0.22550077 -0.48188982  0.20855091  0.79993174 -0.1168451  -0.14104805
   0.          0.        ]
 [ 0.65318552 -0.28875672 -0.59464342  0.12374602  0.11324705 -0.32898247
   0.          0.        ]
 [ 0.48997693 -0.31860576  0.39448425 -0.20610464 -0.63307453  0.24399318
   0.          0.        ]
 [-0.33563583 -0.75773097 -0.0607778  -0.49775699  0.24837474  0.00681139
   0.          0.        ]
 [-0.35818338 -0.00212894 -0.66178497  0.03760326 -0.58531429  0.29955628
   0.          0.        ]
 [ 0.          0.          0.          0.          0.          0.
   0.          1.        ]]

Select the principal component corresponding to the maximum eigenvalue

1# Step 5: Choose the principal component corresponding to the maximum eigenvalue
2max_eigenvalue_index = np.argmax(eigenvalues)
3principal_component = eigenvectors[:, max_eigenvalue_index]
4print(principal_component)
[ 0.         -0.20365153 -0.22550077  0.65318552  0.48997693 -0.33563583
 -0.35818338  0.        ]
1# Step 6: Project the data onto the principal component
2reduced_data = np.dot(centered_matrix, principal_component)
3
4# Print the reduced data
5print("Reduced data (1 principal component):\n", reduced_data)
Reduced data (1 principal component):
 [12.35977044  5.86229378 -8.32285024 -8.14946302 -7.7867473  -8.41834525
  1.49545914 12.95988243]

So far, the data is compressed from 8x8 matrix to 8x1 vector.

Reconstruction

Now reconstruc the data based on the reduced data.

 1# Step 7: Reconstruct the data
 2reconstructed_data = np.dot(reduced_data.reshape(-1, 1), principal_component.reshape(1, -1))
 3
 4# Step 8: Add back the mean to the reconstructed data
 5reconstructed_data += mean_vec
 6
 7# Step 9: Visualize the original and reconstructed data
 8fig, axes = plt.subplots(1, 2, figsize=(12, 6))
 9
10# Original data
11axes[0].imshow(input_matrix, cmap='gray')
12axes[0].set_title('Original Data')
13axes[0].axis('off')
14
15# Reconstructed data
16axes[1].imshow(reconstructed_data.real, cmap='gray')
17axes[1].set_title('Reconstructed Data')
18axes[1].axis('off')
19
20plt.show()

png

comments powered by Disqus

Translations: