Detailed Deep Derivation of Principle Components Analysis (PCA)

Overview

Based on Deep Learning (2017, MIT) book.

1 Overview

Mordern dataset such as web indexing, high resolution image, meteorology, experimental measurement, etc., often contains high dimensionality features that they can be unclear, redundant, or even deceptive. It is challenging to visualize and interpret the relationships between variables, and the trained neural netwokr models with such high dimensionaly data are tend to overfitting (curse of dimensionality). Principle Components Analysis (PCA) is one simple but powerful unsupervised machine learning technique for data dimensionality reduction. It aims to extract a smaller dataset from a large variable set while preserving as much origin information and feature as possible (lossy compression). PCA helps identify the most significant and meaningful features in a dataset, making the data easy for visualization. The applications includes: statistical, noise removal, and prepocessing data for machine learning algorihtms.

  • What is principle component?
    Principal components are new variables that are constructed as linear combinations of the origin variables. These new variables are uncorrelated and contain the most of the information from the original data.

2 Backgroud Mathmetic Knowledge

Those knowledge is required for next section.

  • Orthogonal vector and matrix:
    • Two vectors are orthogonal if they are perpendicular to each other. i.e. the dot product of the two vectors is zero.
    • Orthogonal matrix square matrix in which its rows and columns are mutually orthogonal unit vectors; Every two rows and two columns have zero dot product, and every row and every column has a magnitude of one.
    • $A$ is orthogonal matrix if $A^{T} = A^{-1}$ or $AA^{T}= A^{T}A = I$.
    • In robotics, a rotation matrix is typical a $3\times3$ orthogonal matrix, in spatial transformation it rotates vector's direction but keeps the magnitude of origin vector.
  • Matrix, vector multiplication rules:
    • $(AB)^T=B^TA^T$, transpose of the product of two matrices.
    • $\vec{a}^T\vec{b}=\vec{b}^T\vec{a}$, both sides are multiplication of vector, and results are scalars, scalar's transpose is same.
    • $(A + B)C = AC + BC$, matrice's multiplication is distributive.
    • $AB \not ={} BA$, matrice's multiplication is not commutative.
    • $A(BC)=(AB)C$, matrice's multiplication is associative.
  • Symmetric matrix:
    • $A=A^T$, $A$ is a symmetric matrix.
    • $X^TX$ is a symmetric matrix, since $(X^TX)^T=X^TX$.
  • Vector derivatives rules ($B$ is constant matrix):
    • $d(x^TB)/dx=B$
    • $d(x^Tx)/dx=2x$
    • $d(x^TBx)/dx=2Bx$
  • Matrix trace rules:
    • $Tr(A)=Tr(A^T)$
    • $Tr(AB)=Tr(BA)$
    • $Tr(A)=\sum_i{\lambda_i}$, where $\lambda$ is eigen values of $A$.
    • The trace is invariant under circular shifts: $Tr(ABCD)=Tr(BCDA)=Tr(CDAB)=Tr(DABC)$
  • Vector and matrix norms:
    • Vector's $L^2$ norm, also known as Euclidean norm: $||x||_2=\sqrt{\sum_i|x_i|^2}$.
    • It is also common to measure the size of a vector using the squraed $L^2$ norm, which can be calculated as $x^Tx$.
    • Frobenius norm is used to measure the size of a matrix: $||A||F=\sqrt{\sum{i,j}A^2_{i,j}}$
    • Frobenius norm is square root of the sum of the absolute squares of all matrix's elements.
    • Frobenius norm is the matrix version Euclidean norm.
  • Eigendecomposition and eigenvalue:
    • An eigenvector of a squre matrix $A$ is a non-zero vector $v$ such that multiplication by $A$ alters only the scale of $v$: $Av=\lambda v$. $\lambda$ is eigenvalue and $v$ is eigenvector.
    • Suppose matrix $A$ has $n$ linearly independent eigenvectors $v^{(i)}$, we can conctenate all of the eigenvectors to form a matrix $V=[v^{(1)},\ldots,v^{(n)}]$, and form a vector by concatenating all eigenvalues $\lambda=[\lambda_1,\ldots,\lambda_n]^T$, then the eigendecomposition of $A$ is $A=Vdiag(\lambda)V^{-1}$
    • Every real symmetric matrix can be decompsed into $A=Q\Lambda Q^T$, where $Q$ is an orthogonal matrix composed of eigenvectors of $A$, and $\Lambda$ (pronoucation 'lambda') is a diagnoal matrix.
  • Lagrange multipliers:
    • Lagrange multipliers is a strategy for finding the local maxima and minima of a function subject to equation constraints.
    • General form: $\mathcal{L}(x,\lambda)=f(x)+\lambda\cdot g(x)$, $\lambda$ is called Lagrange multiplier.

3 Detailed PCA Derivation

Requirement description

We have input data with $m$ points ${x^{(1)},...,x^{(m)}}$ in $\mathbb{R}^{n}$ real number sets. So each points $x^{(i)}$ is a colomn vector which has $n$ dimensionalities feature.

We need to apply lossy compression to the input data, encoding these points to represent a lower-dimensional version of them. In other words, we want to know the code vector $c^{(i)}\in \mathbb{R}^{l}$, $(l<n)$ to represent each input point $x^{(i)}$. Our target is to find the encoding function $f(x)=c$ producing the code vector for the input, and the correspoinding reconstruction (decoding) fuction $x\approx g(f(x))$, computing the orogin input based on the code vector $c$.

The decoded $g(f(x))$ is a new set of points(variables), so it "$\approx$" to the orgin $x$. Sotring $c^{(i)}$ and decoding function is cheaper in memroy than storing $x^{(i)}$ since $c^{(i)}$ has less dimenstionality.

Decoding matrix

We choose to use matrix $D$ as the decoding matrix, mapping the code vector $c^{(i)}$ back to $\mathbb{R}^{n}$, so $g(c)=Dc$, where $D\in \mathbb{R}^{n\times l}$. To keep the encoding proble easy, PCA constrains the columns of $D$ to be orthogonal to each other.

Measure reconstruction performance

Before going on, we need to figure out how to generate the optimal code point $c^{}$, we can measure the distance between the input point $x$ and its reconstruction $g(c^)$, using $L^2$ norm (or Euclidean norm): $c^{}=\arg\min_c||x-g(c)||_2$. Since $L^2$ norm is non-negative, and squaring operation is monotonically increasing, so we can switch to use squred $L^2$ norm: $$c^{}=\argmin_c||x-g(c)||_2^2$$

$L^2$ norm of a vector is the sum of the squares of its components, it equals to the dot product of he vector with itself e.g. $||x||_2=\sqrt{\sum|x_i|^2}=\sqrt{x^Tx}$, so the squared $L^2$ norm can be wrote in the form: $$||x-g(c)||_2^2 = (x-g(c))^T(x-g(c))$$ By the distributive property: $$=(x^T-g(c)^T)(x-g(c))=x^Tx-x^Tg(c)-g(c)^Tx+g(c)^Tg(c)$$ Since $x^Tg(c)$ and $g(c)^Tx$ are scalar, and scalar is equal to its transpose, $(g(c)^Tx)^T=x^Tg(c)$, so: $$=x^Tx-2x^Tg(c)+g(c)^Tg(c)$$

To find the $c$ minimizing the above function, the fiirst term can be omitted since it does not depend on $c$, so: $$c^*=\argmin_c-2x^Tg(c)+g(c)^Tg(c)$$

Then substitute in the definition of $g(c)$ with $Dc$: $$=\argmin_c-2x^TDc+c^TD^TDc$$ By the orthogonality and unit norm constraints on $D$: $$c^*=\argmin_c-2x^TDc+c^TI_lc$$ $$= \argmin_c-2x^TDc+c^Tc$$

Objective function

Now the objective function is $-2x^TDc+c^Tc$, where we need to find the $c^*$ to minimize the objective function. Using vector calculus and let its derivative equals to 0: $$\nabla_c(-2x^TDc+c^Tc)=0$$ According to vector derivative rules: $$-2D^Tx+2c=0 \Rightarrow c=D^Tx$$

Find the encoding matrix $D$

So the encoder function is $f(x)=D^Tx$. Therefore we can define the PCA reconstruction operation as $r(x)=g(f(x))=D(D^Tx)=DD^Tx$.

So the encoding matrix $D$ is also used by reconstruction process. We need to find the optimal $D$ to minimize the reconstruction error, the distance between all dimentional feature of inputs and reconstructions. Here use Frobenius norm (matrix norm) define the objective function:

$$D^*=\argmin_D\sqrt{\sum_{i,j}(x_j^{(i)}-r(x^{i})_j)^2},\quad D^TD=I_l$$

Start by considering the case where $l=1$ (this is the first principle component), $D$ is a single vector $d$, and use squared $L^2$ norm form:

$$d^*=\argmin_d{\sum_{i}||(x^{(i)}-r(x^{i}))}||_2^2, ||d||_2=1$$

$$d^*= \argmin_d{\sum_{i}||(x^{(i)}-dd^Tx^{(i)})||_2^2}, ||d||_2=1$$

$d^Tx^{(i)}$ is a scalar:

$$= \argmin_d{\sum_{i}||(x^{(i)}-d^Tx^{(i)}d)}||_2^2, ||d||_2=1$$

Scalar is its own transpose: $$d^*= \argmin_d{\sum_{i}||(x^{(i)}-x^{(i)T}dd)}||_2^2, ||d||_2=1$$

Use matrix form represent

Let $X\in \mathbb{R}^{m\times n}$ defined by stacking all of the vectors describing the points ${x^{(1)^T}, x^{(2)^T}, \ldots, x^{(i)^T}, \ldots, x^{(m)^T}}$, such that $X_{i,:}=x^{(i)^T}$. $$ X = \begin{bmatrix} x^{(1)^T}\ x^{(2)^T}\ \ldots\ x^{(m)^T} \end{bmatrix} \Rightarrow Xd = \begin{bmatrix} x^{(1)^T}d\ x^{(2)^T}d\ \ldots\ x^{(m)^T}d \end{bmatrix} $$ $$ \Rightarrow Xdd^T = \begin{bmatrix} x^{(1)^T}dd^T\ x^{(2)^T}dd^T\ \ldots\ x^{(m)^T}dd^T\ \end{bmatrix} $$

$$ \Rightarrow X-Xdd^T = \begin{bmatrix} x^{(1)^T}-x^{(1)^T}dd^T\ x^{(2)^T}-x^{(2)^T}dd^T\ \ldots\ x^{(m)^T}-x^{(m)^T}dd^T\ \end{bmatrix} $$

One row's transpose in the matrix: $$(x^{(i)^T}-x^{(i)^T}dd^T)^T=x^{(i)}-dd^Tx^{(i)}$$ Since $d^Tx^{(i)}$ is a scalar: $$=x^{(i)}-d^Tx^{(i)}d=x^{(i)}-x^{(i)^T}dd$$

So we know the $L^2$ norm of the $i$-th row of $X$ is same as the origin form,therefore we can rerwrite the problem and omit sum sign using the matrix as: $$d^*=\argmin_{d}||X-Xdd^T||_F^2, \quad d^Td=1 $$

Simplify the Frobenius norm portion with matrix trace rules as follows:

$$\argmin_{d}||X-Xdd^T||_F^2$$

$$=\argmin_{d}Tr((X-Xdd^T)^T(X-Xdd^T))$$ $$=\argmin_{d}-Tr(X^TXdd^T)-Tr(dd^TX^TX)+Tr(dd^TX^TXdd^T)$$ $$=\argmin_{d}-2Tr(X^TXdd^T)+Tr(X^TXdd^Tdd^T)$$

Due to $d^Td=1$: $$=\argmin_{d}-2Tr(X^TXdd^T)+Tr(X^TXdd^T)$$ $$=\argmin_{d}-Tr(X^TXdd^T)$$ $$={\arg\max}_{d}Tr(X^TXdd^T)$$

Since the trace is cyclic permutation invariant, rewrite the equation as: $$d^*={\arg\max}_{d}Tr(d^TX^TXd), \quad d^Td=1$$

Due to the fact $d^TX^TXd$ is a real number, so the trace sign can be omitted: $$d^*={\arg\max}_{d}d^TX^TXd,\quad d^Td=1$$

Find the optimal $d$

Now the problem is to find the optimal $d$ to maximize $d^TX^TXd$ with constraint $d^Td=1$.

Represent the problem with Lagrange multiplier form w.r.t. $d$: $$\mathcal{L}(d,\lambda)=d^TX^TXd+\lambda(d^Td-1)$$

Derivate with respect to $d$ (vector derivatives rules): $$\nabla_d\mathcal{L}(d,\lambda)=2X^TXd+2\lambda d$$

Let the derivation equal to 0, the $d$ will be the optimal one: $$2X^TXd+2\lambda d=0$$ $$X^TXd=-\lambda d$$ $$X^TXd=\lambda' d,\quad(\lambda'=-\lambda)$$

The equation is typical eigen decomposition form, the $d$ is the eigen vector of $X^TX$ and the $\lambda'$ is the corresponding eigen value.

With above, let's recheck the origin equation:

$$d^*={\arg\max}_{d}d^TX^TXd, \quad d^Td=1$$

$$={\arg\max}_{d}d^T\lambda' d$$

$$={\arg\max}_{d}\lambda'd^T d$$

$$={\arg\max}_{d}\lambda'$$

Now things are very clear, the largest eigenvalue of $X^TX$ maximizes the result, so the optimal $d$ is given by the eigenvector of $X^TX$ corresponding to the largest eigenvalue.

The derivation is specific to the case of $l=1$ and recovers only the first principla component. When $l>1$, $D=[d_1, d_2, \ldots]$ the first principle component $d_1$ is the eigenvector of matrix $X^TX$ corrsponding to the largest eigenvalue, the second principle component $d_2$ is the eigenvector corresponding to the second largest eigenvalue, etc.

4 Summary

We have a dataset consisting of $m$ points denoted as ${x^{(1)},...,x^{(m)}}$. Let $X\in \mathbb{R}^{m\times n}$ be defined by stacking all of these points as rows: ${x^{(1)^T}, x^{(2)^T}, \ldots, x^{(i)^T}, \ldots, x^{(m)^T}}$.

The Principal Component Analysis (PCA) encoding function is represented as $f(x)=D^Tx$, and reconstruction function as $x\approx g(c)=Dc$, where the columns of $D=[d_1, d_2, \ldots]$ are the eigenvectors of $X^TX$ corrresponding the eigenvalues in descending order.

Reconstruction matrix's column vectors are the eigenvectors of $X^TX$, corresponding to the eigenvalues arranged in descending order.

comments powered by Disqus

Translations: