Chapter 7. Dimension Reduction Algorithm
Recommended post: 【Algorithm】 Algorithm Index
1. Overview
2. Type 1. Principal Component Analysis (PCA)
3. Type 2. SNE, symmetric-SNE, tSNE
4. Type 3. UMAP
5. Type 4. SVD
6. Type 5. ICA
7. Type 6. PPCA
8. Others
b. UMAP
1. Overview
⑴ Definition of Dimension Reduction
① When given high-dimensional data X(1), ···, X(n) ∈ ℝp,
② Find an appropriate subspace V ⊂ ℝp (where dim(V) = k),
③ Calculate the projection of X onto V, denoted as X̃.
④ It is an unsupervised algorithm.
⑵ Characteristics
① Information preservation
② Easier model training: In machine learning, it requires fewer parameters, so it is faster with lower computation.
③ Easier result interpretation: Visualization of high-dimensional data.
④ Noise reduction
⑤ It can reveal the true dimensionality of the given data.
⑶ Methods
① Feature selection
○ Selecting a few important variables from the existing ones and discarding the rest.
○ Choose one variable with a high correlation or high variance inflation factor (VIF).
② Feature extraction
○ Combine all variables to derive new variables that best represent the data.
○ Techniques that combine existing variables to create new ones.
2. Type 1. Principal Component Analysis (PCA)
⑴ Overview
① Definition: A dimension reduction technique that transforms correlated variables into orthogonal variables.
② In other words, it is a dimension reduction method that leaves only a few meaningful coordinate systems.
③ The orthogonal variables are called principal components, which can be used when they contain as much information as the original data.
④ It originates from the awareness that the given dimensionality and the true dimensionality of the data may differ.
⑤ Introduced by Pearson in 1901.
⑥ PCA can be defined not only in subspace (passing through the origin) but also in affine space.
⑦ ** Normalize by the mean and then perform PCA on the subspace.
⑧ PCA does not require multivariate normality of the given data.
⑨ It can be used when the data is linearly distributed.
⑩ Whitening: It is better to standardize features along the principal component directions when performing standardization.
⑵ Premises
① Represent the basis of the space where the data belongs as e1, ···, ep (the set of bases does not need to be orthogonal).
○ e1, ···, ep can be considered as bases representing the given dimensions (no need for orthogonality).
② Goal: To find the orthogonal basis Z1, ···, Zk of such a subspace (k < n) (typical unsupervised problem setting).
○ Z1, ···, Zk can be considered as bases representing the true dimensionality of the data.
③ Random vector
○ X represents a specific data point: x1, ···, xn are each component.
④ Mean vector: Also called population mean.
⑤ Variance-covariance matrix
⑥ Unbiased sample variance-covariance matrix: With the design matrix X, and center matrix Xc,
○ Used when the population variance-covariance matrix is unknown.
○ Since X ∈ ℝn×p, XTX ∈ ℝp×n × ℝn×p = ℝp×p.
⑦ Expression of the goal orthogonal basis
○ Zk: = k-th PC ∈ ℝp×1.
⑧ Characteristics of the goal orthogonal basis: Easily derived from linear algebra concepts.
○ Var, Cov, etc., are specific values related to variance, belonging to ℝ1x1.
○ Do not confuse with the variance-covariance matrix.
⑶ Definition of the PCA problem
⑷ Solution to the PCA problem
① Method 1. Eigen-decomposition of ∑.
○ Step 1: Data Construction and Centering
○ X ∈ ℝn×d: A data matrix with n samples and d features.
○ Center the data matrix X: Xc = X - μ with μ ∈ ℝd is the vector of column-wise means)
○ Step 2: Covariance Matrix Calculation
○ Compute the sample covariance matrix C ∈ ℝd×d from the centered data:
○ Since the given sample represents the entire population, the population mean is known, so the sample covariance is calculated by dividing by n instead of n−1.
○ Step 3: Eigen-decomposition
○ Perform eigen-decomposition of the covariance matrix C to compute eigenvalues and eigenvectors: C · pi = λi · pi
○ Here, λi represents the eigenvalue and pi ∈ Rd is the eigenvector.
○ Note: The symmetric matrix C can always be orthogonally diagonalized (by the spectral theorem).
○ Step 4: Data Transformation
○ Define P = [p1, p2, …, pd] ∈ Rd×d, a matrix where the eigenvectors are stored as columns.
○ Transform the data into a new coordinate system: Y = XcP where Y ∈ Rn×d is the transformed data matrix.
○ Step 5: Diagonalization of the Covariance Matrix
○ The covariance matrix of the transformed data Y becomes diagonal:
○ Step 6: Principal Component Selection and Dimensionality Reduction
○ Select k principal components (eigenvectors) corresponding to the largest eigenvalues λi and reduce the dimensionality: Yk = XcPk with Pk ∈ ℝd×k, Yk ∈ ℝn×k
○ Alternatively, instead of fixing the number of principal components, eigenvalues smaller than a specific threshold η can be excluded.
○ Conclusion 1. The d-th principal component PCd is the eigenvector corresponding to the d-th largest eigenvalue.
○ Conclusion 2. The meaning of eigenvectors with the same eigenvalue: The magnitude of the corresponding principal component is the same.
② Method 2. Singular value decomposition (SVD) of the center matrix Xc: The standard method currently used in PCA algorithms.
○ Step 1: Center the data to obtain the centered data Xc.
○ Step 2: Define C = (1/n) XcTXc = ∑ (covariance matrix) ○ Step 3: Since C is symmetric, it can be orthogonally diagonalized as C = VDVT (where V is the same as the V in Xc = U∑VT)
○ Step 4. The 1st, 2nd, ··· th column vectors of V are the eigenvectors corresponding to the 1st, 2nd, ··· th largest eigenvalues and are the principal components.
③ Method 3. Lagrange multiplier under equality constraint.
⑸ Determining the number of axes
① Method 1. Scree plot for the fraction of total variance.
○ Definition of the fraction of total variance.
○ Graphical representation.
Figure 1. Example of axis setting.
Figure 2. Scree plot: Shows the amount of variance explained by each principal component.
② Method 2. Scree plot for eigenvalues.
③ Method 3. Use in-sample 10-fold cross-validation to determine the number of axes p that minimizes MSPE.
Figure 3. Determination of the number of axes.
④ Since information about the remaining axes is lost, PCA functions as a lossy compression algorithm.
⑹ Python code example
import numpy as np
import matplotlib as mpl
mpl.use('Agg')
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')
from ipywidgets import interact
%matplotlib inline
from sklearn.datasets import fetch_openml
def normalize(X):
mu = np.mean(X, axis=0)
std = np.std(X, axis=0)
std_filled = std.copy()
std_filled[std==0] = 1.
Xbar = ((X-mu)/std_filled)
return Xbar, mu, std
def eig(S):
eigvals, eigvecs = np.linalg.eig(S)
k = np.argsort(eigvals)[::-1]
return eigvals[k], eigvecs[:,k]
def projection_matrix(B):
return (B @ np.linalg.inv(B.T @ B) @ B.T)
def PCA(X, num_components=2):
# reconstruction
S = 1.0/len(X) * np.dot(X.T, X)
eig_vals, eig_vecs = eig(S)
eig_vals, eig_vecs = eig_vals[:num_components], eig_vecs[:, :num_components]
B = np.real(eig_vecs)
reconst = (projection_matrix(B) @ X.T)
return reconst.T
def PC_coords(X, num_components=2):
# low-dimensional projection
S = 1.0/len(X) * np.dot(X.T, X)
eig_vals, eig_vecs = eig(S)
eig_vals, eig_vecs = eig_vals[:num_components], eig_vecs[:, :num_components]
B = np.real(eig_vecs)
Xpca = X@B
return Xpca
images, labels = fetch_openml("mnist_784", version=1, return_X_y=True, as_frame=False)
labels=np.stack(labels).astype(None)
X = (images.reshape(-1, 28 * 28)) / 255.
subset_X = X[labels == 3]
Xbar, mu, std = normalize(subset_X)
num_components=2
X = Xbar
S = 1.0/len(X) * np.dot(X.T, X)
eig_vals, eig_vecs = eig(S)
eig_vals, eig_vecs = eig_vals[:num_components], eig_vecs[:, :num_components]
print(eig_vecs)
num_component = 2
reconst = PCA(Xbar, num_component)
reconstructions = reconst * std + mu # "unnormalize" the reconstructed image
⑺ Development of PCA
① Kernel PCA: Uses kernels. Can either extend the given data to high dimensions or reduce it to lower dimensions through projection.
② MDS (Multidimensional Scaling)
③ SNE, t-SNE
④ UMAP
3. Type 2. SNE, symmetric-SNE, tSNE
⑴ A dimension reduction algorithm that attempts to preserve local neighborhoods in the data.
⑵ Nonlinear and non-deterministic.
⑶ Requires knowledge of cost function, updating algorithm, etc., related to deep learning.
4. Type 3. UMAP
5. Type 4. Singular Value Decomposition (SVD)
⑴ A method that extracts singular values from an M × N dimensional matrix and reduces the dimensionality of the given data.
⑵ Formula
① Summary: If A is an m × n matrix with rank r, then there exist ℝm×m matrix U and ℝn×n matrix V that satisfy the following conditions:
○ U and V are orthonormal matrices.
○ Since changing the sign of each column of U and the corresponding column of V still satisfies the above equation, the SVD is not uniquely determined.
○ Matrix ∑: An m × n matrix with r nonzero singular values (> 0) on the diagonal.
② Whole process
③ Singular values are defined as the positive square roots of the eigenvalues of AAT or ATA.
○ AAT and ATA may differ in the number of zero eigenvalues, but they yield the same non-zero eigenvalues.
○ The number of non-zero singular values is equal to rank(A).
○ The number of zero singular values is max(dim(A)[0], dim(A)[1])−rank(A).
④ Python code
# np.linalg.svd computes the SVD of a matrix
A = np.array([[4,11,14],
[8,7,-2]])
U, s, Vt=np.linalg.svd(A)
print(U)
print(s) # s is the vector of nonzero singular values
print(Vt)
⑶ Application 1. ATA = V(∑T∑)VT : The singular values of A are the square roots of the eigenvalues of ATA. However, some eigenvalues are set to 0 according to the rank.
⑷ Application 2. reduced SVD: Note that for the reduced version, both U and V have orthonormal columns. This means that UTU = I and VTV = I. However, U and V are not square in this version, so they are not orthogonal matrices.
A = np.array([[4,11,14],
[8,7,-2]])
U, s, Vt=np.linalg.svd(A, full_matrices=False)
print(U)
print(s)
print(Vt)
⑸ Application 3. rank-k approximation
① For k < rank(A), the m × n matrix A can be split into m × k + k × n matrices to achieve compression.
② Given k, how to find the optimal rank-k approximation of A:
○ A = UΣVT = σ1u1v1T + ··· + σrurvrT, where σ1 ≥ ⋯ ≥ σr, remove the smaller singular values σr, σr−1, ….
○ By doing this, the Frobenius distance between A(k) and A is defined as follows:
⑹ Application 4. PCA(principal component analysis)
① Step 1. Center the data
Figure. 4. Data centering of PCA
② Step 2. Xc is the centered data and C = (1/n) XcTXc
③ Step 3. C is symmetric, so it is orthogonally diagonalizable as C = VDVT. V is the same as in Xc = U∑VT in SVD.
④ Step 4. The 1st, 2nd, ··· th column vectors of V are the eigenvectors corresponding to the 1st, 2nd, ··· th largest eigenvalues and are the principal components.
⑤ Step 5. If you want to get the new coordinates of the vectors in X based on just first 2 PCs, you simply project xi (a vector of length p) onto the first two PCs:
6. Type 5. Independent Component Analysis (ICA)
⑴ Unlike principal component analysis, it is a dimensionality reduction technique that statistically separates multivariate signals into independent subcomponents.
⑵ For example, a scenario where $N$ microphones distinguish $M$ conversations.
⑶ The distribution of independent components follows a non-Gaussian distribution.
⑷ It utilizes matrix factorization.
7. PPCA
⑴ Formulation
① Step 1. Assume the observed data x is generated by the latent variable z: The following transformation itself becomes a dimensionality reduction technique.
○ W: Transformation matrix of dimension D × d (for mapping from latent space to observed space)
○ b: Mean vector of dimension D
○ ϵ: Noise, following a normal distribution N(0,σ2I)
② Step 2. Assumption on the latent variable z
③ Step 3. Conditional distribution of x given z
④ Step 4. Marginal distribution p(x)
⑤ Step 5. MLE Estimation
○ Maximum Likelihood Estimation (MLE) is used to estimate b, W, and σ from the given data.
○ The MLE solution is obtained in closed form.
○ When σ → 0, the method converges to standard PCA.
⑵ Code
import numpy as np
from numpy import shape, isnan, nanmean, average, zeros, log, cov
from numpy import matmul as mm
from numpy.matlib import repmat
from numpy.random import normal
from numpy.linalg import inv, det, eig
from numpy import identity as eye
from numpy import trace as tr
from scipy.linalg import orth
def ppca(Y,d,dia):
"""
Implements probabilistic PCA for data with missing values,
using a factorizing distribution over hidden states and hidden observations.
Args:
Y: (N by D ) input numpy ndarray of data vectors
d: ( int ) dimension of latent space
dia: (boolean) if True: print objective each step
Returns:
ss: ( float ) isotropic variance outside subspace
C: (D by d ) C*C' + I*ss is covariance model, C has scaled principal directions as cols
M: (D by 1 ) data mean
X: (N by d ) expected states
Ye: (N by D ) expected complete observations (differs from Y if data is missing)
Based on MATLAB code from J.J. VerBeek, 2006. http://lear.inrialpes.fr/~verbeek
"""
N, D = shape(Y) #N observations in D dimensions (i.e. D is number of features, N is samples)
threshold = 1E-4 #minimal relative change in objective function to continue
hidden = isnan(Y)
missing = hidden.sum()
if(missing > 0):
M = nanmean(Y, axis=0)
else:
M = average(Y, axis=0)
Ye = Y - repmat(M,N,1)
if(missing > 0):
Ye[hidden] = 0
#initialize
C = normal(loc=0.0, scale=1.0, size=(D,d))
CtC = mm(C.T, C)
X = mm(mm(Ye, C), inv(CtC))
recon = mm(X, C.T)
recon[hidden] = 0
ss = np.sum((recon-Ye)**2) / (N*D - missing)
count = 1
old = np.inf
#EM Iterations
while(count):
Sx = inv(eye(d) + CtC/ss) #E-step, covariances
ss_old = ss
if(missing > 0):
proj = mm(X,C.T)
Ye[hidden] = proj[hidden]
X = mm(mm(Ye, C), Sx/ss) #E-step: expected values
SumXtX = mm(X.T, X) #M-step
C = mm(mm(mm(Ye.T, X),(SumXtX + N*Sx).T), inv(mm((SumXtX + N*Sx), (SumXtX + N*Sx).T)))
CtC = mm(C.T, C)
ss = (np.sum((mm(X, C.T) - Ye)**2) + N*np.sum(CtC*Sx) + missing*ss_old) / (N*D)
#transform Sx determinant into numpy float128 in order to deal with high dimensionality
Sx_det = np.min(Sx).astype(np.float64)**shape(Sx)[0] * det(Sx / np.min(Sx))
objective = N*D + N*(D*log(ss)+tr(Sx)-log(Sx_det)) + tr(SumXtX) - missing*log(ss_old)
rel_ch = np.abs( 1 - objective / old )
old = objective
count = count + 1
if( rel_ch < threshold and count > 5 ):
count = 0
if(dia == True):
print('Objective: %.2f, Relative Change %.5f' %(objective, rel_ch))
C = orth(C)
covM = cov(mm(Ye, C).T)
vals, vecs = eig( covM )
ordr = np.argsort(vals)[::-1]
vals = vals[ordr]
vecs = vecs[:,ordr]
C = mm(C, vecs)
X = mm(Ye, C)
#add data mean to expected complete data
Ye = Ye + repmat(M,N,1)
return C, ss, M, X, Ye
8. Others
⑴ Factor analysis
① Definition: A method of interpreting the structure within the data by grouping similar variables based on their correlations.
② Assumes the existence of latent variables that cannot be observed in the data.
③ Often used in social sciences or surveys.
⑵ Multi-dimensional scaling (MDS)
① Definition: A statistical technique for visualizing proximity and grouping of objects in a lower-dimensional space by identifying patterns and structures latent in the data.
② Measures the similarities and dissimilarities between objects and represents them as points in a 2D or 3D space.
⑶ Linear discriminant analysis (LDA)
① Definition: A method that reduces dimensionality by projecting data onto a lower-dimensional space, similar to PCA.
⑷ CCA (Canonical-correlation analysis): Used in the Seurat pipeline.
⑸ RPCA (Reciprocal principal component analysis): Used in the Seurat pipeline.
⑹ PHATE
⑺ ODA
⑻ LLE
⑼ LSA
⑽ PLSA (probabilistic latent semantic analysis)
① Used to decompose mixed components from the latent class model.
② Application : Used for dimension reduction in mass spectrometry data.
Input: 2021.12.3 16:22