Chapter 7. Dimension Reduction Algorithm

Recommended post: 【Algorithm】 Algorithm Index

1. Overview

2. Type 1. Principal Component Analysis (PCA)

3. Type 2. SNE, symmetric-SNE, tSNE

4. Type 3. UMAP

5. Type 4. SVD

6. Type 5. ICA

7. Type 6. PPCA

8. Type 7. CCA

9. Others

a. SNE, symmetric-SNE, tSNE

b. UMAP

1. Overview

⑴ Definition of Dimension Reduction

① When given high-dimensional data X₍₁₎, ···, X_(n) ∈ ℝ^p,

② Find an appropriate subspace V ⊂ ℝ^p (where dim(V) = k),

③ Calculate the projection of X onto V, denoted as X̃.

④ It is an unsupervised algorithm.

⑵ Characteristics

① Information preservation

② Easier model training: In machine learning, it requires fewer parameters, so it is faster with lower computation.

③ Easier result interpretation: Visualization of high-dimensional data.

④ Noise reduction

⑤ It can reveal the true dimensionality of the given data.

⑶ Methods

① Feature selection

○ Selecting a few important variables from the existing ones and discarding the rest.

○ Choose one variable with a high correlation or high variance inflation factor (VIF).

② Feature extraction

○ Combine all variables to derive new variables that best represent the data.

○ Techniques that combine existing variables to create new ones.

2. Type 1. Principal Component Analysis (PCA)

⑴ Overview

① Definition: A dimension reduction technique that transforms correlated variables into orthogonal variables.

② In other words, it is a dimension reduction method that leaves only a few meaningful coordinate systems.

③ The orthogonal variables are called principal components, which can be used when they contain as much information as the original data.

④ It originates from the awareness that the given dimensionality and the true dimensionality of the data may differ.

⑤ Introduced by Pearson in 1901.

⑥ PCA can be defined not only in subspace (passing through the origin) but also in affine space.

⑦ ** Normalize by the mean and then perform PCA on the subspace.

⑧ PCA does not require multivariate normality of the given data.

⑨ It can be used when the data is linearly distributed.

⑩ Whitening: It is better to standardize features along the principal component directions when performing standardization.

⑵ Premises

① Represent the basis of the space where the data belongs as e₁, ···, e_p (the set of bases does not need to be orthogonal).

○ e₁, ···, e_p can be considered as bases representing the given dimensions (no need for orthogonality).

② Goal: To find the orthogonal basis Z₁, ···, Z_k of such a subspace (k < n) (typical unsupervised problem setting).

○ Z₁, ···, Z_k can be considered as bases representing the true dimensionality of the data.

③ Random vector

○ X represents a specific data point: x₁, ···, x_n are each component.

④ Mean vector: Also called population mean.

⑤ Variance-covariance matrix

⑥ Unbiased sample variance-covariance matrix: With the design matrix X, and center matrix X_c,

○ Used when the population variance-covariance matrix is unknown.

○ Since X ∈ ℝ^n×p, X^TX ∈ ℝ^p×n × ℝ^n×p = ℝ^p×p.

⑦ Expression of the goal orthogonal basis

○ Z_k: = k-th PC ∈ ℝ^p×1.

⑧ Characteristics of the goal orthogonal basis: Easily derived from linear algebra concepts.

○ Var, Cov, etc., are specific values related to variance, belonging to ℝ^1x1.

○ Do not confuse with the variance-covariance matrix.

⑶ Definition of the PCA problem

⑷ Solution to the PCA problem

① Method 1. Eigen-decomposition of ∑.

○ Step 1: Data Construction and Centering

○ X ∈ ℝ^n×d: A data matrix with n samples and d features.

○ Center the data matrix X: X_c = X - μ with μ ∈ ℝ^d is the vector of column-wise means)

○ Step 2: Covariance Matrix Calculation

○ Compute the sample covariance matrix C ∈ ℝ^d×d from the centered data:

○ Since the given sample represents the entire population, the population mean is known, so the sample covariance is calculated by dividing by n instead of n−1.

○ Step 3: Eigen-decomposition

○ Perform eigen-decomposition of the covariance matrix C to compute eigenvalues and eigenvectors: C · p_i = λ_i · p_i

○ Here, λ_i represents the eigenvalue and p_i ∈ R^d is the eigenvector.

○ Note: The symmetric matrix C can always be orthogonally diagonalized (by the spectral theorem).

○ Step 4: Data Transformation

○ Define P = [p₁, p₂, …, p_d] ∈ R^d×d, a matrix where the eigenvectors are stored as columns.

○ Transform the data into a new coordinate system: Y = X_cP where Y ∈ R^n×d is the transformed data matrix.

○ Step 5: Diagonalization of the Covariance Matrix

○ The covariance matrix of the transformed data Y becomes diagonal:

○ Step 6: Principal Component Selection and Dimensionality Reduction

○ Select k principal components (eigenvectors) corresponding to the largest eigenvalues λ_i and reduce the dimensionality: Y_k = X_cP_k with P_k ∈ ℝ^d×k, Y_k ∈ ℝ^n×k

○ Alternatively, instead of fixing the number of principal components, eigenvalues smaller than a specific threshold η can be excluded.

○ Conclusion 1. The d-th principal component PC_d is the eigenvector corresponding to the d-th largest eigenvalue.

○ Conclusion 2. The meaning of eigenvectors with the same eigenvalue: The magnitude of the corresponding principal component is the same.

② Method 2. Singular value decomposition (SVD) of the center matrix X_c: The standard method currently used in PCA algorithms.

○ Step 1: Center the data to obtain the centered data X_c.

○ Step 2: Define C = (1/n) X_c^TX_c = ∑ (covariance matrix) ○ Step 3: Since C is symmetric, it can be orthogonally diagonalized as C = VDV^T (where V is the same as the V in X_c = U∑V^T)

○ Step 4. The 1^st, 2^nd, ··· th column vectors of V are the eigenvectors corresponding to the 1^st, 2^nd, ··· th largest eigenvalues and are the principal components.

③ Method 3. Lagrange multiplier under equality constraint.

⑸ Determining the number of axes

① Method 1. Scree plot for the fraction of total variance.

○ Definition of the fraction of total variance.

○ Graphical representation.

Figure 1. Example of axis setting.

Figure 2. Scree plot: Shows the amount of variance explained by each principal component.

② Method 2. Scree plot for eigenvalues.

③ Method 3. Use in-sample 10-fold cross-validation to determine the number of axes p that minimizes MSPE.

Figure 3. Determination of the number of axes.

④ Since information about the remaining axes is lost, PCA functions as a lossy compression algorithm.

⑹ Python code example

import numpy as np
import matplotlib as mpl
mpl.use('Agg')
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')
from ipywidgets import interact
%matplotlib inline
from sklearn.datasets import fetch_openml

def normalize(X):
    mu = np.mean(X, axis=0)
    std = np.std(X, axis=0)
    std_filled = std.copy()
    std_filled[std==0] = 1.
    Xbar = ((X-mu)/std_filled)
    return Xbar, mu, std

def eig(S):
    eigvals, eigvecs = np.linalg.eig(S)
    k = np.argsort(eigvals)[::-1]
    return eigvals[k], eigvecs[:,k]

def projection_matrix(B):
    return (B @ np.linalg.inv(B.T @ B) @ B.T)

def PCA(X, num_components=2):
    # reconstruction
    S = 1.0/len(X) * np.dot(X.T, X)
    eig_vals, eig_vecs = eig(S)
    eig_vals, eig_vecs = eig_vals[:num_components], eig_vecs[:, :num_components]
    B = np.real(eig_vecs)
    reconst = (projection_matrix(B) @ X.T)
    return reconst.T

def PC_coords(X, num_components=2):   
    # low-dimensional projection
    S = 1.0/len(X) * np.dot(X.T, X)
    eig_vals, eig_vecs = eig(S)
    eig_vals, eig_vecs = eig_vals[:num_components], eig_vecs[:, :num_components]
    B = np.real(eig_vecs)
    Xpca = X@B    
    return Xpca

images, labels = fetch_openml("mnist_784", version=1, return_X_y=True, as_frame=False)
labels=np.stack(labels).astype(None)
X = (images.reshape(-1, 28 * 28)) / 255.
subset_X = X[labels == 3]
Xbar, mu, std = normalize(subset_X)

num_components=2
X = Xbar
S = 1.0/len(X) * np.dot(X.T, X)
eig_vals, eig_vecs = eig(S)
eig_vals, eig_vecs = eig_vals[:num_components], eig_vecs[:, :num_components]
print(eig_vecs)

num_component = 2
reconst = PCA(Xbar, num_component)
reconstructions = reconst * std + mu # "unnormalize" the reconstructed image

⑺ Development of PCA

① Kernel PCA: Uses kernels. Can either extend the given data to high dimensions or reduce it to lower dimensions through projection.

② MDS (Multidimensional Scaling)

③ SNE, t-SNE

④ UMAP

3. Type 2. SNE, symmetric-SNE, tSNE

⑴ A dimension reduction algorithm that attempts to preserve local neighborhoods in the data.

⑵ Nonlinear and non-deterministic.

⑶ Requires knowledge of cost function, updating algorithm, etc., related to deep learning.

4. Type 3. UMAP

5. Type 4. Singular Value Decomposition (SVD)

⑴ A method that extracts singular values from an M × N dimensional matrix and reduces the dimensionality of the given data.

⑵ Formula

① Summary: If A is an m × n matrix with rank r, then there exist ℝm×m matrix U and ℝn×n matrix V that satisfy the following conditions:

○ U and V are orthonormal matrices.

○ Since changing the sign of each column of U and the corresponding column of V still satisfies the above equation, the SVD is not uniquely determined.

○ Matrix ∑: An m × n matrix with r nonzero singular values (> 0) on the diagonal.

② Whole process

③ Singular values are defined as the positive square roots of the eigenvalues of AA^T or A^TA.

○ AA^T and A^TA may differ in the number of zero eigenvalues, but they yield the same non-zero eigenvalues.

○ The number of non-zero singular values is equal to rank(A).

○ The number of zero singular values is max(dim(A)[0], dim(A)[1])−rank(A).

④ Python code

# np.linalg.svd computes the SVD of a matrix
A = np.array([[4,11,14],
            [8,7,-2]])
U, s, Vt=np.linalg.svd(A)
print(U)
print(s) # s is the vector of nonzero singular values
print(Vt)

⑶ Application 1. A^TA = V(∑^T∑)V^T : The singular values of A are the square roots of the eigenvalues of A^TA. However, some eigenvalues are set to 0 according to the rank.

⑷ Application 2. reduced SVD: Note that for the reduced version, both U and V have orthonormal columns. This means that U^TU = I and V^TV = I. However, U and V are not square in this version, so they are not orthogonal matrices.

A = np.array([[4,11,14],
            [8,7,-2]])
U, s, Vt=np.linalg.svd(A, full_matrices=False)
print(U)
print(s)
print(Vt)

⑸ Application 3. rank-k approximation

① For k < rank(A), the m × n matrix A can be split into m × k + k × n matrices to achieve compression.

② Given k, how to find the optimal rank-k approximation of A:

○ A = UΣV^T = σ₁u₁v₁^T + ··· + σ_ru_rv_r^T, where σ₁ ≥ ⋯ ≥ σ_r, remove the smaller singular values σ_r, σ_r−1, ….

○ By doing this, the Frobenius distance between A^(k) and A is defined as follows:

⑹ Application 4. PCA(principal component analysis)

① Step 1. Center the data

Figure. 4. Data centering of PCA

② Step 2. X_c is the centered data and C = (1/n) X_c^TX_c

③ Step 3. C is symmetric, so it is orthogonally diagonalizable as C = VDV^T. V is the same as in X_c = U∑V^T in SVD.

④ Step 4. The 1^st, 2^nd, ··· th column vectors of V are the eigenvectors corresponding to the 1^st, 2^nd, ··· th largest eigenvalues and are the principal components.

⑤ Step 5. If you want to get the new coordinates of the vectors in X based on just first 2 PCs, you simply project x_i (a vector of length p) onto the first two PCs:

6. Type 5. Independent Component Analysis (ICA)

⑴ Unlike principal component analysis, it is a dimensionality reduction technique that statistically separates multivariate signals into independent subcomponents.

⑵ For example, a scenario where $N$ microphones distinguish $M$ conversations.

⑶ The distribution of independent components follows a non-Gaussian distribution.

⑷ It utilizes matrix factorization.

7. Type 6. PPCA

⑴ Formulation

① Step 1. Assume the observed data x is generated by the latent variable z: The following transformation itself becomes a dimensionality reduction technique.

○ W: Transformation matrix of dimension D × d (for mapping from latent space to observed space)

○ b: Mean vector of dimension D

○ ϵ: Noise, following a normal distribution N(0,σ²I)

② Step 2. Assumption on the latent variable z

③ Step 3. Conditional distribution of x given z

④ Step 4. Marginal distribution p(x)

⑤ Step 5. MLE Estimation

○ Maximum Likelihood Estimation (MLE) is used to estimate b, W, and σ from the given data.

○ The MLE solution is obtained in closed form.

○ When σ → 0, the method converges to standard PCA.

⑵ Code

import numpy as np
from numpy import shape, isnan, nanmean, average, zeros, log, cov
from numpy import matmul as mm
from numpy.matlib import repmat
from numpy.random import normal
from numpy.linalg import inv, det, eig
from numpy import identity as eye
from numpy import trace as tr
from scipy.linalg import orth

def ppca(Y,d,dia):
    """
    Implements probabilistic PCA for data with missing values,
    using a factorizing distribution over hidden states and hidden observations.

    Args:
        Y:   (N by D ) input numpy ndarray of data vectors
        d:   (  int  ) dimension of latent space
        dia: (boolean) if True: print objective each step

    Returns:
        ss: ( float ) isotropic variance outside subspace
        C:  (D by d ) C*C' + I*ss is covariance model, C has scaled principal directions as cols
        M:  (D by 1 ) data mean
        X:  (N by d ) expected states
        Ye: (N by D ) expected complete observations (differs from Y if data is missing)

        Based on MATLAB code from J.J. VerBeek, 2006. http://lear.inrialpes.fr/~verbeek
    """
    N, D      = shape(Y) #N observations in D dimensions (i.e. D is number of features, N is samples)
    threshold = 1E-4 #minimal relative change in objective function to continue
    hidden    = isnan(Y)
    missing   = hidden.sum()

    if(missing > 0):
        M = nanmean(Y, axis=0)
    else:
        M = average(Y, axis=0)

    Ye = Y - repmat(M,N,1)

    if(missing > 0):
        Ye[hidden] = 0

    #initialize
    C = normal(loc=0.0, scale=1.0, size=(D,d))
    CtC = mm(C.T, C)
    X   = mm(mm(Ye, C), inv(CtC))
    recon = mm(X, C.T)
    recon[hidden] = 0
    ss = np.sum((recon-Ye)**2) / (N*D - missing)

    count = 1
    old = np.inf

    #EM Iterations
    while(count):
        Sx = inv(eye(d) + CtC/ss) #E-step, covariances
        ss_old = ss
        if(missing > 0):
            proj = mm(X,C.T)
            Ye[hidden] = proj[hidden]
        
        X = mm(mm(Ye, C), Sx/ss) #E-step: expected values

        SumXtX = mm(X.T, X) #M-step
        C = mm(mm(mm(Ye.T, X),(SumXtX + N*Sx).T), inv(mm((SumXtX + N*Sx), (SumXtX + N*Sx).T)))
        CtC = mm(C.T, C)
        ss = (np.sum((mm(X, C.T) - Ye)**2) + N*np.sum(CtC*Sx) + missing*ss_old) / (N*D)
        #transform Sx determinant into numpy float128 in order to deal with high dimensionality
        Sx_det = np.min(Sx).astype(np.float64)**shape(Sx)[0] * det(Sx / np.min(Sx))
        objective = N*D + N*(D*log(ss)+tr(Sx)-log(Sx_det)) + tr(SumXtX) - missing*log(ss_old)
        
        rel_ch = np.abs( 1 - objective / old )
        old = objective
        
        count = count + 1
        if( rel_ch < threshold and count > 5 ):
            count = 0
        if(dia == True):
            print('Objective: %.2f, Relative Change %.5f' %(objective, rel_ch))
    
    C = orth(C)
    covM = cov(mm(Ye, C).T)
    vals, vecs = eig( covM )
    ordr = np.argsort(vals)[::-1]
    vals = vals[ordr]
    vecs = vecs[:,ordr]
    
    C = mm(C, vecs)
    X = mm(Ye, C)
    
    #add data mean to expected complete data
    Ye = Ye + repmat(M,N,1)

    return C, ss, M, X, Ye

8. Type 7. CCA

⑴ CCA (Canonical Correlation Analysis)

① Problem Statement: Given data matrices X ∈ ℝ^n×p and Y ∈ ℝ^n×q, the goal is to maximize the correlation coefficient of the transformed variables.

○ ∑_xx = X^TX: Covariance matrix of X ∈ ℝ^p×p

○ ∑_yy = Y^TY: Covariance matrix of Y ∈ ℝ^q×q

○ ∑_xy = X^TY: Cross-covariance matrix between X and Y ∈ ℝ^p×q

○ ∑_yx = (∑_xy)^T

② Solution Approach: Reformulate the problem as an optimization problem under the constraints: w_x^T∑_xxw_x = 1 and w_y^T∑_yyw_y = 1, and apply the Lagrange multiplier method.

③ Conclusion: The canonical correlation is given by the positive square root of the generalized eigenvalues of the matrix: ∑_xx^-1∑_xy∑_yy^-1∑_yx.

⑵ DCCA (Diagonal CCA)

① Problem Statement: Given data matrices X ∈ ℝ^n×p and Y ∈ ℝ^n×q

○ D_x = diag(∑_xx): Using only the diagonal elements of the covariance matrix

○ D_y = diag(∑_yy): Using only the diagonal elements of the covariance matrix

② Process: The singular values of SVD determine the canonical correlation values.

⑶ Comparison of CCA and DCCA

Method	CCA	DCCA
Covariance Matrix	Full matrix used (( \Sigma_{xx}, \Sigma_{yy} ))	Diagonal elements only (( D_x, D_y ))
Solution	Generalized eigenvalue problem	SVD
Computational Cost	Requires matrix inversion (high cost)	Diagonal matrix operations (low cost)
Result Interpretation	Eigenvectors, eigenvalues	Singular vectors, singular values

Table 1. Comparison of CCA and DCCA

9. Others

⑴ Factor analysis

① Definition: A method of interpreting the structure within the data by grouping similar variables based on their correlations.

② Assumes the existence of latent variables that cannot be observed in the data.

③ Often used in social sciences or surveys.

⑵ Multi-dimensional scaling (MDS)

① Definition: A statistical technique for visualizing proximity and grouping of objects in a lower-dimensional space by identifying patterns and structures latent in the data.

② Measures the similarities and dissimilarities between objects and represents them as points in a 2D or 3D space.

⑶ Linear discriminant analysis (LDA)

① Definition: A method that reduces dimensionality by projecting data onto a lower-dimensional space, similar to PCA.

⑷ RPCA (Reciprocal principal component analysis)

⑸ Autoencoder

① Variational Autoencoder (VAE)

② Graph Autoencoder (GAE)

③ Matrix Factorization and NMF

④ Latent Topic Model: Latent Semantic Indexing (LSI), Latent Dirichlet Allocation (LDA), etc.

⑹ Signal Pattern Analysis (SPA) (ref1, ref2)

① Instead of the traditional Laplacian matrix used in spectral clustering, SPA employs the power of the Markov transition matrix P^t, derived from the graph, to capture information at various diffusion times t.

② Shorter diffusion times t are sensitive to local patterns, while longer times reflect more global structures, leveraging this property for analysis.

⑺ PHATE

⑻ ODA

⑼ LLE

⑽ LSA

⑾ PLSA (probabilistic latent semantic analysis)

① Used to decompose mixed components from the latent class model.

② Application : Used for dimension reduction in mass spectrometry data.

Input: 2021.12.3 16:22

2158

Chapter 7. Dimension Reduction Algorithm

1. Overview

2. Type 1. Principal Component Analysis (PCA)

3. Type 2. SNE, symmetric-SNE, tSNE

4. Type 3. UMAP

5. Type 4. Singular Value Decomposition (SVD)

6. Type 5. Independent Component Analysis (ICA)

7. Type 6. PPCA

8. Type 7. CCA

9. Others

results matching ""

No results matching ""