Korean, Edit

Chapter 7. Dimension Reduction Algorithm

Recommended post: 【Algorithm】 Algorithm Index


1. Overview

2. Type 1. Principal Component Analysis (PCA)

3. Type 2. SNE, symmetric-SNE, tSNE

4. Type 3. UMAP

5. Type 4. SVD

6. Type 5. ICA

7. Type 6. PPCA

8. Others


a. SNE, symmetric-SNE, tSNE

b. UMAP (uniform manifold approximation and projection)



1. Overview

⑴ Definition of Dimension Reduction

① When given high-dimensional data X(1), ···, X(n) ∈ ℝp,

② Find an appropriate subspace V ⊂ ℝp (where dim(V) = k),

③ Calculate the projection of X onto V, denoted as X̃.

④ It is an unsupervised algorithm.

⑵ Characteristics

① Information preservation

② Easier model training: In machine learning, it requires fewer parameters, so it is faster with lower computation.

③ Easier result interpretation: Visualization of high-dimensional data.

④ Noise reduction

⑤ It can reveal the true dimensionality of the given data.

⑶ Methods

① Feature selection

○ Selecting a few important variables from the existing ones and discarding the rest.

○ Choose one variable with a high correlation or high variance inflation factor (VIF).

② Feature extraction

○ Combine all variables to derive new variables that best represent the data.

○ Techniques that combine existing variables to create new ones.



2. Type 1. Principal Component Analysis (PCA)

⑴ Overview

① Definition: A dimension reduction technique that transforms correlated variables into orthogonal variables.

② In other words, it is a dimension reduction method that leaves only a few meaningful coordinate systems.

③ The orthogonal variables are called principal components, which can be used when they contain as much information as the original data.

④ It originates from the awareness that the given dimensionality and the true dimensionality of the data may differ.

⑤ Introduced by Pearson in 1901.

⑥ PCA can be defined not only in subspace (passing through the origin) but also in affine space.

** Normalize by the mean and then perform PCA on the subspace.

⑧ PCA does not require multivariate normality of the given data.

⑨ It can be used when the data is linearly distributed.

⑩ Whitening: It is better to standardize features along the principal component directions when performing standardization.

⑵ Premises

① Represent the basis of the space where the data belongs as e1, ···, ep (the set of bases does not need to be orthogonal).

○ e1, ···, ep can be considered as bases representing the given dimensions (no need for orthogonality).

② Goal: To find the orthogonal basis Z1, ···, Zk of such a subspace (k < n) (typical unsupervised problem setting).

○ Z1, ···, Zk can be considered as bases representing the true dimensionality of the data.


image


Random vector

○ X represents a specific data point: x1, ···, xn are each component.


image


Mean vector: Also called population mean.


image


Variance-covariance matrix


image


Unbiased sample variance-covariance matrix: With the design matrix X, and center matrix Xc,


image


○ Used when the population variance-covariance matrix is unknown.

○ Since X ∈ ℝn×p, XTX ∈ ℝp×n × ℝn×p = ℝp×p.

⑦ Expression of the goal orthogonal basis


image


○ Zk: = k-th PC ∈ ℝp×1.

⑧ Characteristics of the goal orthogonal basis: Easily derived from linear algebra concepts.


image


○ Var, Cov, etc., are specific values related to variance, belonging to ℝ1x1.

○ Do not confuse with the variance-covariance matrix.

⑶ Definition of the PCA problem


image


⑷ Solution to the PCA problem

Method 1. Eigen-decomposition of ∑.

Step 1: Data Construction and Centering

X ∈ ℝn×d: A data matrix with n samples and d features.

○ Center the data matrix X: Xc = X - μ with μ ∈ ℝd is the vector of column-wise means)

Step 2: Covariance Matrix Calculation

○ Compute the sample covariance matrix C ∈ ℝd×d from the centered data:


스크린샷 2024-11-16 오전 10 28 08


○ Since the given sample represents the entire population, the population mean is known, so the sample covariance is calculated by dividing by n instead of n−1.

Step 3: Eigen-decomposition

○ Perform eigen-decomposition of the covariance matrix C to compute eigenvalues and eigenvectors: C · pi = λi · pi

○ Here, λi represents the eigenvalue and pi ​∈ Rd is the eigenvector.

○ Note: The symmetric matrix C can always be orthogonally diagonalized (by the spectral theorem).

Step 4: Data Transformation

○ Define P = [p1, p2, …, pd] ∈ Rd×d, a matrix where the eigenvectors are stored as columns.

○ Transform the data into a new coordinate system: Y = XcP where Y ∈ Rn×d is the transformed data matrix.

Step 5: Diagonalization of the Covariance Matrix

○ The covariance matrix of the transformed data Y becomes diagonal:


스크린샷 2024-11-16 오전 10 32 40


Step 6: Principal Component Selection and Dimensionality Reduction

○ Select k principal components (eigenvectors) corresponding to the largest eigenvalues λi and reduce the dimensionality: Yk = XcPk with Pk ∈ ℝd×k, Yk ∈ ℝn×k

○ Alternatively, instead of fixing the number of principal components, eigenvalues smaller than a specific threshold η can be excluded.

Conclusion 1. The d-th principal component PCd is the eigenvector corresponding to the d-th largest eigenvalue.

Conclusion 2. The meaning of eigenvectors with the same eigenvalue: The magnitude of the corresponding principal component is the same.

Method 2. Singular value decomposition (SVD) of the center matrix Xc: The standard method currently used in PCA algorithms.

Step 1: Center the data to obtain the centered data Xc.

Step 2: Define ​C = (1/n) XcTXc = ∑ (covariance matrix) ​ ○ Step 3: Since C is symmetric, it can be orthogonally diagonalized as C = VDVT (where V is the same as the V in Xc = UVT)

Step 4. The 1st, 2nd, ··· th column vectors of V are the eigenvectors corresponding to the 1st, 2nd, ··· th largest eigenvalues and are the principal components.

Method 3. Lagrange multiplier under equality constraint.

⑸ Determining the number of axes

Method 1. Scree plot for the fraction of total variance.

○ Definition of the fraction of total variance.


image


○ Graphical representation.


image

Figure 1. Example of axis setting.


image

Figure 2. Scree plot: Shows the amount of variance explained by each principal component.


Method 2. Scree plot for eigenvalues.

Method 3. Use in-sample 10-fold cross-validation to determine the number of axes p that minimizes MSPE.


image

Figure 3. Determination of the number of axes.


④ Since information about the remaining axes is lost, PCA functions as a lossy compression algorithm.

⑹ Python code example


import numpy as np
import matplotlib as mpl
mpl.use('Agg')
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')
from ipywidgets import interact
%matplotlib inline
from sklearn.datasets import fetch_openml

def normalize(X):
    mu = np.mean(X, axis=0)
    std = np.std(X, axis=0)
    std_filled = std.copy()
    std_filled[std==0] = 1.
    Xbar = ((X-mu)/std_filled)
    return Xbar, mu, std

def eig(S):
    eigvals, eigvecs = np.linalg.eig(S)
    k = np.argsort(eigvals)[::-1]
    return eigvals[k], eigvecs[:,k]

def projection_matrix(B):
    return (B @ np.linalg.inv(B.T @ B) @ B.T)

def PCA(X, num_components=2):
    # reconstruction
    S = 1.0/len(X) * np.dot(X.T, X)
    eig_vals, eig_vecs = eig(S)
    eig_vals, eig_vecs = eig_vals[:num_components], eig_vecs[:, :num_components]
    B = np.real(eig_vecs)
    reconst = (projection_matrix(B) @ X.T)
    return reconst.T

def PC_coords(X, num_components=2):   
    # low-dimensional projection
    S = 1.0/len(X) * np.dot(X.T, X)
    eig_vals, eig_vecs = eig(S)
    eig_vals, eig_vecs = eig_vals[:num_components], eig_vecs[:, :num_components]
    B = np.real(eig_vecs)
    Xpca = X@B    
    return Xpca

images, labels = fetch_openml("mnist_784", version=1, return_X_y=True, as_frame=False)
labels=np.stack(labels).astype(None)
X = (images.reshape(-1, 28 * 28)) / 255.
subset_X = X[labels == 3]
Xbar, mu, std = normalize(subset_X)

num_components=2
X = Xbar
S = 1.0/len(X) * np.dot(X.T, X)
eig_vals, eig_vecs = eig(S)
eig_vals, eig_vecs = eig_vals[:num_components], eig_vecs[:, :num_components]
print(eig_vecs)

num_component = 2
reconst = PCA(Xbar, num_component)
reconstructions = reconst * std + mu # "unnormalize" the reconstructed image


⑺ Development of PCA

① Kernel PCA: Uses kernels. Can either extend the given data to high dimensions or reduce it to lower dimensions through projection.

② MDS (Multidimensional Scaling)

③ SNE, t-SNE

④ UMAP



3. Type 2. SNE, symmetric-SNE, tSNE

⑴ A dimension reduction algorithm that attempts to preserve local neighborhoods in the data.

⑵ Nonlinear and non-deterministic.

⑶ Requires knowledge of cost function, updating algorithm, etc., related to deep learning.



4. Type 3. UMAP



5. Type 4. Singular Value Decomposition (SVD)

⑴ A method that extracts singular values from an M × N dimensional matrix and reduces the dimensionality of the given data.

⑵ Formula

① Summary: If A is an m × n matrix with rank r, then there exist ℝm×m matrix U and ℝn×n matrix V that satisfy the following conditions:


스크린샷 2024-10-22 오전 11 44 56


○ U and V are orthonormal matrices.

○ Since changing the sign of each column of U and the corresponding column of V still satisfies the above equation, the SVD is not uniquely determined.

○ Matrix ∑: An m × n matrix with r nonzero singular values (> 0) on the diagonal.


스크린샷 2024-10-22 오전 11 34 47


② Whole process


image


③ Singular values are defined as the positive square roots of the eigenvalues of AAT or ATA.

○ AAT and ATA may differ in the number of zero eigenvalues, but they yield the same non-zero eigenvalues.

○ The number of non-zero singular values is equal to rank(A).

○ The number of zero singular values is max(dim(A)[0], dim(A)[1])−rank(A).


스크린샷 2024-11-29 오전 12 54 54


④ Python code


# np.linalg.svd computes the SVD of a matrix
A = np.array([[4,11,14],
            [8,7,-2]])
U, s, Vt=np.linalg.svd(A)
print(U)
print(s) # s is the vector of nonzero singular values
print(Vt)


Application 1. ATA = V(∑T∑)VT : The singular values of A are the square roots of the eigenvalues of ATA. However, some eigenvalues are set to 0 according to the rank.

Application 2. reduced SVD: Note that for the reduced version, both U and V have orthonormal columns. This means that UTU = I and VTV = I. However, U and V are not square in this version, so they are not orthogonal matrices.


image


A = np.array([[4,11,14],
            [8,7,-2]])
U, s, Vt=np.linalg.svd(A, full_matrices=False)
print(U)
print(s)
print(Vt)


Application 3. rank-k approximation

① For k < rank(A), the m × n matrix A can be split into m × k + k × n matrices to achieve compression.


image


② Given k, how to find the optimal rank-k approximation of A:

○ A = UΣVT = σ1u1v1T + ··· + σrurvrT, where σ1 ≥ ⋯ ≥ σr, remove the smaller singular values σr, σr−1, ….

○ By doing this, the Frobenius distance between A(k) and A is defined as follows:


image


Application 4. PCA(principal component analysis)

Step 1. Center the data


스크린샷 2024-11-16 오전 8 45 10

Figure. 4. Data centering of PCA


스크린샷 2024-11-16 오전 8 45 28


Step 2. Xc is the centered data and C = (1/n) XcTXc

Step 3. C is symmetric, so it is orthogonally diagonalizable as C = VDVT. V is the same as in Xc = UVT in SVD.

Step 4. The 1st, 2nd, ··· th column vectors of V are the eigenvectors corresponding to the 1st, 2nd, ··· th largest eigenvalues and are the principal components.

Step 5. If you want to get the new coordinates of the vectors in X based on just first 2 PCs, you simply project xi (a vector of length p) onto the first two PCs:


스크린샷 2024-11-29 오전 12 03 43



6. Type 5. Independent Component Analysis (ICA)

⑴ Unlike principal component analysis, it is a dimensionality reduction technique that statistically separates multivariate signals into independent subcomponents.

⑵ For example, a scenario where $N$ microphones distinguish $M$ conversations.

⑶ The distribution of independent components follows a non-Gaussian distribution.

⑷ It utilizes matrix factorization.



7. PPCA

⑴ Formulation

Step 1. Assume the observed data x is generated by the latent variable z: The following transformation itself becomes a dimensionality reduction technique.


스크린샷 2024-11-07 오전 11 56 26


○ W: Transformation matrix of dimension D × d (for mapping from latent space to observed space)

○ b: Mean vector of dimension D

○ ϵ: Noise, following a normal distribution N(0,σ2I)

Step 2. Assumption on the latent variable z


스크린샷 2024-11-07 오전 11 56 18


Step 3. Conditional distribution of x given z


스크린샷 2024-11-07 오전 11 56 49


Step 4. Marginal distribution p(x)


스크린샷 2024-11-07 오전 11 57 08


Step 5. MLE Estimation

○ Maximum Likelihood Estimation (MLE) is used to estimate b, W, and σ from the given data.

○ The MLE solution is obtained in closed form.

○ When σ → 0, the method converges to standard PCA.

⑵ Code


import numpy as np
from numpy import shape, isnan, nanmean, average, zeros, log, cov
from numpy import matmul as mm
from numpy.matlib import repmat
from numpy.random import normal
from numpy.linalg import inv, det, eig
from numpy import identity as eye
from numpy import trace as tr
from scipy.linalg import orth

def ppca(Y,d,dia):
    """
    Implements probabilistic PCA for data with missing values,
    using a factorizing distribution over hidden states and hidden observations.

    Args:
        Y:   (N by D ) input numpy ndarray of data vectors
        d:   (  int  ) dimension of latent space
        dia: (boolean) if True: print objective each step

    Returns:
        ss: ( float ) isotropic variance outside subspace
        C:  (D by d ) C*C' + I*ss is covariance model, C has scaled principal directions as cols
        M:  (D by 1 ) data mean
        X:  (N by d ) expected states
        Ye: (N by D ) expected complete observations (differs from Y if data is missing)

        Based on MATLAB code from J.J. VerBeek, 2006. http://lear.inrialpes.fr/~verbeek
    """
    N, D      = shape(Y) #N observations in D dimensions (i.e. D is number of features, N is samples)
    threshold = 1E-4 #minimal relative change in objective function to continue
    hidden    = isnan(Y)
    missing   = hidden.sum()

    if(missing > 0):
        M = nanmean(Y, axis=0)
    else:
        M = average(Y, axis=0)

    Ye = Y - repmat(M,N,1)

    if(missing > 0):
        Ye[hidden] = 0

    #initialize
    C = normal(loc=0.0, scale=1.0, size=(D,d))
    CtC = mm(C.T, C)
    X   = mm(mm(Ye, C), inv(CtC))
    recon = mm(X, C.T)
    recon[hidden] = 0
    ss = np.sum((recon-Ye)**2) / (N*D - missing)

    count = 1
    old = np.inf

    #EM Iterations
    while(count):
        Sx = inv(eye(d) + CtC/ss) #E-step, covariances
        ss_old = ss
        if(missing > 0):
            proj = mm(X,C.T)
            Ye[hidden] = proj[hidden]
        
        X = mm(mm(Ye, C), Sx/ss) #E-step: expected values

        SumXtX = mm(X.T, X) #M-step
        C = mm(mm(mm(Ye.T, X),(SumXtX + N*Sx).T), inv(mm((SumXtX + N*Sx), (SumXtX + N*Sx).T)))
        CtC = mm(C.T, C)
        ss = (np.sum((mm(X, C.T) - Ye)**2) + N*np.sum(CtC*Sx) + missing*ss_old) / (N*D)
        #transform Sx determinant into numpy float128 in order to deal with high dimensionality
        Sx_det = np.min(Sx).astype(np.float64)**shape(Sx)[0] * det(Sx / np.min(Sx))
        objective = N*D + N*(D*log(ss)+tr(Sx)-log(Sx_det)) + tr(SumXtX) - missing*log(ss_old)
        
        rel_ch = np.abs( 1 - objective / old )
        old = objective
        
        count = count + 1
        if( rel_ch < threshold and count > 5 ):
            count = 0
        if(dia == True):
            print('Objective: %.2f, Relative Change %.5f' %(objective, rel_ch))
    
    C = orth(C)
    covM = cov(mm(Ye, C).T)
    vals, vecs = eig( covM )
    ordr = np.argsort(vals)[::-1]
    vals = vals[ordr]
    vecs = vecs[:,ordr]
    
    C = mm(C, vecs)
    X = mm(Ye, C)
    
    #add data mean to expected complete data
    Ye = Ye + repmat(M,N,1)

    return C, ss, M, X, Ye



8. Others

⑴ Factor analysis

① Definition: A method of interpreting the structure within the data by grouping similar variables based on their correlations.

② Assumes the existence of latent variables that cannot be observed in the data.

③ Often used in social sciences or surveys.

⑵ Multi-dimensional scaling (MDS)

① Definition: A statistical technique for visualizing proximity and grouping of objects in a lower-dimensional space by identifying patterns and structures latent in the data.

② Measures the similarities and dissimilarities between objects and represents them as points in a 2D or 3D space.

⑶ Linear discriminant analysis (LDA)

① Definition: A method that reduces dimensionality by projecting data onto a lower-dimensional space, similar to PCA.

⑷ CCA (Canonical-correlation analysis): Used in the Seurat pipeline.

⑸ RPCA (Reciprocal principal component analysis): Used in the Seurat pipeline.

⑹ PHATE

⑺ ODA

⑻ LLE

⑼ LSA

PLSA (probabilistic latent semantic analysis)

① Used to decompose mixed components from the latent class model.

② Application : Used for dimension reduction in mass spectrometry data.



Input: 2021.12.3 16:22

results matching ""

    No results matching ""