Korean, Edit

Chapter 8. Clustering Algorithms

Recommended Article: 【Algorithm】 Algorithm Index


1. Introduction

2. Type 1. unsupervised hierarchical clustering

3. Type 2. K-means clustering

4. Type 3. Density-based clustering

5. Type 4. Mixture distribution clustering

6. Type 5. Graph-based clustering

7. Other Clustering Algorithms



1. Introduction

⑴ Clustering is an optimization problem

variability


image


○ variability can be defined as the sum of distance or distance squared

Types of distance functions

○ Reason: big and bad is worse than small and bad

dissimilarity


image


○ Clustering is not about finding a global minimum for dissimilarity

Reason 1. A trivial solution exists where ∀variability = 0 with as many clusters as samples

Reason 2. It’s unfair to penalize large clusters per se: big is not necessarily worse than small

○ Constraints are needed: minimum distance between clusters, number of clusters, etc.

⑵ Clustering is an unsupervised algorithm

① Due to this, evaluating the performance of clustering algorithms is difficult, but it can be assessed as follows

Method 1. NMI(normalized mutual information)

Method 2. ARI(adjusted rand index)

Method 3. ASW(average silhouette width)

Method 4. AMI

Method 5. Homo

Method 6. Relative distance between clusters

⑶ Types of Clustering

Type 1. agglomerative: bottom-up approach. An approach that builds larger clusters

Type 2. divisive: top-down approach. An approach that divides into smaller clusters

③ In practice, agglomerative is more commonly used



2. Type 1. Unsupervised hierarchical clustering

⑴ Overview

① Often used to draw a heatmap from an internal matrix (Example: RStudio’s heatmap function)

② A type of graph-based method

③ Unlike non-hierarchical clustering, hierarchical clustering does not pre-determine the number of clusters

Type 1. Divisive Analysis

① A technique that starts from the entire group and separates objects with decreasing similarity

Stage 1: Initially assign N elements to N clusters

Stage 2: Merge the closest clusters into one

Stage 3: Repeat stage 2 until constraints such as the number of clusters are met

○ In this case, a dendrogram is used

Type 2. Agglomerative Analysis (agglomerative hierarchical clustering)

① Definition: A method that considers each entity as a subgroup and progressively merges similar subgroups to form new subgroups

⑷ Linkage metric: Defining distance between clusters

Type 1. Single-linkage: The shortest distance between elements of one cluster and another

Type 2. Complete-linkage: The farthest distance between elements of one cluster and another

Type 3. Average-linkage: The distance between the centers of two clusters

○ Can also mean the average distance between elements of one cluster and another

Type 4. Centroid-linkage: Measuring the distance between the centers of two clusters

Type 5. Ward-linkage: A method of measuring distance between clusters based on within-cluster sum of squares


image

Figure 1. Performance of clustering according to linkage metric]


⑸ Features

① One of the most frequently used clustering algorithms along with K-means clustering

Advantage 1. Deterministic: can reach the same conclusion in any environment

Advantage 2. Can try various linkage criteria

Disadvantage 1. Slow naïve algorithm. Generally requires Ο(n3) time complexity

○ Reason: Ο(n2) is required to perform stage 2 and it has to be repeated Ο(n) times

○ In some cases, Ο(n2) time complexity is required depending on the linkage criteria

References



3. Type 2. K-means clustering

⑴ Overview

① Often used in clustering algorithms targeting images

⑵ Algorithm


image

Figure 2. How K-means clustering works


Stage 1: Randomly select k initial centroids

Stage 2: Assign clusters based on which centroid each element is closest to

Stage 3: Calculate the center of each cluster to define new centroids

Stage 4: Clustering ends when centroids no longer change

⑶ Features

① The most widely used clustering algorithm

② K-means clustering is one of the greedy algorithms for clustering

Determining K

Background knowledge

○ Example: Differentiating between cancerous and non-cancerous data in bio multi-omics data

Elbow Method

○ A method to determine the number of clusters in cluster analysis by plotting the total within-cluster sum of squares against the number of clusters and selecting the elbow point as the appropriate number of clusters

○ Choose the cluster corresponding to the elbow part where the slope is gentle when the number of clusters is on the x-axis and SSE value on the y-axis

Silhouette Method

○ A method that quantitatively calculates the quality of clustering using silhouette coefficients

○ Shows how well clusters are separated from each other

○ A silhouette coefficient close to 1 indicates well-optimized, well-separated clusters

○ A silhouette coefficient close to 0 indicates poorly optimized, closely spaced clusters

Dendrogram: Use dendrogram visualization in hierarchical cluster analysis to determine the number of clusters

Advantage 1. Fast. Ο(Kn) time is required to perform stage 2 (K is the number of clusters, n is the number of data points)

Advantage 2. Mathematically converges after a finite number of iterations as the objective function decreases with each iteration

Disadvantage 1. Choosing the wrong K can lead to incorrect results

Disadvantage 2. Results can vary depending on initial centroid position, so it’s not deterministic

⑷ Python code

from sklearn.cluster import KMeans

X = [[1, 2, 3],[1, 2, 3.1],[2, 2, 4],[2, 2, 4.1]]
kmeans = KMeans(n_clusters=2, random_state=0)
kmeans.fit(X)
cluster_labels = kmeans.labels_

print(cluster_labels)
# [1 1 0 0]

unsupervised hierarchical clustering vs K-means clustering


  Unsupervised Hierarchical Clustering K-means Clustering
Time Complexity O(n3) O(kn) × iteration
Randomness Deterministic Random

Table. 1. Comparison of unsupervised hierarchical clustering and K-means clustering


K-NN vs K-means clustering


Item K-NN K-means Clustering
Type Supervised Learning Unsupervised Learning
Meaning of k Number of Nearest Neighbors Number of Clusters
Optimization Techniques Cross Validation, Confusion Matrix Elbow Method, Silhouette Method
Application Classification Clustering

Table. 2. K-NN vs K-means clustering


Type 2-1. K-medoid clustering

Type 2-2. Lloyd’s K-means clustering algorithm



4. Type 3. Density-Based Clustering (DBSCAN, density-based spatial clustering of applications with noise)

⑴ Overview: One of the non-hierarchical clustering analyses like K-means clustering

① An algorithm that groups closely distributed entities based on the calculation of their densities

② No need to pre-specify the number of clusters

③ Clusters are connected based on cluster density, allowing for clustering of geometric shapes

⑵ Components

① Core point

○ A data point that has a minimum number(min_samples) of other data points within a certain radius(epsilon)

○ The minimum number of data points required within the radius is set as a hyperparameter

② Neighbor point

○ Other data existing within a certain radius around a specific data point

③ Border point

○ Not a core point but exists within the radius of a core point

○ Included in the cluster centered around the core point and typically forms the outskirts of the cluster

④ Noise point

○ Neither a core point nor satisfies the border point condition

○ Also considered as an outlier

⑶ Procedure

Stage 1. Identify core points with a minimum number(min_samples) of points within a certain radius(epsilon).

Stage 2. Explore connected components in the adjacency graph using only core points.

Stage 3. Assign non-core points as noise.

⑷ Features

Advantage 1. No need to predefine the number of clusters, unlike k-means clustering

Advantage 2. Can identify clusters with geometric shapes due to clustering based on cluster density

Disadvantage 1. Difficult to determine hyperparameters, sensitive to parameter choices

Disadvantage 2. Challenges in computation when clusters have varying densities or in higher dimensions

⑸ Python example


image


Figure 3. DBSCAN Python example


import pandas as pd
from sklearn.datasets import make_blobs
from sklearn.cluster import DBSCAN
import matplotlib.pyplot as plt

# Step 1: Generate sample data
X, _ = make_blobs(n_samples=300, centers=4, cluster_std=0.60, random_state=0)

# Step 2: Convert to DataFrame
df = pd.DataFrame(X, columns=['x', 'y'])

# DBSCAN parameters
epsilon = 0.5  
min_samples = 5 

# Step 3: Perform DBSCAN clustering
db = DBSCAN(eps=epsilon, min_samples=min_samples)
df['cluster_label'] = db.fit_predict(df[['x', 'y']])

# Step 4: Visualize the clustering results
plt.figure(figsize=(12, 8))
plt.scatter(df['x'], df['y'], c=df['cluster_label'], cmap='viridis', marker='o')
plt.title('DBSCAN Clustering')
plt.xlabel('X')
plt.ylabel('Y')
plt.show()


⑹ Application

① OPTICS


(skip)

from sklearn.cluster import OPTICS

# OPTICS parameters
min_samples = 5
xi = 0.05
min_cluster_size = 0.05

# Perform OPTICS clustering
optics = OPTICS(min_samples=min_samples, xi=xi, min_cluster_size=min_cluster_size)
df['cluster_label'] = optics.fit_predict(df[['x', 'y']])

(skip)


② HDBSCAN


(skip)

import hdbscan

# HDBSCAN parameters
min_cluster_size = 5
min_samples = 5

# Perform HDBSCAN clustering
hdb = hdbscan.HDBSCAN(min_cluster_size=min_cluster_size, min_samples=min_samples)
df['cluster_label'] = db.fit_predict(df[['x', 'y']])

(skip)


③ DBSCAN++



5. Type 4. Mixture Distribution Clustering

⑴ Mixture Distribution Clustering: One of the non-hierarchical clustering analyses, like K-means clustering

① Assumes data comes from a population model represented as a weighted sum of k parametric models, estimates parameters and weights from the data

② Each of the k models represents a cluster, and each data point is classified into a cluster based on which model it most likely came from

③ Definition of Mixture Model: A mixture model represented as a weighted sum of M distributions (components)

p(x | θ) = Σ p(x | Ci, θi) p(Ci)

○ p(x Ci, θi): The individual probability density function in the mixture model

○ θi: The parameter vector of the i-th distribution

○ Ci: The i-th cluster (class)

○ p(Ci): The importance or weight (αi) of the i-th cluster in the mixture model

④ Estimating parameters of the mixture model is complex compared to single models, often using the EM algorithm instead of differentiation

⑤ Accuracy of estimation may decrease or become challenging if the size of clusters is too small

⑥ Sensitive to outliers, hence preprocessing such as outlier removal is necessary

⑵ Gaussian Mixture Model (GMM)


image

Figure 4. Gaussian Mixture Model


① Overview

○ Assumes the overall data probability distribution is a linear combination of k Gaussian distributions.

○ Clusters are formed based on the probability of belonging to each distribution.

② Formula

○ In GMM, estimates the weights, means, and covariances of the appropriate k Gaussian distributions for the given data X = {x1, x2, …, xN}

○ Simplification: When x is one-dimensional data


스크린샷 2024-10-16 오후 11 51 24


○ Matrix Representation: When x is multi-dimensional data, for the mean μk and covariance matrix ∑k of the k-th cluster


스크린샷 2024-10-16 오후 11 51 53


③ EM algorithm (MLE estimation)

○ EM algorithm can be used to estimate the optimal Gaussian distribution each data point belongs to

○ Objective function: Expressed as log likelihood


스크린샷 2024-10-17 오전 12 23 21


Step 1: Assign random values to γi,k

Step 2: E-step: Estimate γi,k


스크린샷 2024-10-17 오전 12 23 47


Step 3: M-step: Calculate ak, μk, ∑k from γi,k


스크린샷 2024-10-17 오전 12 24 10


Step 4: If γi,k converges, stop; otherwise, move to Step 2: Convergence is guaranteed due to the characteristics of the Gaussian function.

④ Comparison between K means clustering and Gaussian mixture model


image

Figure 5. Comparison of K means clustering and Gaussian mixture model



6. Type 5. Graph-Based Clustering (SOM, Self-Organizing Maps)

⑴ Overview: One of the non-hierarchical clustering analyses, like K-means clustering

① SOM, developed by Kohonen, also known as Kohonen maps

② SOM is an artificial neural network modeled on the learning processes of the cerebral cortex and visual cortex

③ Clustering by autonomous learning methods

④ A type of unsupervised neural network that maps high-dimensional data into an easy-to-understand low-dimensional arrangement of neurons

⑤ The mapping preserves the positional relationships of input variables

⑥ In the actual space, if input variables are close, they are mapped close to each other on the map

⑦ Very fast due to using only one forward pass

Type 1. SOM

Component 1. Input Layer

○ Receives the input vector, contains neurons equal to the number of input variables

○ The input layer data is aligned in the competitive layer through learning, referred to as the map

○ Each neuron in the input layer is fully connected to every neuron in the competitive layer

Component 2. Competitive Layer

○ A layer arranged in a 2D grid, where input vectors cluster into a single point based on their characteristics

○ SOM uses competitive learning, where each neuron calculates its closeness to the input vector, repeatedly adjusting connection strengths

○ Through the learning process, the connection strength makes the neuron most similar to the input pattern the winner

○ Due to the winner-take-all structure, only the winning neuron appears in the competitive layer, with similar input patterns arranged in the same competitive neuron

③ Learning Algorithm

Step 1. Initialization: Initialize connection strengths for nodes in the SOM map

Step 2. Input Vector: Present the input vector

Step 3. Similarity Calculation: Calculate similarity between the input vector and prototype vectors using Euclidean distance

Step 4. Prototype Vector Search: Find the prototype vector (BMU) closest to the input vector

Step 5. Strength Readjustment: Readjust the connection strengths of the BMU and its neighbors

Step 6. Repeat: Return to step 2 and repeat

Type 2. Spectral Clustering

Step 1: Calculate the distance between the given n data points to create an n × n adjacency matrix A.

○ The Gaussian kernel is often used.

Step 2: Based on the adjacency matrix, calculate the Laplacian matrix L: Ln or Lr can also be used.

L = D - A, where Dii = ∑j Aij

Ln = D-1/2LD-1/2 = I - D-1/2AD-1/2

Lt = D-1A

Lr = D-1L = I - Lt = D-1/2LnD1/2

Step 3: Calculate the matrix of eigenvectors W = [ω2, ···, ωk] ∈ ℝn×k corresponding to the k smallest eigenvalues of the Laplacian matrix.

○ The closer the eigenvalue is to 0, the more connected the graph is.

○ Exclude ω1 = 1, which corresponds to λ1 = 0, as it is trivial.

Step 4: The n data points are naturally embedded into k dimensions: Y = WT = [y1, ⋯, yn], with each column vector representing the embedded data points.

Step 5: Perform clustering, such as K-means clustering, on the column vectors of Y.



7. Other Clustering Algorithms

⑴ SNN (Shared Nearest Neighbor) Modularity Optimization Based Clustering Algorithm

Step 1. Search for K-nearest neighbors

Step 2. Construct the SNN graph

Step 3. Optimize the modularity function to define clusters

④ Reference Paper: Waltman and van Eck (2013) The European Physical Journal B

⑵ Leiden Clustering

① The Leiden algorithm used in the scanpy pipeline


import scanpy as sc
import numpy as np

# Load your AnnData object
adata = sc.read_h5ad('my_h5ad.h5ad')

# Run PCA to reduce dimensionality
sc.pp.pca(adata, n_comps=20) 

# Compute the neighborhood graph
sc.pp.neighbors(adata)

# Run clustering using the Leiden algorithm (HERE!)
sc.tl.leiden(adata, resolution = 0.5)

# Optionally, compute UMAP for visualization
sc.tl.umap(adata)

# Plotting the results
sc.pl.umap(adata, color='leiden', title='Leiden Clustering')
sc.pl.spatial(adata, color = 'leiden')


⑶ Louvain Clustering

① Louvain Modularity Optimization: A type of graph-based method

⑷ Mean-Shift Clustering

NMF (Non-Negative Matrix Factorization)

① Algorithm for decomposing known matrix A into the product of matrices W and H: A ~ W × H

○ A Matrix: Sample-feature matrix, known from samples

○ H Matrix: Variable-feature matrix

○ This helps in extracting metagenes

○ Similar to K means clustering, PCA algorithms

Application 1. Cell Type Classification

○ For determining cell type proportion from scRNA-seq obtained from tissue

○ Important to reduce the confounding effect of cell type heterogeneity

3-1. Constrained linear regression

3-2. Reference-based approach

3-2-1. CIBERSORT(cell-type identification by estimating relative subsets of RNA transcript): Allows checking cell type proportion, p value per sample

Application 2. Joint NMF: Extends to multi-omics

Watershed Algorithm

**An algorithm to identify ROI in images based on threshold values like certain brightness levels

⑺ Thresholding Method

① Definition: A technique to create a binary image based on a threshold

② Example :Otsu Thresholding Method

⑻ MST (Minimum Spanning Tree)

⑼ Curve Evolution

⑽ Sparse Neighboring Graph: A type of graph-based method

⑾ SC3

⑿ SIMLR

⒀ FICT

⒁ Fuzzy Clustering

⒂ C-means clustering

Faiss

① A similarity-based search and dense vector clustering algorithm developed by Meta in 2023.

② Fast search algorithm: capable of obtaining the nearest neighbor and the k-th nearest neighbor.

③ Allows for the search of multiple vectors at once (batch processing).

④ There is a trade-off between accuracy and speed.

⑤ Instead of minimizing Euclidean distance, it calculates by maximizing the maximum inner product.

⑥ Returns all elements contained within a given radius.

⒄ mclust

⒅ PAM clustering



Entered: 2021.11.16 13:17

Edited: 2023.09.22 00:20

results matching ""

    No results matching ""