Chapter 7-1. SNE, symmetric-SNE, tSNE
Recommended post: 【Algorithm】 Chapter 7. Dimensionality Reduction Algorithms
1. Overview
2. Type 1. SNE
3. Type 2. Symmetric-SNE
4. Type 3. t-SNE
1. Overview
⑴ Terminology
① Manifold: A shape in a high-dimensional space that can, in practice, be represented in a lower dimension
② Data point: Also called a high-dimensional data point
③ Map point: Also called a low-dimensional data point
⑵ Characteristics
① Introduces the concept of probability to solve the drawbacks of MDS (multidimensional scaling)
② Characteristic 1. A dimensionality reduction algorithm that tries to preserve local neighborhoods in data: A local metric that does not consider extremely far data points
③ Characteristic 2. Nonlinear and non-deterministic
2. Type 1. SNE
⑴ Definition: A method that converts Euclidean distances in high-dimensional space into conditional probabilities that represent similarity between data points
① Euclidean distance: Distance using the Pythagorean theorem
② Mapping: A mapping that corresponds high-dimensional data points to low-dimensional data points
⑵ Conditional probabilities expressing similarity of high-dimensional data
① $x_i$: High-dimensional data point
② $x_j$: High-dimensional data point
③ $\sigma_i$: Standard deviation of the Gaussian distribution centered at xi
⑶ Conditional probabilities expressing similarity of low-dimensional data
① Means clustering so that $\sigma_y = 0.707106781$
⑷ Main assumption: If the mapping is good, it is as follows
② Cost function: Sum of KL divergences over all data points
③ Strategy: Update Q in the direction that minimizes the cost function
⑸ Perplexity
① Perp(pi): A measure of the effective number of neighbors of data point xi
Figure 1. SNE shape according to Perp]
(Note that the above example is t-SNE)
② In general, perplexity is set between 5 and 50
③ Experimentally observed to have a range of 1 ≤ perplexity ≤ 99
④ Set $\sigma_i$ differently for each $i$ so as to have a constant perplexity
○ As σ increases, perplexity decreases (a decreasing function)
○ Define σi using the bisection method
○ 1st. Set LEFT and RIGHT values
○ 2nd. MID ← (LEFT + RIGHT) / 2
○ 3rd. $\sigma_i$ ← MID
○ 4th. If Perp($\sigma_i$) < Perp, then LEFT ← MID; move to 2nd
○ 5th. If Perp($\sigma_i$) > Perp, then RIGHT ← MID; move to 2nd
○ 6th. After repeating N times, define the MID value as $\sigma_i$
⑹ Rate of change of the cost function
① $p_{j \mid i} - q_{j \mid i} + p_{i \mid j} - q_{i \mid j}$: Mismatch between similarity of data points and similarity of map points
② $y_i - y_j$: Distance on the map
⑺ Updating algorithm
① Learning rate η: Generally around 100
② Momentum α: Introduce momentum to avoid falling into a local minimum. Around 0.5 in the early stage, around 0.8 in the later stage
③ Iteration number t: Generally around 1000
⑻ Application: Setting $\sigma_i$ so that Perp(pi) is constant in SNE
① Definition
② Updating algorithm
③ Conclusion: Since s differs on a log scale, a constant learning rate should not be applied
3. Type 2. Symmetric-SNE
⑴ Overview
① Problems of SNE: Optimization is difficult; the crowding problem occurs
② t-SNE: When computing similarity between two points, uses the Student t-distribution rather than a normal distribution
③ symmetric-SNE: An improved version of t-SNE
⑵ Joint probability expressing similarity of high-dimensional data
⑶ Joint probability expressing similarity of low-dimensional data
⑷ Cost function
① SNE strategy: Minimize the KL divergence between conditional probability distributions p j i and q j i
② Symmetric-SNE strategy: Minimize the KL divergence between joint probability distributions p ji and q ji
⑸ Rate of change of the cost function
4. Type 3. t-SNE
⑴ Overview
① t-SNE has better expressive power than SNE (ref)
⑵ Joint probability expressing similarity of high-dimensional data: Convert to Gaussian-distribution similarity
⑶ Joint probability expressing similarity of low-dimensional data: Convert to t-distribution similarity with 1 degree of freedom
⑷ Cost function
⑸ Rate of change of the cost function
⑹ Updating algorithm
① Learning rate η: Generally around 100
② Momentum α: Introduce momentum to avoid falling into a local minimum. Around 0.5 in the early stage, around 0.8 in the later stage
③ Iteration number t: Generally around 1000
④ Note: In the paper, there is a (+) in front of η, but I think it should be (-) when η > 0
⑺ R programming
install.packages("Rtsne")
library(Rtsne)
# data consists of only numbers; no rownames, no colnames
data <- read.csv("C:/Users/data.csv")
tsne <- Rtsne(data, dims = 2, perplexity = 30, verbose = TRUE, max_iter = 500)
# Performing PCA
# Read the 14893 x 50 data matrix successfully!
# OpenMP is working. 1 threads.
# Using no_dims = 2, perplexity = 30.000000, and theta = 0.500000
# Computing input similarities...
# Building tree...
# - point 10000 of 14893
# Done in 5.06 seconds (sparsity = 0.007533)!
# Learning embedding...
# Iteration 50: error is 103.272821 (50 iterations in 4.11 seconds)
# Iteration 100: error is 88.450651 (50 iterations in 4.85 seconds)
# Iteration 150: error is 79.372908 (50 iterations in 3.18 seconds)
# Iteration 200: error is 76.927413 (50 iterations in 3.10 seconds)
# Iteration 250: error is 75.783877 (50 iterations in 3.02 seconds)
# Iteration 300: error is 2.946416 (50 iterations in 2.96 seconds)
# Iteration 350: error is 2.515109 (50 iterations in 2.84 seconds)
# Iteration 400: error is 2.244128 (50 iterations in 2.88 seconds)
# Iteration 450: error is 2.053382 (50 iterations in 2.86 seconds)
# Iteration 500: error is 1.912981 (50 iterations in 2.90 seconds)
# Fitting performed in 32.67 seconds.
exeTimeTsne<- system.time(Rtsne(data, dims = 2, perplexity=30, verbose=TRUE, max_iter = 500))
plot(tsne$Y, t = 'n', xlab = "Y1", ylab = "Y2", main = "tsne")
text(tsne$Y, labels = "○", col = "light blue")
write.csv(tsne$Y, "C:/Users/map.csv")
① Rtsne cannot allocate vectors larger than 2.2 Gb (e.g. a 10000 × 10000 matrix)
② Processing a 1000 × 10000 matrix takes 1 minute 30 seconds
③ ERROR message: “Remove duplicates before running TSNE”
○ Solution:
Rtsne(data, dims = 2, perplexity = 30, verbose = TRUE, max_iter = 500, check_duplicates = FALSE)
Input: 2019.10.05 17:32