Chapter 7-1. SNE, symmetric-SNE, tSNE

Recommended post: 【Algorithm】 Chapter 7. Dimensionality Reduction Algorithms

1. Overview

2. Type 1. SNE

3. Type 2. Symmetric-SNE

4. Type 3. t-SNE

1. Overview

⑴ Terminology

① Manifold: A shape in a high-dimensional space that can, in practice, be represented in a lower dimension

② Data point: Also called a high-dimensional data point

③ Map point: Also called a low-dimensional data point

⑵ Characteristics

① Introduces the concept of probability to solve the drawbacks of MDS (multidimensional scaling)

② Characteristic 1. A dimensionality reduction algorithm that tries to preserve local neighborhoods in data: A local metric that does not consider extremely far data points

③ Characteristic 2. Nonlinear and non-deterministic

2. Type 1. SNE

⑴ Definition: A method that converts Euclidean distances in high-dimensional space into conditional probabilities that represent similarity between data points

① Euclidean distance: Distance using the Pythagorean theorem

② Mapping: A mapping that corresponds high-dimensional data points to low-dimensional data points

⑵ Conditional probabilities expressing similarity of high-dimensional data

① $x_i$: High-dimensional data point

② $x_j$: High-dimensional data point

③ $\sigma_i$: Standard deviation of the Gaussian distribution centered at xi

⑶ Conditional probabilities expressing similarity of low-dimensional data

① Means clustering so that $\sigma_y = 0.707106781$

⑷ Main assumption: If the mapping is good, it is as follows

① KL divergence

② Cost function: Sum of KL divergences over all data points

③ Strategy: Update Q in the direction that minimizes the cost function

⑸ Perplexity

① Perp(p_i): A measure of the effective number of neighbors of data point x_i

Figure 1. SNE shape according to Perp]

(Note that the above example is t-SNE)

② In general, perplexity is set between 5 and 50

③ Experimentally observed to have a range of 1 ≤ perplexity ≤ 99

④ Set $\sigma_i$ differently for each $i$ so as to have a constant perplexity

○ As σ increases, perplexity decreases (a decreasing function)

○ Define σi using the bisection method

○ 1^st. Set LEFT and RIGHT values

○ 2^nd. MID ← (LEFT + RIGHT) / 2

○ 3^rd. $\sigma_i$ ← MID

○ 4^th. If Perp($\sigma_i$) ＜ Perp, then LEFT ← MID; move to 2^nd

○ 5^th. If Perp($\sigma_i$) ＞ Perp, then RIGHT ← MID; move to 2^nd

○ 6^th. After repeating N times, define the MID value as $\sigma_i$

⑹ Rate of change of the cost function

① $p_{j \mid i} - q_{j \mid i} + p_{i \mid j} - q_{i \mid j}$: Mismatch between similarity of data points and similarity of map points

② $y_i - y_j$: Distance on the map

⑺ Updating algorithm

① Learning rate η: Generally around 100

② Momentum α: Introduce momentum to avoid falling into a local minimum. Around 0.5 in the early stage, around 0.8 in the later stage

③ Iteration number t: Generally around 1000

⑻ Application: Setting $\sigma_i$ so that Perp(p_i) is constant in SNE

① Definition

② Updating algorithm

③ Conclusion: Since s differs on a log scale, a constant learning rate should not be applied

3. Type 2. Symmetric-SNE

⑴ Overview

① Problems of SNE: Optimization is difficult; the crowding problem occurs

② t-SNE: When computing similarity between two points, uses the Student t-distribution rather than a normal distribution

③ symmetric-SNE: An improved version of t-SNE

⑵ Joint probability expressing similarity of high-dimensional data

⑶ Joint probability expressing similarity of low-dimensional data

⑷ Cost function

① SNE strategy: Minimize the KL divergence between conditional probability distributions p j i and q j i

② Symmetric-SNE strategy: Minimize the KL divergence between joint probability distributions p ji and q ji

⑸ Rate of change of the cost function

4. Type 3. t-SNE

⑴ Overview

① t-SNE has better expressive power than SNE (ref)

② References: (ref1, ref2, ref3)

⑵ Joint probability expressing similarity of high-dimensional data: Convert to Gaussian-distribution similarity

⑶ Joint probability expressing similarity of low-dimensional data: Convert to t-distribution similarity with 1 degree of freedom

⑷ Cost function

⑸ Rate of change of the cost function

⑹ Updating algorithm

① Learning rate η: Generally around 100

② Momentum α: Introduce momentum to avoid falling into a local minimum. Around 0.5 in the early stage, around 0.8 in the later stage

③ Iteration number t: Generally around 1000

④ Note: In the paper, there is a (+) in front of η, but I think it should be (-) when η ＞ 0

⑺ R programming

install.packages("Rtsne")
library(Rtsne)

# data consists of only numbers; no rownames, no colnames
data <- read.csv("C:/Users/data.csv")

tsne <- Rtsne(data, dims = 2, perplexity = 30, verbose = TRUE, max_iter = 500)
# Performing PCA
# Read the 14893 x 50 data matrix successfully!
# OpenMP is working. 1 threads.
# Using no_dims = 2, perplexity = 30.000000, and theta = 0.500000
# Computing input similarities...
# Building tree...
# - point 10000 of 14893
# Done in 5.06 seconds (sparsity = 0.007533)!
# Learning embedding...
# Iteration 50: error is 103.272821 (50 iterations in 4.11 seconds)
# Iteration 100: error is 88.450651 (50 iterations in 4.85 seconds)
# Iteration 150: error is 79.372908 (50 iterations in 3.18 seconds)
# Iteration 200: error is 76.927413 (50 iterations in 3.10 seconds)
# Iteration 250: error is 75.783877 (50 iterations in 3.02 seconds)
# Iteration 300: error is 2.946416 (50 iterations in 2.96 seconds)
# Iteration 350: error is 2.515109 (50 iterations in 2.84 seconds)
# Iteration 400: error is 2.244128 (50 iterations in 2.88 seconds)
# Iteration 450: error is 2.053382 (50 iterations in 2.86 seconds)
# Iteration 500: error is 1.912981 (50 iterations in 2.90 seconds)
# Fitting performed in 32.67 seconds.

exeTimeTsne<- system.time(Rtsne(data, dims = 2, perplexity=30, verbose=TRUE, max_iter = 500))

plot(tsne$Y, t = 'n', xlab = "Y1", ylab = "Y2", main = "tsne")
text(tsne$Y, labels = "○", col = "light blue")

write.csv(tsne$Y, "C:/Users/map.csv")

① Rtsne cannot allocate vectors larger than 2.2 Gb (e.g. a 10000 × 10000 matrix)

② Processing a 1000 × 10000 matrix takes 1 minute 30 seconds

③ ERROR message: “Remove duplicates before running TSNE”

○ Solution: Rtsne(data, dims = 2, perplexity = 30, verbose = TRUE, max_iter = 500, check_duplicates = FALSE)

Input: 2019.10.05 17:32

1730

Chapter 7-1. SNE, symmetric-SNE, tSNE

1. Overview

2. Type 1. SNE

3. Type 2. Symmetric-SNE

4. Type 3. t-SNE

results matching ""

No results matching ""