Chapter 5-2. Types of Distances and Similarities

Recommended Article: 【Statistics】 Chapter 5. Statistics

1. Overview

2. Types of Norm Concepts

3. Types of Distance Concepts

4. Types of Similarity Concepts

1. Overview

⑴ The difference between norm and distance: often used interchangeably.

① norm

② distance function (metric)

③ If a norm is defined, a distance 𝑑 can be defined.

④ However, the existence of a corresponding norm is not guaranteed just because a distance is defined.

⑵ The difference between distance and similarity: often used interchangeably.

① Commonality: Saying two data points are close (short distance) is equivalent to saying they are similar. In other words, distance ∝ 1 / similarity.

② Commonality: In machine learning, terms like loss function, error function, and cost function also refer to the difference between the true value and the predicted value (∝ 1 / similarity).

③ Difference: While a distance function is rigorously defined in linear algebra, it’s not necessary for loss functions or similarity measures to satisfy that definition.

⑶ Various types of distances and concepts of similarity.

Figure 1. Various types of distances and concepts of similarity

2. Types of Norm Concepts

⑴ Type 1. L1-norm

⑵ Type 2. L2-norm

⑶ Type 3. p-norm: If p becomes ∞, p-norm is called d_∞ norm.

Figure 2. Trend of the p-norm = 1 interval based on p-value

⑷ Type 4. Frobenius norm

3. Types of Distance Concepts

⑴ Type 1. L1 Loss Function (L1-distance, MAE, city-block distance, taxicab distance, rectilinear distance, Manhattan distance, sparse learning, compressed sensing)

① In other words, it is a method of calculating distance by setting paths in shapes like ‘ㄱ’ and ‘ㄴ’.

② MAPE is defined similarly as follows:

⑵ Type 2. L2 Loss Function (L2-distance, MSE): Euclidean distance using the Pythagorean theorem (standard)

① PSNR (Peak Signal to Noise Ratio): the representation of MSE (Mean Squared Error) in decibels.

⑶ Type 3. Maximum Metric (Chebyshev Distance, Supremum Distance, Uniform Distance, Chessboard Distance, 𝑑_∞ Distance)

① Defined as ∣∣x−y∣∣_∞, the p-norm where p = ∞; reminiscent of a chess piece, the King.

② In distance calculations, only the dominant dimension (feature) is considered, while the rest are ignored.

③ Applications: Sample entropy in signal processing theory and robust control in control theory.

⑷ Type 4. Cross Entropy: Typically has a binary cross-entropy (BCE).

⑸ Type 5. Distance in Information Theory

⑹ Type 6. Delaunay Triangulation(Delaunay triangulation)

Figure 3. An example of delaunay triangulation

① Forms triangles such that no point lies inside the circumcircle (the circle that passes through all three vertices) of any triangle.

② Automatically adapts to variations in point density, providing more connections in denser regions and fewer in sparser regions.

⑺ Type 7. Vector arithmetic

① 7-1. Dot Product: Vector inner product

② 7-2. Hadamard product: Element-wise multiplication

⑻ Type 8. Linkage Metric: Defines distance between clusters

⑼ Type 9. Hamming Distance

① Assign binary values to each data point and measure the distance between data points based on the difference in values. This is often used in information theory.

② Both strings should have the same length.

③ Example: (0, 1, 1, 0, 0, 1) and (1, 1, 1, 1, 0, 0) have different values at the 1st, 4th, and 6th positions, so the Hamming distance is 3.

⑽ Type 10. Standardized Distance:

① Distance standardized by the measurement unit of the variable.

② Formula:

d(i, j)² = (X_i - X_j)^T D^-1 (X_i - X_j)

○ X_i: Starting point matrix

○ X_j: Endpoint matrix

○ D: Sample variance (diagonal) matrix

⑾ Type 11. Mahalanobis Distance**

Figure 4. Mahalanobis Distance

① A statistical distance that considers both the standardization of variables and the correlation between variables (shape of the data distribution).

② Formula: When trying to determine the distance d between two data points X_i and X_j, the following formula is used:

d(i, j)² = (X_i - X_j)^T S^-1 (X_i - X_j)

○ X_i : Starting point matrix

○ X_j : Endpoint matrix

○ S : Sample covariance matrix

○ S is a positive semi-definite matrix because (X_i - X_j)^tS^-1(X_i - X_j) ≥ 0 holds.

○ S can be zero if the data contains dependent variables or if the number of samples is small compared to the dimensionality of the data.

③ Advantages: Unlike Euclidean distance, it is scale-free, considers data distribution and correlation, and offers benefits such as outlier detection.

○ If Euclidean distance is a spherical distance, Mahalanobis distance is an elliptical distance.

④ Limitations: Assumes the normality of the data. The process of calculating the sample covariance matrix is computationally intensive.

⑤ Python code

import numpy as np
from scipy.linalg import inv

def mahalanobis_distance(x, y, covariance_matrix, regularization=1e-6):
    # Add regularization to the covariance matrix's diagonal
    regularized_cov = covariance_matrix + np.eye(covariance_matrix.shape[0]) * regularization
    
    x_minus_y = np.array(x) - np.array(y)
    covariance_matrix_inv = inv(regularized_cov)
    distance = np.dot(np.dot(x_minus_y, covariance_matrix_inv), x_minus_y.T)
    return np.sqrt(distance)

# Example usage:
x = [1, 2, 3]
y = [4, 5, 6]
data = np.array([x, y])
cov_matrix = np.cov(data, rowvar=False)  # Here, we're just using the covariance of x and y for simplicity

print(mahalanobis_distance(x, y, cov_matrix))

⑿ Type 12. Levenshtein (Edit) Distance

① An algorithm that determines how similar two strings, A and B, are to each other.

○ More specifically, the number of substitutions needed to change one sequence into another.

○ Strings can be different lengths, resulting in lower bound of Levenshtein being the difference between the lengths.

② Formula

○ For strings a, b, and indices i (from a) and j (from b).

○ 1_(a≠b) is 0 if a = b and 1 otherwise.

⒀ Type 13. Minkowski Distance

① Distance in m-dimensional Minkowski space.

② When m = 1, it is equivalent to Manhattan distance.

③ When m = 2, it is equivalent to Euclidean distance.

⒁ Type 14. Hausdorff Distance

① Formalization: For two sets A = {a₁, …, a_p} and B = {b₁, …, b_q},

H(A, B) = max(h(A, B), h(B, A))

② directed Hausdorff distance: The distance between the two points in A and B that are furthest apart,

h(A, B) = max_{a ∈ A} min_{b ∈ B} || a - b ||

⒂ Type 15. Focal Loss

① Formalization

FL = -(1 - P_t)^γ log (P_t)

⒃ Type 16. Sørensen–Dice Coefficient (Dice Distance)

① Formalization

2 | A ∩ B| / | A ∪ B |

⒄ Type 17. Gromov-Wasserstein distance (Kantorovich–Rubinstein metric, Earth Mover’s Distance, EMD)

⒅ Type 18. Jensen-Shannon distance

⒆ Type 19. total variation (TV) distance

⒇ Type 20. Kolmogorov-Sminrov distance

⒇ Type 21. Hellinger distance: Requires kernel density estimation of a probability density function.

⒇ Type 22. Huber loss function

⒇ Type 23. Bhattacharyya loss

⒇ Type 24. ELBO (evidence lower bound)

⒇ Type 25. Aitchison distance: A distance concept of simplex

⒇ Type 26. Bray Curtis distance

① i = one site, j = another site

② S_i = numbers of species in i, S_j = numbers of species in j

③ C_ij = the less number of the overlapping sites in species

⒇ Type 27. Fourier loss

⒇ Type 28. Cook’s distance: Used to identify outliers in bioinformatics.

⒇ Type 29. Fisher-Rao distance

⒇ Type 30. Grassmann distance

4. Types of Similarity Concepts

⑴ Type 1. Pearson Correlation Coefficient

① Given the standard deviations σ_x, σ_y of X and Y,

② Characteristics

○ Correlation between two variables measured on an interval or ratio scale

○ Applicable to continuous variables

○ Assumes normality

○ Widely used in practice

⑵ Type 2. Spearman Correlation Coefficient(Spearman correlation coefficient)

① Define x’ = rank(x) and y’ = rank(x),

② Characteristics

○ A method for measuring the correlation between two ordinal-scale variables

○ A non-parametric approach targeting ordinal variables

○ Suitable for data with many zeros

○ Sensitive to deviations or errors in the data

○ Produces higher values compared to Kendall’s correlation coefficient

⑶ Type 3. Kendall Correlation Coefficient(Kendall correlation coefficient)

① Define correlation for concordant and discordant pairs,

② Characteristics

○ A method for measuring the correlation between two ordinal-scale variables

○ A non-parametric approach designed for ordinal variables

○ Suitable for data with many zeros

○ Effective for small sample sizes or when there are many ties in the data

⑷ Type 4. Matthew correlation coefficient (MCC)

⑸ Type 5. χ²

① For measurement data xm, ym, and the approximating function f(x),

⑹ Type 6. Energy statistics

① Proposed by Székely, Rizzo, and Bakirov in 2007

② Distance covariance V(X,Y) and distance correlation V(X,Y) / √V((X,X)·V(Y,Y))

⑺ Type 7.SSIM

① Image similarity comparison algorithm

② Python Code

## Method 1

def SSIM(x, y):
    # assumption : x and y are grayscale images with the same dimension

    import numpy as np
    
    def mean(img):
        return np.mean(img)
        
    def sigma(img):
        return np.std(img)
    
    def cov(img1, img2):
        img1_ = np.array(img1[:,:], dtype=np.float64)
        img2_ = np.array(img2[:,:], dtype=np.float64)
                        
        return np.mean(img1_ * img2_) - mean(img1) * mean(img2)
    
    K1 = 0.01
    K2 = 0.03
    L = 256 # when each pixel spans 0 to 255
   
    C1 = K1 * K1 * L * L
    C2 = K2 * K2 * L * L
    C3 = C2 / 2
        
    l = (2 * mean(x) * mean(y) + C1) / (mean(x)**2 + mean(y)**2 + C1)
    c = (2 * sigma(x) * sigma(y) + C2) / (sigma(x)**2 + sigma(y)**2 + C2)
    s = (cov(x, y) + C3) / (sigma(x) * sigma(y) + C3)
        
    return l * c * s
    
    
    
 ## Method 2 
 
from skimage.metrics import structural_similarity as ssim
import cv2

x = cv2.imread('x.png', cv2.IMREAD_GRAYSCALE)
y = cv2.imread('y.png', cv2.IMREAD_GRAYSCALE)

score, diff = ssim(x, y, full=True)
print("SSIM:", score)

⑻ Type 8. Mutual Information

① Principle: Can the second image be predicted given the first image?

② Useful for analyzing the relationship between two images obtained from different modalities

○ Example: In MRI, T1-weighted and T2-weighted images have many inverted points; mutual information considers this.

③ Code

def mutual_information(img1, img2):
    import numpy as np
    import cv2
    import matplotlib.pyplot as plt
    
    # img1 and img2 are 3-channel color images
    
    a = img1[:,:,0:1].reshape(img1.shape[0], img1.shape[1])
    b = img2[:,:,0:1].reshape(img2.shape[0], img2.shape[1])
    
    hgram, x_edges, y_edges = np.histogram2d(
     a.ravel(),
     b.ravel(),
     bins=20
    )

    pxy = hgram / float(np.sum(hgram))
    px = np.sum(pxy, axis=1) # marginal for x over y
    py = np.sum(pxy, axis=0) # marginal for y over x
    px_py = px[:, None] * py[None, :] # Broadcast to multiply marginals

    nzs = pxy > 0 # Only non-zero pxy values contribute to the sum
    
    return np.sum(pxy[nzs] * np.log(pxy[nzs] / px_py[nzs]))

④ Reference

⑼ Type 9. Relative entropy(Kullback-Leibler divergence, KL divergence, KLD)

⑽ Type 10. Sinkhorn divergence (entropic Wasserstein distance)

⑾ Type 11. Cressie-Read power divergence

⑿ Type 12. Mr (Thresholded Mander’s Colocalization Coefficient)

① Ratio of overlapping pixels between two different monochrome images

② tMr (Thresholded Mr): Mr calculated considering values below a specific threshold as background with zero values

③ Background: Pearson correlation coefficient is not suitable for comparing monochrome images due to its negative values

④ Feature 1. Ranges from 0 to 1

⑤ Feature 2. Sensitive to background pixel values, but not heavily influenced by values of overlapping pixels

⑥ Feature 3. Dependent on Pearson correlation

⑦ Step 1. First, use Pearson correlation to obtain p-value and test for colocalization

⑧ Step 2. If colocalization is present, calculate tM1 and tM2 values

⑨ Usage: ImageJ

⒀ Type 13. Jaccard Similarity (IoU, intersection over union)

① Jaccard score: For two sets A and B,

② Takes values between 0 and 1: A value of 1 indicates that the two sets are identical, while a value of 0 indicates that they have no common elements.

③ A method used for the distance of nominal variables.

⒁ Type 14. Cosine Similarity

① Cosine value: For two vectors A and B,

⒂ Type 15. Coverage score

① For two sets A and B,

⒃ Type 16. Fisher Exact Test

① For two sets A and B,

⒄ Type 17. Faiss: Faiss is a library for efficient similarity search and clustering of dense vectors. Developed by Meta.

⒅ Type 18. Smith–Waterman similarity: Used for evaluating the similarity between nucleic acid or amino acid sequences.

⒆ Type 19. Maximal information coefficient (MIC)

⒇ Type 20. Spectral Similarity: Compares eigenvectors when comparing matrices A and B. Related to the Laplacian matrix and graph theory.

⒇ Type 21. SCC (Stratum-Adjusted Correlation Coefficient): A Pearson correlation coefficient that considers weights based on strata.

⒇ Type 22. Tanimoto similarity

⒇ Type 23. Stoyan’s mark-correlation function

Input: 2022.08.02 16:03

Modified: 2023.08.23 14:28

879

Chapter 5-2. Types of Distances and Similarities

1. Overview

2. Types of Norm Concepts

3. Types of Distance Concepts

4. Types of Similarity Concepts

results matching ""

No results matching ""