Korean, Edit

Chapter 5-2. Types of Distances and Similarities

Recommended Article: 【Statistics】 Chapter 5. Statistics


1. Overview

2. Types of Norm Concepts

3. Types of Distance Concepts

4. Types of Similarity Concepts



1. Overview

⑴ The difference between norm and distance: often used interchangeably.

① norm


스크린샷 2024-10-07 오전 11 21 11


② distance function (metric)


스크린샷 2024-10-07 오전 11 21 24


③ If a norm is defined, a distance 𝑑 can be defined.


스크린샷 2024-10-07 오전 11 21 36


④ However, the existence of a corresponding norm is not guaranteed just because a distance is defined.

⑵ The difference between distance and similarity: often used interchangeably.

① Commonality: Saying two data points are close (short distance) is equivalent to saying they are similar. In other words, distance ∝ 1 / similarity.

② Commonality: In machine learning, terms like loss function, error function, and cost function also refer to the difference between the true value and the predicted value (∝ 1 / similarity).

③ Difference: While a distance function is rigorously defined in linear algebra, it’s not necessary for loss functions or similarity measures to satisfy that definition.

⑶ Various types of distances and concepts of similarity.


image

Figure 1. Various types of distances and concepts of similarity



2. Types of Norm Concepts

Type 1. L1-norm


스크린샷 2024-10-07 오전 11 22 04


Type 2. L2-norm


스크린샷 2024-10-07 오전 11 22 25


Type 3. p-norm


image


image

Figure 2. Trend of the p-norm = 1 interval based on p-value


Type 4. Frobenius norm



3. Types of Distance Concepts

Type 1. L1 Loss Function (L1-distance, MAE, city-block distance, taxicab distance, rectilinear distance, Manhattan distance, sparse learning, compressed sensing)


image


① In other words, it is a method of calculating distance by setting paths in shapes like ‘ㄱ’ and ‘ㄴ’.

② MAPE is defined similarly as follows:


스크린샷 2025-02-14 오전 10 24 34


Type 2. L2 Loss Function (L2-distance, MSE): Euclidean distance using the Pythagorean theorem (standard)


image


① PSNR (Peak Signal to Noise Ratio): the representation of MSE (Mean Squared Error) in decibels.


스크린샷 2024-11-18 오전 10 59 00


Type 3. Maximum Metric (Chebyshev Distance, Supremum Distance, Uniform Distance, Chessboard Distance, 𝑑 Distance)

① Defined as ∣∣x−y∣∣, the p-norm where p = ∞; reminiscent of a chess piece, the King.

② In distance calculations, only the dominant dimension (feature) is considered, while the rest are ignored.

③ Applications: Sample entropy in signal processing theory and robust control in control theory.

Type 4. Cross Entropy: Typically has a binary cross-entropy (BCE).


image


Type 5. Distance in Information Theory


image


Type 6. Delaunay Triangulation(Delaunay triangulation)

Type 7. Vector arithmetic

7-1. Dot Product: Vector inner product

7-2. Hadamard product: Element-wise multiplication

Type 8. Linkage Metric: Defines distance between clusters

Type 9. Hamming Distance

① Assign binary values to each data point and measure the distance between data points based on the difference in values. This is often used in information theory.

② Example: (0, 1, 1, 0, 0, 1) and (1, 1, 1, 1, 0, 0) have different values at the 1st, 4th, and 6th positions, so the Hamming distance is 3.

Type 10. Standardized Distance:

① Distance standardized by the measurement unit of the variable.

② Formula:

d(i, j)2 = (Xi - Xj)T D-1 (Xi - Xj)

○ Xi: Starting point matrix

○ Xj: Endpoint matrix

○ D: Sample variance (diagonal) matrix

Type 11. Mahalanobis Distance**


image

Figure 3. Mahalanobis Distance


① A statistical distance that considers both the standardization of variables and the correlation between variables (shape of the data distribution).

② Formula: When trying to determine the distance d between two data points Xi and Xj, the following formula is used:

d(i, j)2 = (Xi - Xj)T S-1 (Xi - Xj)

○ Xi : Starting point matrix

○ Xj : Endpoint matrix

○ S : Sample covariance matrix

○ S is a positive semi-definite matrix because (Xi - Xj)tS-1(Xi - Xj) ≥ 0 holds.

○ S can be zero if the data contains dependent variables or if the number of samples is small compared to the dimensionality of the data.

③ Advantages: Unlike Euclidean distance, it is scale-free, considers data distribution and correlation, and offers benefits such as outlier detection.

○ If Euclidean distance is a spherical distance, Mahalanobis distance is an elliptical distance.

④ Limitations: Assumes the normality of the data. The process of calculating the sample covariance matrix is computationally intensive.

⑤ Python code


import numpy as np
from scipy.linalg import inv

def mahalanobis_distance(x, y, covariance_matrix, regularization=1e-6):
    # Add regularization to the covariance matrix's diagonal
    regularized_cov = covariance_matrix + np.eye(covariance_matrix.shape[0]) * regularization
    
    x_minus_y = np.array(x) - np.array(y)
    covariance_matrix_inv = inv(regularized_cov)
    distance = np.dot(np.dot(x_minus_y, covariance_matrix_inv), x_minus_y.T)
    return np.sqrt(distance)

# Example usage:
x = [1, 2, 3]
y = [4, 5, 6]
data = np.array([x, y])
cov_matrix = np.cov(data, rowvar=False)  # Here, we're just using the covariance of x and y for simplicity

print(mahalanobis_distance(x, y, cov_matrix))


Type 12. Levenshtein Distance: An algorithm that determines how similar two strings, A and B, are to each other.

Type 13. Minkowski Distance

① Distance in m-dimensional Minkowski space.

② When m = 1, it is equivalent to Manhattan distance.

③ When m = 2, it is equivalent to Euclidean distance.

Type 14. Hausdorff Distance

① Formalization: For two sets A = {a1, …, ap} and B = {b1, …, bq},

H(A, B) = max(h(A, B), h(B, A))

② directed Hausdorff distance: The distance between the two points in A and B that are furthest apart,

h(A, B) = maxa ∈ A minb ∈ B || a - b ||

Type 15. Focal Loss

① Formalization

FL = -(1 - Pt)γ log (Pt)

Type 16. Sørensen–Dice Coefficient (Dice Distance)

① Formalization

2 | A ∩ B| / | A ∪ B |

Type 17. Gromov-Wasserstein distance (Kantorovich–Rubinstein metric, Earth Mover’s Distance, EMD)

Type 18. Jensen-Shannon distance

Type 19. total variation (TV) distance

Type 20. Kolmogorov-Sminrov distance

Type 21. Hellinger distance: Requires kernel density estimation of a probability density function.

Type 22. Huber loss function

Type 23. Bhattacharyya loss

Type 24. ELBO (evidence lower bound)

Type 25. Aitchison distance: A distance concept of simplex

Type 26. Bray Curtis distance


스크린샷 2025-01-20 오전 10 49 45


① i = one site, j = another site

② Si = numbers of species in i, Sj = numbers of species in j

③ Cij = the less number of the overlapping sites in species

Type 27. Fourier loss

Type 28. Cook’s distance: Used to identify outliers in bioinformatics.

Type 29. Levenshtein edit distance



4. Types of Similarity Concepts

Type 1. Pearson Correlation Coefficient(Pearson correlation coefficient)

① Given the standard deviations σx, σy of X and Y,


image


② Characteristics

○ Correlation between two variables measured on an interval or ratio scale

○ Applicable to continuous variables

○ Assumes normality

○ Widely used in practice

Type 2. Spearman Correlation Coefficient(Spearman correlation coefficient)

① Define x’ = rank(x) and y’ = rank(x),


image


② Characteristics

○ A method for measuring the correlation between two ordinal-scale variables

○ A non-parametric approach targeting ordinal variables

○ Suitable for data with many zeros

○ Sensitive to deviations or errors in the data

○ Produces higher values compared to Kendall’s correlation coefficient

Type 3. Kendall Correlation Coefficient(Kendall correlation coefficient)

① Define correlation for concordant and discordant pairs,


image


② Characteristics

○ A method for measuring the correlation between two ordinal-scale variables

○ A non-parametric approach designed for ordinal variables

○ Suitable for data with many zeros

○ Effective for small sample sizes or when there are many ties in the data

Type 4. Matthew correlation coefficient (MCC)


image


Type 5. χ2

① For measurement data xm, ym, and the approximating function f(x),


image


Type 6.SSIM

① Image similarity comparison algorithm


image


Python Code


def SSIM(x, y):
    # assumption : x and y are grayscale images with the same dimension

    import numpy as np
    
    def mean(img):
        return np.mean(img)
        
    def sigma(img):
        return np.std(img)
    
    def cov(img1, img2):
        img1_ = np.array(img1[:,:], dtype=np.float64)
        img2_ = np.array(img2[:,:], dtype=np.float64)
                        
        return np.mean(img1_ * img2_) - mean(img1) * mean(img2)
    
    K1 = 0.01
    K2 = 0.03
    L = 256 # when each pixel spans 0 to 255
   
    C1 = K1 * K1 * L * L
    C2 = K2 * K2 * L * L
    C3 = C2 / 2
        
    l = (2 * mean(x) * mean(y) + C1) / (mean(x)**2 + mean(y)**2 + C1)
    c = (2 * sigma(x) * sigma(y) + C2) / (sigma(x)**2 + sigma(y)**2 + C2)
    s = (cov(x, y) + C3) / (sigma(x) * sigma(y) + C3)
        
    return l * c * s


Type 7. Mutual Information

① Principle: Can the second image be predicted given the first image?

② Useful for analyzing the relationship between two images obtained from different modalities

○ Example: In MRI, T1-weighted and T2-weighted images have many inverted points; mutual information considers this.

③ Code


def mutual_information(img1, img2):
    import numpy as np
    import cv2
    import matplotlib.pyplot as plt
    
    # img1 and img2 are 3-channel color images
    
    a = img1[:,:,0:1].reshape(img1.shape[0], img1.shape[1])
    b = img2[:,:,0:1].reshape(img2.shape[0], img2.shape[1])
    
    hgram, x_edges, y_edges = np.histogram2d(
     a.ravel(),
     b.ravel(),
     bins=20
    )

    pxy = hgram / float(np.sum(hgram))
    px = np.sum(pxy, axis=1) # marginal for x over y
    py = np.sum(pxy, axis=0) # marginal for y over x
    px_py = px[:, None] * py[None, :] # Broadcast to multiply marginals

    nzs = pxy > 0 # Only non-zero pxy values contribute to the sum
    
    return np.sum(pxy[nzs] * np.log(pxy[nzs] / px_py[nzs]))


Reference

Type 8. Relative entropy(Kullback-Leibler divergence, KL divergence, KLD)


image


Type 9. Sinkhorn divergence (entropic Wasserstein distance)

Type 10. Cressie-Read power divergence

Type 11. Mr (Thresholded Mander’s Colocalization Coefficient)

① Ratio of overlapping pixels between two different monochrome images

② tMr (Thresholded Mr): Mr calculated considering values below a specific threshold as background with zero values

③ Background: Pearson correlation coefficient is not suitable for comparing monochrome images due to its negative values

Feature 1. Ranges from 0 to 1

Feature 2. Sensitive to background pixel values, but not heavily influenced by values of overlapping pixels

Feature 3. Dependent on Pearson correlation

Step 1. First, use Pearson correlation to obtain p-value and test for colocalization

Step 2. If colocalization is present, calculate tM1 and tM2 values

⑨ Usage: ImageJ

Type 12. Jaccard Similarity (IoU, intersection over union)

① Jaccard score: For two sets A and B,


image


② Takes values between 0 and 1: A value of 1 indicates that the two sets are identical, while a value of 0 indicates that they have no common elements.

③ A method used for the distance of nominal variables.

Type 13. Cosine Similarity

① Cosine value: For two vectors A and B,


image


Type 14. Euclidean Similarity

① Euclidean distance: For two vectors A and B,


image


Type 13. Coverage Score

① For two sets A and B,


image


Type 15. Coverage score

① For two sets A and B,


image


Type 16. Fisher Exact Test

① For two sets A and B,


image

image


Type 17. Faiss: Faiss is a library for efficient similarity search and clustering of dense vectors. Developed by Meta.

Type 18. Smith–Waterman similarity: Used for evaluating the similarity between nucleic acid or amino acid sequences.

Type 19. Maximal information coefficient (MIC)

Type 20. Spectral Similarity: Compares eigenvectors when comparing matrices A and B. Related to the Laplacian matrix and graph theory.

Type 21. SCC (Stratum-Adjusted Correlation Coefficient): A Pearson correlation coefficient that considers weights based on strata.



Input: 2022.08.02 16:03

Modified: 2023.08.23 14:28

results matching ""

    No results matching ""