Korean, Edit

Chapter 5-2. Types of Distances and Similarities

Recommended Article: 【Statistics】 Chapter 5. Statistics


1. Overview

2. Types of Distance Concepts

3. Types of Similarity Concepts



1. Overview

⑴ The concepts of distance and similarity are often used interchangeably.

① Commonality: Saying two data points are close (short distance) is equivalent to saying they are similar. In other words, distance ∝ 1 / similarity.

② Commonality: In machine learning, terms like loss function, error function, and cost function also refer to the difference between the true value and the predicted value (∝ 1 / similarity).

③ Difference: While a distance function is rigorously defined in linear algebra, it’s not necessary for loss functions or similarity measures to satisfy that definition.


image


⑵ The distinction between the following distance concepts and similarity concepts is not rigorous.

⑶ Various types of distances and concepts of similarity.


image

Figure 1. Various types of distances and concepts of similarity



2. Types of Distance Concepts

Type 1. L1 Loss Function (L1-norm, MAE)


image


1-1. Manhattan Distance: A method of calculating distance by setting paths in shapes like ‘ㄱ’ and ‘ㄴ’.

Type 2. L2 Loss Function (L2-norm, MSE): Euclidean distance using the Pythagorean theorem (standard)


image


Type 3. Cross Entropy: Typically has a binary cross-entropy (BCE).


image


Type 4. KL Distance(Kullback-Leibler divergence, KLD)


image


Type 5. Delaunay Triangulation(Delaunay triangulation)

Type 6. p-norm


image


Type 7. Dot Product: Vector inner product

Type 8. Linkage Metric: Defines distance between clusters

Type 9. Hamming Distance: Assigns a binary value to each data point and measures the distance of the data point as the difference in values.

Type 10. Standardized Distance:

① Distance standardized by the measurement unit of the variable.

② Formula:

d(i, j)2 = (Xi - Xj)T D-1 (Xi - Xj)

○ Xi: Starting point matrix

○ Xj: Endpoint matrix

○ D: Sample variance (diagonal) matrix

Type 11. Mahalanobis Distance**


image

Figure 2. Mahalanobis Distance


① A statistical distance that considers both the standardization of variables and the correlation between variables (shape of the data distribution).

② Formula: When trying to determine the distance d between two data points Xi and Xj, the following formula is used:

d(i, j)2 = (Xi - Xj)T S-1 (Xi - Xj)

○ Xi : Starting point matrix

○ Xj : Endpoint matrix

○ S : Sample covariance matrix

③ Advantages: Unlike Euclidean distance, it is scale-free, considers data correlation, and has benefits such as outlier detection.

④ Limitations: Assumes the normality of the data. The process of calculating the sample covariance matrix is computationally intensive.

⑤ Python code


import numpy as np
from scipy.linalg import inv

def mahalanobis_distance(x, y, covariance_matrix, regularization=1e-6):
    # Add regularization to the covariance matrix's diagonal
    regularized_cov = covariance_matrix + np.eye(covariance_matrix.shape[0]) * regularization
    
    x_minus_y = np.array(x) - np.array(y)
    covariance_matrix_inv = inv(regularized_cov)
    distance = np.dot(np.dot(x_minus_y, covariance_matrix_inv), x_minus_y.T)
    return np.sqrt(distance)

# Example usage:
x = [1, 2, 3]
y = [4, 5, 6]
data = np.array([x, y])
cov_matrix = np.cov(data, rowvar=False)  # Here, we're just using the covariance of x and y for simplicity

print(mahalanobis_distance(x, y, cov_matrix))


Type 12. Levenshtein Distance: An algorithm that determines how similar two strings, A and B, are to each other.

Type 13. Minkowski Distance

① Distance in m-dimensional Minkowski space.

② When m = 1, it is equivalent to Manhattan distance.

③ When m = 2, it is equivalent to Euclidean distance.

Type 14. Hausdorff Distance

① Formalization: For two sets A = {a1, …, ap} and B = {b1, …, bq},

H(A, B) = max(h(A, B), h(B, A))

② directed Hausdorff distance: The distance between the two points in A and B that are furthest apart,

h(A, B) = maxa ∈ A minb ∈ B || a - b ||

Type 15. Focal Loss

① Formalization

FL = -(1 - Pt)γ log (Pt)

Type 16. Sørensen–Dice Coefficient (Dice Distance)

① Formalization

2 | A ∩ B| / | A ∪ B |

Type 17. Gromov-Wasserstein distance (Kantorovich–Rubinstein metric, Earth Mover’s Distance, EMD)

Type 18. Sinkhorn divergence

Type 19. Cressie-Read power divergence

Type 20. Jensen-Shannon distance

Type 20. total variation (TV) distance

Type 21. Kolmogorov-Sminrov distance

Type 22. Hellinger distance: Requires kernel density estimation for probability density function (pdf).

Type 24. Huber loss function

Type 25. Bhattacharyya loss

Type 26. evidence lower bound (ELBO)

Type 27. Aitchison distance: Concept of distance in a simplex.

Type 28. Bray Curtis distance

BCij = 1 - 2Cij / (Si + Sj)

① i = one site, j = another site

② Si = numbers of species in i, Sj = numbers of species in j

③ Cij = the less number of the overlapping sites in species

Type 29. Fourier loss



3. Types of Similarity Concepts

Type 1. Pearson Correlation Coefficient(Pearson correlation coefficient)

① Given the standard deviations σx, σy of X and Y,


image


Type 2. Spearman Correlation Coefficient(Spearman correlation coefficient)

① Define x’ = rank(x) and y’ = rank(x),


image


Type 3. Kendall Correlation Coefficient(Kendall correlation coefficient)

① Define correlation for concordant and discordant pairs,


image


Type 4. Matthew correlation coefficient (MCC)


image


Type 5. χ2

① For measurement data xm, ym, and the approximating function f(x),


image


Type 6.SSIM

① Image similarity comparison algorithm


image


Python Code


def SSIM(x, y):
    # assumption : x and y are grayscale images with the same dimension

    import numpy as np
    
    def mean(img):
        return np.mean(img)
        
    def sigma(img):
        return np.std(img)
    
    def cov(img1, img2):
        img1_ = np.array(img1[:,:], dtype=np.float64)
        img2_ = np.array(img2[:,:], dtype=np.float64)
                        
        return np.mean(img1_ * img2_) - mean(img1) * mean(img2)
    
    K1 = 0.01
    K2 = 0.03
    L = 256 # when each pixel spans 0 to 255
   
    C1 = K1 * K1 * L * L
    C2 = K2 * K2 * L * L
    C3 = C2 / 2
        
    l = (2 * mean(x) * mean(y) + C1) / (mean(x)**2 + mean(y)**2 + C1)
    c = (2 * sigma(x) * sigma(y) + C2) / (sigma(x)**2 + sigma(y)**2 + C2)
    s = (cov(x, y) + C3) / (sigma(x) * sigma(y) + C3)
        
    return l * c * s


Type 7. Mutual Information

① Principle: Can the second image be predicted given the first image?

② Useful for analyzing the relationship between two images obtained from different modalities

○ Example: In MRI, T1-weighted and T2-weighted images have many inverted points; mutual information considers this.

③ Code


def mutual_information(img1, img2):
    import numpy as np
    import cv2
    import matplotlib.pyplot as plt
    
    # img1 and img2 are 3-channel color images
    
    a = img1[:,:,0:1].reshape(img1.shape[0], img1.shape[1])
    b = img2[:,:,0:1].reshape(img2.shape[0], img2.shape[1])
    
    hgram, x_edges, y_edges = np.histogram2d(
     a.ravel(),
     b.ravel(),
     bins=20
    )

    pxy = hgram / float(np.sum(hgram))
    px = np.sum(pxy, axis=1) # marginal for x over y
    py = np.sum(pxy, axis=0) # marginal for y over x
    px_py = px[:, None] * py[None, :] # Broadcast to multiply marginals

    nzs = pxy > 0 # Only non-zero pxy values contribute to the sum
    
    return np.sum(pxy[nzs] * np.log(pxy[nzs] / px_py[nzs]))


Reference

Type 8. Mr (Thresholded Mander’s Colocalization Coefficient)

① Ratio of overlapping pixels between two different monochrome images

② tMr (Thresholded Mr): Mr calculated considering values below a specific threshold as background with zero values

③ Background: Pearson correlation coefficient is not suitable for comparing monochrome images due to its negative values

Feature 1. Ranges from 0 to 1

Feature 2. Sensitive to background pixel values, but not heavily influenced by values of overlapping pixels

Feature 3. Dependent on Pearson correlation

Step 1. First, use Pearson correlation to obtain p-value and test for colocalization

Step 2. If colocalization is present, calculate tM1 and tM2 values

⑨ Usage: ImageJ

Type 9. Jaccard Similarity (IoU, intersection over union)

① Jaccard score: For two sets A and B,


image


Type 10. Cosine Similarity

① Cosine value: For two vectors A and B,


image


Type 11. Euclidean Similarity

① Euclidean distance: For two vectors A and B,


image


Type 12. Coverage Score

① For two sets A and B,


image


Type 13. Fisher Exact Test


image

image


① For two sets A and B,

Type 14. Faiss: Faiss is a library for efficient similarity search and clustering of dense vectors. Developed by Meta.

Type 15. Smith–Waterman similarity: Used for evaluating the similarity between nucleic acid or amino acid sequences.

Type 16. Maximal information coefficient (MIC)



Input: 2022.08.02 16:03

Modified: 2023.08.23 14:28

results matching ""

    No results matching ""