Clustering

Overview

KMeans is an iterative process and aims to partition N observations into K where observation belongs to the cluster with the closest mean.
By assigning N observations to K clusters such that within each the average dissimilarity of the observations from the cluster mean (i.e. centroid) is minimized.
Specify the number of clusters k
Randomly pick k centroids from the data points as initial cluster centers
Assign each sample to the nearest centroid (i.e. Euclidean distance)
Move the centroids to the center of the samples that were assigned to it
Repeat the third and fourth steps until the cluster assignment converges

Elbow method:

plot number of clusters versus interia
Inertia: measure of how internally coherent clusters are
Cohesion: measures how closely related are objects in a cluster
Separation: measures how distinctly separated a cluster is from other clusters.

Calinski-Harabasz:

Also known as the Variance Ratio Criterion
Indicates how well a clustering model defines its clusters, such that the higher the score, the more dense and well separated the clusters are.

Silhouette Coefficient:

Is calculated using the mean intra-cluster distance (a) and the mean nearest-cluster distance (b) for each sample.
- The Silhouette Coefficient for a sample is (b - a) / max(a, b)
- The best value is 1 and the worst value is -1.
- Values near 0 indicate overlapping clusters. Negative values generally indicate that a sample has been assigned to the wrong cluster, as a different cluster is more similar.

K-NN is a Supervised machine learning while K-means is an unsupervised machine learning.
K-NN is a classification or regression machine learning algorithm while
K-NN performs much better if all of the data have the same scale but this is not true for K-means.
K-means is a clustering machine learning algorithm and used for unsupervised ML tasks.

Agglomerative Clustering is a clustering algorithm that builds nested clusters by merging or splitting them successively.
This hierarchy of clusters is represented as a tree (or dendrogram).
Hierarchical clustering using a bottom up approach:
- Each observation starts in its own cluster, and clusters are successively merged together.
Hierarchical clustering algorithms aims at optimizing different objective functions:
- Ward minimizes the sum of squared differences within all clusters. It is a variance-minimizing approach and in this sense is similar to the k-means objective function but tackled with an agglomerative hierarchical approach.
- Maximum or complete linkage: minimizes the maximum distance between observations of pairs of clusters.
- Average linkage minimizes the average of the distances between all observations of pairs of clusters.
- Single linkage minimizes the distance between the closest observations of pairs of clusters.

Clustering algorithm views clusters as areas of high density separated by areas of low density.
Clusters found by DBSCAN can be any shape, as opposed to k-means which assumes that clusters are convex shaped.

Affinity Propagation

Affinity Propagation creates clusters by sending messages between pairs of samples until convergence.
The messages sent between pairs represent the suitability for one sample to be the exemplar of the other, which is updated in response to the values from other pairs.
This updating happens iteratively until convergence, at which point the final exemplars are chosen, and hence the final clustering is given.