# Clustering
__________

## Overview

![image](../assets/cluster1.png)

## Selecting Cluster Algorithm 
![image](../assets/cluster2.png)


## K-Means Clustering
- KMeans is an iterative process and aims to partition `N` observations into
  `K` where observation belongs to the cluster with the closest mean. 
- By assigning `N` observations to `K` clusters such that within each the
  average dissimilarity of the  observations from the cluster mean (i.e.
  centroid) is minimized. 
- Specify the number of clusters `k`
- Randomly pick k centroids from the data points as initial cluster centers
- Assign each sample to the nearest centroid (i.e. Euclidean distance)
- Move the centroids to the center of the samples that were assigned to it
- Repeat the `third` and `fourth` steps until the cluster assignment converges
 
![image](../assets/kmeans.png)

## Evaluation of Clusters

**Elbow method:** 
- plot number of clusters versus interia
- **Inertia:** measure of how internally coherent clusters are
- **Cohesion:** measures how closely related are objects in a cluster
- **Separation:** measures how distinctly separated a cluster is from other
  clusters. 

**Calinski-Harabasz:**
- Also known as the `Variance Ratio Criterion`
- Indicates how well a clustering model defines its clusters, such that the
  higher the score, the more  dense and well separated the clusters are.

**Silhouette Coefficient:** 
- Is calculated using the mean intra-cluster distance
(a) and the mean nearest-cluster distance (b) for each sample. 
	- The Silhouette Coefficient for a sample is (b - a) / max(a, b)
	- The best value is 1 and the worst value is -1. 
    - Values near 0 indicate overlapping clusters. Negative values generally
      indicate that a sample  has been assigned to the wrong cluster, as a
      different cluster is more similar.

### Difference between KNN and KMeans
- K-NN is a Supervised machine learning while K-means is an unsupervised
  machine learning. 
- K-NN is a classification or regression machine learning algorithm while 
- K-NN performs much better if all of the data have the same scale but this is
  not true for K-means. 
- K-means is a clustering machine learning algorithm and used for unsupervised
  ML tasks. 

## Hierarchical Clustering 
- Agglomerative Clustering is a clustering algorithm that builds nested
  clusters by merging or splitting them successively. 
- This hierarchy of clusters is represented as a tree (or dendrogram). 
- Hierarchical clustering using a bottom up approach: 
  - Each observation starts in its own cluster, and clusters are successively
    merged together.  
- Hierarchical clustering algorithms aims at optimizing different objective
  functions: 
  - **Ward** minimizes the sum of squared differences within all clusters. It is
    a variance-minimizing  approach and in this sense is similar to the k-means
    objective function but tackled with an agglomerative hierarchical approach.
  - **Maximum or complete linkage:** minimizes the maximum distance between
    observations of pairs of clusters. 
  - **Average linkage minimizes** the average of the distances between all
    observations of pairs of clusters. 
  - **Single linkage** minimizes the distance between the closest observations of
    pairs of clusters. 

## DBSCAN
- Clustering algorithm views clusters as areas of high density separated by
  areas of low density. 
- Clusters found by DBSCAN can be any shape, as opposed to k-means which
  assumes that clusters are  convex shaped. 

**Affinity Propagation**
- Affinity Propagation creates clusters by sending messages between pairs of
  samples until convergence. 
- The messages sent between pairs represent the suitability for one sample to
  be the exemplar of the  other, which is updated in response to the values
  from other pairs. 
- This updating happens iteratively until convergence, at which point the final
  exemplars are chosen,  and hence the final clustering is given.