KMeans and DBScan

Maths behind Support Vector Machine

K-Means Clustering

K-Means is a popular clustering algorithm used to partition data into groups or clusters based on similarity. It's an unsupervised learning algorithm that seeks to find natural groupings within a dataset.

Mathematics Behind K-Means

K-Means works by iteratively assigning data points to clusters and then updating the cluster centroids. Here's a simplified overview of the process:

  1. Initialization: Choose 'k' initial cluster centroids randomly from the data points.
  2. Assignment Step: For each data point, calculate its distance to each cluster centroid and assign it to the nearest centroid. This step creates 'k' clusters.
  3. Update Step: Recalculate the centroids of the clusters based on the data points assigned to them.
  4. Repeat: Repeat the assignment and update steps until convergence, where either the assignments no longer change or change only minimally.


The algorithm minimizes the sum of squared distances between data points and their assigned cluster centroids. This is the within-cluster variance, which is a measure of how compact the clusters are.

Cost Function

The cost function for K-Means, often referred to as the "inertia" or "within-cluster sum of squares," is defined as follows:
J(K) = Σki=1 Σnj=1 || xj(i) - μi ||2

Here:
  • J(K) is the cost function to be minimized
  • K is the number of clusters
  • n is the number of data points
  • xj(i) represents the jth data point in cluster i
  • μi is the centroid of cluster i
The objective is to find the cluster assignments (which data points belong to which clusters) and centroids that minimize this cost function. K-Means aims to minimize the within-cluster variance, resulting in compact clusters.

DBSCAN (Density-Based Spatial Clustering of Applications with Noise)

DBSCAN is another clustering algorithm used to find clusters in data. Unlike K-Means, DBSCAN can discover clusters of varying shapes and is robust to noise.


Mathematics Behind DBSCAN

DBSCAN defines clusters as dense regions of data points separated by areas of lower density. It works by examining the density of data points within a neighborhood of each point.
The algorithm defines three types of points:

  1. Core Points: These are points that have at least a minimum number of data points (a specified threshold) within a defined neighborhood.
  2. Border Points: These points are not core points but are within the neighborhood of a core point.
  3. Noise Points: Points that are neither core nor border points are considered noise points and do not belong to any cluster.

DBSCAN then identifies clusters by connecting core points to their density-reachable neighbors.

Cost Function

DBSCAN does not have a traditional cost function like K-Means because it does not attempt to minimize a specific mathematical criterion. Instead, it seeks to group data points based on their density.

Conclusion

In summary, K-Means is a partitioning clustering algorithm that aims to minimize the within-cluster variance. It's suitable for finding spherical or isotropic clusters.
On the other hand, DBSCAN identifies clusters based on density and can discover clusters of varying shapes. It's robust to noise and doesn't require specifying the number of clusters in advance, making it a valuable choice for certain clustering tasks.

References:

  1. DBScan KMeans and KNN from scratch

Note: Parts of the article are developed by using ChatGPT.

Comments

Popular Posts