KMeans and DBScan
K-Means Clustering
K-Means is a popular clustering algorithm used to partition data into groups or clusters based on similarity. It's an unsupervised learning algorithm that seeks to find natural groupings within a dataset.
Mathematics Behind K-Means
K-Means works by iteratively assigning data points to clusters and then updating the cluster centroids.
Here's a simplified overview of the process:
- Initialization: Choose 'k' initial cluster centroids randomly from the data points.
- Assignment Step: For each data point, calculate its distance to each cluster centroid and assign it to the nearest centroid. This step creates 'k' clusters.
- Update Step: Recalculate the centroids of the clusters based on the data points assigned to them.
- Repeat: Repeat the assignment and update steps until convergence, where either the assignments no longer change or change only minimally.
The algorithm minimizes the sum of squared distances between data points and their assigned cluster centroids. This is the within-cluster variance, which is a measure of how compact the clusters are.
Cost Function
The cost function for K-Means, often referred to as the "inertia" or "within-cluster sum of squares," is defined as follows:Here:
- J(K) is the cost function to be minimized
- K is the number of clusters
- n is the number of data points
- xj(i) represents the jth data point in cluster i
- μi is the centroid of cluster i
DBSCAN (Density-Based Spatial Clustering of Applications with Noise)
DBSCAN is another clustering algorithm used to find clusters in data. Unlike K-Means, DBSCAN can discover clusters of varying shapes and is robust to noise.
Mathematics Behind DBSCAN
DBSCAN defines clusters as dense regions of data points separated by areas of lower density.
It works by examining the density of data points within a neighborhood of each point.
The algorithm defines three types of points:
- Core Points: These are points that have at least a minimum number of data points (a specified threshold) within a defined neighborhood.
- Border Points: These points are not core points but are within the neighborhood of a core point.
- Noise Points: Points that are neither core nor border points are considered noise points and do not belong to any cluster.
DBSCAN then identifies clusters by connecting core points to their density-reachable neighbors.
Cost Function
DBSCAN does not have a traditional cost function like K-Means because it does not attempt to minimize a specific mathematical criterion. Instead, it seeks to group data points based on their density.Conclusion
In summary, K-Means is a partitioning clustering algorithm that aims to minimize the within-cluster variance. It's suitable for finding spherical or isotropic clusters.On the other hand, DBSCAN identifies clusters based on density and can discover clusters of varying shapes. It's robust to noise and doesn't require specifying the number of clusters in advance, making it a valuable choice for certain clustering tasks.
References:
Note: Parts of the article are developed by using ChatGPT.
Comments
Post a Comment