DBSCAN / HDBSCAN Clustering

What?

The key idea is that for each point of a cluster, the neighborhood of a given radius has to contain at least a minimum number of points.

"DBSCAN" = Density-based-spatial clustering of application with noise.
Separate clusters of high density from ones of low density.
Can sort data into clusters of varying shapes.
Input: set of points & neighborhood N & minpts (density)
Output: clusters with density (+ noises)
Each point is either:
- Core point: has at least minpts points in its neighborhood.
- Border point: not a core but has at least 1 core point in its neighborhoods.
- Noise point: not a core or border point.
Phase:
1. Choose a point → it's a core point?
  1. If yes → expand → check core / check border
  2. If no → form a cluster
2. Repeat to form other clusters
3. Eliminate noise points.
Pros:
- Discover any number of clusters (different from K-Means & K-Medoids Clustering which need an input of number of clusters).
- Cluster of varying sizes and shapes.
- Detect and ignore outliers.
Cons:
- Sensitive → choice of neighborhood parameters (eg. If minpts is too small → wrong noises)
- Produce noise: unclear → how to calculate metric indexes when there is noise.

High DBSCAN.
Difference between DBSCAN and HDBSCAN:
- HDBSCAN: focus much on high density.
- DBSCAN: create right clusters but also create clusters with very low density of examples (Figure 1).
- Check more in this note.
Reduce the speed of clustering in comparision with other methods (Figure 2).
HDBScan has the parameter minimum cluster size (min_cluster_size), which is how big a cluster needs to be in order to form.

Figure 1. Difference between DBSCAN (left) and HDBSCAN (right). Source of figure.

Figure 1. Difference between DBSCAN (left) and HDBSCAN (right). Source of figure.

Figure 2.Performance comparison of difference clustering methods. HDBSCAN is much faster than DBSCAN with more data points. Source of figure.

Figure 2.Performance comparison of difference clustering methods. HDBSCAN is much faster than DBSCAN with more data points. Source of figure.