What?
The key idea is that for each point of a cluster, the neighborhood of a given radius has to contain at least a minimum number of points.
DBSCAN
- "DBSCAN" = Density-based-spatial clustering of application with noise.
- Separate clusters of high density from ones of low density.
- Can sort data into clusters of varying shapes.
- Input: set of points & neighborhood N & minpts (density)
- Output: clusters with density (+ noises)
- Each point is either:
- Core point: has at least minpts points in its neighborhood.
- Border point: not a core but has at least 1 core point in its neighborhoods.
- Noise point: not a core or border point.
- Phase:
- Choose a point → it's a core point?
- If yes → expand → check core / check border
- If no → form a cluster
- Repeat to form other clusters
- Eliminate noise points.
- Pros:
- Discover any number of clusters (different from K-Means & K-Medoids Clustering which need an input of number of clusters).
- Cluster of varying sizes and shapes.
- Detect and ignore outliers.
- Cons:
- Sensitive → choice of neighborhood parameters (eg. If minpts is too small → wrong noises)
- Produce noise: unclear → how to calculate metric indexes when there is noise.
HDBSCAN
- High DBSCAN.
- Difference between DBSCAN and HDBSCAN:
- HDBSCAN: focus much on high density.
- DBSCAN: create right clusters but also create clusters with very low density of examples (Figure 1).
- Check more in this note.
- Reduce the speed of clustering in comparision with other methods (Figure 2).
- HDBScan has the parameter minimum cluster size (
min_cluster_size
), which is how big a cluster needs to be in order to form.
![Figure 1. Difference between DBSCAN (left) and HDBSCAN (right). Source of figure.](https://prod-files-secure.s3.us-west-2.amazonaws.com/70a67195-bc38-429a-9695-1ad1b42ccec8/40b7e12b-ee9c-4890-8608-cecafea1d352/Untitled.png)
Figure 1. Difference between DBSCAN (left) and HDBSCAN (right). Source of figure.
![Figure 2.Performance comparison of difference clustering methods. HDBSCAN is much faster than DBSCAN with more data points. Source of figure.](https://prod-files-secure.s3.us-west-2.amazonaws.com/70a67195-bc38-429a-9695-1ad1b42ccec8/804bffd5-9beb-4c1b-9103-0c449b4a44b0/Untitled.png)
Figure 2.Performance comparison of difference clustering methods. HDBSCAN is much faster than DBSCAN with more data points. Source of figure.
When?