Published on

🧠 AI Exploration: DBSCAN Explained

Authors

🧠 AI Exploration: DBSCAN Explained

DBSCAN (Density-Based Spatial Clustering of Applications with Noise) is a powerful clustering algorithm that groups together data points that are close to each other based on density - and separates outliers.

Unlike K-Means, you don't need to specify the number of clusters in advance.


🧠 How DBSCAN Works

DBSCAN relies on two parameters:

  • eps: The maximum distance between two points to be considered neighbors
  • min_samples: The minimum number of neighbors needed to form a dense region

It classifies points as:

  1. Core Point: Has at least min_samples within eps radius
  2. Border Point: Within eps of a core point but not a core itself
  3. Noise Point: Not within eps of any core point

Clusters are formed by expanding core points, while noise points are ignored.


🧮 Mathematical Definition of DBSCAN

Let’s define a few key terms more formally:

1. ε-neighborhood of a point

Given a point x∈Rdx \in \mathbb{R}^d and radius ε>0\varepsilon > 0,

Nε(x)={y∈Rd∣∄xāˆ’yāˆ„ā‰¤Īµ}N_{\varepsilon}(x) = \{ y \in \mathbb{R}^d \mid \|x - y\| \leq \varepsilon \}

This is the set of all points within distance ε\varepsilon of xx.

2. Core Point

A point xx is a core point if:

∣Nε(x)āˆ£ā‰„minPts|N_{\varepsilon}(x)| \geq \text{minPts}

That is, it has at least minPts neighbors (including itself) in its ε-neighborhood.

3. Direct Density-Reachability

A point xx is directly density-reachable from a point yy if:

  • x∈Nε(y)x \in N_{\varepsilon}(y)
  • yy is a core point

4. Density-Reachability

A point xx is density-reachable from yy if there exists a chain of points:

x1=y,x2,…,xn=xx_1 = y, x_2, \dots, x_n = x

such that xi+1x_{i+1} is directly density-reachable from xix_i.

5. Density-Connected

Two points xx and yy are density-connected if there exists a point zz such that both xx and yy are density-reachable from zz.


šŸŽÆ When to Use DBSCAN

  • When clusters have irregular shapes (not spherical)
  • When data contains outliers
  • When you don’t know how many clusters exist

āœ… Advantages and Disadvantages

āœ… Pros

  • Does not require you to specify number of clusters
  • Can detect outliers (label as noise)
  • Works well with clusters of arbitrary shape

āŒ Cons

  • Choosing eps and min_samples can be tricky
  • Performance degrades in high-dimensional spaces

🧪 Code Example: DBSCAN on Iris Dataset

from sklearn.datasets import load_iris
from sklearn.cluster import DBSCAN
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd

# Load dataset
iris = load_iris()
X = iris.data
features = iris.feature_names

# Standardize features (important for distance-based models)
X_scaled = StandardScaler().fit_transform(X)

# Run DBSCAN
dbscan = DBSCAN(eps=0.6, min_samples=4)
clusters = dbscan.fit_predict(X_scaled)

# Visualize
df = pd.DataFrame(X, columns=features)
df['Cluster'] = clusters

sns.pairplot(df, hue='Cluster', palette='Set1', corner=True)
plt.suptitle('DBSCAN Clustering on Iris Dataset', y=1.02)
plt.tight_layout()
plt.show()

This example uses DBSCAN to automatically find structure in the Iris dataset - no cluster count needed.

šŸ“Š The plot below reveals how DBSCAN discovered three dense clusters and labeled several outlier points as noise (in red, cluster -1). Unlike K-Means, DBSCAN effectively identifies non-spherical structures and isolates sparse, scattered points - showcasing its strength in handling real-world imperfections.

DBSCAN Clustering on Iris Dataset

šŸ” K-Means vs. DBSCAN Comparison

K-Means (previous post) cleanly splits the Iris dataset into three compact, spherical clusters, assuming equal density and ignoring outliers.

DBSCAN (this post), in contrast, is density-aware - it detects clusters of varying shapes and automatically flags outliers (in red, cluster -1). This makes DBSCAN more suitable for datasets with noise or uneven cluster sizes, whereas K-Means may struggle when clusters are non-convex or imbalanced.


šŸ“Š Notes on Parameter Tuning

  • Use k-distance plot to find a good value for eps
  • Set min_samples roughly equal to the number of features or slightly larger

šŸ”š Recap

DBSCAN is ideal for clustering spatial, noisy, or arbitrarily shaped data without predefining the number of clusters. Its ability to handle noise makes it a go-to algorithm for real-world exploratory clustering tasks.


šŸ”œ Coming Next

Next in this subseries of clustering techniques: Hierarchical Clustering - where we build a tree of nested clusters and cut it at the desired level.

Stay curious and keep exploring šŸ‘‡