- Published on
๐ง AI Exploration #9: Leiden Clustering Explained
- Authors

- Name
- Van-Loc Nguyen
- @vanloc1808
๐ง AI Exploration #9: Leiden Clustering Explained
Leiden Clustering is a powerful graph-based clustering algorithm used to discover communities in complex networks.
Unlike K-Means, Leiden does not assume spherical clusters. Instead, it works on a graph of relationships between samples and groups together nodes that are densely connected.
Today, Leiden is widely used in:
- Semantic embedding clustering
- Single-cell biology
- Recommendation systems
- Social network analysis
- Large-scale AI retrieval systems
๐ง How Leiden Clustering Works
Leiden operates on a graph:
- Each sample becomes a node
- Similar samples are connected by edges
- Edge weights represent similarity strength
The algorithm then searches for communities where:
- connections inside the cluster are strong
- connections between clusters are weak
Typical Pipeline
A common modern AI clustering pipeline looks like this:
Raw Text
โ
Embedding Model
โ
Vector Embeddings
โ
k-Nearest Neighbor Graph
โ
Leiden Clustering
For example:
- Sentence embeddings from BGE or OpenAI
- kNN graph using cosine similarity
- Leiden for community detection
๐งฎ Mathematical Intuition
Leiden optimizes a quantity called modularity (or a related objective).
The idea is:
A good cluster should contain more internal connections than expected by random chance.
Graph Representation
Let:
- = set of nodes
- = weighted edges
Each edge has weight:
representing similarity between node and node .
Modularity Objective
A simplified modularity formulation is:
Where:
- = edge weight between nodes
- = total degree of node
- = total graph weight
- = cluster assignment
- if nodes belong to same cluster
Higher modularity means stronger community structure.
๐ Why Leiden Is Better Than Louvain
Leiden is actually an improvement over the older Louvain algorithm.
Louvain Problem
Louvain may produce:
- disconnected communities
- poorly connected clusters
- unstable partitions
Leiden Improvement
Leiden adds:
- refinement steps
- connectivity guarantees
- better optimization stability
As a result:
- clusters are more coherent
- results are more robust
- convergence is usually faster
๐ฏ When to Use Leiden
Leiden works especially well when:
- Using embedding vectors
- Clustering semantic text
- Discovering graph communities
- Cluster shapes are non-spherical
- Dataset structure is complex
It is often better than K-Means for modern embedding spaces.
โ Advantages and Disadvantages
โ Pros
- Excellent for embedding clustering
- Handles irregular cluster structure
- No need to assume spherical clusters
- Scales well to large graphs
- Produces well-connected communities
โ Cons
- Requires graph construction
- Sensitive to graph quality
- kNN and resolution tuning matter
- Harder to interpret mathematically than K-Means
๐งช Code Example: Leiden Clustering with Sentence Embeddings
import numpy as np
import igraph as ig
import leidenalg
from sentence_transformers import SentenceTransformer
from sklearn.neighbors import NearestNeighbors
texts = [
"database connection failed",
"unable to connect to database",
"gpu memory overflow",
"cuda out of memory",
"payment transaction failed",
]
# Generate embeddings
model = SentenceTransformer("BAAI/bge-small-en-v1.5")
embeddings = model.encode(
texts,
normalize_embeddings=True,
)
# Build kNN graph
k = 2
nn = NearestNeighbors(
n_neighbors=k + 1,
metric="cosine",
)
nn.fit(embeddings)
distances, indices = nn.kneighbors(embeddings)
edges = []
weights = []
for source_idx, (row_distances, row_indices) in enumerate(
zip(distances, indices)
):
for distance, target_idx in zip(row_distances, row_indices):
if source_idx == target_idx:
continue
similarity = 1.0 - distance
edges.append((source_idx, target_idx))
weights.append(similarity)
# Create graph
graph = ig.Graph(
n=len(texts),
edges=edges,
directed=False,
)
graph.es["weight"] = weights
# Run Leiden
partition = leidenalg.find_partition(
graph,
leidenalg.RBConfigurationVertexPartition,
weights=graph.es["weight"],
resolution_parameter=0.5,
)
labels = partition.membership
for text, label in zip(texts, labels):
print(label, text)
This example clusters semantically similar log messages together using:
- BGE embeddings
- cosine similarity
- kNN graph construction
- Leiden community detection
๐ Understanding Important Parameters
k (kNN neighbors)
Controls graph connectivity.
Smaller
kโ fragmented graph โ more clustersLarger
kโ denser graph โ fewer clusters
resolution_parameter
Controls cluster granularity.
Lower resolution โ larger clusters
Higher resolution โ smaller clusters
This is one of the most important tuning parameters in Leiden.
๐ฌ t-SNE Visualization
A common practice is:
- Generate embeddings
- Run Leiden clustering
- Use t-SNE only for visualization
Important:
t-SNE is NOT the clustering algorithm itself.
It only projects high-dimensional embeddings into 2D for visualization.
๐ Recap
Leiden Clustering is one of the most effective modern clustering techniques for embedding-based AI systems.
Instead of assuming geometric cluster shapes, Leiden discovers communities through graph connectivity - making it especially powerful for:
- semantic search
- observability systems
- recommendation engines
- retrieval pipelines
- large-scale representation learning
๐ Coming Next
Next in this graph clustering subseries:
Louvain vs. Leiden โ understanding community detection quality, modularity optimization, and clustering stability.
Stay curious and keep exploring ๐