Document Clustering with LLM Embeddings in Scikit-learn - MachineLearningMastery.com
Key Points
- 1This article demonstrates how to cluster text documents using large language model embeddings, which capture contextual semantics more effectively than traditional methods like TF-IDF or Word2Vec.
- 2It details a step-by-step process using Python, including generating 384-dimensional embeddings with a pre-trained SentenceTransformer model and applying scikit-learn's K-Means and DBSCAN algorithms.
- 3The results, evaluated against ground-truth labels using metrics like Adjusted Rand Index and visualized with PCA, show K-Means performing strongly on a BBC News dataset, while DBSCAN's sensitivity to hyperparameters and dimensionality is noted.
This paper presents a methodology for clustering text documents using large language model (LLM) embeddings and standard scikit-learn clustering algorithms. It addresses the problem of grouping unclassified documents by topic, highlighting the limitations of traditional methods like TF-IDF (which ignores meaning and context) and Word2Vec (which primarily models individual word relationships rather than full document context). The core argument is that modern LLM-based embeddings, specifically sentence transformer models, are superior because they capture contextual semantics and encode overall document-level meaning, leveraging general language knowledge from pre-training on massive text corpora.
The detailed methodology involves a step-by-step Python implementation:
- Data Loading: A BBC News dataset, containing 2,225 news articles with pre-assigned topic labels, is loaded. This dataset serves as a benchmark with known ground-truth categories for evaluation.
- Embedding Generation:
- A pre-trained
SentenceTransformermodel,all-MiniLM-L6-v2, is utilized to convert raw text documents into dense numerical vector representations (embeddings). - This model generates 384-dimensional embeddings, meaning each document is represented by 384 numeric values.
- The
model.encode()method is applied to the document texts with for efficient processing, resulting in an embedding matrix where semantically similar documents are positioned close to each other in the vector space.
- A pre-trained
- Clustering with K-Means:
- The
sklearn.cluster.KMeansalgorithm is applied to the generated embeddings. - The number of clusters,
n_clusters, is set to 5, leveraging the known number of ground-truth categories in the BBC News dataset. Other parameters include and . - Evaluation is performed against the ground-truth categories (converted to numerical labels using
LabelEncoder) using two metrics:- Silhouette Score: Measures how similar an object is to its own cluster compared to other clusters. A higher value indicates better-defined clusters.
- Adjusted Rand Index (ARI): A measure of similarity between two data clusterings, correcting for chance. Values range from -1 to 1, with 1 indicating perfect agreement with the ground truth.
- The K-Means results demonstrated a Silhouette Score of approximately 0.066 and an Adjusted Rand Index of approximately 0.899, indicating strong agreement with the ground-truth categories. The distribution of documents across the 5 clusters was also observed.
- The
- Clustering with DBSCAN:
- As an alternative, the
sklearn.cluster.DBSCANalgorithm is applied. DBSCAN is a density-based clustering algorithm that automatically determines the number of clusters. - Instead of specifying
n_clusters, DBSCAN requireseps(maximum distance between two samples for one to be considered as in the neighborhood of the other) andmin_samples(number of samples in a neighborhood for a point to be considered as a core point). - For text embeddings, the
metricparameter is set to'cosine', which is often more suitable for high-dimensional text data than Euclidean distance. Example parameters used were and . - DBSCAN also identifies "noise" points, which are not assigned to any cluster (labeled as -1).
- Evaluation metrics are similar to K-Means, also reporting the number of clusters found and noise documents.
- The paper notes that DBSCAN's performance (e.g., lower ARI) was significantly worse with default settings compared to K-Means on this dataset, attributed to DBSCAN's sensitivity to hyperparameters and the "curse of dimensionality" for high-dimensional embeddings (384D).
- As an alternative, the
- Visualization:
- Principal Component Analysis (PCA) is used to reduce the 384-dimensional embeddings to 2 dimensions () for visual comparison.
- Scatter plots are generated to visualize the document distribution based on: 1) True Categories, 2) K-Means assignments, and 3) DBSCAN assignments, allowing for a qualitative assessment of clustering performance.
In conclusion, the paper effectively demonstrates that LLM-based embeddings are highly suitable for document clustering. When combined with traditional clustering algorithms like K-Means, they can achieve high performance in grouping semantically similar documents. While K-Means performed well due to the dataset's clear topical structure, the study also highlighted the challenges of applying density-based methods like DBSCAN to high-dimensional LLM embeddings without extensive hyperparameter tuning. The overall workflow provides a robust method for unsupervised document organization.