Document Clustering with LLM Embeddings in Scikit-learn - MachineLearningMastery.com
Blog

Document Clustering with LLM Embeddings in Scikit-learn - MachineLearningMastery.com

Iván Palomares Carrascosa
2026.02.11
·Web·by 이호민
#Document Clustering#Embeddings#K-Means#LLM#Scikit-learn

Key Points

  • 1This article demonstrates how to cluster text documents using large language model embeddings, which capture contextual semantics more effectively than traditional methods like TF-IDF or Word2Vec.
  • 2It details a step-by-step process using Python, including generating 384-dimensional embeddings with a pre-trained SentenceTransformer model and applying scikit-learn's K-Means and DBSCAN algorithms.
  • 3The results, evaluated against ground-truth labels using metrics like Adjusted Rand Index and visualized with PCA, show K-Means performing strongly on a BBC News dataset, while DBSCAN's sensitivity to hyperparameters and dimensionality is noted.

This paper presents a methodology for clustering text documents using large language model (LLM) embeddings and standard scikit-learn clustering algorithms. It addresses the problem of grouping unclassified documents by topic, highlighting the limitations of traditional methods like TF-IDF (which ignores meaning and context) and Word2Vec (which primarily models individual word relationships rather than full document context). The core argument is that modern LLM-based embeddings, specifically sentence transformer models, are superior because they capture contextual semantics and encode overall document-level meaning, leveraging general language knowledge from pre-training on massive text corpora.

The detailed methodology involves a step-by-step Python implementation:

  1. Data Loading: A BBC News dataset, containing 2,225 news articles with pre-assigned topic labels, is loaded. This dataset serves as a benchmark with known ground-truth categories for evaluation.
  1. Embedding Generation:
    • A pre-trained SentenceTransformer model, all-MiniLM-L6-v2, is utilized to convert raw text documents into dense numerical vector representations (embeddings).
    • This model generates 384-dimensional embeddings, meaning each document is represented by 384 numeric values.
    • The model.encode() method is applied to the document texts with batchsize=32batch_size=32 for efficient processing, resulting in an embedding matrix where semantically similar documents are positioned close to each other in the vector space.
  1. Clustering with K-Means:
    • The sklearn.cluster.KMeans algorithm is applied to the generated embeddings.
    • The number of clusters, n_clusters, is set to 5, leveraging the known number of ground-truth categories in the BBC News dataset. Other parameters include randomstate=42random_state=42 and ninit=10n_init=10.
    • Evaluation is performed against the ground-truth categories (converted to numerical labels using LabelEncoder) using two metrics:
      • Silhouette Score: Measures how similar an object is to its own cluster compared to other clusters. A higher value indicates better-defined clusters.
      • Adjusted Rand Index (ARI): A measure of similarity between two data clusterings, correcting for chance. Values range from -1 to 1, with 1 indicating perfect agreement with the ground truth.
    • The K-Means results demonstrated a Silhouette Score of approximately 0.066 and an Adjusted Rand Index of approximately 0.899, indicating strong agreement with the ground-truth categories. The distribution of documents across the 5 clusters was also observed.
  1. Clustering with DBSCAN:
    • As an alternative, the sklearn.cluster.DBSCAN algorithm is applied. DBSCAN is a density-based clustering algorithm that automatically determines the number of clusters.
    • Instead of specifying n_clusters, DBSCAN requires eps (maximum distance between two samples for one to be considered as in the neighborhood of the other) and min_samples (number of samples in a neighborhood for a point to be considered as a core point).
    • For text embeddings, the metric parameter is set to 'cosine', which is often more suitable for high-dimensional text data than Euclidean distance. Example parameters used were eps=0.5eps=0.5 and minsamples=5min_samples=5.
    • DBSCAN also identifies "noise" points, which are not assigned to any cluster (labeled as -1).
    • Evaluation metrics are similar to K-Means, also reporting the number of clusters found and noise documents.
    • The paper notes that DBSCAN's performance (e.g., lower ARI) was significantly worse with default settings compared to K-Means on this dataset, attributed to DBSCAN's sensitivity to hyperparameters and the "curse of dimensionality" for high-dimensional embeddings (384D).
  1. Visualization:
    • Principal Component Analysis (PCA) is used to reduce the 384-dimensional embeddings to 2 dimensions (ncomponents=2n_components=2) for visual comparison.
    • Scatter plots are generated to visualize the document distribution based on: 1) True Categories, 2) K-Means assignments, and 3) DBSCAN assignments, allowing for a qualitative assessment of clustering performance.

In conclusion, the paper effectively demonstrates that LLM-based embeddings are highly suitable for document clustering. When combined with traditional clustering algorithms like K-Means, they can achieve high performance in grouping semantically similar documents. While K-Means performed well due to the dataset's clear topical structure, the study also highlighted the challenges of applying density-based methods like DBSCAN to high-dimensional LLM embeddings without extensive hyperparameter tuning. The overall workflow provides a robust method for unsupervised document organization.