Blog

Document Clustering with LLM Embeddings in Scikit-learn - MachineLearningMastery.com

Iván Palomares Carrascosa

2026.02.11

·Web·by 이호민

#Document Clustering#Embeddings#K-Means#LLM#Scikit-learn

Key Points

1This article demonstrates how to cluster text documents using large language model embeddings, which capture contextual semantics more effectively than traditional methods like TF-IDF or Word2Vec.
2It details a step-by-step process using Python, including generating 384-dimensional embeddings with a pre-trained SentenceTransformer model and applying scikit-learn's K-Means and DBSCAN algorithms.
3The results, evaluated against ground-truth labels using metrics like Adjusted Rand Index and visualized with PCA, show K-Means performing strongly on a BBC News dataset, while DBSCAN's sensitivity to hyperparameters and dimensionality is noted.

batch_size=32

Blog

Iván Palomares Carrascosa

2026.02.11

·Web·by 이호민

#Document Clustering#Embeddings#K-Means#LLM#Scikit-learn

1This article demonstrates how to cluster text documents using large language model embeddings, which capture contextual semantics more effectively than traditional methods like TF-IDF or Word2Vec.
2It details a step-by-step process using Python, including generating 384-dimensional embeddings with a pre-trained SentenceTransformer model and applying scikit-learn's K-Means and DBSCAN algorithms.
3The results, evaluated against ground-truth labels using metrics like Adjusted Rand Index and visualized with PCA, show K-Means performing strongly on a BBC News dataset, while DBSCAN's sensitivity to hyperparameters and dimensionality is noted.

batch_size=32