Skip to content

User Manual for the KMeansTrainer Clustering Service

Introduction

The KMeansTrainer service allows you to apply an unsupervised clustering algorithm (KMeans) to a tabular dataset. The service is generic and suitable for any mixed dataset (numerical and categorical), including automatic preprocessing and the generation of explanatory graphs.

Clustering allows you to identify groups (clusters) within data without needing to specify a target variable.

Service Features

1. Clustering with Automatic Preprocessing

The service builds an sklearn pipeline consisting of:

  • Preprocessing:

  • Imputation of missing values (mean for numeric features, "missing" constant for categorical features).

  • Removal of constant or completely empty columns.
  • Standardization of numeric features.
  • One-hot encoding of categorical features.

  • Clustering:

*KMeans model with customizable number of clusters and parameters.

2. Results Production

The service generates:

  • An enriched dataset with a cluster column indicating the assignment of each row.
  • A serialized model (pipeline.pkl) containing the entire pipeline (preprocessing + clustering), ready to be reused.
  • Explanatory charts based on the first three principal components (PCA):

  • PCA Component 1 vs 2

  • PCA Component 1 vs 3
  • PCA Component 2 vs 3

  • Cluster Cohesion Metrics:

  • Silhouette Score

3. Logging and Visualization

The service uses the genericoutput module to:

  • Log the value of the silhouette score.
  • Send images (Picture) of the PCA projections to facilitate the visual assessment of cluster separability.

Use of the Service

User Interface

The service can be configured through the Data Analytics System graphical interface or compose, by setting:

1. Input Dataset

  • Tabular dataset containing numerical and/or categorical columns.

2. Main parameters

  • n_clusters: desired number of clusters (es. 3, 4, ...).
  • kmeans_kwargs: optional parameters in JSON format to customize behavior dell'algoritmo KMeans. For example:
{
  "init": "k-means++",
  "max_iter": 300,
  "random_state": 42
}

3. Output

  • Output dataset name
  • Name of the saved model (folder containing pipeline.pkl)

4. Visualization

PCA charts are generated automatically if the preprocessed dataset has at least 3 dimensions.

Execution

After completing the setup, simply save the BDA Application and start the RUN. The output will be available in the resources section of the pipeline.

Useful References