Sklearn Model Training Service User Manual

Introduction

This service allows to train a classification or regression supervised model using the scikit-learn library. It is possible select the desired algorithm (e.g. RandomForestClassifier) and configure the main hyperparameters. The service includes an automatic preprocessing pipeline and tracks all results with MLflow.

Service Features

1. Model Training

The service builds a pipeline composed of:

Automatic Preprocessing:
Imputation of missing values (mean for numeric, mode for categorical).
Standardization of numerical features.
One-hot encoding of categorical features.
Supervised Algorithm dynamically configurable (classifier or regressor).

During classification, automatic stratification is applied to the split; for regression, a standard split is performed.

2. Tracking Results

All model information, including hyperparameters, metrics, and artifacts, are registred on MLflow. Tracked metrics include:

For classification:
Accuracy
F1 Score
ROC AUC
For regression:
Mean Squared Error (MSE)
Mean Absolute Error (MAE)
R-squared (R²)

3. Model Exhibition

Once training is complete, the model is automatically saved and can later be exposed via Seldon. A direct link to the model and dataset on Data Analytics System is available via MLflow tag.

Use of the Service

User Interface

The interface allows to configure parameters via JSON and visualize the results via integrated MLflow.

1. Dataset Loading

It is possible to load a tabular dataset. The target column (label) must be specified using the labelColumn parameter.

2. Parameters Configuration

Configurable parameters include:

labelColumn: target column name (required).
class_algorithm: name of the sklearn algorithm to use (e.g. RandomForestClassifier, LinearRegression, etc.).
algorithm_params: JSON dictionary with the hyperparameters to pass to the algorithm. For example:

{
  "n_estimators": 100,
  "max_depth": null,
  "class_weight": "balanced"
}

* train_test_split_args: parameters for splitting the dataset (default: 80/20 stratified by classification). * Regression: boolean to indicate whether the task is regression (true) or classification (false, default). * MLFlowregisteredModelName: name (optional) with which to register the model on MLflow.

For the complete list of available algorithms: scikit-learn Supervised Models Documentation

3. Training Strat

After configuration, simply save the BDA Application and execute the RUN.

4. Viewing Results

Using the MLflow dashboard integrated into the BDA application, in the "experiments" section, it is possible:

View run parameters and metrics.
Analyze graphs (ROC, residuals, predicted distributions).
Download or inspect the saved model.