Sklearn Model Training Service User Manual
Introduction
This service allows to train a classification or regression supervised model using the scikit-learn library. It is possible select the desired algorithm (e.g. RandomForestClassifier) and configure the main hyperparameters. The service includes an automatic preprocessing pipeline and tracks all results with MLflow.
Service Features
1. Model Training
The service builds a pipeline composed of:
- Automatic Preprocessing:
- Imputation of missing values (mean for numeric, mode for categorical).
- Standardization of numerical features.
- One-hot encoding of categorical features.
- Supervised Algorithm dynamically configurable (classifier or regressor).
During classification, automatic stratification is applied to the split; for regression, a standard split is performed.
2. Tracking Results
All model information, including hyperparameters, metrics, and artifacts, are registred on MLflow. Tracked metrics include:
- For classification:
- Accuracy
- F1 Score
-
ROC AUC
-
For regression:
- Mean Squared Error (MSE)
- Mean Absolute Error (MAE)
- R-squared (R²)
3. Model Exhibition
Once training is complete, the model is automatically saved and can later be exposed via Seldon. A direct link to the model and dataset on Data Analytics System is available via MLflow tag.
Use of the Service
User Interface
The interface allows to configure parameters via JSON and visualize the results via integrated MLflow.
1. Dataset Loading
It is possible to load a tabular dataset. The target column (label) must be specified using the labelColumn parameter.
2. Parameters Configuration
Configurable parameters include:
- labelColumn: target column name (required).
- class_algorithm: name of the sklearn algorithm to use (e.g.
RandomForestClassifier,LinearRegression, etc.). - algorithm_params: JSON dictionary with the hyperparameters to pass to the algorithm. For example:
true) or classification (false, default).
* MLFlowregisteredModelName: name (optional) with which to register the model on MLflow.
For the complete list of available algorithms: scikit-learn Supervised Models Documentation
3. Training Strat
After configuration, simply save the BDA Application and execute the RUN.
4. Viewing Results
Using the MLflow dashboard integrated into the BDA application, in the "experiments" section, it is possible:
- View run parameters and metrics.
- Analyze graphs (ROC, residuals, predicted distributions).
- Download or inspect the saved model.