Balanced Subsampling Service User Manual
Introduction
This service allows you to reduce the number of rows in a dataset through stratified and balanced subsampling, ensuring that all classes are present proportionately. It is useful for:
- handle unbalanced datasets;
- reduce dataset size for rapid prototyping;
- construct consistent subsets for validation.
Service Features
1. Balanced Subsampling
The service uses the imblearn library to perform a RandomUnderSampler:
- the classes are made balanced among themselves;
- each class is represented by at least the specified number of elements;
- if required, the total number of rows is limited by a maximum imposed.
2. Main parameters
labelColumn: name of the column containing the labels (required);max_samples: maximum number of total rows to get (optional);min_per_class: minimum number of rows to maintain for each class (default: 1);random_state: seed for reproducibility (optional).
3. Products Asset
After execution, the service produces a distribuzione_dati asset (visible in the "application media" section of the BDA Application during or after execution) with the number of rows selected for each class in the subsampled dataset.
Use of the Service
- This service is suitable for reducing large datasets or rebalancing classes before training.
1. Loading the Dataset
- Select the dataset to be subsampled as input.
- Specify the name of the target column in the
labelColumnfield.
2. Optional Parameters
- Set
max_samplesfor a maximum number of rows in the output; - Set
min_per_classto check the minimum representation of each class; - Set
random_stateto obtain repeatable results.
3. Execution
- Save the BDA Application
- Click on RUN
The resulting dataset will contain the selected rows and will be accessible through the resources.
What to Check
- That the
labelColumncolumn exists and has at least two distinct classes; - That
max_samplesis large enough to guaranteemin_per_classfor each class; - That the
distribuzione_datiasset has been produced.