Skip to content

Balanced Subsampling Service User Manual

Introduction

This service allows you to reduce the number of rows in a dataset through stratified and balanced subsampling, ensuring that all classes are present proportionately. It is useful for:

  • handle unbalanced datasets;
  • reduce dataset size for rapid prototyping;
  • construct consistent subsets for validation.

Service Features

1. Balanced Subsampling

The service uses the imblearn library to perform a RandomUnderSampler:

  • the classes are made balanced among themselves;
  • each class is represented by at least the specified number of elements;
  • if required, the total number of rows is limited by a maximum imposed.

2. Main parameters

  • labelColumn: name of the column containing the labels (required);
  • max_samples: maximum number of total rows to get (optional);
  • min_per_class: minimum number of rows to maintain for each class (default: 1);
  • random_state: seed for reproducibility (optional).

3. Products Asset

After execution, the service produces a distribuzione_dati asset (visible in the "application media" section of the BDA Application during or after execution) with the number of rows selected for each class in the subsampled dataset.

Use of the Service

  • This service is suitable for reducing large datasets or rebalancing classes before training.

1. Loading the Dataset

  • Select the dataset to be subsampled as input.
  • Specify the name of the target column in the labelColumn field.

2. Optional Parameters

  • Set max_samples for a maximum number of rows in the output;
  • Set min_per_class to check the minimum representation of each class;
  • Set random_state to obtain repeatable results.

3. Execution

  • Save the BDA Application
  • Click on RUN

The resulting dataset will contain the selected rows and will be accessible through the resources.

What to Check

  • That the labelColumn column exists and has at least two distinct classes;
  • That max_samples is large enough to guarantee min_per_class for each class;
  • That the distribuzione_dati asset has been produced.

References