Data validation service with Great Expectations

Service Description

This service performs automatic and customized validation of datasets using the Great Expectations library, generating a detailed report in JSON and HTML format. It allows the use of predefined expectation suites as well as the creation of new ones, automatically integrating expectations generated from dataset analysis.

For more details on Great Expectations, consult the official documentation: Documentazione Great Expectations

Service Features

Automatic expectation generation: The service analyzes the dataset to create automatic expectations (e.g., non-null values, numeric ranges, distinct values).
Union with user-defined expectations: Automatic expectations are combined with manually defined ones and with those contained in any preloaded suites.
Dataset validation: The dataset is validated against the set of final expectations.
Export of results:
expectation_suite.json: File containing all applied expectations, useful for reuse on new datasets.
validation_report.json: JSON file with the validation results for each expectation.
data_site.zip: ZIP archive containing a navigable HTML version of the report.

Use of the Service

1. Dataset loading

The dataset is loaded into the BDA application in the supported format (e.g. CSV). The service operates in Pandas mode.

2. Parameter Settings

Through the user interface, you can configure the following parameters:

expectations_to_define: Optional list of manually defined expectations in JSON format (Great Expectations format).

The parameter must contain an array of objects, each representing a specific expectation. Each object must have:

type: the name of the expectation in format snake_case or CamelCase, es. expect_column_values_to_be_between o ExpectColumnValuesToBeBetween
kwargs: an object with the parameters required by the expectation, for example:

[
  {
    "type": "expect_column_values_to_be_between",
    "kwargs": {
      "column": "age",
      "min_value": 18,
      "max_value": 99
    }
  },
  {
    "type": "expect_column_values_to_not_be_null",
    "kwargs": {
      "column": "email"
    }
  }
]

These expectations, if they are not already in the correct format, will be automatically converted to CamelCase and included in the final suite along with the automatic ones. In the event that an automatic expectation overlaps with one defined by the user, the manual one is always given priority.

suite_to_use: Pre-existing expectation suite (in JSON format) to reuse. If you have previously run this service, you can simply copy the content of the expectation_suite.json file generated in the application's media and paste it as the value of this parameter. WARNING: configuring a pre-existing suite will not generate automatic expectations.
suite_name: name for the new suite of expectations (default: my_expectation_suite).
data_source_name: Name of the Pandas data source used internally.
data_asset_name: Name of the asset to which the dataset is associated.
batch_def_name: Name of the batch definition.
definition_name: Name of the validation definition.
site_name: Name of the HTML site generated with the validation results.

3. Service Startup

Once the BDA application is configured and saved, you can run it: the service will analyze the dataset, automatically generate a series of expectations based on the data, combine them with any manual expectations or those from existing suites, perform validation, and produce the reports.

4. Output and Results

The service produces three main outputs:

expectation_suite.json: File with all the expectations actually used for validation.
validation_report.json: JSON-formatted result of the validation performed.
data_site.zip: Archive containing the navigable HTML version of the report. Warning: the folder extracted from the ZIP file must be decompressed with a directory structure that is not too deeply nested, as some browsers may block the opening of HTML pages if the path is too deep for security reasons.

Validation Result

The JSON report produced by the service includes, for each evaluated expectation, a detailed structure with the following elements:

success: Whether the expectation is met or not.
element_count: Number of validated items.
missing_count: Number of missing values.
unexpected_count: Values that do not meet expectations.
unexpected_percent: Error rate compared to the total.

Simplified example:

{
  "expectation_config": {
    "type": "expect_column_values_to_be_between",
    "kwargs": {
      "column": "anni_esperienza",
      "min_value": 0,
      "max_value": 40
    }
  },
  "success": false,
  "result": {
    "element_count": 200,
    "missing_count": 0,
    "unexpected_count": 5,
    "unexpected_percent": 2.5,
    "unexpected_percent_nonmissing": 2.5,
    "partial_unexpected_list": [-1, 45, 100],
    "partial_unexpected_counts": [
      {"value": -1, "count": 1},
      {"value": 45, "count": 2}
    ]
  }
}

{
  "expectation_config": {
    "type": "expect_column_values_to_be_between",
    "kwargs": {
      "column": "prezzo_unitario",
      "min_value": 10.0,
      "max_value": 100.0
    }
  },
  "success": false,
  "result": {
    "element_count": 120,
    "missing_count": 1,
    "unexpected_count": 12,
    "unexpected_percent": 10.0,
    "unexpected_percent_nonmissing": 10.08,
    "partial_unexpected_list": [5.0, 150.0, 200.0],
    "partial_unexpected_counts": [
      {"value": 5.0, "count": 1},
      {"value": 150.0, "count": 2}
    ]
  }
}

{
  "results": [
        {
            "expectation_config": {
                "expectation_type": "expect_column_values_to_not_be_null",
                "kwargs": {"column": "name"}
            },
            "success": true,
            "result": {
                "element_count": 100,
                "missing_count": 0,
                "unexpected_count": 0,
                "unexpected_percent": 0.0
            }
        }
    ]
}

Conclusion

This service provides a powerful and automated solution for data validation in Data Analytics System, combining automated analysis with the flexibility of custom expectations. The generation of detailed and navigable reports allows precise control over data quality, useful both in development and production phases.