Data Quality (core.quality)¶
Types¶
Data quality check types and models.
- class pyspark_pipeline_framework.core.quality.types.CheckTiming(*values)[source]¶
-
When to run a data quality check.
- BEFORE_PIPELINE = 'before_pipeline'¶
- AFTER_PIPELINE = 'after_pipeline'¶
- AFTER_COMPONENT = 'after_component'¶
- class pyspark_pipeline_framework.core.quality.types.FailureMode(*values)[source]¶
-
How to handle check failures.
- FAIL_ON_ERROR = 'fail_on_error'¶
- WARN_ONLY = 'warn_only'¶
- THRESHOLD = 'threshold'¶
- class pyspark_pipeline_framework.core.quality.types.CheckResult(check_name, passed, message, details=None)[source]¶
Bases:
objectResult of a data quality check.
- class pyspark_pipeline_framework.core.quality.types.DataQualityCheck(name, timing, check_fn, component_name=None, failure_mode=FailureMode.FAIL_ON_ERROR, max_failures=0)[source]¶
Bases:
objectA single data quality check.
The
check_fnreceives aSparkSessionand returns aCheckResult.- Parameters:
name (str) – Unique check name.
timing (CheckTiming) – When the check should run.
check_fn (Callable[[SparkSession], CheckResult]) – Callable that performs the check.
component_name (str | None) – For
AFTER_COMPONENTtiming, the component to attach the check to.failure_mode (FailureMode) – How to handle failures.
max_failures (int) – For
THRESHOLDmode, number of failures allowed before raising.
- timing: CheckTiming¶
- check_fn: Callable[[SparkSession], CheckResult]¶
- failure_mode: FailureMode = 'fail_on_error'¶
Checks¶
Built-in data quality check factories.
- pyspark_pipeline_framework.core.quality.checks.row_count_check(table, min_rows, timing=CheckTiming.AFTER_PIPELINE, component_name=None)[source]¶
Check that a table has at least min_rows rows.
- Parameters:
table (str)
min_rows (int)
timing (CheckTiming)
component_name (str | None)
- Return type:
- pyspark_pipeline_framework.core.quality.checks.null_check(table, column, max_null_pct=0.0, timing=CheckTiming.AFTER_PIPELINE, component_name=None)[source]¶
Check that column has at most max_null_pct % null values.
- Parameters:
table (str)
column (str)
max_null_pct (float)
timing (CheckTiming)
component_name (str | None)
- Return type:
- pyspark_pipeline_framework.core.quality.checks.unique_check(table, columns, timing=CheckTiming.AFTER_PIPELINE, component_name=None)[source]¶
Check that columns form a unique key (no duplicates).
- Parameters:
- Return type:
- pyspark_pipeline_framework.core.quality.checks.range_check(table, column, min_value=None, max_value=None, timing=CheckTiming.AFTER_PIPELINE, component_name=None)[source]¶
Check that all values in column fall within [min_value, max_value].
At least one of min_value or max_value must be provided.
- Parameters:
table (str) – Registered temp view or table name.
column (str) – Numeric column to validate.
min_value (float | int | None) – Minimum allowed value (inclusive), or
Nonefor no lower bound.max_value (float | int | None) – Maximum allowed value (inclusive), or
Nonefor no upper bound.timing (CheckTiming) – When to run the check.
component_name (str | None) – For
AFTER_COMPONENTtiming, which component to attach the check to.
- Raises:
ValueError – If neither min_value nor max_value is provided.
- Return type:
- pyspark_pipeline_framework.core.quality.checks.schema_check(table, schema, check_types=True, timing=CheckTiming.AFTER_PIPELINE, component_name=None)[source]¶
Check that table matches the expected schema.
Verifies that all fields defined in schema exist in the table. When check_types is
True(the default), also verifies that each field’s data type matches.- Parameters:
table (str) – Registered temp view or table name.
schema (SchemaDefinition) – Expected schema definition.
check_types (bool) – Whether to validate data types in addition to column names.
timing (CheckTiming) – When to run the check.
component_name (str | None) – For
AFTER_COMPONENTtiming, which component to attach the check to.
- Return type:
- pyspark_pipeline_framework.core.quality.checks.custom_sql_check(name, sql, timing=CheckTiming.AFTER_PIPELINE, component_name=None)[source]¶
Run an arbitrary SQL expression as a data quality check.
The sql statement must return a single row with at least a
passedcolumn (boolean). An optionalmessagecolumn provides a custom result message.Example:
custom_sql_check( name="no_negative_amounts", sql="SELECT COUNT(*) = 0 AS passed FROM orders WHERE amount < 0", )
- Parameters:
name (str) – Unique check name.
sql (str) – SQL query returning a row with a
passedboolean column.timing (CheckTiming) – When to run the check.
component_name (str | None) – For
AFTER_COMPONENTtiming, which component to attach the check to.
- Return type: