Schema (`core.schema`)¶

Definition¶

Schema definition models for pipeline data contracts.

class pyspark_pipeline_framework.core.schema.definition.DataType(*values)[source]¶

Bases: str, Enum

Platform-independent data types for schema definitions.

The str mixin allows natural string comparison without .value.

STRING = 'string'¶

INTEGER = 'integer'¶

LONG = 'long'¶

FLOAT = 'float'¶

DOUBLE = 'double'¶

BOOLEAN = 'boolean'¶

TIMESTAMP = 'timestamp'¶

DATE = 'date'¶

BINARY = 'binary'¶

ARRAY = 'array'¶

MAP = 'map'¶

STRUCT = 'struct'¶

class pyspark_pipeline_framework.core.schema.definition.SchemaField(name, data_type, nullable=True, metadata=<factory>)[source]¶

Bases: object

A single field within a schema definition.

Parameters:

name (str) – Column/field name.
data_type (DataType | str) – A DataType enum member or a string for complex types (e.g. "array<string>").
nullable (bool) – Whether the field accepts null values.
metadata (dict[str, str]) – Arbitrary key-value metadata for the field.

name: str¶

data_type: DataType | str¶

nullable: bool = True¶

metadata: dict[str, str]¶

class pyspark_pipeline_framework.core.schema.definition.SchemaDefinition(fields, description=None)[source]¶

Bases: object

A collection of fields describing a dataset’s schema.

Parameters:

fields (list[SchemaField]) – Ordered list of schema fields.
description (str | None) – Optional human-readable description.

fields: list[SchemaField]¶

description: str | None = None¶

field_names()[source]¶

Return the set of all field names.

Return type:: set[str]

get_field(name)[source]¶

Look up a field by name.

Returns:: The matching SchemaField, or None if not found.
Parameters:: name (str)
Return type:: SchemaField | None

Validator¶

Schema validation for pipeline data flow contracts.

class pyspark_pipeline_framework.core.schema.validator.ValidationSeverity(*values)[source]¶

Bases: Enum

Severity level for schema validation issues.

ERROR = 'error'¶

WARNING = 'warning'¶

class pyspark_pipeline_framework.core.schema.validator.ValidationIssue(severity, field_name, message, source_component, target_component)[source]¶

Bases: object

A single issue discovered during schema validation.

Parameters:

severity (ValidationSeverity) – Whether this is a blocking error or an informational warning.
field_name (str | None) – The field that caused the issue, or None for schema-level issues.
message (str) – Human-readable description of the issue.
source_component (str) – Name of the upstream (output) component.
target_component (str) – Name of the downstream (input) component.

severity: ValidationSeverity¶

field_name: str | None¶

message: str¶

source_component: str¶

target_component: str¶

class pyspark_pipeline_framework.core.schema.validator.ValidationResult(valid, issues, source_component, target_component)[source]¶

Bases: object

Outcome of validating an output schema against an input schema.

Parameters:

valid (bool) – True when there are no ERROR-level issues.
issues (list[ValidationIssue]) – All discovered issues (errors and warnings).
source_component (str) – Name of the upstream component.
target_component (str) – Name of the downstream component.

valid: bool¶

issues: list[ValidationIssue]¶

source_component: str¶

target_component: str¶

property errors: list[ValidationIssue]¶: Return only ERROR-level issues.

property warnings: list[ValidationIssue]¶: Return only WARNING-level issues.

class pyspark_pipeline_framework.core.schema.validator.SchemaValidator[source]¶

Bases: object

Validates that an output schema is compatible with an input schema.

Rules applied: - Both None + not strict → valid (nothing to check). - Both None + strict → ERROR (schemas must be declared). - One None → valid (partial schemas cannot be validated). - Missing required input field in output → ERROR. - Type mismatch between matching fields → ERROR. - Non-nullable input field backed by nullable output → ERROR. - Extra output fields not in input → WARNING.

validate(output_schema, input_schema, source_component, target_component, *, strict=False)[source]¶

Validate compatibility between an output and input schema.

Parameters:

output_schema (SchemaDefinition | None) – Schema produced by the source component.
input_schema (SchemaDefinition | None) – Schema expected by the target component.
source_component (str) – Name of the source component.
target_component (str) – Name of the target component.
strict (bool) – When True, require both schemas to be declared.

Returns:

A ValidationResult summarising all issues found.

Return type:

ValidationResult

Schema (core.schema)¶

Definition¶

Validator¶

Schema (`core.schema`)¶