Schema (core.schema)¶
Definition¶
Schema definition models for pipeline data contracts.
- class pyspark_pipeline_framework.core.schema.definition.DataType(*values)[source]¶
-
Platform-independent data types for schema definitions.
The
strmixin allows natural string comparison without.value.- STRING = 'string'¶
- INTEGER = 'integer'¶
- LONG = 'long'¶
- FLOAT = 'float'¶
- DOUBLE = 'double'¶
- BOOLEAN = 'boolean'¶
- TIMESTAMP = 'timestamp'¶
- DATE = 'date'¶
- BINARY = 'binary'¶
- ARRAY = 'array'¶
- MAP = 'map'¶
- STRUCT = 'struct'¶
- class pyspark_pipeline_framework.core.schema.definition.SchemaField(name, data_type, nullable=True, metadata=<factory>)[source]¶
Bases:
objectA single field within a schema definition.
- Parameters:
- class pyspark_pipeline_framework.core.schema.definition.SchemaDefinition(fields, description=None)[source]¶
Bases:
objectA collection of fields describing a dataset’s schema.
- Parameters:
fields (list[SchemaField]) – Ordered list of schema fields.
description (str | None) – Optional human-readable description.
- fields: list[SchemaField]¶
- get_field(name)[source]¶
Look up a field by name.
- Returns:
The matching
SchemaField, orNoneif not found.- Parameters:
name (str)
- Return type:
SchemaField | None
Validator¶
Schema validation for pipeline data flow contracts.
- class pyspark_pipeline_framework.core.schema.validator.ValidationSeverity(*values)[source]¶
Bases:
EnumSeverity level for schema validation issues.
- ERROR = 'error'¶
- WARNING = 'warning'¶
- class pyspark_pipeline_framework.core.schema.validator.ValidationIssue(severity, field_name, message, source_component, target_component)[source]¶
Bases:
objectA single issue discovered during schema validation.
- Parameters:
severity (ValidationSeverity) – Whether this is a blocking error or an informational warning.
field_name (str | None) – The field that caused the issue, or
Nonefor schema-level issues.message (str) – Human-readable description of the issue.
source_component (str) – Name of the upstream (output) component.
target_component (str) – Name of the downstream (input) component.
- severity: ValidationSeverity¶
- class pyspark_pipeline_framework.core.schema.validator.ValidationResult(valid, issues, source_component, target_component)[source]¶
Bases:
objectOutcome of validating an output schema against an input schema.
- Parameters:
valid (bool) –
Truewhen there are no ERROR-level issues.issues (list[ValidationIssue]) – All discovered issues (errors and warnings).
source_component (str) – Name of the upstream component.
target_component (str) – Name of the downstream component.
- issues: list[ValidationIssue]¶
- property errors: list[ValidationIssue]¶
Return only ERROR-level issues.
- property warnings: list[ValidationIssue]¶
Return only WARNING-level issues.
- class pyspark_pipeline_framework.core.schema.validator.SchemaValidator[source]¶
Bases:
objectValidates that an output schema is compatible with an input schema.
Rules applied: - Both
None+ not strict → valid (nothing to check). - BothNone+ strict → ERROR (schemas must be declared). - OneNone→ valid (partial schemas cannot be validated). - Missing required input field in output → ERROR. - Type mismatch between matching fields → ERROR. - Non-nullable input field backed by nullable output → ERROR. - Extra output fields not in input → WARNING.- validate(output_schema, input_schema, source_component, target_component, *, strict=False)[source]¶
Validate compatibility between an output and input schema.
- Parameters:
output_schema (SchemaDefinition | None) – Schema produced by the source component.
input_schema (SchemaDefinition | None) – Schema expected by the target component.
source_component (str) – Name of the source component.
target_component (str) – Name of the target component.
strict (bool) – When
True, require both schemas to be declared.
- Returns:
A
ValidationResultsummarising all issues found.- Return type: