Schema (core.schema)

Definition

Schema definition models for pipeline data contracts.

class pyspark_pipeline_framework.core.schema.definition.DataType(*values)[source]

Bases: str, Enum

Platform-independent data types for schema definitions.

The str mixin allows natural string comparison without .value.

STRING = 'string'
INTEGER = 'integer'
LONG = 'long'
FLOAT = 'float'
DOUBLE = 'double'
BOOLEAN = 'boolean'
TIMESTAMP = 'timestamp'
DATE = 'date'
BINARY = 'binary'
ARRAY = 'array'
MAP = 'map'
STRUCT = 'struct'
class pyspark_pipeline_framework.core.schema.definition.SchemaField(name, data_type, nullable=True, metadata=<factory>)[source]

Bases: object

A single field within a schema definition.

Parameters:
  • name (str) – Column/field name.

  • data_type (DataType | str) – A DataType enum member or a string for complex types (e.g. "array<string>").

  • nullable (bool) – Whether the field accepts null values.

  • metadata (dict[str, str]) – Arbitrary key-value metadata for the field.

name: str
data_type: DataType | str
nullable: bool = True
metadata: dict[str, str]
class pyspark_pipeline_framework.core.schema.definition.SchemaDefinition(fields, description=None)[source]

Bases: object

A collection of fields describing a dataset’s schema.

Parameters:
  • fields (list[SchemaField]) – Ordered list of schema fields.

  • description (str | None) – Optional human-readable description.

fields: list[SchemaField]
description: str | None = None
field_names()[source]

Return the set of all field names.

Return type:

set[str]

get_field(name)[source]

Look up a field by name.

Returns:

The matching SchemaField, or None if not found.

Parameters:

name (str)

Return type:

SchemaField | None

Validator

Schema validation for pipeline data flow contracts.

class pyspark_pipeline_framework.core.schema.validator.ValidationSeverity(*values)[source]

Bases: Enum

Severity level for schema validation issues.

ERROR = 'error'
WARNING = 'warning'
class pyspark_pipeline_framework.core.schema.validator.ValidationIssue(severity, field_name, message, source_component, target_component)[source]

Bases: object

A single issue discovered during schema validation.

Parameters:
  • severity (ValidationSeverity) – Whether this is a blocking error or an informational warning.

  • field_name (str | None) – The field that caused the issue, or None for schema-level issues.

  • message (str) – Human-readable description of the issue.

  • source_component (str) – Name of the upstream (output) component.

  • target_component (str) – Name of the downstream (input) component.

severity: ValidationSeverity
field_name: str | None
message: str
source_component: str
target_component: str
class pyspark_pipeline_framework.core.schema.validator.ValidationResult(valid, issues, source_component, target_component)[source]

Bases: object

Outcome of validating an output schema against an input schema.

Parameters:
  • valid (bool) – True when there are no ERROR-level issues.

  • issues (list[ValidationIssue]) – All discovered issues (errors and warnings).

  • source_component (str) – Name of the upstream component.

  • target_component (str) – Name of the downstream component.

valid: bool
issues: list[ValidationIssue]
source_component: str
target_component: str
property errors: list[ValidationIssue]

Return only ERROR-level issues.

property warnings: list[ValidationIssue]

Return only WARNING-level issues.

class pyspark_pipeline_framework.core.schema.validator.SchemaValidator[source]

Bases: object

Validates that an output schema is compatible with an input schema.

Rules applied: - Both None + not strict → valid (nothing to check). - Both None + strict → ERROR (schemas must be declared). - One None → valid (partial schemas cannot be validated). - Missing required input field in output → ERROR. - Type mismatch between matching fields → ERROR. - Non-nullable input field backed by nullable output → ERROR. - Extra output fields not in input → WARNING.

validate(output_schema, input_schema, source_component, target_component, *, strict=False)[source]

Validate compatibility between an output and input schema.

Parameters:
  • output_schema (SchemaDefinition | None) – Schema produced by the source component.

  • input_schema (SchemaDefinition | None) – Schema expected by the target component.

  • source_component (str) – Name of the source component.

  • target_component (str) – Name of the target component.

  • strict (bool) – When True, require both schemas to be declared.

Returns:

A ValidationResult summarising all issues found.

Return type:

ValidationResult