Scope & Design¶

This page describes the scope, design philosophy, and architectural decisions behind pyspark-pipeline-framework. Use it to decide whether this framework is a good fit for your project.

Design Philosophy¶

pyspark-pipeline-framework is built on three principles:

Configuration-driven. Pipelines are declared in HOCON configuration files. Components, ordering, retry policies, and circuit breakers are all specified outside of application code. The same component code runs in development, staging, and production with different configuration files.

Composable. Every pipeline is a sequence of components that conform to the PipelineComponent abstract base class. Components are loaded dynamically from class paths, instantiated from configuration dictionaries, and executed in dependency order. New components can be added without modifying the runner.

Observable. The PipelineHooks protocol provides lifecycle callbacks at every execution point: pipeline start/end, component start/end, retries, and failures. Built-in hooks cover logging, metrics, data quality checks, audit trails, and checkpoint management. Compose them freely via CompositeHooks.

What This Framework IS¶

Simple Sequential Pipelines¶

The framework loads a HOCON configuration file, dynamically instantiates components via importlib, resolves depends_on ordering, and executes components sequentially through SimplePipelineRunner:

HOCON Config --> ConfigLoader --> PipelineConfig
                                      |
                                      v
                                 Dynamic Loader
                                 (importlib)
                                      |
                                      v
                             SimplePipelineRunner
                               (sequential execution)

Each component receives a SparkSession, runs its logic, and passes data downstream via temporary views or shared state. This model is deliberately simple: a pipeline is a list of steps that run in order.

Configuration-Driven Architecture¶

All pipeline definitions use HOCON (Human-Optimized Config Object Notation), parsed by the dataconf library into type-safe Python dataclasses:

{
  name: "customer-etl"
  version: "1.0.0"

  spark {
    app_name: "Customer ETL"
    master: "local[*]"
  }

  components: [
    {
      name: "read_raw"
      component_type: source
      class_path: "my_project.components.ReadCustomers"
      config {
        table_name: "raw.customers"
        output_view: "raw"
      }
    }
  ]
}

HOCON provides features that JSON lacks: comments, includes, variable substitution (${?ENV_VAR}), and multi-file composition. These features make it straightforward to manage environment-specific overrides.

Production-Ready Observability¶

The framework provides five built-in hook implementations that compose via CompositeHooks:

Hook	Purpose
`LoggingHooks`	Structured logging of lifecycle events via `structlog`
`MetricsHooks`	Timing, counters, and retry counts for monitoring systems
`DataQualityHooks`	Run data quality checks before/after components
`AuditHooks`	Emit audit events with config filtering for compliance
`CheckpointHooks`	Persist checkpoint state for pipeline resume

All hooks follow the PipelineHooks protocol, so custom implementations (Slack notifications, PagerDuty alerts, custom dashboards) plug in without any framework changes.

Flexible Error Handling¶

Components can be configured with per-component resilience policies:

Retry with exponential backoff – RetryExecutor with configurable max_attempts, initial_delay_seconds, max_delay_seconds, and backoff_multiplier.
Circuit breaker – CircuitBreaker with failure_threshold and timeout_seconds. Prevents cascading failures by rejecting calls to repeatedly failing components.
Fail-fast vs. continue – The runner supports both strategies. Components can be marked enabled: false to skip them entirely.

{
  name: "flaky_api"
  component_type: source
  class_path: "my_project.components.ApiSource"
  retry {
    max_attempts: 3
    initial_delay_seconds: 1.0
    max_delay_seconds: 30.0
    backoff_multiplier: 2.0
  }
  circuit_breaker {
    failure_threshold: 5
    timeout_seconds: 60.0
  }
}

What This Framework is NOT¶

Not a DAG Executor¶

Pipelines execute components sequentially in dependency order. There is no parallel task scheduling, no DAG visualization, and no fan-out/fan-in execution. If you need complex directed acyclic graph execution with parallel branches, use a workflow orchestrator like Airflow or Dagster and call this framework from individual tasks.

The depends_on field controls ordering, not parallel execution:

components: [
  { name: "a", ... },
  { name: "b", depends_on: ["a"], ... },
  { name: "c", depends_on: ["a"], ... },
  { name: "d", depends_on: ["b", "c"], ... }
]

In this example, b and c both depend on a but they still execute sequentially (a then b then c then d), not in parallel.

Optional Schema Contracts¶

Schema validation is opt-in. Components can implement the SchemaContract protocol and extend SchemaAwareDataFlow to declare input and output schemas, but this is not enforced by default. The framework does not require every component to declare schemas.

Not a Workflow Orchestrator¶

This framework runs a single pipeline. It does not manage multi-pipeline DAGs, scheduling, alerting dashboards, or cross-pipeline dependencies. For multi-pipeline orchestration, use Airflow, Dagster, Prefect, or a similar tool and invoke SimplePipelineRunner from within each task.

When to Use This Framework¶

Good Fit¶

Batch ETL jobs – Read from tables/files, transform with SQL or DataFrame logic, write to tables. The source-transform-sink pattern maps directly to the component model.
Data transformations – Chains of SQL or DataFrame transformations composed as reusable components.
Streaming ingestion – Kafka, file, or Delta Lake sources piped through transforms to Delta, Iceberg, or Kafka sinks using Structured Streaming.
Configuration-driven batch processing – Multiple environments (dev, staging, prod) with shared component code and different HOCON configs.
Audited data pipelines – Compliance requirements met with AuditHooks and ConfigFilter for redacting secrets from audit logs.

Not a Good Fit¶

Complex DAG dependencies – Pipelines with parallel branches, conditional execution, or fan-out/fan-in patterns. Use Airflow or Dagster.
Non-Spark workloads – The framework is purpose-built for PySpark. core/ has no Spark dependency at import time, but runtime/ and runner/ assume a SparkSession.
Real-time serving – For sub-second latency APIs or event-driven microservices, use dedicated serving frameworks (FastAPI, Flink, etc.).
Ad-hoc notebook exploration – The framework adds structure that is unnecessary for one-off interactive analysis. Use plain PySpark in notebooks.

Architecture Decisions¶

Why Sequential Execution?¶

Most PySpark ETL jobs are inherently sequential: read data, transform it, write the result. Parallel execution within a single Spark application adds complexity (thread safety, resource contention, debugging difficulty) with limited benefit because Spark itself parallelizes work across executors.

Sequential execution provides:

Predictable ordering – Components always run in the same order.
Simple debugging – Stack traces point to a single component.
Safe Spark usage – No concurrent SparkSession access concerns.
Checkpoint/resume – Failed pipelines resume from the last completed step.

Why HOCON Configuration?¶

HOCON was chosen over YAML, TOML, or JSON because:

Comments – HOCON supports // and # comments. YAML does too, but JSON and TOML (for complex nested structures) do not.
Includes – include "base.conf" enables environment-specific overrides layered on shared defaults.
Substitution – ${?DB_PASSWORD} reads environment variables with optional fallback. Avoids separate templating tools.
Compatibility – HOCON is a superset of JSON. Existing JSON configs work without changes.
Consistency – The Scala spark-pipeline-framework uses Typesafe Config (HOCON). Using the same format enables teams to share configuration patterns across Scala and Python projects.

The dataconf library parses HOCON into Python dataclasses with type validation, providing the same type safety that PureConfig provides in Scala.

Why Reflection-Based Instantiation?¶

Components are loaded dynamically via importlib.import_module and getattr:

from pyspark_pipeline_framework.runtime.loader import load_component_class

cls = load_component_class("my_project.components.ReadCustomers")
instance = cls.from_config({"table_name": "raw.customers"})

This design provides:

No framework coupling – Components do not need to register themselves. Any class with from_config() and run() can be loaded.
Configuration-driven composition – New components are added to the pipeline by editing a HOCON file, not by modifying Python code.
Validation – validate_pipeline() dry-runs class loading and configuration parsing without executing any Spark operations.

Comparison with Alternatives¶

Feature	This Framework	Raw PySpark	Airflow	Dagster	Custom Framework
Config-driven pipelines	HOCON + dataconf	Manual	Airflow Variables	Dagster config	Varies
Component reuse	`DataFlow` ABC	Copy/paste	Operators	Assets/Ops	Varies
Dynamic loading	`importlib` + class path	N/A	Plugin system	Resource system	Often manual
Execution model	Sequential	Sequential	DAG (parallel)	DAG (parallel)	Varies
Retry / circuit breaker	Built-in, per-component	Manual	Task-level retry	Op-level retry	Often missing
Lifecycle hooks	`PipelineHooks` protocol	None	Callbacks	Sensors/hooks	Varies
Data quality	`DataQualityHooks`	Manual / Great Expectations	Separate DAG	Asset checks	Varies
Audit trail	`AuditHooks` + sinks	Manual	Audit logs	Event log	Varies
Secrets management	`SecretsResolver` + providers	Manual	Connections	Resources	Varies
Checkpoint / resume	`CheckpointStore`	Manual	Built-in	Built-in	Often missing
Streaming support	`StreamingPipeline`	Manual	Limited	Limited	Varies
Learning curve	Low	Lowest	Medium	Medium-High	Varies
Deployment	`spark-submit` / Databricks	Same	Managed scheduler	Managed scheduler	Varies

Summary: Use this framework when you want configuration-driven PySpark pipelines with production features (resilience, observability, audit) but without the operational overhead of a full workflow orchestrator. Use Airflow or Dagster when you need DAG scheduling, cross-pipeline dependencies, or a managed execution environment.