PySpark Pipeline Framework¶
Configuration-driven PySpark pipeline framework with HOCON configuration, resilience patterns, lifecycle hooks, and streaming support.
Build batch and streaming data pipelines using composable components, HOCON configuration files, and a rich set of operational features including retry policies, circuit breakers, data quality checks, audit trails, secrets management, and checkpoint/resume.
Note
Scala/JVM users may also be interested in spark-pipeline-framework, the Scala implementation using PureConfig and Typesafe Config.
Supported Versions:
Python 3.10 – 3.13
Apache Spark 3.4+ (optional runtime dependency)
Getting Started
User Guide
Production
Reference
- Architecture
- API Reference
- Configuration (
core.config) - Components (
core.component) - Schema (
core.schema) - Resilience (
core.resilience) - Data Quality (
core.quality) - Audit (
core.audit) - Secrets (
core.secrets) - Metrics (
core.metrics) - Session (
runtime.session) - DataFlow (
runtime.dataflow) - Streaming (
runtime.streaming) - Dynamic Loader (
runtime.loader) - Schema Converter (
runtime.schema_converter) - Runner (
runner) - Examples (
examples)
- Configuration (
- Glossary