Glossary¶
- Checkpoint¶
A snapshot of pipeline execution state recording which components have completed successfully. Used to resume failed pipelines from the last successful step.
- Circuit Breaker¶
A resilience pattern that tracks consecutive failures and temporarily rejects calls when a failure threshold is reached. Prevents cascading failures by giving a failing component time to recover.
- Component¶
A unit of work in a pipeline. Components extend
PipelineComponent(orDataFlow) and implement arun()method.- CompositeHooks¶
A hooks implementation that delegates to multiple child hooks, allowing logging, metrics, audit, and data quality hooks to coexist.
- ConfigurableInstance¶
A runtime-checkable protocol for components that can be instantiated from a configuration dictionary via
from_config().- Data Quality Check¶
A validation rule that runs at a pipeline lifecycle point (before or after a component) to verify data integrity.
- DataFlow¶
An abstract base class for Spark-aware pipeline components. Extends
PipelineComponentwith SparkSession injection and a logger.- HOCON¶
Human-Optimized Config Object Notation. A superset of JSON used for pipeline configuration files. Parsed by the
dataconflibrary.- Hook¶
A callback interface that receives lifecycle events during pipeline execution (start, end, retry, etc.).
- Pipeline Fingerprint¶
A hash of the pipeline configuration used to detect configuration changes between runs. Stale checkpoints are invalidated when the fingerprint changes.
- PipelineComponent¶
The abstract base class for all pipeline components. Defines the
nameproperty andrun()method.- RetryExecutor¶
Executes a callable with exponential backoff and optional jitter. Configurable per-component via HOCON
retryblocks.- SchemaAwareDataFlow¶
A
DataFlowsubclass that declares input and output schemas and validates them automatically before and afterrun().- SchemaContract¶
A protocol for components that declare input and output schemas. Not runtime-checkable; use
hasattrchecks.- SecretsProvider¶
An abstract base class for secret backends. Implementations include
EnvSecretsProvider,AwsSecretsProvider, andVaultSecretsProvider.- SimplePipelineRunner¶
The main pipeline orchestrator. Loads components from configuration, resolves dependencies, and executes them in order with resilience and hooks.
- SparkSessionWrapper¶
A thread-safe singleton that manages the
SparkSessionlifecycle, including Spark Connect support.- StreamingPipeline¶
An abstract base class for Spark Structured Streaming pipelines that combines a
StreamingSource, optional transform, andStreamingSink.