Scala Migration Guide¶
This guide helps users of the Scala spark-pipeline-framework migrate to the Python version. The two frameworks share the same architecture and concepts, but configuration field names and some API conventions differ.
HOCON Field Mapping¶
The table below maps Scala HOCON field names to their Python equivalents.
Pipeline-Level Fields¶
Scala Field |
Python Field |
Notes |
|---|---|---|
|
|
Shortened |
|
(constructor arg) |
Passed to |
|
(not in config) |
Schema validation is automatic when components implement |
|
(not in config) |
Use |
|
(per-component) |
Set |
|
(per-component) |
Set |
|
|
Shortened |
Component-Level Fields¶
Scala Field |
Python Field |
Notes |
|---|---|---|
|
|
Fully-qualified class path (e.g. |
|
|
Unique identifier within the pipeline |
|
|
Dict of component-specific settings |
|
|
Note: Python counts total attempts, not retries ( |
|
|
Seconds instead of milliseconds |
|
|
Seconds instead of milliseconds |
|
|
Same semantics |
|
(constructor arg) |
|
|
|
Same semantics (list of class names) |
Spark Configuration¶
Scala Field |
Python Field |
Notes |
|---|---|---|
|
|
Same |
|
|
Underscores instead of hyphens |
|
|
Additional |
Secrets Configuration¶
Scala Syntax |
Python Syntax |
Notes |
|---|---|---|
|
|
Simpler URI scheme |
|
|
Same provider |
|
|
Colon separator instead of hash |
Example: Full Config Migration¶
Scala HOCON:
{
spark {
master = "yarn"
app-name = "Customer ETL"
config {
"spark.sql.shuffle.partitions" = "200"
}
}
pipeline {
pipeline-name = "customer-etl"
fail-fast = true
retry-policy {
max-retries = 2
initial-delay-ms = 1000
max-delay-ms = 30000
backoff-multiplier = 2.0
}
pipeline-components = [
{
instance-type = "com.example.ReadCustomers"
instance-name = "read-customers"
instance-config {
table = "raw.customers"
output-view = "raw_customers"
}
},
{
instance-type = "com.example.TransformCustomers"
instance-name = "transform-customers"
instance-config {
input-view = "raw_customers"
output-view = "clean_customers"
}
}
]
}
}
Python HOCON:
{
name: "customer-etl"
version: "1.0.0"
spark {
app_name: "Customer ETL"
master: "yarn"
spark_conf {
"spark.sql.shuffle.partitions": "200"
}
}
components: [
{
name: "read-customers"
component_type: source
class_path: "my_package.ReadCustomers"
config {
table: "raw.customers"
output_view: "raw_customers"
}
retry {
max_attempts: 3
initial_delay_seconds: 1.0
max_delay_seconds: 30.0
backoff_multiplier: 2.0
}
},
{
name: "transform-customers"
component_type: transformation
class_path: "my_package.TransformCustomers"
depends_on: ["read-customers"]
config {
input_view: "raw_customers"
output_view: "clean_customers"
}
}
]
}
API Migration¶
Component Creation¶
Scala:
class MyTransform extends DataFlow {
override def name: String = "MyTransform"
override def run(): Unit = {
val df = spark.sql("SELECT * FROM raw")
df.createOrReplaceTempView("transformed")
}
}
object MyTransform extends ConfigurableInstance {
override def createFromConfig(conf: Config): MyTransform = {
// parse config
new MyTransform(...)
}
}
Python:
class MyTransform(DataFlow):
@property
def name(self) -> str:
return "MyTransform"
@classmethod
def from_config(cls, config: dict) -> "MyTransform":
return cls(**config)
def run(self) -> None:
df = self.spark.sql("SELECT * FROM raw")
df.createOrReplaceTempView("transformed")
Key differences:
nameis a property instead of a method overridefrom_configis a classmethod on the class itself (no companion object)from_configreceives adictinstead of a TypesafeConfigobject
Pipeline Execution¶
Scala:
SimplePipelineRunner.run(config, hooks = LoggingHooks)
Python:
runner = SimplePipelineRunner.from_file(
"pipeline.conf",
hooks=CompositeHooks(LoggingHooks()),
)
result = runner.run()
Key differences:
Python runner is instance-based (Scala uses a singleton object)
Python
run()returns aPipelineResult; Scala returnsUnitand throws on failurePython hooks are composed explicitly with
CompositeHooks; Scala usesPipelineHooks.compose()
Key Conceptual Differences¶
Concept |
Scala |
Python |
|---|---|---|
Config parsing |
PureConfig ( |
dataconf ( |
Type safety |
Compile-time (Scala types) |
Runtime (mypy strict + dataclass validation) |
Protocols |
Traits (nominal typing) |
|
Metrics |
Micrometer |
|
Logging |
Log4j2 |
Python |
Build |
SBT + cross-compilation |
Hatchling + pip |
Dependency injection |
Constructor + companion objects |
Constructor + |
Error handling |
Exceptions + |
Exceptions with |
See Also¶
Configuration – Python configuration reference
Components – Creating Python components
Resilience – Retry and circuit breaker configuration