Migrating ML Pipelines from Spark to Snowflake + Ray: A 360x Improvement
My previous team inherited a set of ML feature engineering pipelines running on Spark via Kubeflow. They worked, when they finished. The problem: a full pipeline run took 4 hours. Nightly batch jobs regularly missed their windows. Engineers took days to iterate on features. On-call incidents often meant someone manually babysitting a stuck Spark job at 2 AM.
We migrated to Snowflake for the data transformation layer and Ray on Anyscale for the ML compute. The same pipeline now runs in 40 seconds.
Why 4 Hours Was Unavoidable on Spark
The pipeline had three phases:
- ·Feature extraction (about 2 hours): SQL-like transformations on large feature tables. Spark's shuffle-heavy operations on our data distribution were the bottleneck.
- ·Model feature computation (about 1 hour): Python UDFs calling into our ML feature library. Spark's Python UDF execution is famously slow because of serialization overhead between the JVM and Python.
- ·Validation (about 1 hour): Data quality checks, schema validation, and statistical tests.
Spark was the wrong tool for phases 2 and 3. It was acceptable for phase 1, but we paid the Spark tax on everything.
Snowflake for Data Transformation
Phase 1 moved to Snowflake almost verbatim. Our SQL transformations ran faster for three reasons:
- ·Auto-scaling compute: Snowflake scales compute warehouses based on query complexity. No manual cluster sizing.
- ·Column store and micro-partitioning: Our queries filtered heavily on a few columns. Snowflake's columnar storage eliminated massive amounts of I/O.
- ·No shuffle: Snowflake's query optimizer handles data distribution transparently.
Phase 1 dropped from 2 hours to under 2 minutes.
Ray for ML Compute
For phases 2 and 3, the Python-heavy ML work, we used Ray.
Ray was designed for Python-native distributed computing. No JVM, no serialization overhead, no UDF performance cliff.
import ray
from ray import data as rd
@ray.remote
def compute_features(batch: dict) -> dict:
return feature_library.compute(batch)
ds = rd.read_parquet("s3://features/raw/")
result = ds.map_batches(compute_features, batch_size=1000)
Ray's map_batches distributes computation across workers transparently. Anyscale handles cluster management, scaling workers up during feature computation and down when idle.
The Migration Journey
Weeks 1-2: Migrate Phase 1 to Snowflake. Validate outputs match exactly. Ship to production.
Week 3: Rewrite Phase 2 as Ray jobs. More work than expected. Code assuming PySpark DataFrames had to be refactored to work with pandas and Ray datasets.
Week 4: Migrate Phase 3 validation. Much easier. Our validators were pure Python functions.
Weeks 5-6: Hardening. Add observability using Ray's built-in dashboard and custom Prometheus metrics. Set up Anyscale cluster autoscaling policies. Integrate with existing alert infrastructure.
Numbers
| Phase | Before | After | Improvement | |-------|--------|-------|-------------| | Feature extraction | 2 hours | 90 seconds | 80x | | ML feature computation | 1 hour | 20 seconds | 180x | | Validation | 1 hour | 10 seconds | 360x | | Total | 4 hours | ~2 minutes | ~120x |
The 360x in the title refers to the validation phase specifically. The end-to-end pipeline improved about 120x overall.
What We Gave Up
Spark has a large ecosystem. A few things we left behind:
- ·MLflow integration: Spark's MLlib has native MLflow support. Wiring up Ray and MLflow required manual work.
- ·Unified compute model: Spark handles SQL and Python in one framework, badly. Now we maintain two systems.
- ·Institutional knowledge: The team knew Spark. Ray had a learning curve.
For our workload, the tradeoffs were clear. If your pipeline is SQL-heavy and Python UDFs are rare, Spark is fine. We were in the wrong place on the spectrum.