ML InfrastructureSnowflakeRayData Engineering AI Generated

Migrating ML Pipelines from Spark to Snowflake + Ray: A 360x Improvement

Nov 20249 min read

Migrating ML Pipelines from Spark to Snowflake + Ray: A 360x Improvement

My previous team inherited a set of ML feature engineering pipelines running on Spark via Kubeflow. They worked, when they finished. The problem: a full pipeline run took 4 hours. Nightly batch jobs regularly missed their windows. Engineers took days to iterate on features. On-call incidents often meant someone manually babysitting a stuck Spark job at 2 AM.

We migrated to Snowflake for the data transformation layer and Ray on Anyscale for the ML compute. The same pipeline now runs in 40 seconds.

Why 4 Hours Was Unavoidable on Spark

The pipeline had three phases:

  1. ·Feature extraction (about 2 hours): SQL-like transformations on large feature tables. Spark's shuffle-heavy operations on our data distribution were the bottleneck.
  2. ·Model feature computation (about 1 hour): Python UDFs calling into our ML feature library. Spark's Python UDF execution is famously slow because of serialization overhead between the JVM and Python.
  3. ·Validation (about 1 hour): Data quality checks, schema validation, and statistical tests.

Spark was the wrong tool for phases 2 and 3. It was acceptable for phase 1, but we paid the Spark tax on everything.

Snowflake for Data Transformation

Phase 1 moved to Snowflake almost verbatim. Our SQL transformations ran faster for three reasons:

  • ·Auto-scaling compute: Snowflake scales compute warehouses based on query complexity. No manual cluster sizing.
  • ·Column store and micro-partitioning: Our queries filtered heavily on a few columns. Snowflake's columnar storage eliminated massive amounts of I/O.
  • ·No shuffle: Snowflake's query optimizer handles data distribution transparently.

Phase 1 dropped from 2 hours to under 2 minutes.

Ray for ML Compute

For phases 2 and 3, the Python-heavy ML work, we used Ray.

Ray was designed for Python-native distributed computing. No JVM, no serialization overhead, no UDF performance cliff.

python
import ray
from ray import data as rd

@ray.remote
def compute_features(batch: dict) -> dict:
    return feature_library.compute(batch)

ds = rd.read_parquet("s3://features/raw/")
result = ds.map_batches(compute_features, batch_size=1000)

Ray's map_batches distributes computation across workers transparently. Anyscale handles cluster management, scaling workers up during feature computation and down when idle.

The Migration Journey

Weeks 1-2: Migrate Phase 1 to Snowflake. Validate outputs match exactly. Ship to production.

Week 3: Rewrite Phase 2 as Ray jobs. More work than expected. Code assuming PySpark DataFrames had to be refactored to work with pandas and Ray datasets.

Week 4: Migrate Phase 3 validation. Much easier. Our validators were pure Python functions.

Weeks 5-6: Hardening. Add observability using Ray's built-in dashboard and custom Prometheus metrics. Set up Anyscale cluster autoscaling policies. Integrate with existing alert infrastructure.

Numbers

| Phase | Before | After | Improvement | |-------|--------|-------|-------------| | Feature extraction | 2 hours | 90 seconds | 80x | | ML feature computation | 1 hour | 20 seconds | 180x | | Validation | 1 hour | 10 seconds | 360x | | Total | 4 hours | ~2 minutes | ~120x |

The 360x in the title refers to the validation phase specifically. The end-to-end pipeline improved about 120x overall.

What We Gave Up

Spark has a large ecosystem. A few things we left behind:

  • ·MLflow integration: Spark's MLlib has native MLflow support. Wiring up Ray and MLflow required manual work.
  • ·Unified compute model: Spark handles SQL and Python in one framework, badly. Now we maintain two systems.
  • ·Institutional knowledge: The team knew Spark. Ray had a learning curve.

For our workload, the tradeoffs were clear. If your pipeline is SQL-heavy and Python UDFs are rare, Spark is fine. We were in the wrong place on the spectrum.