At Lingoda we had a marketing attribution model that lived entirely in PySpark. It worked — until it didn't. Long runtimes, opaque logic buried in Python functions, no test coverage, and every change requiring someone who knew the history of the code. When we decided to migrate it to dbt + SQL running on Redshift, we knew the migration itself was just the start. The real challenge was proving the new version produced identical results.
PySpark made sense when we were on a Hadoop-style architecture and needed distributed compute for raw data processing. Once Redshift became our primary warehouse, running PySpark transformations meant spinning up a cluster, paying for it, and maintaining a separate execution environment — just to produce tables that Redshift could have computed with SQL. The operational overhead was constant and the benefits minimal for this workload.
dbt gave us what PySpark couldn't: version-controlled, documented, testable transformation logic that lives directly in the warehouse. Every model is a SELECT statement. Every column can have a description. Every critical field can have a test. And the lineage is automatic.
The attribution model touched several upstream sources and produced a handful of downstream metrics used in Finance and Marketing dashboards. We couldn't just rewrite and switch over — we needed confidence that nothing changed.
The approach was a reconciliation framework: run both versions in parallel for a period, then compare outputs row by row. We defined metric parity thresholds (zero tolerance for financial figures, small epsilon for attribution percentages due to floating-point handling differences) and built a comparison job that flagged any divergence automatically.
A few specific things that needed careful handling:
After several weeks of parallel running with zero divergence, we cut over. The transformation runtime dropped by 70% — from something that took the better part of an hour to one that completes in under 20 minutes, with no separate cluster to manage. The model now has column-level documentation, dbt tests on key fields, and lineage that traces back to raw sources automatically.
More importantly, the next person who needs to change the attribution logic can read a SQL SELECT statement instead of untangling Python closures. That might be the biggest win of all.