I am contracted by a major financial services firm to refactor an analytical model used for revenue forecasting to PySpark executing on a AWS EMR cluster.
The project's current status is documented[0].
The client's team responsible for operationalization was successful in refactoring another analytical model into Python/pandas. The current model execution time for a 5 year scenario is ~17 hours. Most of that time is
spent executing poorly crafted Oracle SQL queries drawing millions of rows into the analytical run-time for, sorting, aggregation, discarding, merging, and spliting tasks.
In order to constrain this exeuction time, final input is a sample of ~1.8M rows from a loan portfolio of ~81M records.
The client is concerned about performance and believes PySpark is the preferred target language.
I have been on this project for just one month, but I contracted previously at the same firm on a six months to refactor another model into Python/pandas. That project was successful, mainly due to the team leader's rigor for meeting milestones and ability to remove blockers for the team.
I recently discussed these projects with @Travis Oliphant who had some interesting ideas on Python-based frameworks to overcome issues for processing out-of-core dataframes. We discussed the frameworks Dask[1], Coiled.io, commercial Dask support[2], Ray[3], Modin, commercial support for Ray[4].
Others discussed were, Databricks[5], bodo.ai[6], Voltron Data[7], and AtScale[8]. On Reddit, the
commentary for Snowflake was very positive[9].
Easing maintenence burdens to keep the model in production and devising new scenarios (e.g. Covid-19 effects on forebarance requests) are requirements. Its shelf-life is years, making maintainability a major consideration.
What have others experienced in scaling out for teams familiar with Python/pandas for feature engineering tasks?
Is PySpark a dead-end libray in the Python ecosystem?
[0] https://www.pythonforsasusers.com/project_summary/current_project_status.html
[1] https://dask.org/
[2] https://coiled.io/
[3] https://docs.ray.io/en/ray-0.4.0/pandas_on_ray.html
[4] https://modin.readthedocs.io/en/stable/
[5] https://docs.databricks.com/languages/pandas-spark.html (which points to Apache's Pandas API on Spark)
[6] https://bodo.ai/
[7] https://wesmckinney.com/blog/from-ursa-to-voltrondata/
[8] https://www.atscale.com/autonomous-data-engineering/
[9] https://www.reddit.com/r/dataengineering/comments/r893rw/why_is_snowflake_so_popular/