passer_byer on Hacker News

Ask HN: Forensic analysis for 50 year-old tape?

Im posting for a friend who moved recently. He found a magnetic tape that was written over 50 years ago. He says, if he recalls correctly, the tape would have contained 2 file, EBCDIC encoded, written by an IBM utility using a tape sub-system attached to a s/360 processor running MVS.

Assume the IBM guys have this setup in some dusty basement in upstate New York.

This is a 2 part question. What is the best method for attempting to read this mag tape? That’s the main question. Second, what is the probability of success here? Assume the tape was kept in a climate controlled home.

Ask HN: Is PySPark a Dead-End?

I am contracted by a major financial services firm to refactor an analytical model used for revenue forecasting to PySpark executing on a AWS EMR cluster.

The project's current status is documented[0].

The client's team responsible for operationalization was successful in refactoring another analytical model into Python/pandas. The current model execution time for a 5 year scenario is ~17 hours. Most of that time is spent executing poorly crafted Oracle SQL queries drawing millions of rows into the analytical run-time for, sorting, aggregation, discarding, merging, and spliting tasks.

In order to constrain this exeuction time, final input is a sample of ~1.8M rows from a loan portfolio of ~81M records.

The client is concerned about performance and believes PySpark is the preferred target language.

I have been on this project for just one month, but I contracted previously at the same firm on a six months to refactor another model into Python/pandas. That project was successful, mainly due to the team leader's rigor for meeting milestones and ability to remove blockers for the team.

I recently discussed these projects with @Travis Oliphant who had some interesting ideas on Python-based frameworks to overcome issues for processing out-of-core dataframes. We discussed the frameworks Dask[1], Coiled.io, commercial Dask support[2], Ray[3], Modin, commercial support for Ray[4].

Others discussed were, Databricks[5], bodo.ai[6], Voltron Data[7], and AtScale[8]. On Reddit, the commentary for Snowflake was very positive[9].

Easing maintenence burdens to keep the model in production and devising new scenarios (e.g. Covid-19 effects on forebarance requests) are requirements. Its shelf-life is years, making maintainability a major consideration.

What have others experienced in scaling out for teams familiar with Python/pandas for feature engineering tasks?

Is PySpark a dead-end libray in the Python ecosystem?

[0] https://www.pythonforsasusers.com/project_summary/current_project_status.html

[1] https://dask.org/

[2] https://coiled.io/

[3] https://docs.ray.io/en/ray-0.4.0/pandas_on_ray.html

[4] https://modin.readthedocs.io/en/stable/

[5] https://docs.databricks.com/languages/pandas-spark.html (which points to Apache's Pandas API on Spark)

[6] https://bodo.ai/

[7] https://wesmckinney.com/blog/from-ursa-to-voltrondata/

[8] https://www.atscale.com/autonomous-data-engineering/

[9] https://www.reddit.com/r/dataengineering/comments/r893rw/why_is_snowflake_so_popular/

9passer_byer4y ago9

Ask HN: Forensic analysis for 50 year-old tape?

Assume the IBM guys have this setup in some dusty basement in upstate New York.

Ask HN: Is PySPark a Dead-End?

I am contracted by a major financial services firm to refactor an analytical model used for revenue forecasting to PySpark executing on a AWS EMR cluster.

The project's current status is documented[0].

In order to constrain this exeuction time, final input is a sample of ~1.8M rows from a loan portfolio of ~81M records.

The client is concerned about performance and believes PySpark is the preferred target language.

Others discussed were, Databricks[5], bodo.ai[6], Voltron Data[7], and AtScale[8]. On Reddit, the commentary for Snowflake was very positive[9].

What have others experienced in scaling out for teams familiar with Python/pandas for feature engineering tasks?

Is PySpark a dead-end libray in the Python ecosystem?

[0] https://www.pythonforsasusers.com/project_summary/current_project_status.html

[1] https://dask.org/

[2] https://coiled.io/

[3] https://docs.ray.io/en/ray-0.4.0/pandas_on_ray.html

[4] https://modin.readthedocs.io/en/stable/

[5] https://docs.databricks.com/languages/pandas-spark.html (which points to Apache's Pandas API on Spark)

[6] https://bodo.ai/

[7] https://wesmckinney.com/blog/from-ursa-to-voltrondata/

[8] https://www.atscale.com/autonomous-data-engineering/

[9] https://www.reddit.com/r/dataengineering/comments/r893rw/why_is_snowflake_so_popular/

passer_byer

Recent submissions

Ask HN: Forensic analysis for 50 year-old tape?

Ask HN: Is PySPark a Dead-End?

Recent submissions

Ask HN: Forensic analysis for 50 year-old tape?

Ask HN: Is PySPark a Dead-End?