Pandas Exercises for Data Analysis (Interactive) (opens in new tab)

(machinelearningplus.com)

126 pointsselva862mo ago33 comments

33 comments

Pandas is terrific, yet even its original author has noted inherent shortcomings [1], and there exist alternatives.

Polars seems to be the most prominent competitor in the Python DataFrame space, and DuckDB appears to pursue an approach similar to SQLite, but columnar.

I am personally working on a solution to a broader problem, which can also be viewed as an alternative to Pandas [2].

[1] https://wesmckinney.com/blog/apache-arrow-pandas-internals/

[2] https://github.com/ronfriedhaber/autark

arijun2mo ago

For your link [1], many of those issues have been addressed with pandas 2.0 (which I believe Wes Mckinney [pandas' original author] contributed to). So it's a bit disingenuous to point to that post and say "See? Even Wes disowns it!"

That being said, if I were to start a new project requiring that kind of work today, I would probably try Polars first. Their greenfield implementation allowed them to get rid of many of the crusty edges of pandas.

0x696C69612mo ago

Would be nice to have a polars version of this.

Vaslo2mo ago

Came here to say the same. We use Pandas now only when forced, Polars and DuckDB are the future.

selva86OP2mo ago

Made this as well for polars: https://machinelearningplus.com/python/101-polars-exercises-...

I was thinking about it for quite a while, but not sure if there would be interest. Thanks for your comment!

selva86OP2mo ago

Build this as an interactive tool for our popular 101 Pandas exercises. The code runs entirely in local in your browser. Would love feedback on the ease of use and the editor UX.

alexpotato2mo ago

These are great!

Would have made my life a lot easier when I was learning Pandas.

Would also be cool to have a Polars version of this too.

One suggestion:

A lot of folks come to Pandas from using SQL. It might be handy to have a couple "The equivalent of this SQL statement but in Pandas"

Vaslo2mo ago

Looks great, do for Polars next!!

short_sells_poo2mo ago

You'll get a lot of responses saying Polars is better than Pandas. I argue those people are missing the point and don't understand Pandas' real strength or why people choose Pandas today.

Pandas was never meant to be a technologist's tool. It was meant to be a researcher's tool and was unfortunately coopted to be a technical solution as well. It has not well escaped it's roots.

Pandas is fantastic for doing iterative and interactive research on semi-structured data. It has a lot of QoL facilities and utility functions for seamlessly dealing with exploratory timeseries analytics for in-core data. Data that fits into memory.

For example, I can take two time series and calculate their product:

ts3 = ts1 * ts2

This one line does a huge amount of heavily lifting by automatically aligning the timestamps and columns between the two inputs so that I'm not accidentally multiplying two entries that have the same ordinal but not the same timestamp or column label.

Can I do the same with Polars? Yes, but it comes with exponentially more cognitive overhead. And this is just one example.

Pandas is ultimately a flawed product as it's origin's go back more than a decade where R's dataframe was cutting edge. A lot of innovation happened since then and the API and internals of Pandas mean that certain choices that were made early on are nontrivial to change.

This doesn't change the fact that Pandas is still immensely useful. Eventually perhaps Polars will come close to it, but so far the focus wasn't on interactive use ergonomics unfortunately.

As it stands, I use pandas for research and polars for production systems.

rithdmc2mo ago

Dope. I've just started using Pandas in some personal projects, and am quickly hitting my knowledge ceiling. I think this will be useful. I'll check it out properly after work.

derriz2mo ago

If I were investing effort into acquiring knowledge in this domain, I'd skip straight to Polars. Before I made the switch, I had been using Pandas on and off for more than a decade. I'm not sure how representative this is, but most of the people I know who were Pandas users have also made this switch. I initially did it for the performance improvements but the API (according to my subjective opinion) is much more logical and has far fewer surprises compared to Pandas and it would be my default choice for this reason alone at this stage despite my years of Pandas experience.

benrutter2mo ago

I'd second this, especially if its just for personal use!

The data world owes a lot to pandas, but it has plenty of sharp edges and using it can sometimes involve pretty close knowledge of how things like indexing/slicing/etc work under the hood.

If I get stuck in polars, its almost always just a "what's the name of the function to use?" type problem rather than needing lots of knowledge about how things are working under the hood.

1 more reply

rithdmc2mo ago

Thanks, I'll look into this in the future. I don't need the most performant script, but this could change.

1 more reply

jtbaker2mo ago

DuckDB and SQL FTW.

xpe2mo ago

> [Polars] is much more logical and has far fewer surprises compared to Pandas

A kind understatement imo. For me, the following experiences are highly coupled in my brain: "I'm using Pandas" + "I'm feeling a weird combination of confusion and pain" + "This is a dumpster fire".

data-ottawa2mo ago

You should check out the Modern Pandas series by Tom Augspurger, it’s well worth reading to get clean modern style code.

https://tomaugspurger.net/posts/modern-1-intro/

pixelispoint2mo ago

I second this blog post. I worked with Tom on a project several years ago and he's brilliant. Started doing python more frequently after that project and I found his blog to be very helpful in finding a good way to conceptualize pandas and python data structures in general.

rithdmc2mo ago

Thanks. There's a special place in my heart for any blog that opens with 'Prior Work' :)

sceadu2mo ago

also I would recommend looking at videos from matt harrison for polars or pandas, e.g.:

https://www.youtube.com/watch?v=Z9ekw2Ou3s0

driftnode2mo ago

The author posted a Polars version in the comments and almost nobody noticed. Meanwhile the top comments are still asking for it. Building something useful and having people ignore what you made to request what you already made is a special kind of frustration.

kasperset2mo ago

I don't hear much about Ibis here. https://ibis-project.org On paper it sounds like a good idea. Any opinion about this option.

wismwasm2mo ago

Ibis is great! Used it with duckdb & Snowflake. Worked well for these backends

Vaslo2mo ago

Ive used it but definitely ran into issues where Ibis couldnt handle a transformation and had to move back into Polars or DuckDB to do. I just eventually stripped it out.

sghaz2mo ago

The pricing page says, "This page doesn’t seem to exist. It looks like the link pointing here was faulty. Maybe try searching?"

fud1012mo ago

what is the permission it asks for? it seems suspicious af.

najarvg2mo ago

I ran into this too. I'm on a work computer so did not want to accept this without knowing

kjkjadksj2mo ago

If you think pandas is comfortable, wait until you try base R. Such a comfortable language for data wrangling and analysis.

j / k navigate · click thread line to collapse

33 comments

ronfriedhaber2mo ago

Pandas is terrific, yet even its original author has noted inherent shortcomings [1], and there exist alternatives.

Polars seems to be the most prominent competitor in the Python DataFrame space, and DuckDB appears to pursue an approach similar to SQLite, but columnar.

I am personally working on a solution to a broader problem, which can also be viewed as an alternative to Pandas [2].

[1] https://wesmckinney.com/blog/apache-arrow-pandas-internals/

[2] https://github.com/ronfriedhaber/autark

arijun2mo ago

0x696C69612mo ago

Would be nice to have a polars version of this.

Vaslo2mo ago

Came here to say the same. We use Pandas now only when forced, Polars and DuckDB are the future.

selva86OP2mo ago

Made this as well for polars: https://machinelearningplus.com/python/101-polars-exercises-...

I was thinking about it for quite a while, but not sure if there would be interest. Thanks for your comment!

selva86OP2mo ago

Build this as an interactive tool for our popular 101 Pandas exercises. The code runs entirely in local in your browser. Would love feedback on the ease of use and the editor UX.

alexpotato2mo ago

These are great!

Would have made my life a lot easier when I was learning Pandas.

Would also be cool to have a Polars version of this too.

One suggestion:

A lot of folks come to Pandas from using SQL. It might be handy to have a couple "The equivalent of this SQL statement but in Pandas"

Vaslo2mo ago

Looks great, do for Polars next!!

short_sells_poo2mo ago

You'll get a lot of responses saying Polars is better than Pandas. I argue those people are missing the point and don't understand Pandas' real strength or why people choose Pandas today.

Pandas was never meant to be a technologist's tool. It was meant to be a researcher's tool and was unfortunately coopted to be a technical solution as well. It has not well escaped it's roots.

For example, I can take two time series and calculate their product:

ts3 = ts1 * ts2

Can I do the same with Polars? Yes, but it comes with exponentially more cognitive overhead. And this is just one example.

This doesn't change the fact that Pandas is still immensely useful. Eventually perhaps Polars will come close to it, but so far the focus wasn't on interactive use ergonomics unfortunately.

As it stands, I use pandas for research and polars for production systems.

rithdmc2mo ago

Dope. I've just started using Pandas in some personal projects, and am quickly hitting my knowledge ceiling. I think this will be useful. I'll check it out properly after work.

derriz2mo ago

benrutter2mo ago

I'd second this, especially if its just for personal use!

The data world owes a lot to pandas, but it has plenty of sharp edges and using it can sometimes involve pretty close knowledge of how things like indexing/slicing/etc work under the hood.

If I get stuck in polars, its almost always just a "what's the name of the function to use?" type problem rather than needing lots of knowledge about how things are working under the hood.

1 more reply

rithdmc2mo ago

Thanks, I'll look into this in the future. I don't need the most performant script, but this could change.

1 more reply

jtbaker2mo ago

DuckDB and SQL FTW.

xpe2mo ago

> [Polars] is much more logical and has far fewer surprises compared to Pandas

A kind understatement imo. For me, the following experiences are highly coupled in my brain: "I'm using Pandas" + "I'm feeling a weird combination of confusion and pain" + "This is a dumpster fire".

data-ottawa2mo ago

You should check out the Modern Pandas series by Tom Augspurger, it’s well worth reading to get clean modern style code.

https://tomaugspurger.net/posts/modern-1-intro/

pixelispoint2mo ago

rithdmc2mo ago

Thanks. There's a special place in my heart for any blog that opens with 'Prior Work' :)

sceadu2mo ago

also I would recommend looking at videos from matt harrison for polars or pandas, e.g.:

https://www.youtube.com/watch?v=Z9ekw2Ou3s0

driftnode2mo ago

kasperset2mo ago

I don't hear much about Ibis here. https://ibis-project.org On paper it sounds like a good idea. Any opinion about this option.

wismwasm2mo ago

Ibis is great! Used it with duckdb & Snowflake. Worked well for these backends

Vaslo2mo ago

Ive used it but definitely ran into issues where Ibis couldnt handle a transformation and had to move back into Polars or DuckDB to do. I just eventually stripped it out.

sghaz2mo ago

The pricing page says, "This page doesn’t seem to exist. It looks like the link pointing here was faulty. Maybe try searching?"

fud1012mo ago

what is the permission it asks for? it seems suspicious af.

najarvg2mo ago

I ran into this too. I'm on a work computer so did not want to accept this without knowing

kjkjadksj2mo ago

If you think pandas is comfortable, wait until you try base R. Such a comfortable language for data wrangling and analysis.

j / k navigate · click thread line to collapse