Teaching Pandas and Jupyter to Northwestern journalism students (opens in new tab)

(californiacivicdata.org)

86 pointspalewire8y ago45 comments

45 comments

So many people don't realize pandas can be horribly slow if you use it "wrong" -- i.e., for computations that don't vectorize in the way that's native for pandas. Also, working with dataframes that contain millions of rows is like playing a Russian roulette -- there's usually many ways to do the same thing in pandas, if you guessed correct you'll wait a minute or two till the computation's done, if you guessed wrong it'll run out of ram, segfault or never finish.

For big datasets, I've stopped using pandas myself a few years back for anything other than printing dataframe, datetime index series, doing quick plots, or working with tiny/toy datasets -- in favor of numpy structured/record arrays. It's kind of the same thing, without all the groupby/index fluff, but very fast.

Just last week, I've helped my colleague speed up her code (numerical solver for financial data) by more than 100x, the biggest part of it was ditching pandas entirely and using numpy.

Declanomous8y ago

So I've been learning Pandas after mostly using either standard Python, R or VB to do our analysis, and I'm glad I read this because I thought I was going crazy.

I have a data set of about 4 million rows I routinely analyze. I have 32 gb of space on my desktop, and the only time I've really run out is when I write incredibly poor code. In the short while I've been trying to use Pandas run out of memory and get killed by the OOM killer or completely freeze my system for half an hour while processing what I thought were simple operations.

I was honestly beginning to believe I was way worse at programming than I thought due to all of the issues I was having. I wasn't even doing anything particularly complex, I was just loading a dataframe from a sql query and playing around with basic manipulation.

earthnail8y ago

I'm glad you are sharing this. I've made the same experience - in our code, we ditched Pandas entirely for structured arrays. We also used numpy record arrays at first but found them to somehow be significantly slower than structured arrays, and since the former just add syntactic sugar to the latter, we're now running entirely on structured numpy arrays.

farnsworth8y ago

  But pandas’ magical simplicity makes things like computed columns immediately intuitive:
  > data['% of total'] = data.amount / data.amount.sum()

Is that immediately intuitive? I'm staring at this trying to understand what it's doing. Is the / operator overloaded? data.amount is one particular amount, and data.amount.sum() is the sum of all amounts? Why does the "computed column" property goes on the same data object as the actual data? Maybe it's immediately intuitive if you've used pandas.

kinkrtyavimoodh8y ago

OTOH I think it's immediately intuitive if you are not a programmer. :)

When you see amount / sum, you think of how a list can be divided by what appears to a scalar.

When they see it, they parse it out for what they naturally understand a percentage to mean. And all is well.

david_eads8y ago

Exactly this. I'm the author of the post and was a programmer by trade for a long time before I became a journalist. I _don't_ actually find this more intuitive than more explicit and fundamental programming techniques. But my students grokked it immediately, whereas even simple structures like loops seem to be harder to get for them to get their heads around.

Given I had ten weeks to cram a lot of material in but did want to show them some amount of programming, this worked pretty nicely.

2 more replies

IanCal8y ago

The bit I like about this one is that it's also either wrong or highly misleading, depending on your viewpoint. If I have a row that says:

"% of total" : 0.01

I would not expect that to be 1%.

At least, this could easily be the source of an inaccurate calculation elsewhere. This is not a major criticism, but perhaps would be a good point to introduce the idea of testing some of your code, even as a few simple cells that calculate things you expect.

downrightmike8y ago

> Maybe it's immediately intuitive if you've used pandas. I don't find this formula any different than anything in any of the math classes I've ever had. Haven't used pandas.

confounded8y ago

It's vectorized!

tkt8y ago

For installation of Jupyter, Anaconda works well across all platforms, even most slightly older OSes.

http://jupyter.readthedocs.io/en/latest/install.html

It does work better for people to install Jupyter with Anaconda, rather than use virtual environments, because there's not the overhead of also having to learn about virtual environments. People tend to think of them as just associated with the class and don't use them as much for their own work outside of the workshop or course.

flyaway8y ago

I spend about 8 months of the year teaching pandas to journalism students, and it's a wild ride! Despite some of the iffy syntax and pandas' seeming inability to standardize parameter names, the students seem to grok the workflow much more quickly than wrangling lists and dictionaries in the "normal" world of Python.

I know everyone loves the reproducibility Notebooks supposedly bring to the table, but without a doubt my favorite part is the ability to export super-unattractive matplotlib charts as PDF, clean them up in Illustrator, and suddenly find yourself with publication-quality graphics. Knowing you're producing something more than just some numbers to toss in a story can be a strong sell to a lot of folks.

thearn48y ago

I really like Jupyter, but somehow I'm not in love with it. Like, every time I fire it up to use it for quick data analysis, I seem to inevitably end up back in sublime + bash, sending plots to disk. Am I the odd one out?

has2k18y ago

If you know what kind of short analysis you want to do, the benefits of Jupyter are not obvious. If you have to do a lot of exploration, and do longer analyses then it becomes indispensable.

david_eads8y ago

It's also really valuable for sharing. At NPR, I did an analysis of Trump's tweets that was used in a digital post and Morning Edition piece. The notebook was easy to share with the reporter, editor, and readers and accessible enough for them to understand (https://github.com/nprapps/trump-tweet-analysis/blob/master/...).

thearn48y ago

I guess I could see that. Maybe I need to learn more of the keyboard shortcuts and magic methods.

hogu8y ago

1 thing that might help

%run the_script_you_wrote_in_sublime.py

will evaluate the script and expose the script globals in the interactive namespace. Then you can mess around with the values and do plots. This gives you the interactivity of the notebook as well as the benefits of the editor you already use

boron10068y ago

I really like what's offered by Jupyter Lab. It's in alpha right now, but I haven't had too many problems with it. It allows you to open text files, terminals, and notebooks in the interface.

thearn48y ago

I'll give this a shot, I like the idea of editing a file within a separate tab of the web interface.

Anything to help refactor things into and from file and the notebook is nice.

stdbrouw8y ago

You're not the only one. I don't want notebooks, I want my own damned editor. For Atom, there is https://github.com/nteract/hydrogen which embeds Jupyter/iPython output right inside of your editor, not so different from how RStudio works.

radarsat18y ago

There is also a way to embed Jupyter inside emacs: https://github.com/millejoh/emacs-ipython-notebook

radarsat18y ago

My main criticism is the ipynb files. I don't like that it stores input and output in the same file. Ideally I'd like at least an option for it to put the output in a directory, with images stored as normal, separate files. It's commonly known that the current approach is terrible for version control, for one thing.

almostkorean8y ago

I'm with you, but I still end up using notebooks because I haven't found anything better for doing analysis. The two things I want the most are:

1. A variable window where I can browse through the values of each variable (like R Studio) 2. Be able to set breakpoints

So basically something in-between PyCharm and Jupyter.

daveguy8y ago

You have created some excellent bash scripts over the years to do data analysis? I find pandas much easier to use in general, especially with:

pd.read_clipboard()

pd.read_excel()

bsder8y ago

It is hard to overstate just how ferociously bad the experience of getting Jupyter from blank computer to the equivalent of "Hello world" actually is.

rjeli8y ago

I have a strategy that works pretty consistently - close your eyes and ignore the best practices like using Anaconda, Python 3, virtualenv (or venv in py3... oh wait it's a module?) and just install Python 2.7 with pip into default locations (I even run pip with sudo, the horror). It works really well! I run all sorts of CV, ML, deep learning notebooks with no problems.

radarsat18y ago

I agree, I never use virtualenv. I might if I was building a production system, but for my own laptop I feel perfectly capable of remember/tracking/checking what is in my ~/.local. (I always install with `--user`)

If I really need to containerize something, I use Docker.

lorenzfx8y ago

I couldn't agree any less:

    % mkvirtualenv -p `which python3` notebook
    (notebook) % pip install notebook jupyter notebook scipy pandas matplotlib pdbpp ipython

(not sure if all of them are really necessary)

jastr8y ago

I've found that most of the queries that journalists are trying to run are pretty basic, mostly filtering and histograms. Setting up a virtualenv, dependencies, etc can be tough. And RTFM isn't sufficient for someone getting started. I was surprised that nothing existed for this, so I built it.

It has the basics of a Jupyter notebook - filter, sum, average, plot. So far it's attracted a pretty interesting audience including journalists, but also lawyers and consultants.

www.CSVExplorer.com

farnsworth8y ago

Side note, I googled "pandas" and get a lot of results related to the python library, and very few related to the large mammal. Bing doesn't give me any related to the python library. Google knows me too well.

koolhead178y ago

Excellent share.

j / k navigate · click thread line to collapse

45 comments

aldanor8y ago

Just last week, I've helped my colleague speed up her code (numerical solver for financial data) by more than 100x, the biggest part of it was ditching pandas entirely and using numpy.

Declanomous8y ago

So I've been learning Pandas after mostly using either standard Python, R or VB to do our analysis, and I'm glad I read this because I thought I was going crazy.

earthnail8y ago

farnsworth8y ago

  But pandas’ magical simplicity makes things like computed columns immediately intuitive:
  > data['% of total'] = data.amount / data.amount.sum()

kinkrtyavimoodh8y ago

OTOH I think it's immediately intuitive if you are not a programmer. :)

When you see amount / sum, you think of how a list can be divided by what appears to a scalar.

When they see it, they parse it out for what they naturally understand a percentage to mean. And all is well.

david_eads8y ago

Given I had ten weeks to cram a lot of material in but did want to show them some amount of programming, this worked pretty nicely.

2 more replies

IanCal8y ago

The bit I like about this one is that it's also either wrong or highly misleading, depending on your viewpoint. If I have a row that says:

"% of total" : 0.01

I would not expect that to be 1%.

downrightmike8y ago

> Maybe it's immediately intuitive if you've used pandas. I don't find this formula any different than anything in any of the math classes I've ever had. Haven't used pandas.

confounded8y ago

It's vectorized!

tkt8y ago

For installation of Jupyter, Anaconda works well across all platforms, even most slightly older OSes.

http://jupyter.readthedocs.io/en/latest/install.html

flyaway8y ago

thearn48y ago

has2k18y ago

If you know what kind of short analysis you want to do, the benefits of Jupyter are not obvious. If you have to do a lot of exploration, and do longer analyses then it becomes indispensable.

david_eads8y ago

thearn48y ago

I guess I could see that. Maybe I need to learn more of the keyboard shortcuts and magic methods.

hogu8y ago

1 thing that might help

%run the_script_you_wrote_in_sublime.py

boron10068y ago

I really like what's offered by Jupyter Lab. It's in alpha right now, but I haven't had too many problems with it. It allows you to open text files, terminals, and notebooks in the interface.

thearn48y ago

I'll give this a shot, I like the idea of editing a file within a separate tab of the web interface.

Anything to help refactor things into and from file and the notebook is nice.

stdbrouw8y ago

radarsat18y ago

There is also a way to embed Jupyter inside emacs: https://github.com/millejoh/emacs-ipython-notebook

radarsat18y ago

almostkorean8y ago

I'm with you, but I still end up using notebooks because I haven't found anything better for doing analysis. The two things I want the most are:

1. A variable window where I can browse through the values of each variable (like R Studio) 2. Be able to set breakpoints

So basically something in-between PyCharm and Jupyter.

daveguy8y ago

You have created some excellent bash scripts over the years to do data analysis? I find pandas much easier to use in general, especially with:

pd.read_clipboard()

pd.read_excel()

bsder8y ago

It is hard to overstate just how ferociously bad the experience of getting Jupyter from blank computer to the equivalent of "Hello world" actually is.

rjeli8y ago

radarsat18y ago

If I really need to containerize something, I use Docker.

lorenzfx8y ago

I couldn't agree any less:

    % mkvirtualenv -p `which python3` notebook
    (notebook) % pip install notebook jupyter notebook scipy pandas matplotlib pdbpp ipython

(not sure if all of them are really necessary)

jastr8y ago

It has the basics of a Jupyter notebook - filter, sum, average, plot. So far it's attracted a pretty interesting audience including journalists, but also lawyers and consultants.

www.CSVExplorer.com

farnsworth8y ago

koolhead178y ago

Excellent share.

j / k navigate · click thread line to collapse