For big datasets, I've stopped using pandas myself a few years back for anything other than printing dataframe, datetime index series, doing quick plots, or working with tiny/toy datasets -- in favor of numpy structured/record arrays. It's kind of the same thing, without all the groupby/index fluff, but very fast.
Just last week, I've helped my colleague speed up her code (numerical solver for financial data) by more than 100x, the biggest part of it was ditching pandas entirely and using numpy.
I have a data set of about 4 million rows I routinely analyze. I have 32 gb of space on my desktop, and the only time I've really run out is when I write incredibly poor code. In the short while I've been trying to use Pandas run out of memory and get killed by the OOM killer or completely freeze my system for half an hour while processing what I thought were simple operations.
I was honestly beginning to believe I was way worse at programming than I thought due to all of the issues I was having. I wasn't even doing anything particularly complex, I was just loading a dataframe from a sql query and playing around with basic manipulation.
But pandas’ magical simplicity makes things like computed columns immediately intuitive:
> data['% of total'] = data.amount / data.amount.sum()
Is that immediately intuitive? I'm staring at this trying to understand what it's doing. Is the / operator overloaded? data.amount is one particular amount, and data.amount.sum() is the sum of all amounts? Why does the "computed column" property goes on the same data object as the actual data? Maybe it's immediately intuitive if you've used pandas.When you see amount / sum, you think of how a list can be divided by what appears to a scalar.
When they see it, they parse it out for what they naturally understand a percentage to mean. And all is well.
Given I had ten weeks to cram a lot of material in but did want to show them some amount of programming, this worked pretty nicely.
"% of total" : 0.01
I would not expect that to be 1%.
At least, this could easily be the source of an inaccurate calculation elsewhere. This is not a major criticism, but perhaps would be a good point to introduce the idea of testing some of your code, even as a few simple cells that calculate things you expect.
http://jupyter.readthedocs.io/en/latest/install.html
It does work better for people to install Jupyter with Anaconda, rather than use virtual environments, because there's not the overhead of also having to learn about virtual environments. People tend to think of them as just associated with the class and don't use them as much for their own work outside of the workshop or course.
I know everyone loves the reproducibility Notebooks supposedly bring to the table, but without a doubt my favorite part is the ability to export super-unattractive matplotlib charts as PDF, clean them up in Illustrator, and suddenly find yourself with publication-quality graphics. Knowing you're producing something more than just some numbers to toss in a story can be a strong sell to a lot of folks.
%run the_script_you_wrote_in_sublime.py
will evaluate the script and expose the script globals in the interactive namespace. Then you can mess around with the values and do plots. This gives you the interactivity of the notebook as well as the benefits of the editor you already use
Anything to help refactor things into and from file and the notebook is nice.
1. A variable window where I can browse through the values of each variable (like R Studio) 2. Be able to set breakpoints
So basically something in-between PyCharm and Jupyter.
pd.read_clipboard()
pd.read_excel()
If I really need to containerize something, I use Docker.
% mkvirtualenv -p `which python3` notebook
(notebook) % pip install notebook jupyter notebook scipy pandas matplotlib pdbpp ipython
(not sure if all of them are really necessary)It has the basics of a Jupyter notebook - filter, sum, average, plot. So far it's attracted a pretty interesting audience including journalists, but also lawyers and consultants.
www.CSVExplorer.com