Being a Data Scientist: My Experience and Toolset (opens in new tab)

(jeffersonheard.github.io)

168 pointsjeffheard9y ago48 comments

48 comments

These types of posts validate my concern about the people entering my field right now.

Data science, as a line of work, is distinct from other technical roles in its focus on creating business value using machine learning and statistics. This quality is easily observed in the most successful data scientists I've worked with (whether at unicorn startups, big companies like my current employer, or "mission-driven" companies).

Implicit in this definition is avoiding the destruction of business value by misapplying ML/statistics. In that sense, I am concerned about blog posts like these (which list 50 libraries and zero textbooks or papers) and those who comment arguing the relevance of "real math" in the era of computers.

Speaking bluntly: if you are a "data scientist" that can't derive a posterior distribution or explain the architecture of a neural network in rigorous detail, you're only going to solve easy problems amenable to black-box approaches. This is code for "toss things into pandas and throw sklearn at it". I would look for a separate line of work.

SatvikBeri9y ago

I think the "Data Scientist" job title is overloaded–I see several clusters of skills being useful, and in my ideal world they would have similar but slightly different job titles:

–Medium Stats/ML, medium Engineering ("Data Scientist" or "Data Engineer")

–High Engineering on very large datasets, low/medium Stats/ML ("Data Engineer" or "Backend Engineer")

–High Analysis, medium Stats/ML, low Engineering ("Analyst")

–High traditional Stats, High Analysis, low ML/Engineering ("Statistician")

–High ML, medium Stats, medium Analysis ("Data Scientist")

–High ML, medium Engineering ("Machine Learning Engineer")

tangue9y ago

One of the lessons of the web (in the 1990s everyone was a webmaster until the field mature.) is that after the coders, specialists emerged in fields like design, management, UX, seo and content. For data science the most obvious is data visualization but I guess there's plenty of new jobs ahead in addition to core data science jobs.

1 more reply

jeffheardOP9y ago

You know, I really should add a post soon about algorithms, papers, and textbooks. You make an important point which the first responder highlighted, "avoiding the destruction of business value by misapplying ML/statistics."

I understand the math behind what I do, but it's not a fair assumption to think that everyone reading my post will be motivated to pick up and understand the math before they start applying the tools.

Especially with tools like scikit-learn and orange, it's especially easy to misapply ML and statistics or simply approach a problem without understanding the tools and come out with something that looks plausible to the untrained eye.

Key to the reason that you should understand your tools, including the math that underlies them, is that you should be able to look at the results of your work and know if there's something "off". And beyond that the underlying understanding of the math involved gives you the tools you need to debug.

nonbel9y ago

I propose you can basically monte carlo yourself to a decent understanding.

The disadvantage is: You never know you are right for sure, plus there is extra time spent on applying your experience to each new type of problem.

The advantage is: You can easier relax assumptions once it is set up, and learned heuristics to deal with new problems quicker than the perfect way.

teej9y ago

Or, just like software engineering or any other profession in the world, there's going to be a need for people to solve hard problems and people to solve easy problems. Data science isn't different.

achompas9y ago

Yeah, that's fair!

Declanomous9y ago

> Implicit in this definition is avoiding the destruction of business value by misapplying ML/statistics

This is an incredibly important point.

I'm working as a fundraising and marketing analyst for a non-profit, but my background is in biology. The skill-set needed for analysis is pretty similar between marketing and population ecology. If you ask someone in either field what the biggest barrier to analysis is, getting data would almost certainly be the most common answer for both fields. However, data is treated very differently between the two fields.

On the scientific side, I find that most of the frustration occurs because there isn't enough data to make a conclusion. Peers will criticize conclusions made with insufficient information.

On the business side, I find that I'm often pressured to make claims that are much more confident that the data is capable of being. As a scientist, I am always very aware of the limitations of my data, but in business I feel like I'm pressured to make conclusions, and that people are waiting to make decisions based on any information they can get out of me.

I spend more time on my write-ups than I do planning my experiments, collecting data, and performing my analysis combined. In a business setting time "moves faster" and the stakeholders in a project expect results no matter what. In these cases, communicating what the limitations are in a concrete way is really important. Expressing risk in terms of money, or probability in terms of coin-flips makes a pretty substantial difference, and can really help people relate to the information you are presenting.

milliondollar9y ago

Speaking as a business person: often the biggest challenge is to make ANY decision and actually DO something. The perfect is the enemy of the good. So to continue the cliches the business critique of your objections would be "analysis paralysis."

I tell you this just to help you understand what you describe. But in my observations of failure modes in business, it is rarely because one follows the wrong analysis, but more because most are unwilling to make any changes unless confronted with overwhelming evidence. (And that hurdle always gets higher no matter how much evidence you give.)

1 more reply

lacampbell9y ago

and those who comment arguing the relevance of "real math" in the era of computers.

Is this related to my comment? I used "age of computers", but close enough. It's really not a fair representation of what I said at all.

I stressed the importance of knowing theorems and deriving proofs - arguably "realer" math than learning an equation by rote. I did some applied maths in undergrad, and in my experience a lot of my time was devoted to solving large and complex equations using fairly mechanical rules, and comparatively little of my time was spent on axioms and proofs. I wonder whether this focus is justified in the age of computers - might we derive the complex formulas just once or twice as an exercise, and not step through them ourselves again and again? Might we focus more on what the computer can't do well for us - rigour and intuition?

achompas9y ago

> Is this related to my comment?

It was initially related, yeah, but I realized I had uncharitably read your point. I edited my comment, but not enough. Sorry about that.

To be fair, this point is often raised in these threads as "why do math when computers do it for us?" so the criticism wasn't specifically levied against you.

We agree that repeated derivation when working on a new problem can be useless. It would be silly to work out OLS assumptions from first principles upon any import of sklearn.linear_model! I believe understanding those assumptions, though, or (say) how backpropagation works is important, since (1) it can help you debug issues and (2) explain modifications to the core models (GLMs or LSTMs, in the above examples).

jupiter900009y ago

Part of the issue as I see it (for me, unrelated to the article), is that companies are willing to use the data scientist term for positions that need none of the rigor you mention. However, the people were hired and are now called a data scientist.

The same type of thing seems to happen in other fields, too. Software engineers who don't engineer, data scientists who don't 'science', project managers who don't manage. Are they top in their field? No, they somehow have a job with the title though and so far have managed to not become unemployable. Do they care if they are rigorous in what their title is expected to be by top practitioners? Probably not, they get paid still and have the title, and can probably get hired at the next similar place.

Kind of sad that these positions may 'cheapen' the title, so what can be done about that? Not much I guess, since companies can use position titles as they'd like it seems...

avn21099y ago

In my (admittedly short) experience as a data scientist, "solving the wrong problem"/"working on irrelevant things" and "inadequately cleaned/prepped training data" are vastly, overwhelmingly more common failure modes than "building the right thing with good data inputs but misunderstanding the algos." Probably more common by an order of magnitude or two.

Then again, maybe I'm just working at companies with problems that are amenable to easily-understood algos but have plenty of data-and-product-themed problems.

1 more reply

stillsut9y ago

The roles of statistician and data scientist are not substitutes but more like complements. This guy definitely is a data scientist. Here's some ways to tell:

- Works on non-mission-critical components, e.g. he's not doing statistics for the when the wing will fall off your airplane, but he can help you figure out business problems more open to interpretation, e.g. subject line open rates.

- His publishing tools favor flair over convention, e.g. Ctrl+f for "latex" has zero results, but he does have D3, C3, Bokeh, surprisingly no tableau.

- Not sure he even references a single classical statistics package. The vast majority of people publishing in social sciences or "old school" life sciences are using Minitab, JMP, R, or SAS (correct me if I'm wrong, please, it's an outsider's perspective).

This skillset is not inherently "cutting edge!"- or deceptively "all talk, no walk". They really are completely different roles, that use some of the same tools and formulas and jargon. To cut to the heart of it: When a company builds a plane and says "I wonder how unlikely it would be for the wing to fall off?" that creates the demand for a statistician. When a company is trying to out-compete others, or maximize profit/charitable-effectiveness, often in a service or a field that is heavily influenced with human psychology, that creates the potential for a data scientist to add value.

jeffheardOP9y ago

I knew I was forgetting packages. I do in fact use Tableau. Will add it. Thanks for the catch!

As for LaTeX, it would have never occurred to me to add it. I have no idea why not, but it doesn't. Maybe because it feels more like a chore than a tool. It's like an anti-tool. I mean, I do or did in the recent past use LaTeX, but in more recent years I would farm that out to someone junior to me who hadn't worked with it for long enough to prefer pouring bleach in their ears to being faced with tweaking one more broken LaTeX template.

I probably should include classical stats packages. They really should go in here. But I've been coding since I was a kid and typically eschewed classical stats and math packages because of my perception that they were slow walled-gardens, and that as soon as I had a method figured out in Matlab or SPSS I'd end up rewriting it in C, C++, or Java to make it work with other things or at scale. That was hammered home in the first company I worked with where we did modeling in SAS and then rewrote every model in Java because SAS couldn't keep up.

I'm not suggesting that classical stats packages aren't data scientists tools. I think they are. They're just not my tools because of the curious niche I found myself in.

bigger_cheese9y ago

I think my job is similar to yours. My background is in engineering at an industrial manufacturing plant.

I have some of the same issues. The Engineers here tend to reach for spreadsheets first (or Access databases - these things are everywhere at my work) and inevitably they run into scaling problems and end up with a huge bloated mess. I step in to re-architecture these monstrosities (using "real" databases when necessary).

The other big part of my day to day work is modelling and data analysis. Usually regression based stuff and LP optimization problems (SAS is very good for this) especially around yield and quality control. The venerable excel "solver" plugin is often abused very heavily by engineers and is not always the ideal solution.

The person who I took over from was a Stats guy and the original job title was "Process Statistician" my boss has since retitled my role "Data Management Engineer". I still think of myself as an engineer first and foremost and a "data" person second.

I use SAS heavily. We have kind of gone in the opposite direction to you. I have rewritten some of our models in the past from C++ into SAS mostly for ease of maintenance because SAS is better understood by the non programmers (Most of the Engineers here do not have a programming/CS background and those that do tend to either know Fortran or Visual Basic very few grasp C/C++ very well). Speed is not really any issue but opaqueness and ease of maintanece is.

I'd like to learn R because I have heard it is very similar to SAS but more transferable to outside companies. Julia is the other language I've got my eye on I have heard it is somewhat similar to MATLAB which is used for some modelling work here.

autokad9y ago

sometimes i write python packages to auto populate tex files. like imagine running LDA with 50 topics and showing how each topic (via word cloud) correlates to an outcome variable

then it starts to become a tool :)

jordz9y ago

Cassandra is mentioned, I agree it's great for storing metadata and can be used to build efficient graph implementations but it's cited for Graphs and Relationships? I think that can be misleading as Cassandra is a a distributed column based key-value store.

wenc9y ago

I noticed that too. I don't want to gainsay the author's experiences, but it sounds like the author is describing the job of a data analyst who happens to dabble with various software. I don't get the sense the author has in-depth knowledge about the tools he lists.

Also, I don't know about putting Mongo and Cassandra under "Tools for working with unusual datasets".

codr4life9y ago

Am I the only one who came here looking for someone's experience as a tool set? For a second there I thought I might have stumbled over real honesty, a rare treat these days. Maybe, if we stop putting each other in stupid labeled boxes to please our bullshit peddling masters, we would get somewhere...

mastazi9y ago

From the article:

> Machine learning and data mining are not well distinguished, but machine learning techniques increasingly favor “unsupervised” learning algorithms.

The statement above puzzles me because it does not align with what I can see in the news. Maybe I'm just uninformed, so please let me know if I'm wrong.

According to what I can read in the news:

1 - Almost all of the recent ML developments that I can think of are in the field of supervised learning / reinforcement learning

2 - the only field that I can think of where unsupervised learning techniques are prevalent is data mining, which is precisely why I see it as a very specific field.

Am I missing something?

cityhall9y ago

No, you're right. Nothing about this blog post/resume inspires confidence.

DarkLinkXXXX9y ago

Big Data is when you outgrow Excel.

mordant9y ago

'Data scientist' is just title inflation by statisticians.

pjmorris9y ago

Some say [0] it's title deflation for statisticians.

[0] http://bactra.org/weblog/925.html

nonbel9y ago

"Statisticians" taught everyone NHST, and relegated bayesian probability to the appendix for decades. Once you realize what has happened there, you will view that title with very little respect.

I am glad to see machine learning, ai, "data science", whatever, grow as a separate field. The statistics programs had their chance.

paulgb9y ago

There are cases where this may be the case, but did you look at the tools in the blog post? Can statisticians be expected to write mongoDB code, create a web scraper, and make interactive visualizations in D3?

Title inflation exists, but there is a real-world role here that isn't really captured by "statistician" at all.

ianai9y ago

If you're in a statistics program you're going to learn to code. That's been my experience anyway.

1 more reply

bertil9y ago

Actually, I’ve noticed a meaningful distinction between people who learned statistics from machine learning (and are more likely to call each other data scientist) and statisticians (the least experimental of whom used to go by the title analyst): what to do when there is either too little, or too noisy data. Interestingly, those two are happy to be called Data scientist, but in my experience, they rarely meet.

A traditionally trained statistician would evoke negative result and decide not to use the model and support to maintain the pre-existing approach. A machine learning expert might not care, apply the coefficient out of the model as is because they are presumably closer than a guess and is more likely to be openly skeptical of human expertise.

That has lead to some frustrating situation for me: me arguing we should censor things like negative speeds, while I was told that there was no problem because the results were regularised anyway. Building and picking proper factors to use in regression is something that you can partially get away with when having larger databases, and back-propagation can take over; before that, insights still do matter.

I have not meet many who can articulate that transition effectively.

It seems that you’ve met mostly the second category; they are possibly the larger group, but not necessarily the most influential. There is a core of people who are meaningfully different. The linked article seems to be from someone in between but closer to the second group.

thinkr429y ago

More like 'analyst' in how easily it is thrown around. Calling a built in function in python or R is just about equivalent to calling one in Excel. Sure, you can claim that folks need to know more about what is going on, but honestly, how many have actually gone through the work of deriving the functions they're calling to begin with?

lacampbell9y ago

I'm wondering how useful deriving functions yourself is in the age of computers. I feel like knowing axioms about the mathematical structure you're dealing with and how to do proofs is very important, but it always struck me as odd that were still stepping through complex applied maths functions manually in pen and paper. Programmers don't bother say, writing our own hashtable implementation more than a handful of times in our lives, do we? Does forgetting how to derive hashtables mean we won't know how to use them effectively?

Genuine question - more than happy to be proven wrong.

2 more replies

searine9y ago

>how many have actually gone through the work of deriving the functions they're calling to begin with?

Why would you waste your time re-inventing a wheel.

A good data scientist isn't good because he/she can ace shitty trivia, he/she is good because they know the right question to ask.

1 more reply

Volt9y ago

I'm not sure this is what a data scientist is. It was supposed to be a research scientist (which is where the scientist part came from) that wrangles data and code. This individual should have both domain knowledge and coding chops while knowing how to conduct research.

1310129y ago

That would make me a data scientist, but I do not think I am and still have to learn a few tricks from this guy (and others).

j / k navigate · click thread line to collapse

48 comments

achompas9y ago

These types of posts validate my concern about the people entering my field right now.

SatvikBeri9y ago

I think the "Data Scientist" job title is overloaded–I see several clusters of skills being useful, and in my ideal world they would have similar but slightly different job titles:

–Medium Stats/ML, medium Engineering ("Data Scientist" or "Data Engineer")

–High Engineering on very large datasets, low/medium Stats/ML ("Data Engineer" or "Backend Engineer")

–High Analysis, medium Stats/ML, low Engineering ("Analyst")

–High traditional Stats, High Analysis, low ML/Engineering ("Statistician")

–High ML, medium Stats, medium Analysis ("Data Scientist")

–High ML, medium Engineering ("Machine Learning Engineer")

tangue9y ago

1 more reply

jeffheardOP9y ago

nonbel9y ago

I propose you can basically monte carlo yourself to a decent understanding.

The disadvantage is: You never know you are right for sure, plus there is extra time spent on applying your experience to each new type of problem.

The advantage is: You can easier relax assumptions once it is set up, and learned heuristics to deal with new problems quicker than the perfect way.

teej9y ago

Or, just like software engineering or any other profession in the world, there's going to be a need for people to solve hard problems and people to solve easy problems. Data science isn't different.

achompas9y ago

Yeah, that's fair!

Declanomous9y ago

> Implicit in this definition is avoiding the destruction of business value by misapplying ML/statistics

This is an incredibly important point.

On the scientific side, I find that most of the frustration occurs because there isn't enough data to make a conclusion. Peers will criticize conclusions made with insufficient information.

milliondollar9y ago

1 more reply

lacampbell9y ago

and those who comment arguing the relevance of "real math" in the era of computers.

Is this related to my comment? I used "age of computers", but close enough. It's really not a fair representation of what I said at all.

achompas9y ago

> Is this related to my comment?

It was initially related, yeah, but I realized I had uncharitably read your point. I edited my comment, but not enough. Sorry about that.

To be fair, this point is often raised in these threads as "why do math when computers do it for us?" so the criticism wasn't specifically levied against you.

jupiter900009y ago

Kind of sad that these positions may 'cheapen' the title, so what can be done about that? Not much I guess, since companies can use position titles as they'd like it seems...

avn21099y ago

Then again, maybe I'm just working at companies with problems that are amenable to easily-understood algos but have plenty of data-and-product-themed problems.

1 more reply

stillsut9y ago

The roles of statistician and data scientist are not substitutes but more like complements. This guy definitely is a data scientist. Here's some ways to tell:

- His publishing tools favor flair over convention, e.g. Ctrl+f for "latex" has zero results, but he does have D3, C3, Bokeh, surprisingly no tableau.

jeffheardOP9y ago

I knew I was forgetting packages. I do in fact use Tableau. Will add it. Thanks for the catch!

I'm not suggesting that classical stats packages aren't data scientists tools. I think they are. They're just not my tools because of the curious niche I found myself in.

bigger_cheese9y ago

I think my job is similar to yours. My background is in engineering at an industrial manufacturing plant.

autokad9y ago

sometimes i write python packages to auto populate tex files. like imagine running LDA with 50 topics and showing how each topic (via word cloud) correlates to an outcome variable

then it starts to become a tool :)

jordz9y ago

wenc9y ago

Also, I don't know about putting Mongo and Cassandra under "Tools for working with unusual datasets".

codr4life9y ago

mastazi9y ago

From the article:

> Machine learning and data mining are not well distinguished, but machine learning techniques increasingly favor “unsupervised” learning algorithms.

The statement above puzzles me because it does not align with what I can see in the news. Maybe I'm just uninformed, so please let me know if I'm wrong.

According to what I can read in the news:

1 - Almost all of the recent ML developments that I can think of are in the field of supervised learning / reinforcement learning

2 - the only field that I can think of where unsupervised learning techniques are prevalent is data mining, which is precisely why I see it as a very specific field.

Am I missing something?

cityhall9y ago

No, you're right. Nothing about this blog post/resume inspires confidence.

DarkLinkXXXX9y ago

Big Data is when you outgrow Excel.

mordant9y ago

'Data scientist' is just title inflation by statisticians.

pjmorris9y ago

Some say [0] it's title deflation for statisticians.

[0] http://bactra.org/weblog/925.html

nonbel9y ago

"Statisticians" taught everyone NHST, and relegated bayesian probability to the appendix for decades. Once you realize what has happened there, you will view that title with very little respect.

I am glad to see machine learning, ai, "data science", whatever, grow as a separate field. The statistics programs had their chance.

paulgb9y ago

Title inflation exists, but there is a real-world role here that isn't really captured by "statistician" at all.

ianai9y ago

If you're in a statistics program you're going to learn to code. That's been my experience anyway.

1 more reply

bertil9y ago

I have not meet many who can articulate that transition effectively.

thinkr429y ago

lacampbell9y ago

Genuine question - more than happy to be proven wrong.

2 more replies

searine9y ago

>how many have actually gone through the work of deriving the functions they're calling to begin with?

Why would you waste your time re-inventing a wheel.

A good data scientist isn't good because he/she can ace shitty trivia, he/she is good because they know the right question to ask.

1 more reply

Volt9y ago

1310129y ago

That would make me a data scientist, but I do not think I am and still have to learn a few tricks from this guy (and others).

j / k navigate · click thread line to collapse