Statical inference generally only works well in very specific conditions:
1 - You know the distribution of the phenomenon under study (or make an explicit assumption and assume the risk of being wrong)
2 - Using (1), you calculate how much data you need so you get an estimation error below x%
Even though most ML models are essentially statistics and have all the same limitations (issues with convergence, fat tailed distributions, etc...) it seems the industry standard is to pretend none of that exists and hope for the best.
IMO the best moneymaking opportunities in the decade will involve exploiting unsecured IOT devices and naive ML models, we will have plenty of those.
In ML (or more specifically deep learning), we make no distribution-based assumptions, other than the fundamental assumption that our training data is "distributed like" our test data. Thus, there aren't issues with fat-tailed distributions since we make no such normality assumptions. Indeed, with the use of autoencoders, we don't assume a single distribution, but rather a stochastic process.
I suppose you could say statistics is less "empirical" than ML in the sense that it is axiom-based, whether that is a normality assumption of predictions about a regression line or stock prices following a Wiener process. By contrast, ML is less rationalist by simply reflecting data.
The two fat tail questions one has to engage are:
- is it possible that a catastrophic input might be lurking in the wild that would not be present in a typical training set? Even with a 1M instance training set, a one-in-a-million situation will only appear (and affect your objective function) on average one time, and could very well not appear at all.
- can I bound how badly I will suffer if my system is allowed to operate in the wild on such an input?
DL gives no additional tools to engage these questions.
However most ML certainly makes distributional assumptions - they are just weaker. When you're learning a huge deep net with an L2 loss on a regression task, you have a parametric conditional gaussian distribution under the hood. It's not because it's overparametrized that there's no distributional assumption. Vanilla autoencoders are also working under a multivariate gaussian setup as well. Most classifiers are trained under a multinomial distribution assumption etc.
And fat-tailed distributions are definitely a thing. It's just less of a concern for the mainstream CV problems on which people apply DL.
Okay, so that's about the same as classical statistics. You're just waiving the requirement to know what the distribution is. You are still assuming there exists a distribution and that it holds in the future when you apply the model. Sure you may not be trying to estimate parameters of a distribution, but it is still there and all standard statistical caveats still apply.
> Indeed, with the use of autoencoders, we don't assume a single distribution, but rather a stochastic process.
Classical statistics frequently makes use of multiple distrutions and stochastic processes.
Current ML techniques just work well for the kinds of problems people are applying them to, which is kind of a tautology. We should definitely seek to understand the theory behind stuff like dropout and not consider our lack of understanding a strength.
I don't think that's true (or maybe I misunderstood?), I guess your comment "simply reflecting data" means fitting data with a very flexible function (curve)? There are very flexible distributions to fit almost any kind of data e.g https://en.wikipedia.org/wiki/Gamma_distribution or with a composition of them, but as a practitioner you still need to interpret the model and check if it does represent the underlying process well. Both statistical inference and ML are getting there using different methods.
And there is a whole field of non-parametric statistics that doesn't make distribution assumptions.
Anyone can use SOTA deep learning models today, but in my experience, it's more important to understand the answer to "what are the shortcomings/consequences of using a particular method to solve this problem?" "what is (or could be) biases in this dataset?", etc. It requires a non-trivial understanding of the underlying methodology and statistics to reliably answer these questions (or at least worry about them).
Can you apply deep reinforcement learning to your problem? Maybe. Should you? Well, it depends, and you should understand the pros and cons, which requires more than just the knowledge of how to make API calls. There are consequences to misusing ML/AI, and they may not even be obvious from offline testing and cross validation.
If the outputs you want are well within the bounds of your training data set, ML can do wonders. If they aren't, it'll tell you that in 20 years everyone will be having -0.2 children and all the other species on the planet will start having to birth human babies just so they can be thrown into the smoking pit of bad statistical analysis.
Being bad at extrapolation is a consequence of assuming all training data can describe your phenomena distribution and being wrong.
statisticians make assumptions 1 and 2, and think of themselves as trying to find the "correct" parameters of their model.
people doing applied ML typically assume they don't know 1 (although they might implicitly make some weak assumptions like sub-gaussian to avoid fat tails, etc.) and also typically don't care about being able to do 2. and they don't care about their parameters; in a sense to an ML practitioner, every parameter is a nuisance parameter.
instead you assume you have some reliable way of evaluating performance on the task you care about -- usually measuring performance on an unseen test set. as long as this is actually reliable, then things are fine.
but you are right that in the face of a shifting distribution or an adversary crafting bad inputs, ML models can break down -- but there is actually a lot of research on ways to deal with this, which will hopefully reach industry sooner rather than later.
This is the part that often fails in practice. Think of all the benchmarks that show superhuman performance and compare that to how good those same models really aren't. Constructing a good set of holdouts to evaluate on is really hard and gets back to similar issues. In practice, doing what you're describing reliably (in a way that actually implies you should have confidence in your model once you roll it out) is rarely as simple as holding out some random bit of your dataset out and checking performance on it.
On the other hand, what you often see is people just holding out a random bunch of rows.
At minimum you must assume your underlying process is not fat tailed. If it is, then your training/validation/test data might never be enough to make reliable predictions and your model might break constantly in prod.
BTW shifting distributions and fat tailed distributions are sort of equivalent, at least mathematically.
Despite using similar models, the expertise required for 'doing statistics' (statistical inference) is actually very different from machine learning. Machine learning fits into the 'hacker mentality' well - try stuff out see what works. To do statistical inference effectively, you really do need to spend time learning the theory. They both require deep skills - but the skills are surprisingly different considering it's often the same underlying model.
Once you have a model, at minimum understand how to tune for the tradeoffs of different types of error and don't naively optimize for pure accuracy. At the obvious extremes, if you're trying to prevent nuclear attack, false negatives are much more costly than false positives, if you're trying to figure out whether to execute someone for murder, false positives are much more costly than false negatives. Understand the relative costs of different types of error for whatever you're trying to predict and proceed accordingly.
Typical ML methods just have a huge distribution space that can fit almost anything from which they pick just 1 option. This has two downsides:
Since your distribution space is several times too large by design you lose the ability to say anything useful about the accuracy of your estimate, other than that it is not the only option by far.
Since you must pick 1 option from your parameter space you may miss slightly less likely explanations that may still have huge consequences, which means your models tend to end up overconfident.
With regard to exploitation, IBM research has done some interesting work in the form of an open source "Adversarial Robustness Toolbox" [0]. "The open source Adversarial Robustness Toolbox provides tools that enable developers and researchers to evaluate and defend machine learning models and applications against the adversarial threats of evasion, poisoning, extraction, and inference."
It's fascinating to think through how to design the 2nd and 3rd order side-effects using targeted data poisoning to achieve a specific outcome. Interestingly, poisoning could be to force a specific outcome for a one-time gain (e.g. feed data in a way to ultimately trigger an action that elicits some gain/harm) or to alter the outcomes over a longer time horizon (e.g. Teach the bot to behave in a socially unacceptable way)
Nonparametric methods say 'hi'.
If you know the distribution of the phenomenon under study you dont need ML, that is what probability is for.
> or make an explicit assumption and assume the risk of being wrong
No.You have the Bias/Variance tradeoff here.You can make an explicit assumption about your model or not.
> Using (1), you calculate how much data you need so you get an estimation error below x%
This is extremely complicated for anything except the most trivial toy examples, probably not solvable at all and definitely not the way biological intelligent systems (aka some humans) do it.
Then the New Yorker packages it up with a cartoon and a headline and subheadline like "Big Data: When will it eat our children?" or "Numbers: Do they even have souls?", and serves it up to their technophobic audience in a palatable way.
I do have a problem with her conclusion here. Are numbers really lying if it's actually an incorrect data collection method or conflicting definitions of criteria for generation of certain numbers (like the example used in the second to last paragraph)? She seems to be pointing out a more important fact, which is that people don't question underlying data, how it was collected, and the choices those data collectors made when making a data set. People tend to take data and conclusions drawn from it as objective realities, when in reality data is way more subjective.
Obviously it's a figurative metaphor, but it's pretty clearly a case of "this supposedly objective factual calculation is presenting an untruth."
It’s an anecdote about a government incentive to have doctors see patients within 48 hours causing doctors to refuse scheduling patients later than 48 hours in order to get the incentive bonus.
This is not an example of limits of data, but an example of perverse incentives.
That part is probably just the first 1/8th or so of the article (rough guess). Sounds like it was cut short for you?
This case could be said to be creating misleading data. If the doctor's offices aren't recording appointments more than 48 hours in advance, the System is losing visibility on the total number of people who want appointments. Every office will appear to be 100% efficient even though there is effectively still an invisible waiting list.
I just tried opening it in Firefox now (a couple hours later) and I see the whole article. If I switch to reader mode I do see it's truncated about halfway through, but I think that's a separate issue from what the OP was seeing.
But also without reader mode I can’t see more than the NHS anecdote.
The article focuses on governments and bureaucracies but there's no better example than "data-driven" tech companies, as we A/B test our key engagement metrics all the way to soulless products (with, of course, a little machine learning thrown in to juice the metrics).
I wrote about this before: https://somehowmanage.com/2020/08/23/data-is-not-a-substitut...
But in tech, if your goal is just to make money, soullessly following data will often get you there, to the detriment of everyone else. Clickbait headlines will get you more views. Full-page popup ads will get you more ad clicks/newsletter subscriptions. Microtransactions will get you more sales. Gambling mechanics will get you more microtransactions.
You can say it's a flawed metric, but I think in the end, most people just actually care more about making money than they do about building a good product.
Here's my thesis, curious to hear your thoughts.
At some time around 2005, when efficient persistence and computation became cheap enough that any old f500-corp could afford to endlessly collect data forever, something happened.
Before 2005, if a company needed to make a big corporate decision, there was some data involved in making the decision, but it was obviously riddled with imperfections and aggregations biases.
Before 2005, executives needed to be seasoned by experience, to develop this thing we call "Good Judgement", that allows them to make productive inferences from a paucity of data. The Corporate Hierarchy was a contest for who could make the best inferences.
Post-2005, data collection is ubiquitous. Individuals and companies realized that you don't need to pay people with experience any more, you can simply collect better data, and outsource decision-making to interpretations of this data. The corporate hierarchy now is all about how can gather the "best" data, where "best" means grow the money pile by X% this quarter.
"Good Judgement" used to be expected from the CEO, down to at least 1-3 levels of middle management above the front-line people. Now, it appears (to me) to be mostly a feature of the C-Suite and Boards, and it's disappeared elsewhere. Long-term, high-performing companies seem to have a more diffused sense of good judgement. But these are rare. maybe they always have been?
Anyways, as we agree, this has a tendency to lead in problematic directions. Here's my thesis on "why".
Fundamentally, any "data" is reductive of human experience. It's like a photograph that captures a picture by excluding the rest of the world.
Few people seem to understand this analogy, because they think photographs are the ultimate record of an event. Lawyers understand this analogy. With the right framing, angle, lighting (and of course, with photoshop), you can make a photograph tell any story you want.
It's the same issue with data, arguably worse since we don't have a set of standard statistics. We have no GAAP-equivalent for data science (yet?).
Our predecessors understood that data was unreliable, and compensated for this fact by selecting for "Good Judgement". The modern mega-corps demonstrate that we don't have a good understanding of this today, evidenced by religious "data-driven" doctrine, as you describe.
People will say "hey! at least some data is better than no data!", to which I'll say data is useless and even harmful in lieu of capable interpreters. In 2021, have an abundance of data, but a paucity of people who are capable of critical interpretation thereof.
I don't know if it's a worse situation than we had 20 years ago. But it's definitely a different situation, that requires a new approach. I think people are taking notice of it, so I'm hopeful.
I had a stint writing conferencing software for quite some time, and every once in a while we'd come across a customer requirement that had capabilities which were obvious to us developers "would be misused". As a result, we did the "Thinking, Fast and Slow" pre-mortem to help surface other ways that the system could be attacked (along with what we would do to prevent it and how it impacted the original feature).
If you create something, and open it to the public, and there's any way for someone to misuse it for financial incentive (especially if they can do so without consequence), it will be misused. In fact, depending on the incentive, you may find that the misuse becomes the only way that the service is used.
Say the calendar is initially empty and 1000 people want to see the doc, right now. You can fill them all into the calendar, or you can play games that solve nothing, like only filling tomorrow's schedule with 10 people, asking 990 of them to call back. That doesn't change the fact that it takes 100 days to see 1000 patients. All it does is cause unfair delays; the original 1000 can be pre-empted by newcomers who get earlier appointments since their place in line is not being maintained.
Not measuring that from the first contact that the patient made is simply dishonest.
"Call back in three days to make the appointment, so I can claim you were seen within 48 hours, and therefore collect a bonus" amounts to fraud because the transaction for obtaining that appointment has already been initiated.
I mean, they could as well just give the person the appointment in a secret, private appointment registry, and then copy the appointments from that registry into the public one in such a way that it appears most of the appointments are being made within the 48 hour window. Nothing changes, other than that bonuses are being fraudulently collected, but at least the doctor's office isn't being a dick to the patients.
Although the New Yorker piece has leaned on the bonus angle the way it was discussed publicly was that doctors weren't allowed to offer you appointments outside the 48 hour window [0].
It was a very silly interpretation of the rules, but I think GPs felt it was too rigid and therefore stuck to the letter rather than the spirit.
If you want to reduce the queuing time in a system you need to reduce the processing time (i.e. the duration of an appointment) or increase the number of servers (i.e. doctors). You can't do it by edict.
Tony Blair was trying to solve a real problem that needed solving. That he opened a different problem is something that we should think of as normal, and not blame him for trying to solve the original problem. The question should be how to we change the metric until the unintended consequences are ones we can live with. That will probably take more than a lifetime to work out.
Note that there will be a lot of debate. There are predicted consequences that don't happen in the real world for whatever reason. There are consequences that some feel we can live with that others will not accept. Politics is messy.
Let AI crunch the numbers, but combine it with a human who can understand the "why" of things and you can really kick butt.
A good example of what I mean can be found on wikipedia :
His instinctive preference for offensive movement was typified by an answer Patton gave to war correspondents in a 1944 press conference. In response to a question on whether the Third Army's rapid offensive across France should be slowed to reduce the number of U.S. casualties, Patton replied, "Whenever you slow anything down, you waste human lives."[103]
https://en.wikipedia.org/wiki/George_S._Patton
Here, US general Patton is not confounding a performance metric (number of casualities) with strategic goal (winning the war). His counterfactual statement could be that ''if we slow things down, you are simply delaying future battles and increase the total number of casualties in order to achieve victory''.
I'm not suprised at Blair decision. When we choose leaders, do we favor long term strategic thinkers, or opportunistic pretty faces?
> Whenever you try to force the real world to do something that can be counted, unintended consequences abound. That’s the subject of two new books about data and statistics: “Counting: How We Use Numbers to Decide What Matters”, by Deborah Stone, which warns of the risks of relying too heavily on numbers, and “The Data Detective”, by Tim Harford, which shows ways of avoiding the pitfalls of a world driven by data.
Data is a powerful feedback mechanism that can enable system gamification; it can also expose it. The evil is extracting unearned value from a system through gamification not the tools employed to do so. I’m looking forward to reading both books.
Data has its limits, but the solution is usually - maybe even always - more data, not less.
Reasoning counter-factually is trivial: What would happen if I dropped this object in this place in which an object, of this kind, has never been dropped before?
Well apply relevant models, etc. and "the object falls, rolls, pivots, etc.".
This is reasoning-forward from models, rather than backwards from data. And it's the heart of anything that makes any sense.
Data is not a model and provides no model. The "statistics of mere measurement" is a dangerously utopian misunderstanding of what data is. The world does not tell you, via measurement, what it is like.
First, the Tony Blair example is not about data. It is a failure of government planning. It's wrong politics and wrong economy.
The G.D.P. example is laughable. G.D.P. is never intended to be used to compare individual cases. What kind of nonsense is this?
And the IQ example. The results are backed by decades of extensive studies. The author thinks picking a few critics can invalidate the whole field. And look! The white supremacist who gave Asians the highest IQ, what a disgrace to his own ideology.
Many more. I feel it's kind of tactic to produce this kind of article. Just glue a bunch of stuff, throw together with somethings seem to be related, bam, you got an article.
I think the problem goes even deeper, which is a misunderstanding of the scientific method. Good discussion about this topic here: https://news.ycombinator.com/item?id=26122712
There are many wrongs seem to have something to do with data, but in fact they are not.
Like socialist economy planning will eventually fail, but then you would say they misused data. It seems relevant, but misusing the data is not the real cause of their failure at all.
Two other unintended consequences of incentives I learned in economics:
1. Increasing fuel efficiency does not reduce gas consumption. People just use their car more often.
2. Asking people to pay-per-bag for garbage pickup resulted in people dumping trash on the outskirts of town.
Edit: Did more research after downvote. Definitely double check things you learn in college
1. The jury is still out: https://en.wikipedia.org/wiki/Jevons_paradox
2. Seems false https://en.wikipedia.org/wiki/Pay_as_you_throw#Diversion_eff...