If I'm understanding correctly, the new test is based on a single data point from each group, rather than an aggregate statistic (like mean). I'm no statistician, but it seems like this data would have far too much variance and noise for this to be a useful test.
The minimum performer could be someone who had a sudden personal crisis. Or who had 10 competitors suddenly pop up. Or any number of other circumstances outside their control. The minimum performer is, almost by definition, an outlier. It doesn't seem rational to suppose that an outlier is representative of the group.
I can understand that statistically this test may be more rigorous. In practice I would expect it to be less rigorous. Because the assumption it makes (that a single outlier is representative of the group) seems even more dubious than the assumptions required for Paul's original idea.
Of course, a real thresholding process would not be perfect, so the lower bound of the distribution of accepted candidates would not be a perfect vertical cutoff as in the examples. Just like any process that adds additional variation to the data, this would reduce the statistical power. You could accept more bias in return for lower variance in your test by taking, say, the 5th percentile instead of the sample minimum as your test statistic. (You can think of the sample minimum as the zeroth percentile.)
[1] https://en.wikipedia.org/wiki/Uniform_distribution_%28contin...
(Having said that, the mean is not difficult to manipulate)
There's an important game theory issue to KPIs:
- if everyone is behaving perfectly, then why bother checking the stat?
- if someone is behaving maliciously, and wants to game the stat, will they be forced to actually improve the situation?
A malicious actor barely has to break stride here.
Both the mean and the minimum are a scalar-valued function of the whole sample.
Their basic premise is wrong, if bias continues to exist after the selection event in question.
For example, if YC had (hypothetically) a real bias against black or women entrepreneurs, it is almost certain that future funding rounds, as well as all possible exit scenarios, would exhibit very much of the same bias.
In which case, the future "performance" of those candidates would be poor, and by PG's definition unbiased even though the only meaningful result is that YC is no more biased than subsequent performance evaluations.
The test may not be sufficient to prove that you have no bias, but it may be good enough to prove that you do. When it does indicate bias, it seems likely to be correct.
To put it another way, if it is 1948 and the only three black people in Major League Baseball are all superstars, then the distribution of baseball skill among black players is extremely unbalanced or there is a lot of bias keeping the average and moderately-better-than-average black players out.
Of course favourable treatment can't make people into superstar startup founders or baseball players (and I'm sure any special treatment afforded to black baseball players in the 1940s was the complete opposite of favourable). But more generally it can make an organisation with fair selection processes look like it sets a higher bar for $MINORITY because it addresses low numbers by being very keen to promote and very reluctant to fire/deselect members of said minority, so these kind of studies still have to be considered with care.
(Of course even if an organisation is proactively treating a minority group favourably after selection doesn't mean that conscious or unconscious biases don't exist in the selection process.)
It seems entirely possible that if your program is "narrow" enough you could exhaust the pool of more successful candidates from a minority group. Of course this would be a lot more plausible if you're biased.
It's also worth noting that an unbiased prediction is not necessarily "fair" in the colloquial sense. For example, I've seen data suggesting that an unbiased prediction of college outcomes would actually penalize black applicants, since black applicants underperform relative to their SATs and college grades. (The person who had this data was very careful not to draw this conclusion in the publication - career limiting move, as they say.)
So a fair selection process which looks only at high school grades/SAT/etc might actually be biased as a statistical decision procedure.
It does mean that maybe monetary earnings or anything else sensitive to later-round bias are not the thing to use to measure candidate performance, at least if you're doing this for the social utility.
Of course, if you're only in it to make money, and you're only in charge of the first round... then you really do want just an unbiased evaluation of the (biased) future earnings prospects. So in that case using raw earnings would be correct...
However, the test may still useful to help confirm bias. If outperformance is observed, you can infer one of 3 things is true:
1) there is bias at initial selection but not after (or at least reduced bias)
2) members of the outperforming group are simply stronger performers (different but still interesting)
3) there is no bias at selection but there are affirmative action effects after the initial selection (not obvious why this would be the case)
I would argue that social responsibility requires YC to take the hit, but bias is the wrong word if they don't.
The importance of relationships to the funding round also plays a role. If you get as far as an IPO, it seems unlikely that the stock-buying public is going to stay away because the founders are female/black/etc.
"But Uber skews the results!" So what? You don't get to just throw out data points you don't like without good reason.
If your "test" is that sensitive to individual outliers, then perhaps it isn't really a good test after all.
The right way to deal with outliers is to use a method that acknowledges their existence, not to ignore them. For example, if outliers destroy your OLS linear regression, it's because your error is not normal. That means you need to do Bayesian linear regression with a non-normal error term, not just throw them away.
In many labs, your data is looked at very suspiciously if you don't have any outliers.
An outlier may not be thrown out without good reason. Preferably an a priori reason before you do the analysis.
This does even consider the problem of data dredging which First Round Capital engaged in.
I wonder if all YC companies would be enough data points to learn something useful. Or maybe grab a large swath of VC-funded startups including First Round's investments and many other top firms.
Most of the raw stats in Chris's post was above my head, but I'd love to see this applied to a larger data set of fundings.
More fundamentally even if you could get enough data, the data is just too messy to analyse and draw any valid conclusions.
A test in this general direction (but which handles noise) is much better suited for answering questions like "are colleges biased against Asians". In that case you have a pretty clear output (college GPA) which very rarely reaches zero.
I find what's wrong with the idea more fundamental, that it talks only about the 'selection process' but in fact bias that impacts success or failure can come at other points.
Actually, choosing an identifiable group at random would be both socially and statistically unwise, as, following Patero distribution, there are vastly more minority/extreme minority distinguishable groups of people than there are majority/significant minority ones; this means, firstly, that any group randomly selected with equal biasing between all groups has a high probability of being subject to actual discrimination, mooting any social benefit of choosing a group at random; secondly, that the generalizable qualities of the group chosen would therefore have a distribution with very little deviation (if I'm using my terms correctly) and would be highly predictable, thereby obviating any possible statistical benefit of doing so.
Paul Graham wrote an article about an idea. The idea is generally correct - bias in a decision process will be visible in post-decision distributions, due to the existence of marginal candidates in one group but not the other. But the math was wrong. // That's ok! Very few ideas are perfect when they are first developed.
I'm not good enough at statistics to check that OP's math is sound, but this is the mindset of a scientist. OP reasons rigorously, finds a way to salvage the core insight and improves on it. As readers can see, it took quite a lot of work and prior knowledge to do.
If I were pg I would consider putting a link to this post on both the disagree.html and bias.html as a note for posterity.
Only examining the sample without looking at the population of applicants has its limits. Especially as multiple interviews becomes the norm, filters that don't affect the distribution of outcomes will be missed. For example, the person screening resumes might weed out anyone with an ethnic-sounding name. A different person, who is not biased, interviews the candidates. The quality of the candidates accepted will be the same, but the number of minority applicants will be smaller than it should be.
Measuring outcomes allows for external biases to distort the results. Start with a company that is biased against women, so that the average female founder is better than the average male. However, that same level of sexism exists in the market, such that the company's performance is hampered due to prejudice against the founder. The VC's bias would be hidden by the counter-bias in the market.
Hi Chris ---
In the earlier thread, it seemed like some people were reaching different conclusions because they were using different definitions of "bias". I think my working definition would be something like "there existed in the actual applicant pool a subset of unfunded female founders who should have been statistically expected (given the information information available to the VC's at the time of decision) to outperform an equal sized subset of male founders who did in fact receive funding".
Alternatively (and I don't think equivalently?) one could reasonably take bias to mean "Given their prejudices, if the same VC's had been blinded to the sex of the applicants, they would have made funding choices resulting in higher total returns than the sex-aware choices they actually made." I'm sure there are many other ways of defining "bias". Could you define what would need to be true for your test to show that "the VC process is biased against female founders"?
This particular test is terrible for VC since the min return in VC will always be zero. But if you build a noise-sensitive version for something like college admissions, what needs to be true is a) bias manifests as raising/lowering the bar for one group relative to another, and b) both groups have a significant number of members near the cutoff.
As an example of the type of bias this test would detect, consider U-Michigan's point system [1]. An extra +1.0 GPA was added to black applicants. I.e. an Asian person with 3.9 GPA and black person with 2.9 GPA were equivalent. This would result in Asian people having a higher min GPA than black people.
[1] They replaced the point system with vague human heuristics when the supreme court said point systems can't be racist, but vague heuristics can.
1. Pool and randomly label the data from A and B 2. Sample with replacement and form two partitions of the same cardinality as the original A and B groups 3. Compute the differences in mean 4. Rinse and repeat millions of times to form a distribution of mean differences 5. Check if the observed difference in means (from the true A/B labels) is statistically significant relative to the distribution found in (4)
This has some problems with fat tailed distributions but tends to work great otherwise. It's so simple that it avoids a host of pitfalls that can arise with other resampling schemes (what's being proposed is a type of resampling), and I love that it makes basically zero assumptions on the underlying data.
https://news.ycombinator.com/item?id=10484602
That post shows that what PG is doing is a first-cut effort at a statistical hypothesis test but with being vague on assumptions and without any information on false alarm rate.
In particular, in my post, get to compare sample averages without making a distribution assumption. Indeed make no distribution assumptions at all.
Yes, distributions exist, but that does not mean that we have to consider their details in all applications!
Come on guys, this is distribution-free statistical hypothesis testing, and we should be able to use that.
Is this really such an unreasonable assumption, given that pg restricts the applicability of his bias test to groups of equal ability distribution and that we can assume that both groups have roughly the same amount of capital at their disposal?
The question is if the "equal ability" qualification is sufficient to make sure the distributions are roughly similar. But that is not a mathematical issue.
The main problem I was having is that you are assuming our observation variable is the latent skill or potential value variable (which you're calling x here). However, the article by PG was talking solely about the average of returns (let's call it y).
So the reason I was confused is that, assuming that the outcome of a startup is dependent only on x, we are really observing y ~ f(x) = \int_0^1 g(x)h(x)dx, where h is your cut-off criteria for x, g(x) is some unknown payoff distribution for a given skill level, and I'm assuming our x is in [0,1] without loss of generality. So in essence, the real problem here, even if you could see all of the individual returns for a given portfolio, is that you have to perform a very, very difficult deconvolution problem. And I'm pretty sure it's non-identifiable without some other information or additional parametric assumptions.
Thinking out loud a bit, let's assume that y is actually log(return), where a return of 1 is breaking even and 0 is losing everything. Since log(0) is undefined, most startups return 0, and very few exit for less than 1, I would think we could model this as a point-inflated normal distribution: p(y) = c * \delta_0 + (1-c) * N(\mu, \sigma^2). Given this, we could then model our latent parameters (c, \mu, \sigma) as being functions of x. Since the model is separable, we can even just look at the zeros and non-zeros in isolation. Then we can come up with a test from there, but I'm not really sure what that test would be at this point. Anyway, that's a completely different line of thinking, but it seems much more tractable in practice.
To be concrete, assuming "performance" is measured as return on investment, min(performance) will always go to to -100% (i.e., bankruptcy) with a large enough sample size.
Worth to do and discover few facts about ourselves, even if uncomfortable.
The sample population was chosen by specific type of groups of partners. There is no female technical partners in the group. As a female technical founder, I am not interested in building 'tea-making bot', sandwich making bot or selling organic condom. IMHO we have different views in looking at problems and solving them. Without having female technical founder as partner, YC would be perceived to be biased.
The algorithm of selecting promising candidates will vary once there is a variety of partners.
Any credible statistical test for bias should be framed in the language of causal inference, e.g., as described by Judea Pearl: http://ftp.cs.ucla.edu/pub/stat_ser/r350.pdf
Surely the distribution of minimums would then be the same between all skin colours, but you end up employing half the number of black applicants that you should be.
TIL about https://www.mathjax.org/.
> So rather than comparing mean performance, we'll compare minimum performance.
1. This is a useless metric for startup investors to use, since (almost surely) the minimum performance in every group of reasonable size will 0 (the startup went out of business)... and this will be true even if the investor is biased.
2. The maximum statistic was rightly avoided here because for power-law distributed values (which startups returns are), you'd need to know the population sizes to estimate if the distribution of {A} was different than the distribution of {B}.
If you're willing to take on faith that both A and B have the same distribution, then the test is easy: is the acceptance rate for As significantly different than the acceptance rate for Bs? If you've invested in more than, say, 100 startups, you have a big enough sample to check this... this requires knowing the size of application pools, and who was accepted though.
3. I believe that in general it's not possible to determine a bias from the kind of aggregate statistic pg is discussing without at least some knowledge of the sample space.
For example, using OP's method, you will find that almost every selection process in the world is biased for you if you divide the world as {you} vs {non-yous} (you're doing significantly worse than the best non-you). And find that almost every selection process in the world is heavily biased against you if you use the minimum statistic (you're doing significantly better than the worst non-you). This is also true for smallish groups (eg {your friends} vs {not your friends}).
The same is true for PG's method -- it's highly unlikely that {you} fall exactly at the average value of {non-yous}, or that {your friends} fall exactly at the average of {not your friends}.
4. I believe that the math here is distracting from the core question.
Core question 1: Do men and women on average make the same choices?
If you believe that, then determining bias is easy: we already know who the investor funded. Is the number of men the investor funded different from the number of women? Yes? Then the investor is biased. This is much more direct than the the kind of forensic accounting pg is proposing.
I suspect that pg didn't propose this test because pg doesn't believe that men and women on average make the same choices. He knows, for example, that the number of female applicants to YC is different than the number of male applicants (a gendered difference in behavior). Google "men and women career choices" or similar if you're interested in learning more, or better yet, read some first person accounts from FTM men about the cognitive effects of taking testosterone.
Since it's clear that there's a gendered difference before applying to YC, it seems very difficult to justify an assumption there would be no gendered difference in behavior after applying to YC (or any other investment firm, FirstRound in this case). Given that, the question we were asking becomes much more confusing... a simple bias towards ideas and plans you understand/agree with/are excited by is a gender bias in as much as your gender caused you to like the idea or plan. Removing that bias (supporting plans you understand less, agree with less, or are less excited by) seems like an obviously bad idea.
Returning to the problem: if we accept that this sort of "makes sense to me bias" can be observed when looking for gender biases, we are left in a really hard place. That bias seems to be both a good thing, and confounds the entire analysis. Unless you've controlled for the "makes sense" bias, such analysis will apply pressure for investors to waste money from their perspective. This seems obviously bad.
Core question 2: which biases do we want investors to have?
Investors who knowingly pass up good opportunities on the basis of the founder's gender are punishing themselves worse than any company they pass over -- their competitors who aren't gender biased will get higher returns, and so will have more money to invest in the future. This is to say that gender biases for startup investing are self-correcting. The investors already have their self-interest maximally aligned with not being sexist.
I don't pretend to know which companies are worth investing in more than any other smart technologist. I also don't pretend to know to what extent gender differences cause differences in returns, so my answer is: investors should be as biased (selective about investing) as they see fit. Startups are positive sum for society, and anyone who can find a way to fund more of them profitably is making the world better.
In large part, this is because I find it very unlikely that any modern investor is knowingly sexist -- I think it's much more likely that the sort of "makes sense" bias I discuss above is at play.
Of course, this is an early thought that came from first principals, so counter arguments solicited. Perhaps there is something deeply evil about passing over startups you don't feel comfortable investing in (assuming that comfort has any correlation with founder gender), or perhaps there's some easy fix which makes previously dicy-looking ideas from {other-gender} founders look like obviously good investments. (If you know what that idea is, I'd love to know it too).
5. Thanks to both pg and Chris for the fun math/philosophy problem. :)
>Group A comes from a population where the chance of graduation is distributed uniformly from 0% to 100% and group B is from one where the chance is distributed uniformly from 10% to 90%
>The mean of group B is not lower because of bias (which would be reflected near x=80), but because the very best members of group B are simply not as good as the very best members of group A.
Yes, if we can assume some a-priori knowledge about certain "groups" of people, then we can make a more "informed" decision. That's pretty much the definition of bias, isn't it? Paul Graham's point, as I understood it, was that those assumptions are often invalid. Therefore, bias could cause the market to under value someone or some company. Your counterpoint seems to be, "let's suppose those biases are legitimate."
>Unfortunately, using the mean as a test statistic is flawed - it only works when the pre-selection distribution of A and B is identical, at least beyond C
His argument is based the proposition that different sexes/races have different market value profiles. He needs to demonstrate why that is the case before proceeding to heavy math.