I'm doing a lot of such algorithms (well, not for images). Does someone know if such algorithms have a name? I'm calling it "heuristics" and I think it falls under "AI".
Every single photo was of a cat.
I have to say I was humbled by the amount of human and computing power that had gone into developing this system over the years, that could achieve such a complicated, impressive technical feat, without requiring any effort or money on my part, and yet also be 100% wrong.
This really is quite impressive. It's rare for humans to do worse than random guessing on tasks, and they almost never do much worse. There's something almost charming about the ability of AI to put real effort into actively avoiding correct answers.
Calling it "feature engineering" implies it's still being fed into some sort of trained classifier to make the final decision, though.
What you're describing of your own work might better fall under the broad category of an "expert system".
https://en.wikipedia.org/wiki/Bag-of-words_model_in_computer...
Couldn't they have retrained the system with a 50/50 mix of males / females resumes ? Or restrict the use of the algorithm to sort male resumes ? Or maybe resumes don't actually correlate at all with success in Amazon ...
1. The AI system accurately predicted employee success across both genders
AND
2. The AI system predicted that women would do worse than men
That's politically embarrassing and something that you can't necessarily 'fix' by improving the system. (see: all the 'will this person commit a crime if let out on parole' systems that end up accurately discriminating based on race)
This isn't to say that women are worse engineers than men, or anything of that sort - only that the applicant pool to Amazon was skewed, or women were treated worse in the workplace and thus performed worse, or a dozen other possible causes. (And only in this hypothetical scenario! I have no inside info from Amazon!)
Assume that the ability curve of male applicants and female applicants are identical; that the majority of applicants are male; and that Amazon wants to hire more females then would be expected given the portion of applicants that are female.
A natural way of accomplishing this goal is to give extra points to female applicants [0].
Due to selection bias, the ability curve of women within the population of Amazon engineers would skew lower then men within the population of Amazon engineers.
This is a special case of a more general phenomona. If you have signal S that is positivly correlated with a desired trait in the general population, and over select for S, you will find that S is negativly correlated within your population.
[0]. All proposals I have seen amount to either a good approximation of this or changing the applicant pool. And, by assumption, the latter is excluded.
> Gender bias was not the only issue. Problems with the data that underpinned the models’ judgments meant that unqualified candidates were often recommended for all manner of jobs, the people said. With the technology returning results almost at random, Amazon shut down the project, they said.
Apparently the recommendation system really did create gender bias, neither inherited from real differences nor from replicated human biases. (It looks like an issue with mismatched training data and task.) But that initial bias was found and corrected (2015) more than a year before the project was cancelled (2017) for providing "random" results. I think this is the most extreme case of algorithmic bias I've ever seen, but also the least commonly relevant; Amazon appears to have built a model which contained almost no rules except sexism, and scrapped it for not knowing anything worthwhile.
https://www.reuters.com/article/us-amazon-com-jobs-automatio...
If it isn't acceptable to use an AI to create biased outcomes how is it acceptable to use people to create the the same outcomes. AI decision making can be examined and tuned in ways that people cannot.
The parole software was NOT being fed data for "will this person commit another crime". It was being fed data for, "will this person be a suspect for another crime".
The significant difference is that selective enforcement biases the data that it was trained on. Said selective enforcement has multiple causes, including the fact that heavier patrolling in black neighborhoods makes catching crimes more likely.
The size of the selective enforcement bias shows in a number of ways. For example consider drugs. In surveys, the usage of illegal drugs is the same in blacks and whites. And yet 6 times as many blacks are arrested for using illegal drugs as whites.
Humans are pretty happy to create nonsensical results if it fits their goals... especially if it befits them. I wonder if with AI we do that to the point that it is somewhat irrelevant.
To some extent, you're bringing in your human bias to prefer human biases when you make that statement. We humans have a hierarchy of important attributes, and for various reasons believe race and gender are more important than eye color or height. But the machine learning algorithm just gets a multidimensional point in hyperspace. It doesn't, a priori, "know" that it needs to do a "per capita" adjustment based on FIELD_1 any more than it knows it needs to do a per capita adjustment on FIELD_2. And you can't "adjust" on all the fields because that'll just cancel out.
We are also in the weird position of wanting the machine to do adjustments based on FIELD_1, but without us having to actually admit to ourselves that we're doing it. From a technical perspective, probably the best answer is to do a straight-up training based on the data, then have an cleanly-separated after-the-fact cleanup process to perform whatever social adjustments it is we want on the outcome. But nobody is willing to admit that's what we want, and to put those adjustments down on paper in the form of code, because the instant they're concrete, pretty much everybody is going to decide they're wrong, and no two people are going to agree on the manner in which they are wrong, and an epic, national-front-page-news shitstorm will ensue. So here we are, trying to make adjustments without making adjustments, or, alternatively, trying to make adjustments in a place where we can blame the AI rather than humans.
(The ironic thing is that because we can't admit what we're trying to do, we're going to end up doing a really poor job of it. Tools will be applied haphazardly, the results can't be measured except very grossly at the very end of the process, and the goals won't be obtained and the system is always going to be quirky and weird. If we could clearly declare what it is we actually wanted, it would be fairly easy to get it from the AIs.)
Going by the details of the Reuters story and several others, it appears that what actually happened was a training/task mismatch. Amazon wanted an algorithm to do resume discovery, which recruiters would run and get quality predictions as they viewed resumes. But they trained it on resume results, giving it past resumes which had been submitted to Amazon and telling it to seek similar resumes. None of the stories make it clear if there even was negative training data; it looks like the tool was simply told to compute degree-of-similarity to past inputs, and possibly told to prioritize resumes which were ultimately hired.
As a result, the tool was trying to convert a relatively gender-neutral pool (resumes found online) to a skewed one (Amazon applicant resumes), and did so by weighting gendered terms. It also seems to have underweighted technical terms, failing to appreciate them as mandatory or strictly position-specific.
The developers were sufficiently aware of that to catch and correct the known gender biases (e.g. devaluing women's colleges or the literal word "women's"), but were scared there were other uncaught biases. And the results were apparently terrible all around, so the tool was scrapped. Which is pretty much what you'd expect from something trained on exclusively positive, sample-biased examples. The story has been seriously distorted, but the real plan also seems terrible...
The typical AI system doesn't work on the basis of selecting candidates entirely at random, pro rata, in order to meet a quota. It works on the basis of criteria for success. One thing it might learn (unfortunately) is that most posts at the company are filled by men.
Using the blog's skin cancer example, couldn't the labelled images be augmented by altering the skin tones and adding these new examples to the training set?
It seems to me that some of the anomalous results discussed in the article are actually the result of poor model design or poor pre-processing data choices. We can't just throw anything to any ol' machine learning model and expect it to be magic
As far as I can tell from later stories (e.g. 1, 2), what Amazon actually did was build a tool to show recruiters 'quality' predictions for all resumes, for instance as they scrolled LinkedIn. But they trained it on resumes submitted to Amazon for various positions, possibly also adding weight to resumes which produced hires.
In which case the problem is painfully obvious; the system effectively had no negative training data, and its positive examples (submitted resumes) didn't actually match the desired output (qualified resumes). It was computing degree of similarity between a gender-neutral-ish pool (resumes posted online) and a gender-skewed pool (resumes submitted to Amazon), and tried to make that conversion with whatever data was available - like devaluing resumes that mentioned women's colleges. (This wasn't just a proxy-variable thing, the model essentially learned to weight on gender.) Amazon's team apparently caught this issue and did the usual things like blinding on those words. But they were scared of uncaught factors; reading between the lines, they were unable to "detrain" biases like neural nets do because their dataset and task didn't match.
Ultimately, the tool was apparently scrapped because it made selections "almost at random". Which, again, isn't exactly surprising in light of the absolutely bonkers choice of training examples.
[1] https://www.aclu.org/blog/womens-rights/womens-rights-workpl...
[2] https://www.ml.cmu.edu/news/news-archive/2018/october/amazon...
I swear, when someone starts building autonomous killer robots, the first set of concerned articles will probably be asking whether robots were properly trained to target all genders and races with equal accuracy. This is not a sensible way to approach AI ethics.
>It was recently reported that Amazon had tried building a machine learning system to screen resumés for recruitment. Since Amazon’s current employee base skews male, the examples of ‘successful hires’ also, mechanistically, skewed male and so, therefore, did this system’s selection of resumés.
There is nothing "mechanistic" about this. It depends on how you select sample resumes and how you split them between "good" and "bad" labels.
I worked on a similar thing as an "encouraged" side-project at a certain company. Except I realized from day 1 that using AI on resumes is a bad idea and aimed to show this with data. My model was aiming to detect people who will quit or get fired within first 6 month (with the intent of lowering them in priority for interviews, supposedly). It miraculously achieved 85% accuracy... by figuring out how to detect summer interns.
Framing this problem as "bias" and especially hyper-focusing everyone's attention on diversity aspect of it is extremely irresponsible. (I'm not saying that's what the author is doing, but that's definitely what's being done at large.) Fundamentally, there are significant higher-level problems with using statistical ML models for things like hiring or crime prediction.
More topically, you're quite right to object to that Amazon reference. As far as I can tell, the real story is even worse than mislabeling. Amazon devs wanted a system to spot candidates in resume banks, so they trained it to recognize resumes similar to the ones submitted to Amazon in the past. The entire dataset was 'positive', and output degrees of similarity instead of classifications. Amazon applicants are mostly male while the pool was presumably 50/50, so that was learned as an element of "Amazon-candidate-ness".
That's also an interesting story, but from the first publication (in Reuters) it's been framed as an uneven base rate 'inevitably/predictably/mechanistically' producing a biased result. Which is not only untrue but downright backwards, since it implies that the rate in the general data is what matters, rather than the relative rate between samples and positive classifications. It's yet another variant of the mammogram base rates question, and I wish people would stop trying to reinforce the incorrect answer to that.
Post your bank! Let's be like Magnus Carlson and occasionally ask ourselves, "What would DeepMind do?"
Except that's exactly what it is. Much as your model was biased against interns.
> and especially hyper-focusing everyone's attention on diversity aspect of it is extremely irresponsible.
Why? Pointing out a specific and concrete harm badly designed ML models cause is irresponsible? Just because the same kind of methodological flaw can cause other harms its irresponsible to use a motivating example?
In my opinion, yes, if it leads most readers to misjudge some fundamental properties of the problem as a whole. Again, I'm not saying this article is guilty, but most are.
Using the term 'bias' has certain political motivations behind it. It's not about the term being technically untrue as it is about the term being non-neutral. For instance, here are some definitions of 'bias' I just grabbed from American Heritage:
"A preference or an inclination, especially one that inhibits impartial judgment."
"An unfair act or policy stemming from prejudice."
"A statistical sampling or testing error caused by systematically favoring some outcomes over others."
The ML model does not have a preference, inclination, or prejudice relating to interns, except insofar as we anthropomorphize it to have them. What does using a word suggesting that add?
A more neutral account of what's going on is along the lines: It's easy to accidentally train ML models so that they will make systematic errors. (Among those errors is the possibility for it to exhibit behavior resembling prejudice.)
Isn't that what the article is trying to say, though? That your model can only be as accurate as your data set… and that even then, you have to be very careful to make sure it's not inferring patterns from entirely unrelated information?
Curiously though, did you compare the non-hire (full time) rates of interns vs fire rates of non-interns?
That's not what happened in the example at all. The example company isn't biased against summer interns, "who stops working after x time" was just a bad question.
The comment you're replying to can boil down to "do you want a monkey's paw solving your problem? If so then AI may be for you"
Or perhaps "stop pretending you're ever going to get ethics or empathy out of a computer"
Not sure I understand the question. IIRC, the way data was setup there was no way to tell why an intern stopped working for the company, because for all interns "reason code" for separation was the same.
Meaning, isn't it prudent to spending time on this issue?
That was the logical next step and we started on that, but it required exporting more historic data out of the HR system and filtering out anyone who started as an intern as well. Sounds simple, but in practice it's anything but. Just for the reference, data extraction, cleaning and filtering in that project took at least an order of magnitude more time than anything related to machine learning.
The project eventually lost steam and got abandoned.
>Do you still suspect a skewed result?
Absolutely. My personal intuition is that there is very little correlation between resumes and candidate quality. If that is true, any seemingly accurate predictions would be the result of a similar problem. Testing this hypothesis was a large portion of why I agreed to work on the project in the first place.
More importantly, the only way to really show causation is by positing a mechanism.
Given a statistically large enough sample, 2 outcomes: 1) The Siemens sensor actually is at fault. 2) The Siemens sensor is a part of a larger system, which is different in non-Siemens turbines, and that system is failing.
Either way, the model prediction on turbine failures is enhanced with that Siemens feature. But to even get to this granularity, you are diving into model explainability, or what features were important for each prediction. Here, you try to understand the black-box to find reasons for particular input->output.
We aren't just looking for patterns. We are looking for patterns so that we can take action and affect the future. If the patterns, which are real enough in the historical data, don't correctly predict the impact of a choice, then they are anti-helpful bias.
For example, it may be that the company bought Siemens sensors years ago and then switched to another brand later. Unsurprisingly, older turbines fail more than newer ones. So, really, it's age that is the causative factor and the concrete action you want to take is to pay closer attention to older turbines. Even though the correlation to Siemens is real, if the action you take is "replace all the Seimens sensors with another brand", that won't make those old turbines work any better.
In other words, understanding data doesn't just mean "see which bits are correlated with which other bots". In order to be useful, we need to understand which changes to those bits in the future will be correlated with which desired outcomes. Anything less than that and you don't yet have information, just data.
Yes, AI systems presume induction to be true. But so does... uh, science and most other things we do?
The point is the Siemens sensor is a superfluous correlation with turbine failure, because the underlying dataset is biased towards Siemens sensors. The scenario suggested by the author is one in which your turbine failure dataset does not match reality.
No amount of sample enlargement will correct sample bias. You have a variable which is disproportionately represented in your underlying dataset despite being independent from a collection of variables correlated to failure, and the algorithm is learning that one instead.
Real world ways this is plausible and cannot be corrected by increased sampling:
1. Your telemetry data is accurate, but your logging service providing that data is faulty and only consumes data from a subset of meaningful publishers.
2. Whoever provided this dataset fat fingered a SQL query which joined too few tables including the sensor vendors, but correctly returned only the failing turbines.
3. Your data has (unnormalized) duplicates, because more than one system is providing telemetry data for Siemens sensors without the older systems being retired.
4. You use mostly Siemens sensors, and simply didn't correct for this in your sample.
1. Not a spurious correlation - Siemens sensors are in fact associated with increased failure rates in the dataset and if you continue to sample data with the same methodology this correlation will continue. You need to fix your data collection methodology, but it's not a spurious correlation.
2. See #1.
3. See #1.
4. The original problem statement said that a low percentage of unfailed turbines used Siemens sensors, and a high percentage of failed turbines used Siemens sensors. So 'you use mostly Siemens sensors' would imply that most of your turbines have failed, which seems a little unlikely to me.
Given how incredibly hard it is to avoid sample bias, you can't take it for granted that your training data doesn't have any sample bias.
"just as a dog is much better at finding drugs than people, but you wouldn’t convict someone on a dog’s evidence. And dogs are much more intelligent than any machine learning."
Because in my head I followed it with the sentence "but we're all confident that we will have dogs driving our cars in about 5 years." Food for thought for sure.
They didn't say dogs were better than technology at solving problems, in any sort of general sense.