Let's look at the second test. It's advertised as a "logic test". The description is:
> Participants then completed a 20-item logical reasoning test that we created using questions taken from a Law School Admissions Test (LSAT) test preparation guide (Orton, 1993).
That's the entire description of their method. So immediately, we can see the following problems:
1. Just like the joke test, there's no way to replicate this given the description in the paper. Which questions did they take and why? In turn this throws all claims that the DK study has been replicated into question.
2. The citation is literally a Cliffs Notes exercise for students. It's about memorization of answers to pass law exams, not an actual test itself designed to verify logical reasoning ability. Why do they think this is a good source of questions for testing logic? Law is not a system of logic, there's even a famous saying about that: "the life of the law is not logic but experience". If you wanted to test logical reasoning a more standard approach would be something like Raven's Matrices.
Putting my two posts together there's a third problem:
3. Putting aside the obvious problems with subjectivity, their joke test is defined in an illogical way. They define a test of expertise (working as a comedian), select some people who pass this test and define them as experts, then discover that one expert would have been ranked by their own test as "incompetent but doesn't know it". Yet this is a contradiction, because this person was selected specifically because the researchers defined them as competent. Rather than deal with this logical contradiction by reframing the question they simply ignore it by discarding that comedian from their expert pool.
This is good evidence that DK themselves weren't particularly logical people, yet, they claim to have designed a test of logic - a bold claim at the best of times. Ironically, it appears DK may be suffering from their own effect. They believe themselves to be competent at designing tests yet the evidence in their paper suggests they aren't.
They’re 100% divorced from law and are closer to puzzles of the nature of, “Six people sit at a table, four of whom are wearing hats, three of which are red, …”
[1] https://www.petersons.com/blog/sample-lsat-test-questions/
1. Which of the following is most strongly suggested by the government’s statement above?
(A) A warning that applies to a small population is inappropriate.
(B) Very few people drink as many as six cups of coffee a day.
(C) There are doubts about the conclusive nature of studies on animals.
(D) Studies on rats provide little data about human birth defects.
(E) The seriousness of birth defects involving caffeine is not clear.
Given the structure of this question I assumed there'd be more than one right answer but apparently, the only "logical" answer is C.
Maybe the word logic is used differently in the legal profession, but this doesn't resemble the kind of logic test I'm used to. It's about unstated/assumed implications of natural language statements i.e. what a 'reasonable' person might read into something, rather than some sort of tight reasoning on which logical laws could be applied. I can see why that's relevant for lawyers but it's not really about logic.
Still, let's roll with it. (A) and (B) are clearly irrelevant given the stated justification, strike those. But (C) and (D) appear to just be minor re-phrasings of each other. Why is C correct but D not? An implied assumption of the study is that rat studies provide a lot of data about human birth defects, and the government's position implies that they don't agree with that. D could easily be a reasonable subtext for that position. E could also be taken as a reasonable inference, that is, the government believes there's a risk the study authors are using an exaggerated definition of birth defect that voters wouldn't agree with, and that 'refutation' of the study would take the form of pointing out the definitional mismatch.
So if I was asked to score this question I'd accept C, D or E. The LSAT authors apparently wouldn't.
That said, the "analytical reasoning" sample question looks more like a logic test, and the logic test looks more like a test of analytical reasoning. But even their bus question is kind of bizarre. It's not really a logical reasoning test. It's more like a test to see if you can ignore irrelevant information. The moment they say rider C always takes bus 3, and then ask which bus {any combination + C} can take, the answer must be (C) 3 only. Which is the correct answer.
> I do not think that it really matters which questions you choose as long as they span a wide enough difficulty range so that you are able to separate participants.
The problems here are pointing at a fundamental difficulty: all claims about competence/expertise are relative to the person picking the definition of competent. In this case the tasks are all variants on "guess what the prof thinks the right answer is", which is certainly the definition of competence used in universities, but people outside academia often have rather different definitions.
So the questions really do matter. If the DK claim was more tightly scoped to their evidence - "people who think they're really good at guessing what DK believe actually aren't" - then nobody would care about their results at all. Because they generalized undergrads guessing what jokes Dunning & Kruger think are funny to every possible field of competence across the entire human race, they became famous.
I'm not sure what that has to do with anything? The paper doesn't claim to have anything to do with testing logic. It's about people's self-perception in relation to a task at which they are, or are not, competent. That task could be juggling watermelons or strangling geese for all it matters.
The paper reports on the results of a 'logic' test administered to undergrads and uses this to define competence. It's a key part of their evidence their effect is real.
> It's about people's self-perception in relation to a task at which they are, or are not, competent. That task could be juggling watermelons or strangling geese for all it matters.
The specific tasks matter a great deal.
The whole paper relies very heavily on the following assumption: DK can accurately and precisely tell the difference between competence and lack of competence. In other words, that they know the right answers to the questions they're asking their undergrads.
In theory this isn't a difficult bar to meet. They work at a school and schools do standardized testing on a routine basis. There are lots of difficult tasks for which there are objectively correct and incorrect answers, like a maths test.
But when we read their paper, the first two tasks they chose aren't replicable, meaning we can't verify DK actually knew the right answers. Plus the first task is literally a joke. There isn't even a right answer to the question to begin with, so their definition of "competence" is meaningless. The other tasks might or might not have right answers that DK correctly selected, but we can't verify that for ourselves (OK, I didn't check their grammar test but given the other two are unverifiable why bother).
That's a problem because the DK effect could appear in another situation they didn't consider: what if DK don't actually know the right answers to their questions but their students do. If this occurs then what you'd see is this: some students would answer with the "wrong" (right) answers and rate their own confidence highly, because they know their answer is correct and don't realize the professors disagree. Other students might realize that the professors are expecting a different answer and put down the "right" (wrong) answer, but they'd know they were playing a dangerous game and so rate their confidence as lower. That's all it would take to create the DK effect without the underlying effect actually existing. To exclude this possibility we have to be able to check that DK's answers to their own test questions are correct, but we can't verify that. Nor should we take it on faith given their dubious approach to question design.