I think that your HIT design highlights several common mistakes requesters make on MTurk:
- You are underpaying for the task (would you write a good review of Berkeley, CA for $1 for a stranger?)
- You provide no aggregation or verification step, to ensure that turkers know their work should jive with other turkers' output. You also give no indication that such verification is possible or likely to happen.
- Your task output is poorly defined and open to interpretation. You may have asked a straightforward question, but I assume you placed a blank textbox on the screen and expected well-formed paragraphs in return.
If you want a great example of text synthesis of relatively high quality using MTurk for prices in the range of your budget, see http://borismus.com/crowdforge/
If you want to learn more about how to design HIT workflows, see http://projects.csail.mit.edu/soylent/ (disclosure: I share an office with and work with Michael Bernstein, but not on this work). One of Soylent's contributions was the Find-Fix-Verify design pattern, which helps with some of the problems you raise.
Your task is even harder, of course, since you require subject-matter experts in a fictional location. So perhaps MTurk is the wrong crowd for your task.
I think this article doesn't reflect everyone's experience with Mechanical Turk. We get lots of high quality work out of Mechanical Turk and lots of other companies do as well. It does take a fair amount of work to get the quality right - that's how we got started as a business and that's why many people still come to us.
As an aside, if the author of the article is reading this thread and wants data, we would be happy to talk about it.
It should be trivial to create a task, create a task for evaluating that task, and yet another task for evaluating that task. Run all three long enough and you will in fact get good results.
Obviously if you're going to use an unreliable protocol there have to be management protocols in effect to correct errors, or you will end up with errors. This is not a revelation.
So you'd think this tool would do this for you - but instead you need another layer on top, either one you code, or some 3rd party tool like CrowdFlower.
Even if in theory, feeding back the results into Mechanical Turk for manual evaluation will correct errors, there are still huge tradeoffs in practice.
Suppose you had three people look at a task and say if it was done correctly or not. We pick the most popular choice out of the three. This works fine for tasks like speech transcription where it is easy to tell if it was done correctly or not. But what about tasks like labeling features in biological images? Surprisingly, even if you show people examples of what is correct and not correct, they still have a hard time distinguishing between the two. This are the kinds of difficult tasks that are especially in need of QA.
If the people evaluating correctness are only right 60 percent of the time, you'd better have more than 3 people vote on whether it's correct, just to get a good estimate. (Also, we are assuming people are biased toward the correct answer, rather than toward the wrong answer, or toward a fixed response) If you need a lot of people to evaluate each task, then you're paying several times more money than you were for the original tasks, and you have to write some infrastructure for feeding things back into mechanical turk.
Like you said, it will work in principle, but there are some tradeoffs.
Personally, I prefer the gold-data method and being conservative about accepting results in the first place rather than feeding them back to get fixed or labeled incorrect.
[1] www.vision.caltech.edu/publications/WelinderPerona10.pdf
Or are they?
What if "mechanical turks" continue to use their FB-account to do the same?
This makes any rating-system almost useless.
And since I will be publishing an Android-App soon: Wouldn't it be wise to hire people to rate it with 5 stars, say a few hundred times? It seems like my competition will do it.
There is additional work for the service provider but it would seem to me that it does align with their self-interest at some level. I don't think Amazon really wants mturk to be associated with providing a spam work force.
I believe one of the things that CrowdFlower explicitly calls out as an advantage over mturk is quality control (although for this particular solution to work all crowd-sourcing providers would have to do it - in this particular case it takes only one bad provider to enable bad behavior.
As to your hopefully hypothetical question: a risk you're running is that Google will pull out your app from the store. I haven't heard a case with Google but I'm pretty sure apps were pulled from Apple's App Store for manipulating ratings, so the downside could be big (your hard work could amount to nothing).
Except for the one shady site that doesn't, and ends up raking in profits.
They are tackling tasks like extremely difficult OCR and collaborative editing and proofreading.
I've used mturk at work to automate transcribing short recordings and have found that it works pretty well. The trick is to qualify your workers so that they pass some kind of test. You can also only accept workers that have a rating above some minimum. Then, critically, as suggested by others here, get each task done multiple times for cross-checking. And make sure that your instructions are clear.
I was planning on using mt for a project I'm working on.
has anyone got any pointers on how to get the best out of mechanical turk? Advice much appreciated.
To "weed out" ineligible workers, try this approach: 1. Post a bunch (1000-5000) of cheap multiple-choice HITs. 2. Allow no more than 10 hits per worker. 3. Each hit to get 3 responses from different workers. 4. Review answers, compile the list of "good" workers, blacklist the "bad" ones. 5. Post another bunch of HITS, make them available for eligible workers only (found in step 4), this time the HITs might be more demanding, individually review results for each worker -> the best ones go on your "preferred worker" list. 6. Repeat steps 1-5 as necessary.
From then on it's fairly safe to rely on mturk workers from your preferred list.
> We all know that Mechanical Turk challenges the whole “Junk-in, Junk-out” dilemma and makes it more like “Always junk-out, regardless of the input process”
Couldn't be more true IMO, mturk is basically useless except for this "meta" kind of research and is a good example of a community that needed active management and positive incentives going to absolute shit in the absence of both.