Median ratings are "honest" in this sense, as long as ties are broken arbitrarily rather than by averaging. Math challenge: is there a way of combining the desirable properties mentioned in the post with the property of honesty? I suspect there is but I haven't tried it.
I usually don't want ratings, I want the Wirecutter treatment. Sometimes, I know/care enough to really research the topic, in which case star reviews are relatively unhelpful. The rest of the time, I just want someone trustworthy to say "buy this if you want to pay a lot, buy this if you want something cheap, but this third thing is no good at any price".
Foursquare uses it and I've found their scores to be way more useful than Yelp's.
The biggest problem with star ratings is that it's so arbitrary. What is the difference between 3 and 3.5? What is a 1 vs a 2? 3/5 is 60%, that's almost failing when you think about it on a grading scale, if I scored something as a 3/5 I would never use that product or service again, yet, many of the best restaurants are rated 3/5 on Yelp.
Unless the user has some scoring system in place for different qualities of the product or service, there is no way you can get anything resembling an accurate score.
I would never trust a user to accurately assess a score given 10 different options (.5-5) but I would be way more likely to trust a user to say either "I like this product" or "I do not like this product."
But yes, the Wirecutter approach works great, but it just doesn't scale.
For example, Uber seems to think anything but a 5/5 is a failure. I know this so I skew to accommodate, but in my personal ranking system I've only had a couple 5 star rides (someone really going above and beyond).
Up/down with an optional qualifier afterwards (e.g. "why were you unhappy?" after a thumbs down) seems to remove a lot of confusion.
The ultimate question is is this going to be useful to me, and the answer to that is ... somewhat complicated.
Informative, timely, accurate, significant (which may be none of the above), funny (may be appropriate or inappropriate, based on context and/or volume).
Some information is often (though not always) better than no information. Bad information is almost always worse.
(Aside: troubleshooting a systems issue yesterday I had the problem of someone trying to offer answers to questions where "I don't know" was far more useful than "I think ...". Unfortunately, I was getting the "I think ..." response, though not phrased as such.)
What you describe, the wirecutter treatment, is the case of an expert opinion. Here there remain issues -- particular of the biased expert. But if I could give a hierarchy of opinions from least to most useful:
-2. Biased. -1. Ignorant. 0. None. 1. Lay. 2. Novice. 3. Experienced. 4. Expert. 5. Authority in field and unbiased.
Note that the problem of judging expertise itself recapitulates much of the same problem.
Qualification and reputation of the raters themselves is a critical element missing from virtually all ratings systems.
The other alternative is for users to actually SORT and RANK all products in that category that they have reviewed. Not a tenable solution.
Side comment, the Yelp histograms ARE useful... but that is more of a side effect/emergent from a bad rating scheme than anything else. Because people are using the stars not ideally, the histogram gives you insight into that. So it's not a bad solution, but a better solution would be something other than the stars.
The real win would be empowering the user to choose their own rating style. I don't see this happening because it's much harder to push content at users this way.
But, to your overall point, there are a lot of things that I just want a hopefully mostly unbiased expert to tell me what to buy and I'm just fine with that. When I buy a garden hose nozzle, I'm just fine with whatever one of Wirecutter's sister sites tells me to buy. I don't need or want to do a lot of research into the finer points of garden hose nozzle design.
1. How long did it take for the person to vote in the first place? (Might change weight; if we're talking cars, nobody really knows if they "like" it 4 minutes after purchasing and it means something different if the rating appears 3 months later.)
2. Has the vote changed between up/down? Has this happened twice?
3. Has the person voted for other things in similar categories? Might make sense for phones, over a period of years. Doesn't make sense for a person to buy and rate 20 different chairs in a week. Use it to give credibility.
The problem with dichotomous ratings (binary, thumbs up-down) is that they lose a lot of meaningful information without eliminating the problems you're referencing.
That is, the same problems apply to dichotomous ratings, in that people still have tendencies to use the rating scale differently. Some tend to give thumbs up a lot, others down, and people interpret what's good or bad differently. People who are ambivalent split the difference differently.
On top of that, you lose the valid variance in moderate ranges, and actually amplify a lot of these differences in use of the response scale, by forcing dichotomous decisions, because now you've elevated these response style differences to the same level of the "meaningful part" of the response. E.g., maybe one person tends to rate things more negatively than another person, rating 4 and 5 respectively. But when you dichotomize, now that becomes 1 and 2.
The question is whether or not, on balance, the variance associated with irrelevant response scale use is greater than the meaningful variance, and generally speaking studies show the meaningful variance is bigger. In general, you see a small but significant improvement in rating quality going from 2 to 3, and from 3 to 4, and then you get diminishing returns after 4-6 options.
Also, people really don't like being forced to take ambivalence and choose up or down, so in the very least having a middle option is better (unless you want to lose ratings).
It's fairly straightforward to adjust for rating style differences if you have a bunch of ratings of an individual on a bunch of things whose rating properties are fairly well-known. Amazon could do this if they wanted to, and Rotten Tomatoes I think might do something like this already.
RT, in fact, is kind of a bad example, because their situation is so different from typical product ratings, in that you have a small sample of experts who are rating a lot of things. They also are aggregating things that themselves are not standardized-- their use of the tomatometer in part stems from them having to aggregate a wild variety of things, as if everyone on Amazon used a different rating scale, or no rating scale at all. Note too that there's then a "filtering" process involved by RT. Finally I also feel obliged to note they do have ratings and not just the tomatometer, which I've started paying attention to after realizing that things like Citizen Kane show up as having the same tomatometer score as Get Out--a fine movie but not the same.
The game theory angle is interesting to think about. It's something I don't deal with usually because in the situation I'm used to, the raters don't have access to other rater's ratings. That's one solution, but impractical. A sort of meta-rating is one solution--a lot like Amazon's "helpfulness" ratings. It's imperfect but probably does well in adjusting for game theory-type phenomena, like retaliatory rating, etc.
Surely you can't prove this simply by noticing that there are many 1- and 5-star reviews, as there could be many other reasons for that. One obvious one: people who strongly like or strongly dislike a product are more likely to take the time to review. I personally have never felt the need to review something if I felt "meh" about it.
One sample study might be to see how people's ratings change if they have a chance to see the average rating first or not, but that would be a tricky study as you'd need to get people to buy something without seeing the ratings.
In any case, I find the mechanism design angle interesting regardless of the behavioral angle :)
I doubt it's thought about in precisely those terms, but I've certainly thought "that's overrated/underrated" before, and had that affect my rating.
Or for that matter if I got a simple item and it works as expected. What's an Amazon Basics HDMI cable supposed to do? Does 5 stars mean that it turns HD into 4K through magic? But if it does what an HD cable is supposed to do (and seems well-built) I guess I should give it 5 stars?
I'm fairly sure at one point years ago I did see some data showing that rating distribution changes depending on whether the rater sees the average rating first, but it was a long time ago and I don't recall the specific differences now. In practice you're right that this isn't feasible for most use cases.
I do know that the rating distribution is strongly bimodal - on a 5 star scale I think 80+% of ratings will be either 1 or 5 star. Mostly 5 stars - IIRC they were around half of all ratings.
I'm not sure if that would actually work better in practice, but it's at least an interesting idea.
Then I realized that it would just incentivise bots to add 1-star reviews to random products once their creators figure out this mechanism.
Sometimes these problems make me sad, it could all be so nice and easy if it weren't for these bad actors.
_Judgement_ is a slightly different problem. There's an entire issue (#145) of the _Journal of Economic Theory_ on this, but the panorama is still quite bleak, and the reddit approach is far from state-of-the-art.
(Personal experience: I've "returned" to reddit (I swore off facebook but I'm still addicted to having something on my phone), and the only way to get people to interact with you is to browse the "new" queue. Once something is "hot" it's basically dead -- new comments are queued to the end even if they're rising fast, and no one replies to you).
(For those not familiar: single-peaked preferences assumes that the person always prefers the final rating to end up closer to their personal rating. So if I believe the restaurant is 3 stars, I'd most prefer it gets rated actually at 3 stars, and I'd rather see 2 stars than 1 star. If all the raters have single-peaked preferences, then using the median to produce the final rating is truthful: A person can't move the final rating closer to their own belief by lying. The mean is not truthful: If the current average is 4 and my rating is 3, I can pull the average closer to 3 by giving a 1-star review.)
For instance, if I see a product on Amazon with a 4.8 average rating but notice a lot of very angry 1 star ratings, I'm likely to infer that there may be quality control problems.
Amazon displays a histogram so the shopper can assess the meaning of the distribution heuristically.
There's also the issue of whether ratings should be absolute or based on value. If I buy some obviously knockoff ear buds for $6 and they are way better than expected, I'd give them 5 stars, but if they had cost $50 I'd have given a three star review.
So for shopping it seems that there are multiple signals being aliased into a single star rating.
> DOTA 2 users then had the brilliant idea to do the dumbest thing any fanbase can do to a game, flood Metacritic with bad user reviews. The slew of zeros since the forgotten Diretide has dropped the game's user score about two points to a 4.5.
https://www.forbes.com/sites/insertcoin/2013/11/02/valve-for...
https://arstechnica.com/gaming/2017/08/steam-reviewers-bomb-...
Also I'd like to know if a 5/10 rating is mostly 5s, or an average of mostly 1s and 10s.
In the case of Amazon, the relevant options are:
1. Likert scale of quality:
a junk, just don't buy it
b cheap and works good enough for occassional use
c higher quality: willing to spend more and you'll get a much better outcome.
d overpriced
2. bad shipping, bad vendor, poor customer service
I hate seeing a bad review for a product based on the last item, they're normally outlier issues or whiners and I normally try to filter them out.In the case of rotten tomatoes it is, again a different set of parameters.
The suggested method seems to be asking for binary responses (like/dislike), then aggregating them with the confidence-bound formula. This should be truthful in, e.g., a model where users who like it want to maximize the score and users who dislike it want to minimize the score.
This is really interesting, do you have anything where I could read more about it?
In particular, the conclusion:
"The reviews on Amazon’s Electronics products very frequently rate the product 4 or 5 stars, and such reviews are almost always considered helpful. 1-stars are used to signify disapproval, and 2-star and 3-stars reviews have no significant impact at all. If that’s the case, then what’s the point of having a 5 star ranking system at all if the vast majority of reviewers favor the product? Would Amazon benefit if they made review ratings a binary like/dislike?"
https://www.youtube.com/watch?v=fX9lj0UdB9s
Around 11:40 he shows evidence of this "dishonest" behavior. As far as I remember, the whole talk was very good. He has some publications on the topic.
But, counter to the OP's point, I wouldn't assume this is an attempt to move the average; I would guess this is for a number of reasons, including because it's too much mental energy to decide if a product (film) is worth four or five stars, if you rate something you are often just trying to say "liked it" or "didn't like it."
I used it precisely as you describe.
Ended up in a discussion/argument with one of the users (not aware of my role) over whether or not that constituted "abuse" of the system.
It was pointed out (by others) what my relationship to the system design was. I remain amused by the episode.
Usually we consider "aggregation functions" with a fixed number of graders, N. It has been proven that if you want an aggregation function that is:
- anonymous: all graders treated equally
- unanimous: if all graders give the same grade, then that must be the output grade
- strategy-proof: a grader who submitted a grade higher (lower) than the output grade, if given the chance to change their grade, could do nothing to raise (lower) the output grade
- strictly monotone: if all graders raise (lower) their grade, then the output grade must rise (fall)
then your aggregation function must be an "order statistic": the median (if N is odd) or some other function which always chooses the Mth highest input grade.
If you relax the last criterion to:
- weakly monotone: if all graders raise (lower) their grade, then the output grade must rise (fall) or stay the same
then your aggregation function must be "the median of the input data and N-1 fixed values". As an example of this last type of function, let's take @panic's idea that each grader has an honest evaluation between 0 and 100 but has an agent that submits a fake grade (0 or 100 usually) to pull the average toward their honest evaluation.
As I say in a descendant comment, this system will converge to the unique number, X, such that X percent of the graders want the final grade to be X or above. You noted that this whole system (the average and the agents) is strategy-proof, so each grader should be honest with their agent. We might as well pull the agents into the system and say, "submit your honest evaluation and we'll calculate X, the unique number such that X percent of the graders want the final grade to be X or above."
This is an aggregation function. It is anonymous, unanimous, strategy-proof, and weakly monotone. I call it the "linear median" in my PhD thesis. Rob LeGrand called it "AAR DSV" in his thesis. We've been calling it the "chiastic median" more recently. It has some interesting properties. Considered in the context of "the median of the input data and N-1 fixed values", with 100 graders, the 99 fixed values are 1,2,...,99, and this function always returns the median of the input data with these 99 fixed values. (No matter how many graders there are, the fixed values will equally divide the number line between 0 and 100.)
You can see chapters 5-8 of my PhD dissertation for more info: http://ajennings.net/dissertation.pdf
Now, you're thinking about how the grade changes when a new vote is added, so we're really talking about a family of aggregation functions, one for each possible number of graders. We want each one to be strategy-proof in itself, but we also need to consider how they relate to each other.
Do you want strict monotonicity or weak? (I find strict monotonicity too restrictive, myself.) If you say "strict", then for each N you need to choose which order statistic you want. If you say "weak", then for each N you need to choose N-1 fixed values and you'll always take the median of the input data and the appropriate array of fixed values.
In my thesis (section 7.2) I talk about how you can create a "grading function" to unify a family of aggregation functions, but I don't think that's a perfect fit since we want to somehow "punish" subjects that don't have very many grades (that's what the OP is about). Do we want to pull them towards 0, or pull them towards some global neutral value (like 3 out of 5)?
Worse than useless.
Even a simple change like adding a "show only items with a minumum of X reviews" would be a godsend.
If Amazon ranked it's items the "proper" way, such that all one-rating products were far from the top, I imagine it would see a lot more clustering of purchases on the most popular version of every product type. All those variations that were not as popular would receive many fewer purchases, and some of those vendors might simply fold.
Amazon may have decided that having a larger ecosystem of vendors is worth more than implementing a better rating system. This, presumably, is not in the customer's interest (unless perhaps the "discoverability" of unknown products is on balance worth the risk).
Whatever the reason, Amazon certainly knows about other ranking systems, so it has to have made this choice deliberately.
don't you think if they switched to this algorithm you would see page after page of cheap crap with 500 5-star reviews each from the 500 resident fake reviewers?
...though now that I've written it like that it does make it obvious that this would be higher a burden on fakers, and cut out some of the products doing this.
It helped me to push chinese crap way below the fold and focus on well-researched items on top.
I loved it, but I noticed that not too long after (maybe a year?) they removed it. My sense was that small businesses were complaining that the system was unfairly benefiting larger businesses. E.g., if you have a new product, using the lower bound or something similar is unfair because it penalizes you for being new, relative to established players.
Honestly, I can see that perspective too (which is missing from the linked piece), and am not really sure what to do about it. The linked piece comes at it from the perspective of consumer risk minimization, and not from the perspective of the producer, which Amazon also has to contend with.
The solution is probably to allow sorting by both.
"The following formula is used to calculate the Top Rated 250 titles. This formula provides a true 'Bayesian estimate', which takes into account the number of votes each title has received, minimum votes required to be on the list, and the mean vote for all titles:
weighted rating (WR) = (v ÷ (v+m)) × R + (m ÷ (v+m)) × C
Where:
R = average for the movie (mean) = (Rating) v = number of votes for the movie = (votes) m = minimum votes required to be listed in the Top 250 C = the mean vote across the whole report"
http://www.imdb.com/help/show_leaf?votestopfaq&pf_rd_m=A2FGE...
When we were building the NextGlass app, I took much of this into consideration for giving wine and beer recommendations.
We recently ran the query on the Untappd database of 500 million checkins and it yielded some interesting results. The "whales" (rare beers) bubbled to the top. I assume this is because users who have to trade and hunt down rare beers are less likely to rate them lower. The movie industry doesn't have to worry about users rating "rare movies", but I would think Amazon might have the same issue with rare products.
That is also a problem with movie ratings (I just noticed that you mentioned movies). Critics (and audiences) at pre-screenings are generally significantly more favorable to a movie than an equivalent group in a normal theater. I would not be surprised if the same thing applied to foreign movies, and other types of "whales".
Works amazingly well and so easy to calculate vs say the way IMDb rates things.
https://stackoverflow.com/questions/10029588/python-implemen...
The accepted answer uses a hard-coded z-value.
In the event that you want a dynamic z-value like the ruby solution offers, I just submitted the following solution:
https://stackoverflow.com/questions/10029588/python-implemen...
I suppose one could arbitrarily assign ratings above a certain threshold to "positive" and those below to "negative", and use the same algorithm, but I imagine there's probably a similar algorithm that works directly on numeric ratings. Anyone know? Or if you must convert the numeric ratings to positive/negative, how does one find the best cutoff value?
Inspired by Evan's post, I wrote "How Not to Sort by Popularity" a few weeks ago: https://medium.com/@jbochi/how-not-to-sort-by-popularity-927...
Often you also want to give a configurable advantage or handicap to new entries.
Well, you can't answer that question without making assumptions. And these seem to be missing in the article.
The Bayesian approach would be to assume the true vote distribution is binomial and use a beta prior (possibly with Jeffrey's degenerate bimodal prior). Then as the total number of votes increases the posterior distribution tightens. Ranking score is prob(score>0).
Obviously it's not entirely analogous but I would not be surprised if it mapped over to this domain.
Edit: on mobile so late on the link to Kenneth Arrow https://en.m.wikipedia.org/wiki/Arrow%27s_impossibility_theo...
In other words - I can care less how Joe Blow rated the product - but it's important to me how likeminded people like me rated the product.
Also - Amazon is not making mistake in ratings.
Amazon is less interested in selling you relevant product for you.
Amazon is more interested to boost it's bottom line, move stalled inventory or move higher margin inventory.
I've made a simple plot in Excel here: http://i.imgur.com/adjaLQ9.png
The number of up-votes remains the same, while down-votes increases linearly. The scoring declining line in grey is the score.
x: [0, 100] y: [0, 100] z: [0, 1]
https://www.google.com/search?q=graph+((x+%2B+1.9208)+%2F+(x...)
sort_score = (pos / total) + (W * log(total))
Here, W is the weighting (scaling) factor. Total = positive + negativeSee http://www.imdb.com/help/show_leaf?votes for details.
score = (pos + a) / (tot + b).
Where a<b, e.g. a=1, b=2.
See this post why that formula follows from Bayesian reasoning: http://julesjacobs.github.io/2015/08/17/bayesian-scoring-of-...
Hal Varian (UC Berkeley) has some 1990s refs, which remain good. "Grouplens" is the project/product.
Randy Farmer literally wrote the book on the topic. There's a book, blog, and wiki.
Frankly, Farmer's work, good as it is, largely reinforces my view that Varian captured the essence of the problem, which I've summarised in my opening 'graph. You cannot algorithmically correct for crap quality assessment.
If you're interested in the long-form answer, the fields are epistemology (philosophy) and epistemics (science).
Enjoy!
http://people.ischool.berkeley.edu/~hal/Papers/publish.html
http://people.ischool.berkeley.edu/~ngood/