When it comes down to it, the end goal is just to predict whether someone would like something, and/or present them a list of the things you are most certain they'd like. In the analysis context (as with much of HCI), the scales are being used to draw qualitative conclusions about tasks and preferences, so it makes sense to directly attack erroneous modeling and assumptions, because it can lead to wrong conclusions. But for prediction, erroneous modeling only really matters to the extent that it means we're: 1) optimizing the wrong thing; or 2) doing optimization suboptimally.
#1 is important to get right, but #2 is more of a "whatever works" sort of thing, and we even have fairly good automatic methods for deciding. If treating ratings as numerical data empirically leads to good predictions, then it's fine to do; if not, then it's best avoided. Many recent systems avoid even having a human make those kinds of decisions, by throwing in a giant bag of possible ways of slicing the data, and then handing off the decision about which of them to use, and how to weight them, to an ensemble method. Iirc, that's what the winning Netflix-prize entry was like.
#3: Suggest items that they are likely to really love.
This is subtly different than predicting what the user is most likely to like. To optimize with RMSE scoring, you are better off suggesting a sure "4" than a risky "5". For buying an expensive item like a car or a stereo, the safe bet might be a good approach. But for books, music, or movies --- easily sampled, one of a series --- I'd be much more excited by a system that can predict A+ items with even 25% probability than one that offers up straight B items with 80% consistency.And I can see a difference between someone that hasn't voted, and someone that voted a 3-stars in a 5-stars system.
Well, I would say it kind of is, but it kind of isn't.
For many things, most people don't bother to think about the difference between crap unless you force them to, so in a 4-star system 2 stars becomes the "mediocre" rating while 3 and 4 differentiates between the good ones. How many people really care about a grade difference between D and an F? Likewise, do people really spend that much time making sure their 1 and 2 star ratings form a consistent philosophy of relative crappiness?
If "forcing" people to make a choice about something that is supposed to be a subjective categorization to begin with is probably not helping anything. If you want to force a like/dislike you should get binary data and be done with it.
"Would you eat this? Yes/No."
That's easy to answer accurately. Everyone will agree on what you mean. If you haven't answered that means you don't have an opinion. Beyond that semantic ambiguity is impossible to avoid and gets worse the more numbers you add.
Also, this observation could be interpreted a bit differently:
> The probability that a user changes her rating between 2 and 3 is almost 0.35 while the probability she changes between 4 and 5 goes down to almost 0.1. This is a clear indication that users perceive that the distance between a 2 and a 3 is much lower than between a 4 and a 5.
It seems a bit counter intuitive that the distance between 3 (neutral) and 4 (positive) is smaller then 4 and 5 (very positive). You could also interpret this differently. When a user changes his mind, he has to change his mind in such a way that the difference is significant enough to also change the review (is the review now a little bit wrong or very wrong). This means that he might actually see the difference between 3 and 4 as larger then 4 and 5, large enough for him/her to change the review. This effect is dampened the amount of time the user actually changes his mind this way. If you look at it in that way then the amount of pairwise inconsistencies are the wrong way to measure the distance between these ordinal categories in this particular case, because there actually might be two mechanisms that cancel each other out.
Additionally, a great body of work in behavioral psych tells us that humans have a tough time measuring preferences on any absolute scale; however, we can consistently compare two items as better or worse (particularly when they're of the same type, instead of apples versus oranges). "Riffle independence" is a recent method for modeling these kinds of preference distributions, and has been used quite successfully for social curation of the blogosphere - i.e., showing the best set of blogs that span the topic space and have little redundancy.
There is a great data set to test this theory... Netflix. This article shouldn't end by just solicitng opinions, but with his results in the Netflix data set.
If you give me a few months, I might get there. But this is the reason I wrote a blog post and not a paper ;-)
Fair enough. I'll keep my eyes peeled for a paper in a few months. :-)
I would love to see Amazon go to a binary recommendation system (thumbs up/thumbs down) with a free text review.
EDITED for embarrassing grammar mistakes
When I was doing recommender for a pet supplies company, I used log likelihood test.
Given they product A, what are the chances they would buy B.
Also since we had thousands of products I sometimes looked at correlations charts or even simple histogram to easily pinpoint what products and quantities were purchased after the initial purchase of A. It made crunching millions of transactions easier.
Then the article is an example of why computer people should be careful on where they learn their statistics!
The article is awash in hand wringing about "interval scale" and "ordinal scale" data without being at all clear on just why someone should care, and for all the rest of the article they should not care.
So, the article has:
"For ordinal data, one should use non-parametric statistical tests which do not assume a normal distribution of the data."
Mostly nonsense. In statistical testing, the normal distribution arises mostly just via the central limit theorem which has quite meager assumptions trivially satisfied by "Likert" scale data.
Then there is:
"Furthermore, because of this it makes no sense to report means of likert scale data--you should report the mode."
Nonsense: The law of large numbers has especially meager assumptions also trivially satisfied by Likert scale data. If you want to estimate expectation, then definitely use the mean and not the mode.
Beyond the law of large numbers, there is also the classic
Paul R. Halmos, "The Theory of Unbiased Estimation", 'Annals of Mathematical Statistics', Volume 17, Number 1, pages 34-43, 1946.
that makes clear that the mean is the most accurate way to estimate expectation.
If you want to use the mode for something, then say what the heck you want to use it for and then justify using the mode as the estimator.
There is:
"In order to defend that ratings can be treated as interval, we should have some validation that the distance between different ratings is approximately equal."
Nonsense. Instead, you get a 'rating', say, an integer in [1, 5]. Now you have it. Use it. For
"validation that the distance between different ratings is approximately equal."
why bother? Besides, "the distance" is undefined here!
For the
"This is a clear indication that users perceive that the distance between a 2 and a 3 is much lower than between a 4 and a 5."
the writer is just fishing in muddy waters.
There is
"All the neighbor based methods in collaborative filtering are based on the use of some sort of distance measure. The most commonly used are Cosine distance and Pearson Correlation. However, both these distances assume a linear interval scale in their computations!"
Nonsense. Just write out the definitions of expectation, variance, covariance, and Pearson correlation and see that sufficient is that the expectation of the squared random variables be finite. There is nothing about "interval scale" in the assumptions.
But why calculate Pearson correlation? When dig into that, again, basically just want some MSE (mean square error) convergence, which again makes no assumptions about "interval scale" data.
There is
"This is my favorite one... The most commonly accepted measure of success for recommender systems is the Root Mean Squared Error (RMSE). But wait, this measure is explicitly assuming that ratings are also interval data!"
Nonsense. There is no such assumption about MSE. The main point about MSE is just that any sequence of random variables (e.g., estimates) that converges in MSE will have a subsequence that converges almost surely. In practice, convergence in MSE is convergence almost surely, and that's the best convergence there can be. So, if your estimates are good in MSE, then essentially always in practice they are close in every sense. Nowhere in this argument is an assumption about "interval data".
This article sounds like 'statistics' from some psycho researcher who has an obsession about interval scales and a phobia about using ordinal scale data! In particular he has high anxieties about being charged with heresy by the Statistical Religious Police! The guy needs 'special help'!
Did I mention that the article is nonsense?
Really, you can be positive, constructive and even happy in your life without sounding less smart by doing so.
But there isn't a lot of room to respond with substance because the article is, did I mention, nonsense.
There is a reason: Several paths led to some of the more central topics in probability and statistics. Such paths included gambling, astronomical observations, psychological testing, signal processing, control theory, quality control, 'statistical' physics, quantum mechanics, mathematical models in the social sciences, experimental design, especially in agriculture, mathematical finance, and more. In addition there is now a very solid, polished field of probability, stochastic processes, and their statistics,
Some of these paths got lost in the swamp on their way to some reasonably clear understanding. For the solid material, so far that is rarely taught: The prerequisites need quite a lot of pure math, and then the pure math departments rarely follow through with the probability, stochastic processes, and statistics.
Early in my career, I was dropped into parts of the swamp, but later I got the rest of the pure math prerequisites and good coverage of the solid, polished material.
So, at this point I see both the swamp and the solid, polished material.
Net, the paper is from the swamp, and I responded with just a little of the solid, polished material.
For the swamp, not a lot of discussion is justified. The best response is the one I gave: The stuff from the swamp is nonsense. That may sound harsh, but it's on the center of the target.
Is it just me who starts reading articles like this, only to crash into sentences like this one:
"A Likert scale is a unidimensional scale on which the respondent expresses the level of agreement to a statement - typically in a 1 to 5 scale in which 1 is strongly disagree and 5 is strongly disagree."
And think snarky comments like "I should keep reading this article. Do I 1) strongly disagree or 5) strongly disagree?" (Then flick back to the site that linked to it to post that Snarky comment).