If you pick a point to the left of your sample, then moving your estimate to the left will improve your mean squared error on average. If you pick a point to the right of your sample, then moving your estimate to the right will improve your mean squared error as well.
I'm still trying to come to grips with this, and below is conjecture on my part. Imagine sampling many points from a 3-D Gaussian distribution (with identity covariance), making a nice cloud of points. Next choose any point P. P could be close to the cloud or far away, it doesn't matter. No matter which point P you pick, if you adjust all the points from your cloud of samples in accordance to this James-Stein formula, moving them all towards your chosen point P by various amounts, then, on average they will move closer to the center of your Gaussian distribution. This happens no matter where P is.
The cloud is, of course, centered around the center of the Gaussian distribution. As the points are pulled towards this arbitrary point P some will be pulled away from the the center of Gaussian, some are pulled towards the center, and some are squeezed so that they are pulled away from the center in the paralled direction, but squeezed closer in the perpendicular direction. Anyhow, apparently everything ends up, on average, closer to the center of the Gaussian in the end.
I'm not entirely sure what to make of this result. Perhaps it means that mean squared error is a silly error metric?
https://www.naftaliharris.com/blog/steinviz/
https://www.youtube.com/watch?v=cUqoHQDinCM (this video actually references the original post)
My takeaway is that the volume of points which get worse as they are pulled towards point P exists in some region R. As the number of dimensions increase, region R's volume shrinks as a % of the total cloud volume, making it much more unlikely that a sample is pulled from that region. In other words, you are more likely to sample points which move closer to the center than move away, which is why the estimator is an improvement on average.
The mean of the n-dimensional gaussian is an element of R^n, an unbounded space. There's no uninformed prior over this space, so there is always a choice of origin implicit in some way...
As you say, you can shrink towards any point and you get a valid James-Steiner estimator that is strictly better than the naive estimator. But if you send the point you are shrinking towards to infinity you get the naive estimator again. So it feels like the fact you are implicitly selecting a finite chunk of R^n around an origin plays a role in the paradox...
You get close to it but strictly speaking wouldn’t it always be better than the naive estimator?
You could use an uninformed improper prior.
In fact, in a real world setting I would probably use my first measurement to define the origin, having no other reference to reach for.
Either way, this is absurd unless we have some additional background information about μ other than our sample x itself. But it's easy to resolve the paradox: Since the choice of origin is arbitrary (unless it isn't!), select our coordinate system such that x = 0, then the adjustment is also zero, and then the James-Stein estimator agrees that û = x = 0.
Here's one wikipedia example:
> Suppose we are to estimate three unrelated parameters, such as the US wheat yield for 1993, the number of spectators at the Wimbledon tennis tournament in 2001, and the weight of a randomly chosen candy bar from the supermarket. Suppose we have independent Gaussian measurements of each of these quantities. Stein's example now tells us that we can get a better estimate (on average) for the vector of three parameters by simultaneously using the three unrelated measurements.
Here's what's bogus about this: the "better estimate (on average)" is mathematically true ... for a certain definition of "better estimate". But whatever that definition is, it is irrelevant to the real world. If you believe you get a better estimate of the US wheat yield by estimating also the number of Wimbledon spectators and the weight of a candy bar in a shop, then you probably believe in telepathy and astrology too.So it's not that "you get a better estimate of the US wheat yield by estimating also the number of Wimbledon spectators and the weight of a candy bar in a shop", it's simply that you get a better estimate for the combined vector of the three means. (Which, in this case, the vector of the three means is probably meaningless, since the three data sets are entirely unrelated. But we could also imagine scenarios where that vector is meaningful.)
Am I misunderstanding something?
I am personally bothered by the way it is presented as a "paradox", with the implication that it would have real world applications.
I have zero doubts that you can't improve the estimate of the US wheat yields by looking at some other unrelated things, like candy bars. Presenting the result as if it a real "improvement" is false advertisement.
On the other hand, if we look at related observations, then the improvement is not a paradox at all. Let's say I want to estimate the average temperature in the US and in Europe. They are related, and combining the estimates will result to a better result, to nobody's surprise.
I remember back in 7th or 8th grade I asked my math teacher why we want to minimize the rms error rather than the sum of the absolute values of the errors. She couldn't give me a good answer, but the book All of Statistics does answer why (and under what circumstances) that is the right thing to do.
But I really don't know, it's just an intuition with no formalism behind it.
\hat{mu} = ReLU(…)
I think that ship has sailed, but I think it's unfortunate that "ReLU(x)" became a popular notation for "max(0,x)". And using the name "rectified linear unit" for basically "positive part" seems like a parody, like insisting on calling water "dihydrogen monoxide".
Now the specific formula may be complicated. But otherwise I do not understand the “paradox”? Or am I missing something?
Of the set of samples a fixed distance d from the mean of the distribution, strictly less than half of them will be closer to the origin than the mean is, and strictly greater than half of them will be further from the the origin than the mean. This is true for all values of d > 0, so the result holds for all samples.
The claim that the best estimator must be smooth seemed surprising to me.
It could be "more likely" in the jars example where estimates may convey some relevant information for each other. But consider this example from wikipedia:
"Suppose we are to estimate three unrelated parameters, such as the US wheat yield for 1993, the number of spectators at the Wimbledon tennis tournament in 2001, and the weight of a randomly chosen candy bar from the supermarket. Suppose we have independent Gaussian measurements of each of these quantities. Stein's example now tells us that we can get a better estimate (on average) for the vector of three parameters by simultaneously using the three unrelated measurements."