The theoretical tools (and intuitions) we have today for making sense of the distribution of data, developed over the past three centuries, break down in high dimensions. The fact that in high dimensions Gaussian distributions are not "clouds" but actually "soap bubbles" is a perfect example of this breakdown. Can you imagine trying to model a cloud of high-dimensional points lying on or near a lower-dimensional manifold with soap bubbles?
If the data is not only high-dimensional but also non-linearly entangled, we don't yet have "mental tools" for reasoning about it:
* https://medium.com/intuitionmachine/why-probability-theory-s...
* https://news.ycombinator.com/item?id=15620794
[0] See kgwgk's comment below.
More precisely: it is the mass that is “concentrated” at the edge, not the density. In the Gaussian case the distribution “gets more and more dense in the middle” regardless of the number of dimensions. However, in high dimensions the volume in the middle is so low that essentially all the mass is close to the surface of the hypersphere.
High dimensions are weird.
(seems to require a Google account, sorry in advance)
My intuition says many of those problems have high dimensionality, but I'm not really confident about my intuition here.
A uniform Gaussian presupposes that the variates are either linearly orthogonal, or all have the same linear interaction with each other (in the case of fixed positive correlation).
If your actual problem has dimension 20, but you've measured it with N dimensions, then that means there are strong interactions between your measured variates, and moreover the intervariate interactions do not have a single fixed interaction strength (like a single Gaussian correlation), but probably vary like a random matrix.
This might be related to the Tracy-Widom[1] distribution somehow. Perhaps the the distribution you use to replace the Gaussian should really be something like: first generate a random positive semi-definite matrix as C, then generate random data based on different random choices of C.
[1] https://en.wikipedia.org/wiki/Tracy%E2%80%93Widom_distributi...
Less importantly, the last paragraph says that the probability that two samples are orthogonal is "very high". Being precisely orthogonal is technically a probability zero event. There author means "very close to orthogonal."
There was a good discussion about this problem in the context of Monte Carlo simulations in (1).
The second is that the squared norm has a chisq distribution. There's no point simulating it. You can just plot the pdf, and have all kinds of facts about its mean, var, entropy etc. Also, iirc Shannon had something to say about this.
However, I do think these facts are worth a reminder.
On the second point, I agree that the approximation deserves a mention.
[0] http://www.inference.vc/content/images/2017/11/Screen-Shot-2...
They use a gamma distribution which has more probability density near the origin, which causes samples around the origin and interpolations to be more like real input.
BTW, matplotlib has a nicer facility than add_subplot() for making grid plots:
fig, axes = plot.subplots(nrows=figdims, ncols=figdims)
for dim, ax1 in zip(range(2, MAX_DIM), axes.flatten()[:(MAXDIM-2)]):
.
.
.As a concrete example, if you have a coin that gets heads 99% of the time and you flip it 1M times, you are overwhelmingly likely to get around 10k tails, even though the individual sequences with many fewer tails are each far likelier than the typical sequences.
However, it's not a "bubble" in the intuitive sense. He's looking at the magnitude distribution of dots over the entire space, and implicitly using the Cartesian coordinate system (discarded angle, looking at just magnitude).
If you look at the distribution of dots per volume (or R^N hyper-volume rather), then you'll still have the highest concentration in the center, with no "bubble".
Comes in handy for plotting a radially smooth 'star cluster' without doing polar coordinates and trig. Just plot a load of (x=a_guass,y=another_gaus,z=another_gaus) and you have a radially smooth object. I dont think any other distribution can do that, it seems to me there is something mathematically profound about it which Im sure some mathemagicians have a proper grasp of.
The 'co-linear' distortions of other distributions can be seen here in some plots in the test page for my random distribution lib:
He shows that importance sampling will likely fail in high dimensions precisely because samples from a high dimensional Gaussian can be very different than those from a uniform distribution on the unit sphere.
Consider the ratio between a sample at the same point from a 1000D Gaussian and a 1000D uniform distribution over a sphere. If you sample enough times, then the median ratio and the largest ratio will be different by a factor of 10^19. Basically, most samples from the Gaussian will be fairly similar to the uniform. A few will be wildly different.
Perhaps I'm misunderstanding both the post and MacKay's book. I'd be happy to be corrected.
This is what he means when he says "practically indistinguishable from uniform distributions on the [unit] sphere." As tgb remarked in another comment, the "unit" bit is incorrect.
I really like and often come back to this talk by Michael Betancourt where the theme is quite similar: https://youtu.be/pHsuIaPbNbY
There was no such thing as an average pilot. If you’ve designed a cockpit to fit
the average pilot, you’ve actually designed it to fit no one.
Good enough source here: http://wmbriggs.com/post/18291/Humans form a very high-dimensional space. I'm not sure what to make of the point about orthogonality in that regard.
Any recommended foundational texts to begin with?
Recommended learning trajectory to get to where this is understandable?