Gaussian distributions form a monoid (opens in new tab)

(izbicki.me)

45 pointsdroque10y ago14 comments

14 comments

For anyone else who didn't understand what the author meant by a binary operation on Gaussian distributions: he is really talking about combining sufficient statistics. In fact, this property is true for any exponential family (Gamma, Poisson, etc).

graycat10y ago

> "Gaussians are ubiquitous in learning algorithms because they accurately describe most data."

Amazing!

Uh, might want to reconsider and check that.

donovanr10y ago

It's not necessary for the distributions to be normal for this approach to be relevant. A lot of the (very powerful) Gaussian machinery for e.g. graphical models, can still be rigorously applied to a many non-normal distributions by using the Nonparanormal transform: http://jmlr.csail.mit.edu/papers/volume10/liu09a/liu09a.pdf

morpher10y ago

This is especially true if you have multimodal distributions (such as the income datasets in the article). Although its true that there are simple algebraic properties that allow you to calculate the mean and variance of the total population given the same for sub-populations, it often isn't the case that this will be a good fit to the data. That being said, this is a useful property for parallelizing gaussian fits.

graycat10y ago

> That being said, this is a useful property for parallelizing gaussian fits.

Might you be able to clarify this sentence? I have zero idea what might be meant:

> this is a useful property

What is the antecedent of "this", that is, in this phrases, what does "this" refer to?

> gaussian fits

What is a gaussian fit? I have no idea. I'm comfortable with the Lindeberg-Feller version of the central limit theorem, the weak and strong laws of large numbers, martingale theory, the martingale proof of the strong law of large numbers, the Radon-Nikodym theorem, and the fact that sample mean and variance are sufficient statistics for the Gaussian distribution, but, still, I can't even guess what a gaussian fit is.

> parallelizing

I can guess that what is meant by "parallelizing" is the computer software approach of having one program try to get some work done faster by starts several threads or tasks in one or several processor cores, processors, or computers. Okay. But what is it about "gaussian fits" that might commonly call for "parallelizing"?

mrow8410y ago

Fitting a distribution to data is pretty common parlance in my experience, and there is even a wikipedia article with a relevant name [0].

I presume that the parallelisation point was with reference to the point made by the article, that the calculation of means and variances can be parallelised, so large datasets can be dealt with efficiently.

Is there something else you are missing?

[0] http://en.wikipedia.org/wiki/Distribution_fitting

1 more reply

gone3510y ago

Nor are they ubiquitous in learning algorithms either. And thankfully so.

jessaustin10y ago

Yeah NNT will call out the pitchforks for this guy. In fairness, though, in this area, this usually "works" (using the ML meaning of that verb).

graycat10y ago

Maybe we just need to note that mean and variance can be useful in using some data even if the distribution of that data is not Gaussian.

PaulHoule10y ago

Back when I was a grad student in physics we used similar thinking to prove the central limit theorem w/ the renormalization group.

drostie10y ago

So the basic math here looks like this: in classical probability theory, when you've got multiple random variables `X, Y` they have a joint probability `f(x, y)` such that the probability that (X, Y) is in the square (x ± dx/2, y ± dy/2) is `f(x, y) dx dy`. Independent distributions have probabilities which factor: `f(x, y) = g(x) * h(y)`, to fit the general rule that independent discrete events have probabilities which multiply. To calculate probabilities for a function like `Z = X + Y` you have to invent a new probability-distribution function `p(z) = ∫ f(x, z - x) dx`, which for independent events appears as a convolution of the two probability-densities.

All of the useful statistics for a probability distribution can be gleaned from the probability-density function by taking the Fourier transform and then (when it's not too bad) the (complex) natural logarithm -- or, when it's easier, just `f(s) = log(E[exp(s X)])` (having `s X` rather than `i s X` up-top). These are called cumulant-generating functions or CGFs.

If you have a sum of two independent random variables, the CGF of the sum is the sum of their CGFs. This means that the Taylor polynomial of the CGFs have components which simply add together. Take derivatives of the CGF to get the "cumulants", which are pseudo-linear (linear in independent random variables, with some relationship `f(k X) = k^n f(X)`). So these cumulants become really convenient for characterizing the distribution as a whole: in fact if you have all of them you can recover the original distribution.

Now if you describe your distributions in terms of its cumulants, you now have a special monoid:

    data Distr n = Distr [n]
    instance (Num n) => Monoid (Distr n) where
        mempty = Distr (repeat 0)
        mappend = zipWith (+)

Gaussian distributions in particular have a form for their CGF which is preserved by this monoid operation, namely:

    if X ~ Gauss(m, s^2) then ln(E(exp(k X))) = m k + s^2 k^2

(Note that the leading term has to be 0, as if s = 0 we have ln(E(1)) = ln(1) = 0.) Another way to state this is "the Fourier transform of a Gaussian is another Gaussian."

So that is super-simple, and you can already see the hints of the central limit theorem emerging: the mean of N identically-distributed variables X_i will have a CGF related to the coefficients c_i of the original distribution by:

    ln(E(k * sum_i X_i / N)) = N sum_m c_m k^m / N^m

so the Taylor expansion gets attenuated by successive powers of N: you can approximate the CGF with truncation.

So, the set of all distributions form a monoid under convolution if you allow the Dirac delta-function to act as `mempty` -- in a certain representation, this appears as the termwise-summing monoid. Gaussian distributions are a sub-monoid of this larger monoid, and they are not the only one: we could easily go out to 3, 4, or 5 cumulants before setting the rest to 0 and finding a similar sub-family.

SamReidHughes10y ago

That diagram of a Gaussian distribution is... something special.

j / k navigate · click thread line to collapse

14 comments

simonbyrne10y ago

graycat10y ago

> "Gaussians are ubiquitous in learning algorithms because they accurately describe most data."

Amazing!

Uh, might want to reconsider and check that.

donovanr10y ago

morpher10y ago

graycat10y ago

> That being said, this is a useful property for parallelizing gaussian fits.

Might you be able to clarify this sentence? I have zero idea what might be meant:

> this is a useful property

What is the antecedent of "this", that is, in this phrases, what does "this" refer to?

> gaussian fits

> parallelizing

mrow8410y ago

Fitting a distribution to data is pretty common parlance in my experience, and there is even a wikipedia article with a relevant name [0].

Is there something else you are missing?

[0] http://en.wikipedia.org/wiki/Distribution_fitting

1 more reply

gone3510y ago

Nor are they ubiquitous in learning algorithms either. And thankfully so.

jessaustin10y ago

Yeah NNT will call out the pitchforks for this guy. In fairness, though, in this area, this usually "works" (using the ML meaning of that verb).

graycat10y ago

Maybe we just need to note that mean and variance can be useful in using some data even if the distribution of that data is not Gaussian.

PaulHoule10y ago

Back when I was a grad student in physics we used similar thinking to prove the central limit theorem w/ the renormalization group.

drostie10y ago

Now if you describe your distributions in terms of its cumulants, you now have a special monoid:

    data Distr n = Distr [n]
    instance (Num n) => Monoid (Distr n) where
        mempty = Distr (repeat 0)
        mappend = zipWith (+)

Gaussian distributions in particular have a form for their CGF which is preserved by this monoid operation, namely:

    if X ~ Gauss(m, s^2) then ln(E(exp(k X))) = m k + s^2 k^2

(Note that the leading term has to be 0, as if s = 0 we have ln(E(1)) = ln(1) = 0.) Another way to state this is "the Fourier transform of a Gaussian is another Gaussian."

    ln(E(k * sum_i X_i / N)) = N sum_m c_m k^m / N^m

so the Taylor expansion gets attenuated by successive powers of N: you can approximate the CGF with truncation.

SamReidHughes10y ago

That diagram of a Gaussian distribution is... something special.

j / k navigate · click thread line to collapse