my $variance = $count > 1 ? ($sum_square - ($sum**2/$count)) / ($count-1) : undef;
Taking the difference between two similar numbers loses precision, and in extreme cases squaring the raw numbers could cause overflow. For comparison, see the recently posted: http://www.python.org/dev/peps/pep-0450/ and https://en.wikipedia.org/wiki/Algorithms_for_calculating_var...https://github.com/nferraz/st/commit/d0fb1bf814fc5940c5aae39...
double M = Sum/N; /* mean */
double var = (s2 - M*Sum)/(N-1); /* variance */
where s2 is the sum of squares. In most reasonable situations, this approach will work fine. It just doesn't take as wide a variety of inputs as is easily possible to achieve with normal floating point doubles. In fairness, the |STAT terms and conditions state:|STAT PROGRAMS HAVE NOT BEEN VALIDATED FOR LARGE DATASETS, HIGHLY VARIABLE DATA, NOR VERY LARGE NUMBERS.
$ octave
octave:1> a=load('numbers.txt');
octave:2> sum(a)
ans = 55
octave:3> mean(a)
ans = 5.5000
octave:4> std(a)
ans = 3.0277
octave:5> quantile(a)
ans =
1.0000
3.0000
5.5000
8.0000
10.0000
etcThe reason I wrote this script was to get quick results from the command line.
For instance: I could use grep, cut and other unix tools to get the numbers from a file and make quick calculations.
Of course, for complex processing I would use octave or R.
function mean() {
octave -q --eval "mean = mean(load('$1'))"
}
Then just run "mean numbers.txt".I am sure your approach is much quicker, octave takes a good 0.5s(!) to load on my machine.
octave -q --eval "mean = mean(load('$1'))"
But, again, octave requires more time to warm up...IMHO if you want your script's output to be easily usable by other scripts, line-delimited is easier since you can grep out what lines you want rather than having to rely on the column position never changing (since you can give cut only a field number and not a field name like "average").
I wanted to use "stat", but it was already used (display file status); "statistics" was too big.
Just as curiosity, I got the idea for this script when I wanted to calculate the sum of some numbers and discovered that the "sum" command was used for another purpose (display file checksums and block counts)!
http://www.freebsd.org/cgi/man.cgi?query=ministat&apropos=0&...
I converted this tool to linux for the archlinux package forever ago:
https://github.com/codemac/ministat
There are a few forks (adding autoconf, an osx branch, etc) as well.
The calculation of median and quartiles require that the whole set is stored and later sorted, so it is limited to the available memory.
Regarding your suggestion -- I'm considering the idea of dealing with multiple columns and even CSV and other types of tabulated data.
edit: it's perl, not python. brainfart on my part.