1. Spotting odd things in MPs' expenses: http://blog.jgc.org/2009/06/its-probably-worth-testing-mps.h...
2. Spotting odd things in BBC executives' expenses: http://blog.jgc.org/2009/06/running-numbers-on-bbc-executive...
3. The Iranian election: http://blog.jgc.org/2009/06/benfords-law-and-iranian-electio...
4. New Age mumbo jumbo: http://www.jgc.org/blog/2008/02/any-sufficiently-simple-expl...
I was thinking to make it into a little web tool.
<blockquote>The discovery of this fact goes back to 1881, when the American astronomer Simon Newcomb noticed that in logarithm books, the earlier pages (which contained numbers that started with 1) were much more worn than the other pages.</blockquote>
Can you imagine the sense of observation and curiosity that would make someone look at a book of numbers and say, "I wonder why these pages are more worn than those ones."
I simply chalked it up to most people not being very serious about reading books in general and any given book in particular. It's a rare person who makes it all the way through.
I don't think my own observation was a particularly interesting or original one.
What made Newcomb's observation interesting was that it was about books of logarithm tables in particular, where (unlike a typical book) you'd think the lookups would be uniformly distributed.
The other interesting thing that did require an unusual amount of curiosity and dedication is the systematic testing of such a casual observation to try to figure out what the underlying reasons for it were and how they might apply to things other than books of logarithms. This desire and dedication to observe, test, and figure out the underlying workings of things is the hallmark of many a great scientist.
Imagine that your calculator break down every three-to-four months. After a couple of years, any hacker is bound to think "I should be able to take a couple of broken ones, pick working parts, and build a working one". Then, you discover that all of them have perfectly working '9' keys, but broken '1' keys.
[EDITED to add: Discussed before on HN: http://news.ycombinator.com/item?id=687241. There have been quite a number of other discussions of Benford's law on HN, too.]
In the United States, evidence based on Benford's law is legally admissible in criminal cases at the federal, state, and local levels.
Wouldn't it be stranger (and actually interesting) if that evidence wasn't admissible?
It's still a bit of a brain f--- when you first encounter it. I found it easier to get using plotting tools, as opposed to aggregating lists of numbers and measurements.
If it proves itself true, then you could use it to test if a group of things is increasing or decreasing.
Proof that the starting digit in numbers is not base invariant:
In base 10 not all numbers start with 1. In base 2 all numbers start with 1. Hence, the distribution is not base invariant. QED.
For an explanation why you get the right thing if you substitute "units", I refer you to the Wikipedia page on Benford's Law. http://en.wikipedia.org/wiki/Benfords_law
Here are some hacker-newsers testing files in their home directories: http://news.ycombinator.com/item?id=1076534
http://news.ycombinator.com/item?id=100540
http://news.ycombinator.com/item?id=499405
http://news.ycombinator.com/item?id=699202
http://news.ycombinator.com/item?id=731176
http://news.ycombinator.com/item?id=1076405
http://news.ycombinator.com/item?id=1429336
http://news.ycombinator.com/item?id=1569669
http://news.ycombinator.com/item?id=1653808
http://news.ycombinator.com/item?id=1917514
http://news.ycombinator.com/item?id=2089809
Now imagine that numbers are built out of stones. To "build" a 1, you only need 1 stone. But to "build" a 2, you need 2 stones. Thus, if you wanted to write a 3, you would have to go in the desert and find 3 stones. It's 3x as hard, and so you'd expect people to "build" 1/3 as many 3's as 1's, 1/5 as many 5's as 1's, and so on. Just as you'd expect there to be a lot more single story buildings than skyscrapers. It's easier to build a single story building.
Thus, the distribution is exactly what you'd expect. While it doesn't actually take stones to build numbers, we don't write the number 3 unless we have 3 of something. Unless you are lying. Which is why this is a great method of detecting fraud.
UPDATE: What do I mean when I say "3 times as hard"?
Imagine the desert is a rectangle of 10 squares. Kind of like a mancala board or a ladder on the ground. You start by stepping in square 1, and to get to square 10 you have to step through each square.
If there is only 1 rock, what are the odds that you'll have to walk all 10 steps to find it? This is the same thing as asking what are the odds that this rock is in square 10. The answer is 1/10 or 10%.
Now, if there are 3 rocks, what are the odds that you'll have to step into all 10 squares? Well, what are the odds that there's a rock in the last square? 26.1%, or approximately 3x as hard. It's interesting that it's not exactly 3x as hard, it's 2.61x as hard. Which makes the data in the OP seem even more logical since you'd expect 30.8% 1's given 11.8% 3's--the 32.62% actual number is not that far off.
Suppose you are the guy looking for the stones. There are two stones in the desert. Everything being random but equal, you are twice as likely to run into a stone when there are two than when there is only one stone in the desert. Once you find the first stone, it is equally difficult to find the second stone as it is to find only one stone at the beginning (if you treat "finding a stone" as independent events where you don't learn about the location of subsequent stones).
So while the idea is interesting, the analogy is poor. I much prefer the wikipedia explanation which is similar to yours but much more logically rigorous: http://en.wikipedia.org/wiki/Benfords_law#Outcomes_of_expone...
Response to update: Now I feel that you are convoluting your analogy. Can multiple stones occupy the same square? How is it appropriate to equate/compare "the number of squares you walk through in order to pick up all the stones" to "the number of times a digit should show up"? I apologize, but your illustration has become completely lost to me.
You're right I should have clarified. If multiple stones could not occupy the same square, the odds would remain as I first explained them (3x, etc.). I think in my stones analogy and real life, stones should be able to occupy the same square. In fact, there should be a positive correlation (ie, given that there's a rock in this square, odds of a second rock being there go up).
> How is it appropriate to equate/compare "the number of squares you walk through in order to pick up all the stones" to "the number of times a digit should show up"?
The odds of coming across 3 units of a quantity are 3x as hard as coming across 1 unit. When we write numbers, we are either:
1) writing a truthful description of how many units we see/own/ate/taste/touch etc. (I ate 2 bagels, I earned $5, I ran 10 miles.)
2) lying.
By "lying", I'm including things like writing a novel. Maybe a better word is "imagining". With numbers, we are either writing down true observations or we are imagining them. It's just as easy to "imagine" $9 million in your bank account as it is to "imagine" $1 million, while truthfully finding $9 million in your bank account is a lot more difficult :). This is why Benford's law doesn't apply for "imagined" numbers. By using Benford's law, you can quickly classify a number set into either "real" or "imagined".
According to Benford's law the odds of a leading 1 are 1.709511291351... times the odds of a leading 2. This isn't the factor of 2 you thought it should be. The odds of a leading 1 are 2.409420839653... times the odds of a leading 3. This isn't the factor of 3 you thought it should be.
Yes, I know that it is fun to try to figure things out for yourself. But it is essential to learn when you're headed down the wrong path. That lets you correct your misconceptions before they cement and lead to severely wrong impressions of how to do things. Your whole desert/rock analogy? That's a wrong path.
It looks to me though, that my line of reasoning(note I said more precisely 2.6x, I used 3x initially to simplify it) more closely matches the data than the numbers you provided.
For the numbers 1-19, more than half of them start with 1. For the numbers 1-199, more than half of them start with one.
Change the examples to 1-299, 1-399, etc, and you'll get percentages of all digits matching Benford's law.
Your method also seems to depend heavily on the choice of starting and ending points. If I chose 1-99, then only 1/10th of the numbers in the interval will start with 1. So why choose 199 and not 99?
I just selected end points to illustrate the concept. I think this place is getting a little too literal. :)
I think it seems "counter-intuitive" to some because they are not used to thinking of numbers and counting as being related to exponents and bases.
This may seem more intuitive to those of us that work with computers all day since we are intimately familiar with how to count in a handful of different bases (base-2, base-10, base-16 etc).
http://blog.yafla.com/Demystifying_Benfords_Law
Best page I've seen on it.
This kind of mathematically unsophisticated reasoning is exactly why Benford's law is so surprising to people. If you think of what it means for a value to be "truly random", the result is not surprising at all.
[1] http://ckan.net/
I wonder what influence the 'spatial' properties of a number pad password has on this data. For example "5" gets a nice little spike... and "5" is the center key on the 10-key iPhone number pad. The "1" is still the winner by far, but I wonder how many of those are the easy-to-remember "1234".
Chart looks like this. https://url.odesk.com/a7och
How is that a large dataset? There aren't that many countries.
Any time you are counting something it seems obvious to me that you'd have 1 more often than 2.