Probably Not: From Chapter 16, Benford’s Law
|
Table 13.1
|
|
Fraction of Occurrences
|
|
Leading Digit
|
|
|
|
1
|
0.301
|
0.555
|
|
2
|
0.178
|
0.185
|
|
3
|
0.124
|
0.093
|
|
4
|
0.095
|
0.056
|
|
5
|
0.079
|
0.037
|
|
6
|
0.067
|
0.027
|
|
7
|
0.058
|
0.020
|
|
8
|
0.051
|
0.015
|
|
9
|
0.046
|
0.012
|
|
Table 13.2
|
|
Fraction of Occurrences
|
|
Leading Digit
|
|
|
|
1
|
0.302
|
0.111
|
|
2
|
0.176
|
0.371
|
|
3
|
0.125
|
0.184
|
|
4
|
0.098
|
0.110
|
|
5
|
0.080
|
0.074
|
|
6
|
0.066
|
0.055
|
|
7
|
0.058
|
0.038
|
|
8
|
0.052
|
0.031
|
|
9
|
0.044
|
0.025
|
|
Table 13.3
|
|
Leading Digit
|
Fraction of Occurrences
|
|
1
|
0.350
|
|
2
|
0.250
|
|
3
|
0.100
|
|
4
|
0.080
|
|
5
|
0.050
|
|
6
|
0.060
|
|
7
|
0.040
|
|
8
|
0.020
|
|
9
|
0.050
|
|
Table 13.4
|
|
Area
|
Population
|
Country
|
Capital
|
|
250,000
|
31,056,997
|
Afghanistan
|
Kabul
|
|
11,100
|
3,581,655
|
Albania
|
Tirane
|
|
919,590
|
32,930,091
|
Algeria
|
Algiers
|
|
181
|
71,201
|
Andorra
|
Andorra la Vella
|
|
481,351
|
12,127,071
|
Angola
|
Luanda
|
|
171
|
69,108
|
Antigus & Barbuda
|
St. John’s
|
|
1,068,296
|
39,921,833
|
Argentina
|
Buenos Aires
|
|
11,506
|
2,976,372
|
Armenia
|
Yerevan
|
|
2,967,893
|
20,264,082
|
Australia
|
Camberra
|
|
32,382
|
8,192,880
|
Austria
|
Vienna
|
|
33,436
|
7,961,619
|
Azerbaijan
|
Baku
|
|
5,382
|
303,770
|
Bahamas
|
Nassau
|
|
257
|
698,585
|
Bahrain
|
Al-Manamah
|
|
55,598
|
147,365,352
|
Bangladesh
|
Dhaka
|
|
80,154
|
279,912
|
Barbados
|
Bridgetown
|
|
11,787
|
10,293,011
|
Belarus
|
Mensk
|
|
8,867
|
287,730
|
Belgium
|
Brussels
|
|
43,483
|
7,862,944
|
Bhutan
|
Belmopan
|
|
Table 13.5
|
|
Fraction of Occurrences
|
|
Leading Digit
|
Area
|
Population
|
Ideal Terms
|
|
1
|
0.340
|
0.273
|
0.301
|
|
2
|
0.149
|
0.180
|
0.176
|
|
3
|
0.139
|
0.119
|
0.125
|
|
4
|
0.129
|
0.119
|
0.097
|
|
5
|
0.036
|
0.067
|
0.079
|
|
6
|
0.057
|
0.025
|
0.067
|
|
7
|
0.046
|
0.062
|
0.058
|
|
8
|
0.046
|
0.062
|
0.051
|
|
9
|
0.057
|
0.036
|
0.046
|
Getting the ideal distribution numbers for the Benford series starting with the 1/x PDF is a small calculus exercise. The resulting formula is
Figure 13.2 compares the ideal Benford terms to the land area and population data. Again, I've drawn continuous lines rather than showing the discrete points just for clarity. As can be seen, the fit isn't bad for such small data sets.
Can we conclude that just about any set of data will obey these statistics? No. For example, data that are normally distributed will not obey these statistics. Uniformly distributed data will not obey these statistics. However, it has been proven that if you build a data set by randomly choosing a distribution from a list and then randomly choosing a number using that distribution, the resulting data set will obey Benford's law.
A simple caveat should be mentioned. Suppose I have a list of some measurements made up of numbers between 1.000 and 1.499; 100% of these numbers have a leading digit of 1. If I double these numbers, 100% of the new list will have a leading digit of 2. The entire discussion falls apart if the numbers are not drawn from a list spanning multiples (at least one) of decades. Legitimate lists are numbers from 1 to 10, 1 to 100, 10 to 100, 2 to 20, 300 to 3,000,000, and so on, whereas 1 to 5, 20 to 30, and so on, are bad choices. If the list spans many decades, then the upper limit loses significance, because of the shape of the underlying 1/x PDF; there are relatively so few numbers coming from the large values of x that they don't contribute to the statistics in any meaningful way.
An interesting thing about data sets such as the distances to the nearest stars is that the set will obey the Benford statistics regardless of what units these distances are expressed in. This because changing of units, say from miles to kilometers or light - years, is done by multiplying every distance by a scale factor. For example, to change from meters to inches, you would multiply every number by 39.37. Since a data set obeying the Benford statistics will obey it just as well when multiplied by an arbitrary scale factor, the leading digit (Benford) statistics will not change.
A large company's collection of (say) a year's worth of business trip expense reports will obey the Benford statistics. I don't know if it's an urban legend or really the truth, but the story I heard is that the IRS has caught companies generating spurious expense statements using random number generators because they didn't realize that neither a uniformly nor a normally distributed list of numbers between, say, $10 and $1000 conforms to the correct statistical distribution.
The frequent occurrence of the Benford's Law in the world around us is really not as surprising as it might first seem. If we were to generate a sequence of numbers by repeatedly adding a constant to the previous number, for example
1.0, 1.1, 1.2, ............9.9, 10.0
we would get a uniformly distributed sequence of numbers. A specific range, for example a range 1 unit wide, would contain the same number of members of the sequence wherever it was placed in the sequence.
This sequence, called an arithmetic sequence, is therefore an example of what we have been calling a uniform distribution from 1 to 10.
The sequence of numbers generated by multiplying the previous number by a constant, for example
1.00, 1.10, 1.21,...............
generates a sequence where the separation between the numbers increases as the numbers get large. This sequence, called a geometric sequence, is an example of a 1/ x distribution of numbers which demonstrates Benford's Law.

Philosophically, there is no reason to believe that nature should prefer the first of these sequences to the second. It does seem reasonable, however, to assert that there is no reason for nature to have a preferred set of units. This would make the second sequence, whose members obey Benford's law, what we should expect to see rather than what we are surprised to see, when we look at, say, the distance to the nearest stars (whether they're measured in light - years, miles, kilometers, or furlongs).
One last word about Benford series: Figure 13.3 shows the results of adding up 10 random sequences generated using the 1/x PDF. As can be seen, despite the unusual properties of the 1/x PDF, it does not escape the central limit theorem.
Figure 13.1 shows the function for x between 1 and 10. I'll discuss the reasons for the choice of 1 and 10 a little later. For now just note that this function cannot be defined for x = 0, because 1/0 is not a defined operation (a fancy way of saying that it makes no sense). Also shown in Figure 13.1 is the function . This function is somewhat similar looking to 1/x and will be used to contrast properties.
Table 13.1 shows the results of counting the number of occurrences of the leading (leftmost, or most significant) digits in a large list (100,000 numbers) of random numbers generated from each of the functions above and showing the fraction of the total contributed by each of these counts. Figure 13.1 shows that both of these functions are largest for the lowest values of x and decrease with increasing x. Therefore it is not surprising that there are more numbers with a leading digit of 1 than with a leading digit of 2, more numbers with a leading digit of 2 than with a leading digit of 3, and so on.
I am going to take the two lists of random numbers and double every number in them. Since the original lists were of numbers between 1 and 10, the new lists will be of numbers between 2 and 20. Now I'll repeat what I did above — I’ll count the number of occurrences of each of the leading digits and report these occurrences as fractions of the total.
Table 13.2 shows the results of this exercise. Compare Tables 13.1 and 13.2. For the function there is no obvious relationship between the two tables. This is what you'd intuitively expect for the results of such a seemingly arbitrary exercise. The results for the function 1/x on the other hand are striking — the entries are almost identical.
Without showing it here, let me state that I would have gotten the same results regardless of what number I had multiplied the lists of random numbers by. Also, had I invested the time to generate extremely large lists of numbers, I could have shown that the agreement gets better and better as the lists get larger. The function 1/x has the unique property that the distribution of leading digits does not change (is “invariant ”) when the list is multiplied by any number (sometimes called a “scaling factor”).
At this point I've demonstrated an interesting statistical curiosity and nothing more. Now look at Table 13.3. This table is generated by using the same procedure as above for a list of the distances to the 100 closest stars to the earth. Since this is a relatively small list, we shouldn't expect ideal agreement. It does look, however, as if this list was generated using the 1/x distribution
Is this a coincidence of just one naturally occurring set of numbers, or is there something else going on here? I downloaded a list of the land area and population of 194 countries from the Internet. Table 13.4 shows the beginning of this list, in alphabetical order by country. As an aside, one has to wonder at the accuracy of the data. For example, can the group that surveyed Bangladesh really be sure that there were at the time of the completion of their survey exactly 147,365,352 people in Bangladesh at that time? My intuition tells me that this number isn't accurate to better than about 100,000 people. In any case, since I'm interested in the leading, or most significant, digit only, this and other similar inaccuracies won't affect what I'm doing.
Table 13.5 shows the results of the leading number counts on the area and population numbers in the list (the entire list of 194
countries, not just the piece excerpted in Table 13.4). It's not obvious how well these numbers do or don't fit the Benford series, so I'd like to compare them.