I was curious as to what I actually could do with Benford’s Law. Could I detect anomalies that would warrant questioning? Where could I apply it?
So – first off – what is Benford’s Law – or the Newcomb-Benford law? It is a natural occurring distribution if the domain spans several orders of magnitude. It states that 1 is then – by far – the most prevalent most significant digit. The theoretical frequency is found by applying log10(1+1/digit).
Let’s say our data set is based upon the number of people living in cities, then the cities with 10-19 million, 1 million, 100 -199 thousand, 10-19 thousand, 1 thousand, 100-199, 10-19, and 1 citizen would cover a little more than 30% of the cities, whereas cities with 20-29 million, 2 million, 200-299 thousand, 20-29 thousand, 2 thoushand, 200-299, 20-29, and 2 citizens would cover 17.6% of the cities.
I find this counter intuitive, and thus very fascinating.
I had a bank statement with deposits and withdrawal – 212 in total – I though, why not try to apply Benford’s law to that lot and see how well it fits, and if not, if there are any explanation.
To get the most significant digit I chose to use the Log10 function basically writing the number in scientific notation. Then taking the exponent and divide the original amount to get a power of one, then finally take the absolute integer value:
abs(as.integer(amount/10^floor(log10(abs(amount)))))
Let’s assume the amount is 12345, then log10 would give us 4.091491, we then divide 12345 by 10^4 to get 1.2345, then take the integer value – not rounding – this gives us 1.
Using dplyr in R to get the frequency for this, I used:
data_hist <- as.data.frame(data %>% group_by(digit=abs(as.integer(amount/10^floor(log10(abs(amount)))))) %>% summarize(count=n()) )
Which gave me:
digit | count |
---|---|
1 | 54 |
2 | 29 |
3 | 42 |
4 | 18 |
5 | 24 |
6 | 15 |
7 | 11 |
8 | 6 |
9 | 13 |
To add the Benford values simply add the values:
data_hist$benford = log10(1+1/data_hist$digit)
We would also like to have the frequency as a percentage, not as an observation count.
data_hist$freq = data_hist$count / sum(data_hist$count)
Now, we could use these two values to calculate which digits are the most off either simply using freq/benford or (freq-benford)/(freq+benford) – but let’s just plot the data as a bar chart along with the Benford curve.
ggplot(data_hist, aes(x=digit, y =freq)) + geom_bar(stat=”identity”) + geom_line(aes(x=digit, y = benford)) + labs(title = “Bank payment most significant digit: Benfords law”) + scale_x_discrete(“digit”, data_hist$digit, waiver(), factor(data_hist$digit))
It would seem tha 3, 5, and 9 are overrepresented, which begs the question: Why?
It turns out, that I have subscriptions in here and those are not random, and will thus skew my distribution.
While this was relatively easy to do, and I did spot a few outliers, it seems I didn’t quite get what I was looking for. And the 4 subscriptions didn’t expose themselves in this plot – maybe if the other subscriptions had been removed. Studies for another day.