
{"id":351,"date":"2022-08-28T19:48:38","date_gmt":"2022-08-28T17:48:38","guid":{"rendered":"http:\/\/serverdude.dk\/?p=351"},"modified":"2022-08-28T19:48:38","modified_gmt":"2022-08-28T17:48:38","slug":"benfords-law","status":"publish","type":"post","link":"https:\/\/serverdude.dk\/?p=351","title":{"rendered":"Benford&#8217;s Law"},"content":{"rendered":"\n<p class=\"wp-block-paragraph\">I was curious as to what I actually could do with <a href=\"https:\/\/en.wikipedia.org\/wiki\/Benford's_law\">Benford&#8217;s Law<\/a>. Could I detect anomalies that would warrant questioning? Where could I apply it?<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">So &#8211; first off &#8211; what is Benford&#8217;s Law &#8211; or the Newcomb-Benford law? It is a natural occurring distribution if the domain spans several orders of magnitude. It states that 1 is then &#8211; by far &#8211; the most prevalent most significant digit. The theoretical frequency is found by applying log10(1+1\/digit).<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Let&#8217;s say our data set is based upon the number of people living in cities, then the cities with 10-19 million, 1 million, 100 -199 thousand, 10-19 thousand, 1 thousand, 100-199, 10-19, and 1 citizen would cover a little more than 30% of the cities, whereas cities with 20-29 million, 2 million, 200-299 thousand, 20-29 thousand, 2 thoushand, 200-299, 20-29, and 2 citizens would cover 17.6% of the cities.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">I find this counter intuitive, and thus very fascinating.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">I had a bank statement with deposits and withdrawal &#8211; 212 in total &#8211; I though, why not try to apply Benford&#8217;s law to that lot and see how well it fits, and if not, if there are any explanation.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">To get the most significant digit I chose to use the Log10 function basically writing the number in scientific notation. Then taking the exponent and divide the original amount to get a power of one, then finally take the absolute integer value:<br><br>abs(as.integer(amount\/10^floor(log10(abs(amount)))))<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Let&#8217;s assume the amount is 12345, then log10 would give us 4.091491, we then divide 12345 by 10^4 to get 1.2345, then take the integer value &#8211; not rounding &#8211; this gives us 1.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Using dplyr in R to get the frequency for this, I used:<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">data_hist &lt;- as.data.frame(data %>% group_by(digit=abs(as.integer(amount\/10^floor(log10(abs(amount)))))) %>% summarize(count=n()) )<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Which gave me:<\/p>\n\n\n\n<figure class=\"wp-block-table\"><table>\n<tbody><tr><th>digit<\/th><th>count<\/th><\/tr>\n<tr><td>1<\/td><td>54<\/td><\/tr>\n<tr><td>2<\/td><td>29<\/td><\/tr>\n<tr><td>3<\/td><td>42<\/td><\/tr>\n<tr><td>4<\/td><td>18<\/td><\/tr>\n<tr><td>5<\/td><td>24<\/td><\/tr>\n<tr><td>6<\/td><td>15<\/td><\/tr>\n<tr><td>7<\/td><td>11<\/td><\/tr>\n<tr><td>8<\/td><td>6<\/td><\/tr>\n<tr><td>9<\/td><td>13<\/td><\/tr>\n<\/tbody><\/table><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\">To add the Benford values simply add the values:<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">data_hist$benford = log10(1+1\/data_hist$digit)<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">We would also like to have the frequency as a percentage, not as an observation count. <\/p>\n\n\n\n<p class=\"wp-block-paragraph\">data_hist$freq = data_hist$count \/ sum(data_hist$count)<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Now, we could use these two values to calculate which digits are the most off either simply using freq\/benford or (freq-benford)\/(freq+benford) &#8211; but let&#8217;s just plot the data as a bar chart along with the Benford curve.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">ggplot(data_hist, aes(x=digit, y =freq)) + geom_bar(stat=&#8221;identity&#8221;) + geom_line(aes(x=digit, y = benford)) + labs(title = &#8220;Bank payment most significant digit: Benfords law&#8221;) + scale_x_discrete(&#8220;digit&#8221;, data_hist$digit, waiver(), factor(data_hist$digit))<\/p>\n\n\n\n<figure class=\"wp-block-image size-full\"><a href=\"http:\/\/serverdude.dk\/wp-content\/uploads\/image-1.png\"><img loading=\"lazy\" decoding=\"async\" width=\"861\" height=\"550\" src=\"http:\/\/serverdude.dk\/wp-content\/uploads\/image-1.png\" alt=\"\" class=\"wp-image-354\" srcset=\"https:\/\/serverdude.dk\/wp-content\/uploads\/image-1.png 861w, https:\/\/serverdude.dk\/wp-content\/uploads\/image-1-300x192.png 300w, https:\/\/serverdude.dk\/wp-content\/uploads\/image-1-768x491.png 768w\" sizes=\"auto, (max-width: 861px) 100vw, 861px\" \/><\/a><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\">It would seem tha 3, 5, and 9 are overrepresented, which begs the question: Why?<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">It turns out, that I have subscriptions in here and those are not random, and will thus skew my distribution. <\/p>\n\n\n\n<p class=\"wp-block-paragraph\">While this was relatively easy to do, and I did spot a few outliers, it seems I didn&#8217;t quite get what I was looking for. And the 4 subscriptions didn&#8217;t expose themselves in this plot &#8211; maybe if the other subscriptions had been removed. Studies for another day.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>I was curious as to what I actually could do with Benford&#8217;s Law. Could I detect anomalies that would warrant questioning? Where could I apply it? So &#8211; first off &#8211; what is Benford&#8217;s Law &#8211; or the Newcomb-Benford law? It is a natural occurring distribution if the domain spans several orders of magnitude. It [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[1],"tags":[],"class_list":["post-351","post","type-post","status-publish","format-standard","hentry","category-ikke-kategoriseret"],"_links":{"self":[{"href":"https:\/\/serverdude.dk\/index.php?rest_route=\/wp\/v2\/posts\/351","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/serverdude.dk\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/serverdude.dk\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/serverdude.dk\/index.php?rest_route=\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/serverdude.dk\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=351"}],"version-history":[{"count":1,"href":"https:\/\/serverdude.dk\/index.php?rest_route=\/wp\/v2\/posts\/351\/revisions"}],"predecessor-version":[{"id":355,"href":"https:\/\/serverdude.dk\/index.php?rest_route=\/wp\/v2\/posts\/351\/revisions\/355"}],"wp:attachment":[{"href":"https:\/\/serverdude.dk\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=351"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/serverdude.dk\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=351"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/serverdude.dk\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=351"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}