After reading How many bugs are left? I was intrigued by the use of the Lincoln Index to estimate the number of bugs residing in a solution. But after reading the blog post I was bit baffled that the conclusion didn’t pick up on what was really reflected in the data.

In the blog post there are 2 examples concerning 2 QAs, A and B, finding 20 and 30 bugs respectively in the each case. The real difference is the overlap.

In the first example there is only 1 bug in the overlap, and the Lincoln Index is then 20*30/1 = 600 - in total 49 bugs found

In the second example there are 18 bugs in the overlap, making the Lincoln Index 20*30/18 = 33.3 - in total 32 bugs found

The probability that a QA finds a bug is then:

QA A | QA B | Total | |
---|---|---|---|

Example 1 | 20/600 = .03 | 30/600 = .05 | 49/600 = .08 |

Example 2 | 20/33.3 = .6 | 30/33.3 = .9 | 32/33.3 = .96 |

While this is an example of the method it tells me something not mentioned in the blog post: The bugs in Example 2 must have been extremely obvious making it questionable whether the trials are independent.

Another thing, while it may seem like overkill to have 2 QAs in the 2nd example, it seems too little to be worth the effort in the first example, but we really should have 3 QAs in both cases.

There is nothing indicating the size of the example solutions - which is part why the example is good, and part why I was a bit skeptical at first. There is no right answer for the examples, but if the Lincoln Index are to be considered sufficient estimates on the number of bugs in the systems, then what should we do?

Starting with Example 2 we have found almost all the bugs, and hopefully the fixes will not introduce new ones. There is a good probability that the remaining bugs will be fixed when the code base is fixed - after all 33.3 bugs in a code base is not a lot (depending on the size of the code base itself naturally).

Examining Example 1 we have a different problem. We have discovered approximately 1/12th of the bugs, and we have an estimated 600 bugs in the system. It would seem that we are in dire need for some sort of assistance. Possibly rework of the system as well.

## Code base size estimates

Yes - I know - “Measuring software productivity by lines of code is like measuring progress on an airplane by how much it weighs” (Bill Gates), but the bugs has to come from somewhere, and being somewhat consistent in our styles the number of lines do pose as a quantifiable metric.

According to Dan Mayer (bugs per line of code ratio) referencing Steve McConnell, then we have different ratios of bugs per 1000 lines of code (bugs/kloc): 3, 10-20, 15-50 bugs/kloc

Apart from the obvious 600/33.3 = 18 factor in number of bugs between the examples, which may be as simple as 18 times as much code, there are alternative explanations for the number.

**Example 1**

600 bugs at 3 bugs/kloc = 200,000 lines

600 bugs at 50 bugs/kloc = 12,000 lines

**Example 2**

33.3 bugs at 3 bugs/kloc = 11,111 lines

33.3 bugs at 50 bugs/kloc = 666 lines

That is, if Example 1 is 200 kloc with 3 bugs/kloc, and Example 2 is 666 lines with 50 bugs/kloc, then Example 1 is 300 times the lines, but only 18 times the bugs - in which case it is a rather small amount of bugs even at 600. Example 2 though should really clean up the mess.

If it is the opposite, that is Example 1 12,000 lines at 50 bug/kloc, and Example 2 11,111 lines at 3 bug/kloc, then the number of lines are almost the same, yet the number of bugs is 18 times higher. In this case Example 1 is truly in dire needs of some help.

## Alternative Analysis

These speculations are really afterthoughts on the blog’s content. My real beef was with the Lincoln Index itself - it degenerates at a 0 overlap, basically saying that if two observers examine the same area, they must find some of the same elements. That is a natural assumption if observers are stringent and actually look at the same things. Seeing some of the Escape Room issues where contestants overlook the obvious it would seem that for a software solution there would be several opportunities for QAs to overlook something the developers already overlooked.

While there are some suggestions on improving the Lincoln Index in case the overlap is less than 10, e.g. Bailey (1952) suggesting N = A*(B+1)/(C+1), which would lead Example 1 to 310 bugs instead of the 600. My idea was to turn to the German Tank Problem and estimate the number of bugs from the Bayesian credibility score.

By applying our own serial number system to the bugs (tracking ID) we aren’t really playing into the correct scenario, but bear with me. The maximum serial number we see is thus the total number of unique bugs found. We only have 2 observations - one from each QA.

Only having 2 observations mean the mean, µ, is infinite. We should have at least 4 observations to come up with a mean and standard deviation.

We can still try to make a credible guess. Given at least 2 observations, the credibility that the number of bugs is equal to *n*, is:

0 | if n < m |

k-1/k * C(m-1,k-1)/C(n,k-1) | if n >= m |

*m* = number of distinct bugs found in the *k* observations

As *k* is 2 in our case, the formula simplifies into:

0 | if n < m |

(m-1)/(n*(n-1)) | if n >= m |

The credibility that we have more than *n* bugs is:

1 | if n < m |

C(m-1, k-1)/C(n, k-1) | if n >= m |

Again with *k* = 2 this simplifies into:

1 | if n < m |

(m-1)/n | if n >= m |

This latter formula means that if we want to be 95% confident in the number of bugs, *n*, then 5% risk that N >* n*: .05 = (m-1)/n <=> n = (m-1)/0.05 = 20*(m-1)

Running the examples under the German Tank Problem setting we get:

Example 1: A = 20, B = 30, C = 1, m = A+B-C = 49

Number of bugs at 95% confidence: 20*(49-1) = 960

pA = 20/960 = 0.02

pB = 30/960 = 0.03

total = 49/960 = 0.05

Example 2: A = 20, B = 30, C = 18, m = A+B-C = 32

Number of bugs at 95% confidence: 20*(32-1) = 620

pA = 20/620 = 0.03

pB = 30/620 = 0.05

total = 32/620 = 0.05

We see that we have a lot more bugs than our previous estimates, but the QAs probability of finding bugs are almost the same (below 5%) for both examples, and we have an estimated 5% of the total amount of bugs.

Looking at the accumulated credibility score, we can see that it grows rapidly, then slows down, perhaps an 80% confidence is sufficient. In this case .2 = (m-1)/n <=> n = 5*(m-1), this is a quarter of the 95% confidence numbers.

Example 1: A = 20, B = 30, C = 1, m = A+B-C = 49

Number of bugs at 80% confidence: 5*(49-1) = 240

pA = 20/240 = 0.08

pB = 30/240 = 0.13

total = 49/240 = 0.20

Example 2: A = 20, B = 30, C = 18, m = A+B-C = 32

Number of bugs at 80% confidence: 5*(32-1) = 155

pA = 20/155 = 0.13

pB = 30/155 = 0.19

total = 32/155 = 0.20

This is certainly better for Example 1 both with regards to the 95% confidence, but also with regards to the Lincoln Index - even the improved estimate.

## Conclusion

I didn’t know about the Lincoln Index, so I learned something new today - that is always good. The original application to estimate the number of bugs in total seems good, at least better than disregarding data from the trenches.

John D. Cook suggests calibrating through experiments. This blog post has been a thought experiment on some of the deliveries presented by the data and an unrealistic application of the German Tank Problem - the odds of getting the “tanks” in sequence diminishes quickly, thus improvements can be applied to the *m* estimate.

Cutting the confidence level from 95% to 80% may seem drastic - and it is - as it cuts 75% off of the number of expected bugs, but for thought experiments it may be good enough.

QAs are valuable, and there is value in having several (at least 2, but 4 is better) to test a product.

## Resources:

http://leankit.com/blog/2015/12/how-many-bugs-are-left-the-software-qa-puzzle/

https://en.wikipedia.org/wiki/German_tank_problem

https://en.wikipedia.org/wiki/Lincoln_index

http://c2.com/cgi/wiki?LinesOfCode

http://www.mayerdan.com/ruby/2012/11/11/bugs-per-line-of-code-ratio/