Simple insights into source code bases

Does the code base scream the domain?

I was wondering whether or not it would be possible to use parts of PageRank ( to gain insights into a code base. If PageRank works on web pages to ascertain what the contents of the page relates to, then likely a similar way could be construed for source code.

The simplest thing that could possibly work?
I chose the n-gram ( approach - unigram to be specific. While bi- and tri-grams are better for text, I’m not so sure for code bases, nevertheless, it could be tested.

The simple process

  • Find all files of a specific language inside the project structure. Likely it would be prudent to examine source and test code independently
  • Remove all for of new-lines
  • Tokenize on non alphanumeric entities
  • Build histogram of these tokens

Removing comments and possibly strings would likely be a good idea, but that would require parsing and not just bash.

find . -name “*.java” -type f | xargs cat | tr -d ‘\n’ | tr -d ‘\r’| tr -cs ‘[:alnum:]‘ ‘\n’ | sort | uniq -c | sort -rn > wordfreq.txt

Looking at gerrit’s word frequency, we get something along these lines:

  • 27560 import
  • 25092 the
  • 21758 com
  • 21385 google
  • 16615 License
  • 14676 public
  • 14544 gerrit
  • 13553 String
  • 12309 final
  • 11823 return
  • 10431 private
  • 10191 new
  • 9809 if
  • 8940 this
  • 8196 0
  • 7665 in
  • 7225 void
  • 7163 under
  • 6809 a
  • 6590 null
  • 6389 server
  • 6234 client
  • 6185 static
  • 6125 for
  • 6024 2
  • 5965 org
  • 5953 to
  • 5384 or
  • 5212 class
  • 4972 may
  • 4963 Override
  • 4934 name
  • 4923 get
  • 4752 distributed
  • 4666 of
  • 4602 java
  • 4492 throws
  • 4392 n
  • 4164 is
  • 3705 e

Reading it “import the com google License public gerrit String final return private new if this 0 in void under a null server client static for 2 org to or class may Override name get distributed of java throws n is e” doesn’t quite make sense. Clearly the “License” and namespace “” influences heavily.

Removing the keywords we get:
“the com google License gerrit 0 in under a server client 2 org to or may name get distributed of n is e”

It is not as if the source code really screams what gerrit is about. From Chinese Whisper reconstruction I get something about a “client server with name distribution” - not quite the “Gerrit provides web based code review and repository management for the Git version control system” tagline.

The frequency count drops rapidly - let’s pull the data into R to see if there are some patterns.

gerrit <- read.table(”wordfreq.txt”, header=F)
f <-$V1))
f$Var1 <- as.numeric(as.character(f$Var1))
plot(log(f), type=”l”, xlab=”log(frequency)”, ylab=”log(count)”, main =”Gerrit source code tokens\nlog-log plot”)

gerrit loglog plot

This seems to be a power law distribution, but with a lot of outliers above 7 (corresponding to around 1100) - and with an anomaly just short of 8 (corresponding to 2374 to be exact). This is quite likely the template License.

gerrit[gerrit$V1 == 2374,]
V1 V2
101 2374 Unless
102 2374 Licensed
103 2374 LICENSE
104 2374 law
105 2374 governing
106 2374 express
108 2374 compliance
109 2374 BASIS
110 2374 agreed

Plotting the more conformant data

k <- f[f$Var1 <1100,]
plot(log(k), type=”l”, xlab=”log(frequency)”, ylab=”log(count)”, main =”Gerrit source code tokens\nfrequency < 1100\nlog-log plot”)
abline(glm(log(k$Freq ) ~ log(k$Var1)), col=”red”)

glm(log(k$Freq ) ~ log(k$Var1))

Call: glm(formula = log(k$Freq) ~ log(k$Var1))

(Intercept) log(k$Var1)
8.554 -1.372

Degrees of Freedom: 495 Total (i.e. Null); 494 Residual
Null Deviance: 1299
Residual Deviance: 152.9 AIC: 829.7

[1] 510.1444

gerrit loglog < 1100

So, we should likely look at values with the frequency in this area to get a better suggestion for what the code base is used for.

gerrit[gerrit$V1 < 600 & gerrit$V1 >= 500,]
V1 V2
240 598 code
241 597 url
242 592 rw
243 590 values
244 589 label
245 581 plugin
246 580 v
247 563 ctx
248 561 Result
249 558 Util
250 550 UUID
251 544 2013
252 541 bind
253 538 cb
254 533 IdentifiedUser
255 532 err
256 531 u
257 530 o
258 528 substring
259 526 master
260 525 Repository
261 522 CurrentUser
262 522 as
263 521 res
264 520 dom
265 517 assertEquals
266 516 token
267 508 start
269 508 interface
270 507 lang
271 506 servlet
272 500 Object

This has a better match with the core of the project, though we still see comment debris, e.g. “2013″

Gerrit can be found at - I was looking at the codebase from 02bafe0f4c51aa24b2b05d4d1309ecfc828762c0 (January 20th, 2016)

Independence check

With the previous information - and the notion of a vector representation - I thought about the possibility to check for independence.

If two vectors are independent, then they should be orthogonal. If two code bases are independent, then they should be orthogonal in their domain vectors. To test this, we can try to plot the words used in the code bases. Naturally, we would need to strip away the language keywords, but as we will see, this is not quite as necessary as expected. We can even gain other insights by looking at the keyword uses.

So, as above, I created word frequence files for two JavaScript projects.

p1 <- read.table(”p1-wordfreq.txt”, header=F)
p2 <- read.table(”p2-wordfreq.txt”, header=F)

We don’t really want the exact count, so we pick the relative frequencies

p1$V1 <- p1$V1/max(p1$V1)
p2$V1 <- p2$V1/max(p2$V1)

Now, we only want to look at the tokens they have in common to see whether or not they are orthogonal - the tokens not common are already orthogonal.

common <- merge(p1, p2, by = “V2″)

plot(common$V1.x, common$V1.y, xlab=”p1″, ylab=”p2″, main=”Comparing p1 and p2″)

comparing JavaScript projects p1 and p2

Next, we want to identify the JavaScript keywords.

js <- read.table(”JavaScriptKeywords.txt”, header=F)
names(js) <- “V2″ # js is a single column, we want to merge on the keywords in the same column names
js2 <- merge(js, common, by=”V2″)
points(js2$V1.x, js2$V1.y, pch=19, col=”red”)

# mark the 20% in both directions, thus we get a Pareto segmentation
abline(h=.2, col=”blue”)
abline(v=.2, col=”blue”)

high <- common[common$V1.x > .2 & common$V1.y > .2,]

The most frequently used non-keywords:

high[-(match(intersect(high$V2, js2$V2), high$V2)),]
V2 V1.x V1.y
34 data 0.4170306 0.2444444
49 err 0.5545852 0.4555556
50 error 0.3013100 0.8000000
115 censored 0.6812227 0.6888889
131 settings 0.2052402 0.2111111

The second to last in this list has been censored, it does provide an indication that the projects aren’t quite independent. The error, err, and data are so common and nondescript that it is somewhat okay to find them in this area, though I’d rather have less callback functions and better names in general.

The most frequently used keywords:

high[(match(intersect(high$V2, js2$V2), high$V2)),]
V2 V1.x V1.y
47 else 0.3449782 0.4444444
65 function 1.0000000 0.8000000
72 if 0.4716157 0.5000000
154 var 1.0000000 0.6444444

Again this can be explained by a lot of callbacks, which are often on the form:

function(err, data) {
} else {

Another explanation could be lots of anonymous functions, though usually callback.


Removing comments and imports should provide for a better picture of the code base. Even so, it seems to not exactly scream the domain or architecture.

Bi-grams could be another improvement.

Independence check of supposedly independent projects may reveal that they aren’t or that the code is skewed towards an unwanted design.

It is far from perfect, but as always it brings a different way of looking at the code base, and it is relatively quick to do.

Comparing large code bases somewhat defeats the purpose as regression to the mean tells nothing much of interest. Taking Gerrit as an example, then the most used token is “import”, which is used 27560 times and as we saw above, the interesting parts reveal themselves around 1100 uses, which is less than 4%.

comparing gerrit to dotCMScomparing gerrit to dotCMS (loglog)

Comparing Gerrit and an old repo I had of dotCMS, we find that the most used keywords including entities in java.lang are:


Which could indicate a lot of String constants and conditional logic (with return statements instead of else clauses), and with a possibility of Primitive Obsession - well, the web does call for a lot of String use.

Comments are closed.