Finding Hotspots In Your Codebase
It can be difficult to identify and prioritize problem areas to improve in large code bases, because no one can hold a hundred thousand lines of code in their mind, especially when you have multiple teams working on different areas, and legacy code that no one understands or feel responsible for.
Adam Thornhill in his book “Your Code as a Crime Scene” combines his knowledge of forensic psychology to create a number of techniques to help us find, prioritize and address these problem areas.
Finding Hotspots in your code
In forensic psychology, geographical profiling is where the locations of related crimes are mapped in order to predict the area an offender is likely to reside - the Hotspot. Hotspots allow the police to more effectively allocate their resources. They can investigate a smaller area instead of a whole city.
In a programming context, a hotspot is a a complex area in our codebase that changes a lot (takes a lot of effort to maintain). Both complexity and change matter because alone they don’t tell us the full picture. We might confuse a config file that changes a lot or a stable and well-tested legacy file with hundreds of lines of code as hotspots. Together, number of lines of code (complexity), and the number of times a file has been modified (effort) is a good indicator of problem areas in our code-base (Gall and Krajewsky, 2003).
Hotspot Shell Script
There are a few tools mentioned in Thornhill’s book to help us narrow down hotspots, including CodeCity and Code Maat etc. I decided to write this one-liner shell script to do this instead (with the help of a colleauge from work). Mostly because I struggled to get the software working on my Mac because of operating system incompatibility.
echo "modified count,line count,filename" > hotspots.csv && join -j 2 -o 2.1,1.1,1.2 <(wc -l **/*.{js,ts}) <(git log --diff-filter=M --name-only | sort | uniq -c| grep src) | sort -r -n -k 1 -k 2| sed 's/ /,/g' >> hotspots.csv
echo "modified count, line count, filename" > hotspots.csv
means output the results to a file called “hotspots.csv” as a table with the following column headers: ‘modified count’, ‘line count’ and ‘filename’.join
joins two files-j 2
match them on their second field, which in our case is the filename. Both files should already be sorted by filename.-o 2.1,1.1,1.2
output the first field from file two (number of modifications), followed by the first field from file one (line count), followed by the second field from file one (filename).<()
means “run the command between the brackets and treat the output as a file for the join command.” We do this twice, once for the modified count and once for the line count.wc -l **/*.{js,ts}
- counts number of lineswc -l
in all files in the current directory and all sub directories**/*
, ending in .js (JavaScript) or .ts (TypeScript)js, ts
. This means we ignore config and JSON files - things that we expect to change and so do not count as a hotspot. You don’t need to have this, and if you do, change the extensions to those that you care about.git log --diff-filter=M --name-only
outputgit log
the name only--name-only
of files that have been modified only--diff-filter=M
M stands for Modified, you can swap it for A for added, D for deleted and R for renamed etc if you want to.|
means pipe the result of the previous command as an input into the command on the right of the pipe.sort
sort by least to most. We can reverse that by adding-r
, I’ve left it like this for now because when we run it the hotspot files are at the bottom right where we can see them without having to scroll up.uniq -c
. Counts the occurances of unique entries. If you have aaa, bb, then the output you’d get is 3a 2b.grep src
Return all of the lines that havesrc
in them (files inside the src folder)sort -n -k 1 -k 2
sort numericallysort -n
, first by key 1 (the first column of the output, which in this case is the number of times modified)-k 1
and then by key 2 (the second column of the output, which in this case is the number of lines)-k2
. key 3 in this case would be the name of our file.sed 's/ /,/g' >> hotspots.csv
replace spaces with commas (otherwise the values appear wrong in the table -sed 's/ /,/g'
). Then append output>>
to the hotspots.csv file which at the moment just contains our table headers.
The result of running the above shell script was as follows:
The really cool thing about the table above is that the first file is the exact file the CTO of the startup I work at mentioned was the most nightmarish file in the entire codebase. Super cool!