I want to compare the changes of the source code of two project stages, e.g. the web application source code before it was scalable, and the scalable one.
For me it is interesting to show how many lines needed to be changed, removed or added to get from one to the other stage. I'm searching for a good distance metric that rewards less code and little code changes - the one I imagine would output a relative value:
0% = "Both projects are the same"
50% = "Half of the source code has been changed"
100% = "Both projects have nothing in common"
Intentionally I came up with a few solutions:
diff
: Maybe concat all files to a single source code file and run a diff against them. Problem here is that less code is better, but with this solution is counted as a plain change therefore punishing code removal.- Levenshtein Distance: Calculates the changes needed to transform source code
a
to source codeb
. The result is a number of changes in characters. Problem here again is, that code removal is not rewarded but punished. - Unified Code Count: Sets up rules how to consistently count lines of code, but is no descriptive distance metric between projects.
So I'm searching for a metric that is descriptive, rewards code removal and only counts in code changes or additions. It doesn't have to be source code specific, both projects use the same language. My personal feeling goes into the diff
direction but I did not come up with a satisfactory descriptive metric.
What would you propose?