A programming language metric
Most metrics to compare programming languages — lines of code, number of symbols, compressed lines of code — hover between useless and harmful. Most of these metrics have one fundamental problem: they compare apples and oranges. Here’s a way to get past that hurdle. The resulting metric still seems broken and unjustified, but it’s an improvement.
A data model is a set of data structures plus all the operators on them. I don’t mean the implementation of these, I mean the abstract mathematical definition. Relational databases are a mathematical definition distinct from any given implementation, and include both the underlying structure (the relation and a set of projectors on the tuples which constitute the relation), and all the operations to manipulate relations.
Given a particular data model, how much code is required to ensure that an implementation of that particular data model is sufficiently close to the mathematical ideal that the programmer can treat it as such in all further work? The only difficulty is saying when an implementation is sufficiently iron-clad to be so treated. This can be done experimentally.
We are comparing languages A and B. We take a group of subjects who all know both languages. Each produces an implementation of the same data model in both languages.
Different programmers may have different tolerances for abstraction leakage. To fix this, take each implementation and mark all the testing code (unit tests, run-time checks, etc.). Partition it randomly into equal subsets. Sequentially remove subsets of testing code. This gives a sequence of monotonically less assured mutilations of the original implementation.
Give each programmer who submitted an implementation a randomly chosen mutilation of each implementation he didn’t write. He marks each of them as iron-clad or leaky.
When we have all the marks, we find the level of mutilations for each implementation which gives some fixed fraction marking it as iron-clad, say 95%. This gives us a distribution of amount of code for each language, controlled for how faithfully it implements a data model, and we turn to standard statistical techniques to ask if they are different, and how different.
This presupposes that a shorter program that truly does the same thing than a longer one is better.
Leave a comment