This site contains older material on Eiffel. For the main Eiffel page, see http://www.eiffel.com.

The role of object-oriented metrics

Bertrand Meyer

A shorter variant of this article appeared in Computer (IEEE), as part of the Component and Object Technology department, in the November 1998 issue.

The request I hear most commonly, when talking to project managers using object technology or about to use it, is for more measurement tools. Some of those people would kill for anything that can give them some kind of quantitative grasp on the software development process.

There is in fact an extensive literature on software metrics, including for object-oriented development, but surprisingly few publications are of direct use to actual projects. Those who are often go back quite far in time; an example is Barry Boehm's 1981 Software Engineering Economics (Prentice Hall), with its COCOMO cost predictio model: despite the existence of many more recent works on the subject, it is still among the most practical sources of quantitative information and methodology.

Metrics are not everything. Lord Kelvin's famous observation that

When you cannot measure, when you cannot express [what you are speaking about] in numbers, your knowledge is of a meager and unsatisfactory kind: you have scarcely, in your thoughts, advanced to the stage of a science

is exaggerated. Large parts of mathematics, including most of logic, are not quantitative; but we don't dismiss them as non-scientific. This also puts in perspective some of the comments published recently in this magazine (July 1998) by Walter Tichy and Marvin Zelkowitz on the need for more experimentation, which were largely a plea for more quantitative data. I agree with their central argument -- that we need to submit our hypotheses to the test of experience; but when Tichy writes

Zelkowitz and Wallace also surveyed journals in physics, psychology, and anthropology and again found much smaller percentages of unvalidated papers [i.e. papers not supported by quantitative evaluation] than in computer science

one cannot help thinking: physics, OK -- but do we really want to take psychology as the paragon of how "scientific" computer science should be? I don't think so. In an engineering discipline we cannot tolerate the fuzziness that is probably inevitable in social sciences. If we are looking for rigor, the tools of mathematical logic and formal reasoning are crucial, even though they are not quantitative.

Still, we need better quantitative tools. Numbers help us understand and control the engineering process. In this column I will present a classification of software metrics and five basic rules for their application.

Types of metrics

The first rule of quantitative software evaluation is that if we collect or compute numbers we must have a specific intent related to understanding, controlling or improving software and its production.

This implies that there are two broad kinds of metrics: product metrics that measure properties of the software products; and process metrics that measure properties of the process used to obtained these products.

Product metrics include two categories: external product metrics cover properties visible to the users of a product; internal product metrics cover properties visible only to the development team. External product metrics include:

Product non-reliability metrics, assessing the number of remaining defects.
Functionality metrics, assessing how much useful functionality the product provides.
Performance metrics, assessing a product's use of available resources: computation speed, space occupancy.
Usability metrics, assessing a product's ease of learning and ease of use.
Cost metrics, assessing the cost of purchasing and using a product.

Internal product metrics include:

Size metrics, providing measures of how big a product is internally.
Complexity metrics (closely related to size), assessing how complex a product is.
Style metrics, assessing adherence to writing guidelines for product components (programs and documents).

Process metrics include:

Cost metrics, measuring the cost of a project, or of some project activities (for example original development, maintenance, documentation).
Effort metrics (a subcategory of cost metrics), estimating the human part of the cost and typically measured in person-days or person-months.
Advancement metrics, estimating the degree of completion of a product under construction.
Process non-reliability metrics, assessing the number of defects uncovered so far.
Reuse metrics, assessing how much of a development benefited from earlier developments.

Internal and external metrics

The second rule is that internal and product metrics should be designed to mirror relevant external metrics as closely as possible.

Clearly, the only metrics of interest in the long run are external metrics, which assess the result of our work as perceptible by our market. Internal metrics and product metrics help us improve this product and the process of producing it. They should always be designed so as to be eventually relevant to external metrics.

Object technology is particularly useful here because of its seamlessness properties, which reduces the gap between problem structure and program structure (the "Direct Mapping" property). In particular, one may argue that in an object-oriented context the notion of function point, a widely accepted measure of functionality, can be replaced by a much more objective measure: the number of exported features (operations) of relevant classes, which requires no human decision and can be measured trivially by a simple parsing tool.

Designing metrics

The third rule is that any metric applied to a product or project should be justified by a clear theory of what property the metric is intended to help estimate.

The set of things we can measure is infinite, and most of them are not interesting. For example I can write a tool to compute the sum of all ASCII character codes in any program, modulo 53, but this is unlikely to yield anything of interest to product developers, product users, or project managers.

A simple example is a set of measurements that we performed some time ago on the public-domain EiffelBase library of fundamental data structures and algorithms, reported in the book Reusable Software (Prentice Hall). One of the things we counted was the number of arguments to a feature (attribute or routine) over 150 classes and 1850 features, and found an average of 0.4 and a maximum of three, with 97% of the features having two or less. We were not measuring this particular property in the blind: it was connected to a very precise hypothesis that the simplicity of such interfaces is a key component of the ease of use and learning (and hence the potential success) of a reusable component library. These figures show a huge decrease as compared to the average number of arguments for typical non-O-O subroutine libraries, often 5 or more, sometimes as much as 10. (Note that a C or Fortran subroutine has one more argument than the corresponding O-O feature.)

Sometimes people are skeptical of the reuse claims of object technology; after all, their argument goes, the idea of reuse has been around for a long time, so what's so special with objects? Quantitative arguments such as provided by the EiffelBase measurements provide some concrete evidence to back the O-O claims.

The second rule requires a theory, and implies that the measurements will only be as good as the theory. Indeed, the correlation between small number of feature arguments and ease of library use is only a hypothesis. Authors such as Zelkowitz and Tichy might argue that the hypothesis must be validated through experimental correlation with measures of ease of use. They would have a point, but the first step is to have a theory and make it explicit. Experimental validation is seldom easy anyway, given the setup of many experiments, which often use students under the sometimes dubious assumption that their reactions can be used to predict the behavior of professional programmers. In addition, it is very hard to control all the variables. For example I recently found out, by going back to the source, that a nineteen-seventies study often used to support the use of semicolons as terminators rather than separators seemed to rely on an unrealistic assumption which casts doubt on the results.

Two PhD theses at Monash University, by Jon Avotins and Glenn Maughan under the supervision of Christine Mingins, have applied these ideas further by producing a "Quality Suite for Reusable Software". Starting from several hundred informal methodological rules in the book "Reusable Software" and others, they identified the elements of these rules that could be subject to quantitative evaluation, defined the corresponding metrics, and produced tools that evaluate these metrics on submitted software. Project managers or developers using these tools can assess the values of these measurements on their products.

In particular, you can compare the resulting values to industry-wide standards or to averages measured over your previous projects. This brings the fourth rule, which states that measurements are usually most useful in relative terms.

Calibrating metrics

More precisely, the fourth rule is that most measurements are only meaningful after calibration and comparison to earlier results.

This is particularly true of cost and reliability metrics. A sophisticated cost model such as COCOMO will become more and more useful as you apply it to successive projects and use the results to calibrate the model's parameters to your own context. As you move on to new projects, you can use the model with more and more confidence based on comparisons with other projects.

Similarly, many internal product metrics are particularly useful when taken relatively. Presented with an average argument count measure of 4 for your newest library, you will not necessarily know what it means -- good, bad, irrelevant? Assessed against published measures of goodness, or against measures for previous projects in your team, it will become more meaningful. Particularly significant are outlying points: if the average value for a certain property is 5 with a standard deviation of 2, and you measure 10 for a new development, it's probably worth checking further, assuming of course (rule 2) that there is some theory to support the assumption that the measure is relevant. This is where tools such as the Monash suite can be particularly useful.

Metrics and the measuring process

The fifth rule is that the benefits of a metrics program lie in the measuring process as well as in its results.

The software metrics literature often describes complex models purporting to help predict various properties of software products and processes by measuring other properties. It also contains lots of controversy about the value of the models and their predictions. But even if we remain theoretically skeptical of some of the models, we shouldn't throw away the corresponding measurements. The very process of collecting these measurements leads (as long as we confine ourselves to measurements that are meaningful, at least by some informal criteria) to a better organization of the software process and a better understanding of what we are doing. This idea explains the attraction and usefulness of process guidelines such as the Software Engineering Institute's Capability Maturity Model, which encourage organizations to monitor their processes and make them repeatable, in part through measurement. To quote Emmanuel Girard, a software metrics expert, in his advice for software managers: before you take any measures, take measurements.