Design by Contract:
The Lessons of Ariane
by Jean-Marc Jézéquel, IRISA
and Bertrand Meyer, ISE
This article appeared in a slightly
different form in Computer (IEEE), as part of the Object-Oriented department, in January
of 1997 (vol. 30, no. 2, pages 129-130).
Reader reactions to the article published in IEEE's
Computer magazine appear at the end of the article.
Keywords: Contracts,
Ariane, Eiffel, reliable software, correctness, specification, reuse, reusability, Java,
CORBA, IDL.
How not to test your software
Several earlier columns in IEEE Computer have emphasized the importance of Design by ContractTM for constructing
reliable software. A $500-million software error provides a sobering reminder that this
principle is not just a pleasant academic ideal.
On June 4, 1996, the maiden flight of the European Ariane 5 launcher crashed about 40
seconds after takeoff. Media reports indicated that the amount lost was half a billion
dollars -- uninsured.
The CNES (French National Center for Space Studies) and the European Space Agency
immediately appointed an international inquiry board, made of respected experts from major
European countries, who produced their report in hardly more than a month. These agencies
are to be commended for the speed and openness with which they handled the disaster. The
committee's report is available on the Web in these two places:
It is a remarkably short, simple, clear and forceful document.
Its conclusion: the explosion was the result of a software error -- possibly the
costliest in history (at least in dollar terms, since earlier cases have caused loss of
life).
Particularly vexing is the realization that the error came from a piece of the software
that was not needed during the crash. It has to do with the Inertial Reference
System, for which we will keep the acronym SRI used in the report, if only to avoid the
unpleasant connotation that the reverse acronym could evoke for US readers. Before
lift-off certain computations are performed to align the SRI. Normally they should be
stopped at -9 seconds, but in the unlikely event of a hold in the countdown resetting the
SRI could, at least in earlier versions of Ariane, take several hours; so the computation
continues for 50 seconds after the start of flight mode -- well into the flight period.
After takeoff, of course, this computation is useless; but in the Ariane 5 flight it
caused an exception, which was not caught and -- boom.
The exception was due to a floating-point error: a conversion from a 64-bit integer to
a 16-bit signed integer, which should only have been applied to a number less than 2^15,
was erroneously applied to a greater number, representing the "horizontal bias"
of the flight. There was no explicit exception handler to catch the exception, so it
followed the usual fate of uncaught exceptions and crashed the entire software, hence the
on-board computers, hence the mission.
This is the kind of trivial error that we are all familiar with (raise your hand if you
have never done anything of the sort), although fortunately the consequences are usually
less expensive. How in the world can it have remained undetected, and produced such a
horrendous outcome?
Is this incompetence?
No. Everything indicates that the software process was carefully organized and planned.
The ESA's software people knew what they were doing and applied widely accepted industry
practices.
Is it an outrageous software management problem?
No. Obviously something went wrong in the validation and verification process
(otherwise there would be no story to write), and the Inquiry Board makes a number of
recommendations to improve the process, it is clear from its report that systematic
documentation, validation and management procedures were in place.
The contention often made in the software engineering literature that most software
problems are primarily management problems is not borne out here. The problem is
technical. (Of course one can always argue that good management will spot technical
problems early enough.)
Is it the programming language's fault?
Although one may criticize the Ada exception mechanism, it could have been used here to
catch the exception. In fact, quoting the report:
Not all the conversions were protected because a maximum workload target of 80% had
been set for the SRI computer. To determine the vulnerability of unprotected code, an
analysis was performed on every operation which could give rise to an ... operand error.
This led to protection being added to four of [seven] variables... in the Ada code.
However, three of the variables were left unprotected.
In other words the potential problem of failed arithmetic conversions was recognized.
Unfortunately, the fatal exception was among the three that were not monitored, not the
four that were.
Is it a design error?
Why was the exception not monitored? The analysis revealed that overflow (a horizontal
bias not fitting in a 16-bit integer) could not occur. Was the analysis wrong? No! It was
right -- for the Ariane 4 trajectory. For Ariane 5, with other trajectory
parameters, it does not hold any more.
Is it an implementation error?
Although one may criticize the removal of a protection to achieve more performance (the
80% workload target), it was justified by the theoretical analysis. To engineer is to make
compromises. If you have proved that a condition cannot happen, you are entitled not to
check for it. If every program checked for all possible and impossible events, no useful
instruction would ever get executed!
Is it a testing error?
Not really. Not surprisingly, the Inquiry Board's report recommends better testing
procedures, and testing the whole system rather than parts of it (in the Ariane 5 case the
SRI and the flight software were tested separately). But if one can test more one cannot
test all. Testing, we all know, can show the presence of errors, not their absence. And
the only fully "realistic" test is to launch; this is what happened, although
the launch was not really intended as a $500-million test of the software.
So what is it?
It is a reuse error. The SRI horizontal bias module was reused from a
10-year-old software, the software from Ariane 4.
But this is not the full story:
It is a reuse specification error
The truly unacceptable part is the absence of any kind of precise specification
associated with a reusable module.
The requirement that the horizontal bias should fit on 16 bits was in fact stated in an
obscure part of a document. But in the code itself it was nowhere to be found!
From the principle of Design by Contract expounded by earlier columns, we know that any
software element that has such a fundamental constraint should state it explicitly, as
part of a mechanism present in the programming language, as in the Eiffel construct
where the precondition states clearly and precisely what the input must satisfy to be
acceptable.
Does this mean that the crash would automatically have been avoided had the mission
used a language and method supporting built-in assertions and Design by Contract? Although
it is always risky to draw such after-the-fact conclusions, the answer is probably yes:
- Assertions (preconditions and postconditions in particular) can be automatically turned
on during testing, through a simple compiler option. The error might have been caught
then.
- Assertions can remain turned on during execution, triggering an exception if violated.
Given the performance constraints on such a mission, however, this would probably not have
been the case.
- But most importantly the assertions are a prime component of the software and its
documentation ("short form", produced automatically by tools). In an environment
such as that of Ariane where there is so much emphasis on quality control and thorough
validation of everything, they would be the QA team's primary focus of attention. Any team
worth its salt would have checked systematically that every call satisfies the
precondition. That would have immediately revealed that the Ariane 5 calling software did
not meet the expectation of the Ariane 4 routines that it called.
The lesson for every software developer
The Inquiry Board makes a number of recommendations with respect to improving the
software process of the European Space Agency. Many are justified; some may be overkill;
some may be very expensive to put in place. There is a more simple lesson to be learned
from this unfortunate event:
Reuse without a contract is
sheer folly!
From CORBA to C++ to Visual Basic to ActiveX to Java, the hype is on
software components. The Ariane 5 blunder shows clearly that naïve hopes are doomed to
produce results far worse than a traditional, reuse-less software process. To
attempt to reuse software without Eiffel-like assertions is to invite failures of
potentially disastrous consequences. The next time around, will it only be an empty
payload, however expensive, or will it be human lives?
It is regrettable that this lesson has not been heeded by such recent
designs as Java (which added insult to injury by removing the modest assert instruction of C!), IDL (the Interface Definition
Language of CORBA, which is intended to foster large-scale reuse across networks, but
fails to provide any semantic specification mechanism), Ada 95 and ActiveX.
For reuse to be effective, Design by Contract is a requirement. Without a
precise specification attached to each reusable component -- precondition, postcondition,
invariant -- no one can trust a supposedly reusable component. |
Reader reactions
The February 1997 issue of IEEE Computer contained two letters
from readers commenting on the article. Here are some extracts from these letters and from
the response by one of the authors:
Tom Demarco, The Atlantic Systems Guild (co-author of PeopleWare):
Jean-Marc Jézéquel and Bertrand Meyer are precisely on-target
in their assessment of the Ariane-5 failure. This was the kind of problem that a
reasonable contracting mechanism almost certainly would have caught; the kind of problem
that almost no other defense would have been likely to catch.
I believe that the use of Eiffel-like module contracts is the most important
non-practice in software today.
Roy D. North, Falls Church, Va.:
Our designs must incorporate safety factors, and we must freeze the design
before we produce the product (the software).
Bertrand Meyer's response:
What software managers must understand is that Design by Contract is not a
pie-in-the-sky approach for special, expensive projects. It is a pragmatic set of
techniques available from several commercial and public-domain Eiffel sources and
applicable to any project, large or small.
It is not an exaggeration to say that applying Eiffel's assertion-based O-O development
will completely change your view of software construction ... It puts the whole issue of
errors, the unsung part of the software developer's saga, in a completely different light.
To learn more
An extensive discussion of Design by Contract and its consequences
on the software development process is in the following book:
Object-Oriented Software Construction, 2nd
edition
Learn more about the using Eiffel to develop mission-critical systems,
read this book:
Object-Oriented Software
Engineering with Eiffel
Also, ISE's Web pages contain an introduction to the concepts of
Design by Contract.
For a contrarian perspective
For a different view of the issue (written in response to the
IEEE Computer article) see
Ken Garlington's paper. Although we disagree with Mr. Garlington's
analysis, as expressed in Usenet discussions, we feel it is part
of this site's duty to its readers to give them access to contrarian
views, letting them make them make up their own minds, for the greater
benefit of software quality.
|