BenMeadowcroft.com

Why Systems Fail

by Ben Meadowcroft.


1. Introduction:

The aim of this report is to put forward the major causes of Systems failure, to analyse the proposed causes and to justify them with examples of actual examples taken from the recent past (Section2). The report then proposes a classification scheme into which the various proposed and justified factors may be placed. This scheme provides a generalised view of the areas where systems failure may be caused (Section 3). The generalised view given by this classification scheme then allows us to conclude some general techniques and/or practices, which will greatly reduce the number of system failures and thus decrease the impact these failures have (Section 4). The system failures used as examples and proposed here are not just those that cause the complete collapse of a system but also those that through erroneous operation of that system make large impressions in other ways, e.g. loss of money, life, functionality e.t.c.

2. Analysis of Causes of System Failure:

2.1 Poor development practices.

As a cause of system failure, poor development practices are one of the most significant. This is due to the complex nature of modern software.

An example of poor development practices causing a system failure can be found in the experience of the Pentagon's National Reconnaissance Office (NRO). The inadequate testing of the delivery system of Titan IV rocket. Two Titan rockets were lost, meaning that expensive military equipment necessary to the U.S. Governments defence program (namely early warning satellites) were unable to be deployed. The head of the N.R.O. has attributed this error to "a misplaced decimal point" in software, which controlled the rocket. See http://spacer.com/spacecast/news/launchers-99l.html
Thumbnail picture of a satellite, click for a larger version

Testing of systems already in operation is also important in being prepared for potential system failures. The most obvious example is with the case of the "YK bug". There can be problems with testing operational systems however, as the example of the Loss Angeles emergency preparedness system shows. Due to an error which occurred while testing around four million gallons of raw sewage was dumped into the Sepulveda Dam area. See http://catless.ncl.ac.uk/Risks/20.46.html#subj2

2.2 Incorrect assumptions with regard to system requirements

Incorrect assumptions may be made on the part of the software developers or indeed may be made on the part of the entity requiring the software system. So what exactly can faulty assumptions cause to happen? There are many problems that can result from faulty assumption made by the development team. An example of this factor causing major problems can be found in the experience of Nuclear Regulatory Commission.

A program, which was developed to test models of nuclear reactors called Shock II, miscalculated some important calculations needed to ensure the test models would survive a massive earthquake. A summer student who wrote a module, which converted a vector into a magnitude by summing components, rather than summing absolute values, caused the miscalculation. This error of the earthquake testing system, discovered after the nuclear reactors had been built and were providing energy, meant that five nuclear power stations had to be closed down for checks and reinforcement. The analysis of potential problems within the plants and correcting those problems took months. During that time the utility companies had to provide electricity by a different method, relying again on the more expensive oil and coal power stations. To provide enough power to make up for the shortfall produced by the closure of the nuclear plants the older style plants had too consume an estimated 100,000 barrels of oil a day or more. This means a substantial loss of money for the utility companies and would have posed serious potential health risks if the software error had not been discovered. See http://www2.southwind.net/~rwweeks/bugs.html

Incorrect assumptions can also be made in other ways. For example system developers may make incorrect assumptions about the requirements of certain modules. For example one of the probes sent to Mars had the capability to have new mission requirements sent to it with instructions detailing what it was to do. After some time memory space on the probe in which new instructions could be stored, began to run out. To remedy this one of the engineers decided to delete the landing modules (as the probe was never going to have to land again) and thus free up lots of storage space for new instructions. The probe was sent the new program overwriting the landing module. As soon as the new program was loaded into the probe computer, all contact with the probe was lost. What had happened was that the landing module needed celestial navigation information to land correctly. So the information needed to do this was part of the landing module, however the celestial navigation information was also needed to point the antenna that enabled communication with earth in the right direction. Therefore the loss of the landing module, caused the loss of the mars probe. See http://www2.southwind.net/~rwweeks/bugs.html

2.3 Poor user interface.

A poor user interface may cause significant problems for users of the system and thus greatly increase the likelihood of those users introducing to the system data for example that causes system failure. In accounting software for example a single mistake due to poor user interface may cause an invoice to be sent to someone in the wrong currency or may turn what is supposed to be an invoice into a credit note. While not major problems, usually, data that is erroneous may make the system fail. The U.S. General Accounting Office after reviewing for many years the federal computer systems has said, "data problems are primarily attributable to the user abuse of computer systems." It then categorised the causes of these errors into six areas the first of which was "Forms designed and used for input preparation are too complex." This information obtained from a book by William Perry shows that overly complex user interfaces is one of the biggest factors in data errors in the computer systems of the US government.

2.4 Faulty hardware.

Faulty hardware is a problem that can cause severe system failure. It is also one that is hard to guard against. This factor is however an important one that should be given due consideration along with the more common software errors. An example of a hardware error in a system can be found in the experience of the Wide Field Infrared Explorer (WIRE) spacecraft operated by NASA. The failure of the system caused the WIRE to enter an uncontrolled 60-RPM spin within 48 hours of its launch. This dramatic failure of the system was due to faulty hardware components. See http://catless.ncl.ac.uk/Risks/20.47.html#subj1

Although not under the responsibility of the software developers faulty hardware should be taken into consideration when designing the systems so as to try and minimise the impact of the failure. Hardware failure is not as likely to occur as software faults but can be as damaging.

2.5 Inadequate user training/ user error.

This factor is an important contributor to system failure. If a user is improperly trained then the likelihood of them making serious errors is increased due to their lack of knowledge of the system. Failures due to a lack of training should not be thought of as an error due to the individual operator as is likely with a poorly designed user interface, but as a mistake by the management. A person who makes a mistake should not be reprimanded for it if they have not been trained to deal with it. In the report by the U.S. General Accounting Office quoted by Perry (1989) the conclusion is that improved training of the end users will significantly reduce system failures and improve the integrity of data stored on the computer systems.

2.6 Poor fit between systems and organisation.

A poor fit between the system and the organisation can lead to various problems. A poor fit between a system and an organisation occurs when either the software developers or the entity requesting the software solution, do not grasp the full spectrum of tasks the new system will need to deal with. An example of this is the asylum system in this country, which is currently undergoing a major transformation in its system, there is a large shortfall between what is required by the system and what is actually available. In the example of the asylum system the purpose of the system, to provide ports access to home office files is discounted for half of all the ports as reported by Travis (1999).

This Poor fit is thus making the system, that is supposed to be operating on a national scale, be available for only around half of the ports which it needs to be operating in to be effective. Thus due to this failure the scope of the project is significantly diminished. This failure and other flaws revealed in the report by FT management consultants from which Travis quotes, could cause the British tax payer an additional five hundred million pounds.

3. Classification of Failures:

3.1 Poor development practices.

These are errors caused primarily because of faults on the part of the software developers.

These are faults such as:

3.2 End user or entity problems.

These are failures caused primarily by errors on the part of either the end user or the entity that are using/requesting the system.

These are faults such as:

3.3 Implementation or hardware errors.

These failures are those that are primarily caused by hardware faults or by the entity not having the resources necessary to make the specified system usable.

These are faults such as

4. Conclusion of the Report:

At this the conclusion of the report I feel that I have shown numerous examples of System failure and have related them to their underlying cause (Section 2). These underlying factors have been then classified into three general areas (Section 3). The format then in which I will propose some measures to alleviate and/or eliminate the problems caused by system failure shall be generalised into the same categories as the Failure classification.

4.1 Poor development practices:

The responsibility for failures caused by this factor falls squarely on the development team responsible for creating and/or maintaining the software system. As such measures to reduce poor development practices must centre on the development team itself.

4.1.1 A stringent testing policy for all new/modified software.

In a report on the Ballista project undertaken at Carnegie Mellon University it is reported that "more than half of software defects and system failures may be attributed to problems with exception handling." Stringent testing then will eliminate one of the major causes of system failure. See http://www.darpa.mil/ITO/psum1998/E350-0.html

4.1.2 No management pressure to compromise safety to rush a product out.

Fear of management reprisal is a common excuse sited by those who fail to report possible problems which would delay the project, as stated in a report found on the web site of the state university of New Jersey. See http://www-civeng.rutgers.edu/asce/failure.html

4.1.3 Safety critical program modules to be reviewed by other team members.

Review by peers when implemented will increase the likelihood of logical errors being spotted before the implementation of the system.

4.1.4 Sticking closely to required specification.

Failing to stick to the required specification is a major mistake. If the system does not produce the results expected then it is useless for the required tasks.

4.1.5 Asking for clarification when faced with obscure specification criteria.

A potential problem occurs in the development of a system when a programmer needs to implement obscure specification criteria. If clarification is not sought then guesswork on the part of the programmer can lead to unwanted results being generated by the system, and thus cause system failure.

4.1.6 A policy requiring documentation to be thoroughly "tested" i.e. is it usable?

This policy of reviewing documentation will ensure that the users are given clear and concise instructions on how to operate the system, and will eliminate failures caused by users following erroneous instructions.

4.1.7 Allow input from those using the systems to ensure interfaces are clear.

This point is made by an article by Charles B. Kreitzberg, Ph.D. President, Cognetics Corporation in his report which can be found at http://www.cognetics.com/presentations/charlie/drk_chaos.html

4.2 End user or entity problems:

System failures, which would come under this classification, are those failures for which the entity that is going to be utilising the system is responsible. Therefore to minimise the frequency of this type of error occurring certain actions must be taken by the entity.

The actions, which I propose, are:

4.2.1 Critical review of the specification provided to the system developers.

In a report by Centre for Human-Computer Studies, Uppsala University, the point is made that specifications need to be prepared and reviewed by those who will be utilising the systems. See http://mbi.dkfz-heidelberg.de/helios/doc/book/UIRS.html

4.2.2 Full training given too end-users of the system.

Training is one of the factors highlighted by Charles B. Krietzig, which will reduce chaos in implementing computer systems. See http://mbi.dkfz-heidelberg.de/helios/doc/book/UIRS.html

4.3 Implementation or hardware errors:

This class of failure contains those faults for which the responsibility can not be clearly placed on one particular group. These failures are caused either by random events or by a combination of factors between the relevant parties.

The proposals for this classification are:

4.3.1 Ensure backup hardware is available esp. for safety critical systems.

In the event of a hardware failure in the primary system, essential functions could be transferred to backup components thus reducing the impact of hardware failure. An example of this is found in disk duplexing as explained in a report on disaster avoidance by the University of Toronto. See http://www.utoronto.ca/security/disavd.htm

4.3.2 Frequent consultation between relevant parties as to condition of the system.

The interchange of ideas and information between the relevant parties will help the developers to identify any potential problems in the implementation of the proposed system. The open interchange of information encourages people to highlight any problem areas with which they are concerned.

5. References:

Perry, William "Handbook of Diagnosing and Solving Computer Problems", TAB Professional and Reference Books. 1989. pp. 105 - 106.
Travis, Alan "Asylum system hit by IT black hole", The Guardian, November 1st 1999, pp.7
http://catless.ncl.ac.uk/Risks/20.47.html#subj1
http://www2.southwind.net/~rwweeks/bugs.html
http://spacer.com/spacecast/news/launchers-99l.html
http://www.cognetics.com/presentations/charlie/drk_chaos.html
http://mbi.dkfz-heidelberg.de/helios/doc/book/UIRS.html
http://www.darpa.mil/ITO/psum1998/E350-0.html
http://www-civeng.rutgers.edu/asce/failure.html
http://www.utoronto.ca/security/disavd.htm