Page History

...

ISO/IEC TR 24772, Section 6.47 , "REU Termination strategy" [ISO/IEC TR 24772], says:

Expectations that a system will be dependable are based on the confidence that the system will operate as expected and not fail in normal use. The dependability of a system and its fault tolerance can be measured through the component part's reliability, availability, safety and security. Reliability is the ability of a system or component to perform its required functions under stated conditions for a specified period of time [IEEE Std 610.12 1990]. Availability is how timely and reliable the system is to its intended users. Both of these factors matter highly in systems used for safety and security. In spite of the best intentions, systems will encounter a failure, either from internally poorly written software or external forces such as power outages/variations, floods, or other natural disasters. The reaction to a fault can affect the performance of a system and in particular, the safety and security of the system and its users.

Effective error handling (which includes error reporting, report aggregation, analysis, response, and recovery) is a central aspect of the design, implementation, maintenance, and operation of systems that exhibit survivability under stress. Survivability is the capability of a system to fulfill its mission, in a timely manner, despite an attack, accident, or other stress that is outside the bounds of normal operation [Lipson 2000]. If full services can't services cannot be maintained under a given stress, survivable systems degrade gracefully, continue to deliver essential services, and recover full services as conditions permit.

Error reporting and error handling play a central role in the engineering and operation of survivable systems. Survivability is an emergent property of a system as a whole [Fisher 1999] and depends on the behavior of all of the system's components and the interactions among them. From the viewpoint of error handling, every system component, down to the smallest routine, can be considered to be a sensor capable of reporting on some aspect of the health of the system. Any error , or anomaly, ignored or improperly handled, can threaten delivery of essential system services and, as a result, put at risk the organizational or business mission that the system supports.

...

Recognition of the full nature of adverse events and the determination of appropriate measures for recovery and response are often not possible in the context of the component or routine in which a related error first manifests itself. Aggregation of multiple error reports and the interpretation of those reports in a higher context may be required both to understand what is happening and to decide on the appropriate action to take. Of course, the domain-specific context in which the system operates plays a huge role in determining proper recovery strategies and tactics. For safety-critical systems, simply halting the system (or even just terminating an offending process) in response to an error is rarely the best course of action and may lead to disaster. From a system perspective, error-handling strategies should map directly into survivability strategies, which may include recovery by activating fully redundant backup services or by providing alternate alternative sets of roughly equivalent services that fulfill the mission with sufficient diversity to greatly improve the odds of survival against common mode failures.

An error-handling policy must specify a comprehensive approach to error reporting and response. Components and routines should always generate status indicators, and all called routines should have their error returns checked. All input should be checked for compliance with the formal requirements for such input rather than blindly trusting input data. Moreover, never assume, based on specific on the basis of specific knowledge about the system or its domain, that the success of a called routine is guaranteed. The failure to report or properly respond to errors or other anomalies from a system perspective can threaten the survivability of the system as a whole.

ISO/IEC TR 24772, Section 6.47 , "REU Termination strategy" [ISO/IEC TR 24772], describes the following mitigation strategies:

...

CERT C++ Secure Coding Standard: ERR00-CPP. Adopt and implement a consistent and comprehensive error-handling policy

ISO/IEC 9899:1999 Sections 2011 Section 7.1.4, 7.9.10.4, and 7.11.6.2"Use of library functions"

ISO/IEC PDTR 24772 "REU Termination strategy" and "NZN Returning error status"

...

MITRE CWE: CWE-391, "Unchecked Error Conditionerror condition"

MITRE CWE: CWE-544, "Missing Standardized Error Handling Mechanismstandardized error handling mechanism"

Bibliography

[Fisher 1999]
[Horton 1990] Section 11, p. 168, and Section 14, p. 254
[Koenig 1989] Section 5.4, p. 73
[Lipson 2000]
[Lipson 2006]
[Summit 2005] C-FAQ Question 20.4

...

Space shortcuts

Page tree

Versions Compared

Old Version 58

New Version 59

Key

Bibliography