Secure systems are invariably subject to stresses (for example, those caused by attack, hardware failures, unanticipated user behavior, or unexpected environmental changes) that are outside the bounds of "normal operation," and yet the system must continue to deliver essential services in a timely manner, safely and securely. To accomplish this, a system must exhibit system qualities such as robustness, reliability, fault tolerance, performance, and security. All of these system quality attributes depend upon a consistent and comprehensive error-handling that supports the goals of the overall system.
According
Error handling is critical to the success and security of your application. It is necessary to adopt and implement a consistent error handling policy that is consistent with the goals and requirements of your application domain.
ISO/IEC PDTR 24772 Section 6.47, "REU Termination strategy" says:
Expectations that a system will be dependable are based on the confidence that the system will operate as expected and not fail in normal use. The dependability of a system and its fault tolerance can be measured through the component part's reliability, availability, safety and security. Reliability is the ability of a system or component to perform its required functions under stated conditions for a specified period of time IEEE 1990 glossary. Availability is how timely and reliable the system is to its intended users. Both of these factors matter highly in systems used for safety and security. In spite of the best intentions, systems will encounter a failure, either from internally poorly written software or external forces such as power outages/variations, floods, or other natural disasters. The reaction to a fault can affect the performance of a system and in particular, the safety and security of the system and its users.
Effective error-handling (which includes error reporting, report aggregation, analysis, response, and recovery) is a central aspect of the design, implementation, maintenance, and operation of systems that exhibit survivability under stress. Survivability is the capability of a system to fulfill its mission, in a timely manner, despite an attack, accident, or other stress that is outside the bounds of normal operation 1. If full services can't be maintained under a given stress, survivable systems degrade gracefully, continue to deliver essential services, and recover full services as conditions permit.
Error reporting and error handling play a central role in the engineering and operation of survivable systems. Survivability is an emergent property of a system as a whole 2 and depends on the behavior of all of the system's components and the interactions among them. From the viewpoint of error handling, every system component, down to the smallest routine, can be considered to be a sensor capable of reporting on some aspect of the health of the system. Any error (i.e., anomaly) ignored, or improperly handled, could threaten delivery of essential system services and thus put at risk the organizational or business mission that the system supports.
The key characteristics of survivability include the 3Rs: resistance, recognition, and recover. Resistance refers to measures that "harden" a system against particular stresses, recognition refers to situational awareness with respect to instances of stress and their impact on the system, and recovery is the ability of a system to restore services after (and possibly during) an attack, accident, or other event that has disrupted those services. Comprehensive error reporting and handling can
Recognition of the full nature of adverse events and the determination of appropriate measures for recovery and response are often not possible in the context of the component or routine in which a related error first manifests itself. Aggregation of multiple error reports and the interpretation of those reports in a higher context may be required both to understand what is happening and to decide on the appropriate action to take. Of course the domain-specific context in which the system operates plays a huge role in determining proper recovery strategies and tactics. For safety-critical systems, simply halting the system (or even just terminating an offending process) in response to an error is rarely the best course of action and may lead to disaster. From a system perspective, error handling strategies should map directly into survivability strategies, which may include recovery by activating fully redundant backup services, or by providing alternate sets of roughly equivalent services which fulfill the mission with sufficient diversity to greatly improve the odds of survival against common mode failures.
ISO/IEC PDTR 24772 Section 6.47, "REU Termination strategy" says:
When a fault is detected, there are many ways in which a system can react. The quickest and most noticeable way is to fail hard, also known as fail fast or fail stop. The reaction to a detected fault is to immediately halt the system. Alternatively, the reaction to a detected fault could be to fail soft. The system would keep working with the faults present, but the performance of the system would be degraded. Systems used in a high availability environment such as telephone switching centers, e-commerce, etc. would likely use a fail soft approach. What is actually done in a fail soft approach can vary depending on whether the system is used for safety critical or security critical purposes. For fail safe systems, such as flight controllers, traffic signals, or medical monitoring systems, there would be no effort to meet normal operational requirements, but rather to limit the damage or danger caused by the fault. A system that fails securely, such as cryptologic systems, would maintain maximum security when a fault is detected, possibly through a denial of service.
...
Software developers can avoid the vulnerability or mitigate its ill effects in the following ways:
- A strategy for fault handling should be decided. Consistency in fault handling should be the same with respect to critically similar parts.
- A multi-tiered approach of fault prevention, fault detection and fault reaction should be used.
- System-defined components that assist in uniformity of fault handling should be used when available. For one example, designing a "runtime constraint handler" (as described in ISO/IEC TR 24731-1) permits the application to intercept various erroneous situations and perform one consistent response, such as flushing a previous transaction and re-starting at the next one.
- When there are multiple tasks, a fault-handling policy should be specified whereby a task may
- halt, and keep its resources available for other tasks (perhaps permitting restarting of the faulting task)
- halt, and remove its resources (perhaps to allow other tasks to use the resources so freed, or to allow a recreation of the task)
- halt, and signal the rest of the program to likewise halt.
Risk Analysis
Failing to detect error condition can result in unexpected program behavior, and possibly abnormal program termination resulting in a denial-of-service conditionFailure to adopt and implement a consistent and comprehensive error-handling policy is detrimental to system survivability, and can result in a broad range of vulnerabilities depending on the operational characteristics of the system.
Recommendation | Severity | Likelihood | Remediation Cost | Priority | Level |
---|---|---|---|---|---|
ERR00-A | 2 (medium) | 2 (probable) | 2 (medium) | P8 | L2 |
Related Vulnerabilities
Search for vulnerabilities resulting from the violation of this rule on the CERT website.
References
Wiki Markup |
---|
\[[Horton 90|AA. C References#Horton 90]\] Section 11 p. 168, Section 14 p. 254 \[[ISO/IEC 9899-1999|AA. C References#ISO/IEC 9899-1999]\] Sections 7.1.4, 7.9.10.4, and 7.11.6.2 \[[ISO/IEC PDTR 24772|AA. C References#ISO/IEC PDTR 24772]\] "NZN Returning error status" \[[Koenig 89|AA. C References#Koenig 89]\] Section 5.4 p. 73 \[[MISRA 04|AA. C References#MISRA 04]\] Rule 16.1 \[[Summit 05|AA. C References#Summit 05]\] C-FAQ Question 20.4 |
...