Fault Tolerance

In critical situations, software systems must be fault tolerant. Fault tolerance is required where there are high availability requirements or where system failure costs are very high.
Fault tolerance means that the system can continue in operation in spite of software failure.Even if the system has been proved to conform to its specification, it must also be fault tolerant as  there may be specification errors or the validation may be incorrect.

There are various types of fault tolerance actions like fault detection, damage assessment, fault recovery, fault repair.
Fault detection

The first stage of fault tolerance is to detect that a fault (an erroneous system state) has occurred or will occur. Fault detection involves defining constraints that must hold for all legal states and checking the state against these constraints. There are two types of fault detection; preventative and retrospective fault detection.
Damage Assessment
First step is to analyze system state to judge the extent of corruption caused by a system failure.The assessment must check what parts of the state space have been affected by the failure.It is generally based on ‘validity functions’ that can be applied to the state elements to assess if their value is within an allowed range. There are various techniques of damage assessment. Checksums are used for damage assessment in data transmission.Redundant pointers can be used to check the integrity of data structures. Watch dog timers can check for non-terminating processes. If no response after a certain time, a problem is assumed.

Fault Recovery

The system must restore its state to a known safe state. There are two types of fault recovery; forward and backward recovery. Forward recovery applies repairs to a corrupted system state. It is usually application specific. Backward recovery restore the system to a known safe state. It is simpler. Details of a safe state are maintained and this replaces the corrupted system state.

Fault Repair

The system may be modified to prevent recurrence of the fault. As many software faults are transitory, this is often
unnecessary.

Leave a Reply

Your email address will not be published. Required fields are marked *