SAID node state machine: state triggered when a SAID node detects, diagnoses, and rectifies faults. A SAID node involves seven states: initial, detecting, diagnosing, invalid-diagnose, recovering, judging, and service exception states.
SAID tracing: The SAID collects and stores information generated when a SAID node detects, diagnoses, and rectifies faults. The information can be used to locate the root cause of a fault.
Fault locating in the SAID involves the fault detection, diagnosis, and recovery phases. The SAID has multiple SAID nodes. Each time valid diagnosis is triggered (that is, the recovery process has been triggered), the SAID records the diagnosis process information for fault tracing. The SAID's main processes are described as follows:
Defense startup phase: After the system runs, it instructs modules to deploy fault defense (for example, periodic logic re-loading and entry synchronization), starting the entire device's fault defense.
Detection phase: A SAID node detects faults and finds prerequisites for problem occurrence. Fault detection is classified as periodic detection (for example, periodic traffic decrease detection) or triggered detection (for example, IS-IS Down detection).
Diagnosis phase: Once a SAID node detects a fault, the SAID node diagnoses the fault and collects various fault entries to locate fault causes (only causes based on which recovery measures can be taken need to be located).
Recovery phase: After recording information, the SAID node starts to rectify the fault by level. After the recovery action is completed at each level, the SAID node determines whether services recover (by determining whether the fault symptom disappears). If the fault persists, the SAID node continues to perform the recovery action at the next level until the fault is rectified. The recovery action is gradually performed from a lightweight level to a heavyweight level.
Tracing phase: If the SAID determines the fault and its cause, this fault diagnosis is a valid diagnosis. The SAID then records the diagnosis process. After entering the recovery phase, the SAID records the recovery process for subsequent analysis.
The fault detection, diagnosis, and recovery processes of a SAID node are implemented through state machines.
All state transition scenarios are as follows:
When detecting a trigger event in the initial state, the SAID node enters the detecting state.
If the detection is not completed in the detecting state, the SAID node keeps working in this state.
If a detection timeout occurs or no fault is detected in the detecting state, the SAID node enters the initial state.
When detecting a fault in the detecting state, the SAID node enters the diagnosing state.
If the diagnosis action is not completed in the diagnosing state, the SAID node keeps working in this state.
If an environmental change occurs in the diagnosing state or another SAID node enters the recovering state, the SAID node enters the invalid-diagnose state.
If the diagnosis action is not completed in the invalid-diagnose state, the SAID node keeps working in this state.
If no device exception is detected after the diagnosis action is completed in the diagnosing state, the SAID node enters the initial state.
If a device exception is detected after the diagnosis action is completed in the diagnosing state, the SAID node enters the recovering state.
If the recovery action is not completed in the recovering state, the SAID node keeps working in this state.
If the recovery action is completed in the recovering state, the SAID node enters the judging state.
If the judgment action is not completed in the judging state, the SAID node keeps working in this state.
If the service does not recover in the judging state and a secondary recovery action exists, the SAID node enters the recovering state.
If the service does not recover in the judging state and no secondary recovery action exists, the SAID node enters the service exception state.
In the service exception state, the SAID node periodically checks whether the service recovers.
If the service recovers in the judging state, the SAID node enters the initial state.