The failure to ping a directly connected device often occurs on networks, causing services to be interrupted for a long time and fail to automatically recover. The ping process involves various IP forwarding phases. A ping failure may be caused by a hardware entry error, board fault, or subcard fault on the local device or a fault on an intermediate device or the peer device. Therefore, it is difficult to locate or demarcate the specific fault.
The ping service node is a specific SAID service node. This node performs link-heartbeat loopback detection to detect service faults, diagnoses each ping forwarding phase to locate or demarcate faults, and takes corresponding recovery actions.
For details about the SAID framework and principles, see Basic SAID Functions. SAID uses IP packets in which the protocol number is 1, indicating ICMP. The ping service node undergoes four phases (fault detection, fault diagnosis, fault recovery, and service recovery determination) to implement automatic device diagnosis, fault information collection, and service recovery.
Fault detection
The ping service node performs link-heartbeat loopback detection to detect service faults. The packets used are ICMP detection packets. There are 12 packet templates in total. Each template sends two packets in sequence within a period of 30s. Therefore, a total of 24 packets are sent by the 12 templates within a period of 30s. After five periods, the system starts to collect statistics on lost packets and modified packets.
Packet modification detection checks whether the content of received heartbeat packets is the same as the content of sent heartbeat packets. If one of the following conditions is met, a trigger message is sent to instruct the SAID ping node to perform fault diagnosis:
Packet loss detection checks whether the difference between the number of received heartbeat packets and the number of sent heartbeat packets is within the permitted range. If one of the following conditions is met, a trigger message is sent to instruct the SAID ping node to perform fault diagnosis:
Fault diagnosis
After receiving the triggered message in the fault detection state, the ping service node enters the fault diagnosis state.
Fault recovery
If a counting error occurs on the subcard, the ping service node resets the subcard for service recovery. Then, the node enters the service recovery determination state and performs link-heartbeat loopback detection to determine whether services recover. If services recover, the node returns to the fault detection state. If services do not recover, the node returns to the fault recovery state and takes a secondary recovery action. (For a subcard reset, the secondary recovery action is board reset.)
If no counting error occurs on the subcard, the ping service node resets the involved board for service recovery. After the board starts, the node enters the service recovery determination state and performs link-heartbeat loopback detection to determine whether services recover. If services recover, the node returns to the fault detection state. If services do not recover, the node remains in the service recovery determination state and periodically performs link-heartbeat loopback detection until services recover.
Service recovery determination
After fault recovery is complete, the ping service node uses the fault packet template to send diagnostic packets. If a fault still exists and a subcard reset is performed, the node generates an alarm and instructs the subcard to perform a switching for self-healing. If a fault still exists but no subcard reset is performed, the node generates an alarm only. If no fault exists, the node instructs the link-heartbeat loopback function to return to the initiate state, and the node itself returns to the fault detection state.
Fault alarm
If link-heartbeat loopback detects packet loss, it triggers SAID ping diagnosis and performs recovery operations (reset the subcard or board). However, services fail to be recovered, and the device detects packet loss and reports an alarm.
If services fail to be restored after recovery operations (reset the subcard or board), the device detects packet loss and reports an alarm.