SAID for Ping

Background

The failure to ping a directly connected device often occurs on networks, causing services to be interrupted for a long time and fail to automatically recover. The ping process involves various IP forwarding phases. A ping failure may be caused by a hardware entry error, board fault, or subcard fault on the local device or a fault on an intermediate device or the peer device. Therefore, it is difficult to locate or demarcate the specific fault.

Definition

The ping service node is a specific SAID service node. This node performs link-heartbeat loopback detection to detect service faults, diagnoses each ping forwarding phase to locate or demarcate faults, and takes corresponding recovery actions.

Principles

For details about the SAID framework and principles, see Basic SAID Functions. SAID uses IP packets in which the protocol number is 1, indicating ICMP. The ping service node undergoes four phases (fault detection, fault diagnosis, fault recovery, and service recovery determination) to implement automatic device diagnosis, fault information collection, and service recovery.

  • Fault detection

    The ping service node performs link-heartbeat loopback detection to detect service faults. The packets used are ICMP detection packets. There are 12 packet templates in total. Each template sends two packets in sequence within a period of 30s. Therefore, a total of 24 packets are sent by the 12 templates within a period of 30s. After five periods, the system starts to collect statistics on lost packets and modified packets.

    Link-heartbeat loopback detection is classified as packet modification detection or packet loss detection.
    • Packet modification detection checks whether the content of received heartbeat packets is the same as the content of sent heartbeat packets. If one of the following conditions is met, a trigger message is sent to instruct the SAID ping node to perform fault diagnosis:

      • Modified packets are detected in each of the five periods.
      • Two or more packets are modified in a period.
    • Packet loss detection checks whether the difference between the number of received heartbeat packets and the number of sent heartbeat packets is within the permitted range. If one of the following conditions is met, a trigger message is sent to instruct the SAID ping node to perform fault diagnosis:

      • The total number of lost packets exceeds 3.
      • After each packet sending period ends, the system checks the protocol status and whether ARP entries exist on the interface and find that there is no ARP in three consecutive periods.
      • The absolute value of the difference between the number of lost packets whose payload is all 0s and the number of lost packets whose payload is all Fs is greater than 25% of the total number of sent packets in five periods.
  • Fault diagnosis

    After receiving the triggered message in the fault detection state, the ping service node enters the fault diagnosis state.

    • If a packet loss error is detected on the device, the SAID ping node checks whether a module (subcard, TM, or NP) on the device is faulty. If no module is faulty, the system completes the diagnosis and returns to the fault detection state.
    • If a packet loss error is detected on the device, the SAID ping node checks whether a module (subcard, TM, or NP) on the device is faulty. If a module fault occurs, the system performs loopback diagnosis. If packet loss or modification is detected during loopback, the local device is faulty. The system then enters the fault recovery state. If no packet is lost during loopback diagnosis, the system returns to the fault detection state.
    • If a packet modification error is detected on the device, the SAID ping node checks whether a module (subcard, TM, or NP) on the device is faulty. Loopback diagnosis is performed regardless of whether a module fault occurs. If packet loss or packet modification occurs during loopback, the local device is faulty. The system then enters the fault recovery state. If no packet is lost during the loopback, the system returns to the fault detection state and generates a packet modification alarm.
  • Fault recovery

    If a fault is detected during loopback diagnosis, the ping service node determines whether a counting error occurs on the associated subcard.
    • If a counting error occurs on the subcard, the ping service node resets the subcard for service recovery. Then, the node enters the service recovery determination state and performs link-heartbeat loopback detection to determine whether services recover. If services recover, the node returns to the fault detection state. If services do not recover, the node returns to the fault recovery state and takes a secondary recovery action. (For a subcard reset, the secondary recovery action is board reset.)

    • If no counting error occurs on the subcard, the ping service node resets the involved board for service recovery. After the board starts, the node enters the service recovery determination state and performs link-heartbeat loopback detection to determine whether services recover. If services recover, the node returns to the fault detection state. If services do not recover, the node remains in the service recovery determination state and periodically performs link-heartbeat loopback detection until services recover.

  • Service recovery determination

    After fault recovery is complete, the ping service node uses the fault packet template to send diagnostic packets. If a fault still exists and a subcard reset is performed, the node generates an alarm and instructs the subcard to perform a switching for self-healing. If a fault still exists but no subcard reset is performed, the node generates an alarm only. If no fault exists, the node instructs the link-heartbeat loopback function to return to the initiate state, and the node itself returns to the fault detection state.

  • Fault alarm

    If link-heartbeat loopback detects packet loss, it triggers SAID ping diagnosis and performs recovery operations (reset the subcard or board). However, services fail to be recovered, and the device detects packet loss and reports an alarm.

    If link-heartbeat loopback detects packet modification, it triggers SAID ping diagnosis and reports an alarm when any of the following conditions is met:
    • If services fail to be restored after recovery operations (reset the subcard or board), the device detects packet loss and reports an alarm.

    • If a software error occurs, the device forcibly cancels link-heartbeat loopback and reports an alarm if no other recovery operation is performed within 8 minutes.
    • If no packet loss or packet modification error occurs during link-heartbeat loopback, the device cancels the recovery operation. If no other recovery operation is performed within 8 minutes, the device reports an alarm.
    • If the board does not support SAID ping, the device reports an alarm.
Copyright © Huawei Technologies Co., Ltd.
Copyright © Huawei Technologies Co., Ltd.
< Previous topic Next topic >