Overview of Reliability

Definition

Reliability is a technology that can shorten traffic interruption time and ensure the quality of service on a network, improving user experience.

Device reliability can be assessed from the following aspects: system, hardware, and software reliability design; reliability test and verification; IP network reliability design.

As networks rapidly develop and applications become diversified, various value-added services (VASs) are widely used. The requirement for network bandwidth increases dramatically. Any network service interruption will result in immeasurable loss to carriers.

Demands for network infrastructure reliability are increasing.

This chapter describes IP reliability technologies supported by the NetEngine 8000 F.

Reliability Indexes

Reliability indexes include the mean time to repair (MTTR), mean time between failures (MTBF), and availability.

Generally, product or system reliability is assessed based on the MTTR and MTBF.

MTTR: The MTTR indicates the fault rectification capability in terms of maintainability. This index refers to the average time that a component or a device takes to recover from a failure. The MTTR involves spare parts management and customer service and plays an important role in evaluating device maintainability.
The MTTR is calculated using the following formula:

MTTR = Fault detection time + Board replacement time + System initialization time + Link recovery time + Route convergence time + Forwarding recovery time

A smaller addend indicates a shorter MTTR and higher device availability.
MTBF: The MTBF indicates fault probability. This index refers to the average time (usually expressed in hours) when a component or a device is working properly.
Availability: Availability indicates system utility. Availability can be improved when the MTBF increases or the MTTR decreases.
Availability is calculated using the following formula:

Availability = MTBF/(MTBF + MTTR)

In the telecom industry, 99.999% availability means that service interruptions caused by device failures are less than 5 minutes each year.

On live networks, network faults and service interruptions are inevitable due to various causes. Availability can be improved by decreasing the MTTR.

Reliability Requirement Levels

Reliability requirements at different levels differ in the target and implementation.

Table 1 describes three reliability requirement levels and their targets and implementations.

**Table 1** Reliability requirements
Level	Target	Implementation
1	Few faults in system software and hardware	Hardware: simplified design, standardized circuits, reliable application of components, reliability control in purchased components, reliable manufacture, environment endurability, highly accelerated life testing (HALT) and highly accelerated stress screen (HASS). Software: specifications for software reliability design
2	No impact on the system if a fault occurs	Redundancy design, switchover policy, and switchover success rate improvement
3	Rapid recovery if a fault occurs and affects the system	Fault detection, diagnosis, isolation, and rectification

Networking Principles for Highly Reliable IP Networks

Networking principles for highly reliable IP networks include hierarchical networking, redundancy, and load balancing.

The details are as follows:

Hierarchical networking: A network is divided into three layers: core layer, convergence layer, and edge layer. According to service status or prediction, redundancy backup is configured so that a customer edge device is dual-homed to the devices at the convergence layer. The devices at the convergence layer are dual-homed to multiple devices in a single node or different nodes at the upper layer. The devices at the core and convergence layers can be deployed as required. The devices at the core layer are fully or half interconnected. Two devices are reachable to each other using one route at a fast traffic rate, avoiding multi-interconnection.
Multi-interconnection is preferred at the same layer, whereas multi-device is preferred in a single node.
A lower-layer device is dual- or multi-homed to multiple devices in a single node or different nodes.
Adjustments can be made based on the actual traffic volume.