CCAMP Working Group P. Czezowski (FLA) Internet Draft T. Soumiya (FLL) draft-czezowski-optical-recovery-reqs-00.txt (Editors) Expires: April 2003 October 2002 Optical Network Failure Recovery Requirements draft-czezowski-optical-recovery-reqs-00.txt Status of this Memo This document is an Internet-Draft and is in full conformance with all provisions of Section 10 of RFC2026 [1]. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF), its areas, and its working groups. Note that other groups may also distribute working documents as Internet- Drafts. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress." The list of current Internet-Drafts can be accessed at http://www.ietf.org/ietf/1id-abstracts.txt The list of Internet-Draft Shadow Directories can be accessed at http://www.ietf.org/shadow.html. Abstract This draft describes requirements for control plane-based recovery from data plane failures in pre-OTN networks. pre-OTN networks are transport networks that have a GMPLS-based control plane and various transport plane technologies (such as Optical Cross Connects and Optical Add/Drop Multiplexers, etc.) An important feature of these networks is recovery from failures - using either a protection or restoration scheme. Achieving recovery under strict time constraints is a difficult problem. Shared mesh-based recovery is especially desirable for reducing spare capacity ratios and achieving flexible recovery scenarios. Following a brief overview and consideration of the requirements, they are presented in an itemized list in section 3.4 of this document. Czezowski & Soumiya Expires - April 2003 [Page 1] draft-czezowski-optical-recovery-reqs-00.txt October 2002 Conventions used in this document The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in RFC-2119 [2]. Table of Contents 1. Introduction...................................................2 2. Glossary of Terms Used.........................................3 3. Failure Recovery Requirements..................................4 3.1 Overview of Recovery Requirements..........................4 3.2 Shared Mesh-based Recovery.................................6 3.3 Failure Notification Mechanisms............................6 3.4 pre-OTN Network Failure Recovery Requirements..............7 4. Security Considerations........................................9 5. Conclusions....................................................9 Reference.........................................................9 Acknowledgments..................................................10 Editors' Addresses...............................................11 Contributing Authors.............................................11 1. Introduction This draft describes requirements for control plane-based recovery from data plane failures in pre-OTN Networks. pre-OTN Networks are transport networks that have a GMPLS-based [3] control plane and various transport plane technologies (such as Optical Cross Connects (OXC), Optical Add/Drop Multiplexers (OADM), etc). Service recovery from failures, using either a protection or restoration scheme, is an important feature of these networks to ensure high-availability and uninterrupted service. Achieving service recovery under strict time constraints is a difficult problem. Several mechanisms for recovery in mesh and ring topologies have been devised. Protection and restoration algorithms can be used for local repair (span-based or node-based) and LSP recovery. Shared mesh-based recovery is especially desirable for reducing spare capacity ratios and achieving flexible service recovery scenarios. Span-based recovery can handle faults such as fiber link failures. In the case of a node failure, recovery can be done through either node-based or LSP-based techniques. While LSP-based recovery has the potential for efficient redundancy requirements, it also entails the potentially lengthy delay incurred in notifying all nodes along the recovery path of the failure of a remote resource. For some applications, recovery paths must be chosen carefully to meet strict recovery time requirement (e.g., 50ms). Czezowski & Soumiya Expires - April 2003 [Page 2] draft-czezowski-optical-recovery-reqs-00.txt October 2002 There are currently several Internet Drafts in the Sub-IP Area related to recovery in GMPLS networks. They cover the topics of terminology [4], functional specification [5] and mechanisms analysis [6] for recovery in GMPLS-based networks, and survivability requirements and considerations for traffic engineered or hierarchical networks [7,8]. As a set, these documents provide their readers with detailed descriptions of the concepts and mechanisms used in network recovery. However, the list of requirements for control plane-based recovery has not been specifically detailed in any one document. 2. Glossary of Terms Used The following acronyms are used in this document: o GMPLS: Generalized Multiprotocol Label Switching [3] o LMP: Link Management Protocol [9] o LSP: Label Switched Path o LSR: Label Switched Router o OADM: Optical Add/Drop Multiplexer o OTN: Optical Transport Network o OXC: Optical Cross-Connect o RSVP-TE: Resource Reservation Protocol-Traffic Eng. [10] The terminology for GMPLS-based recovery is documented in [4]. These terms are borrowed from a work in progress at the ITU-T [11]. Here, we use the following terms from that document: o Detecting Entity (Failure Detection): An entity that detects a failure or group of failures; providing thus a non-correlated list of failures. o Reporting Entity (Failure Correlation and Notification): An entity that can make an intelligent decision on fault correlation and report the failure to the deciding entity. Fault reporting can be automatically performed by the deciding entity detecting the failure. o Deciding Entity (part of the failure recovery decision process): An entity that makes the recovery decision or select the recovery resources. This entity communicates the decision to the impacted LSPs/spans with the recovery actions to be performed. o Recovering Entity (part of the failure recovery activation process): An entity that participates in the recovery of the LSPs/spans. o Bridge: A bridge is the function that connects the normal traffic and extra traffic to the working and recovery LSP/span. Czezowski & Soumiya Expires - April 2003 [Page 3] draft-czezowski-optical-recovery-reqs-00.txt October 2002 There are three types of bridges (Permanent Bridge, Broadcast Bridge and Selector Bridge). o Selector: A selector is the function that extracts the normal traffic either from the working or the recovery LSP/span. There are two types of selectors (Selective selector and Merging Selector). o Recovery phases: 1. Failure Detection, 2. Failure Localization and Isolation, 3. Failure Notification, 4. Recovery (Protection or Restoration), 5. Reversion (Normalization) 3. Failure Recovery Requirements Even though some requirements for fault recovery have been discussed in working groups of the Sub-IP area, several additional aspects should be examined and mentioned regarding recovery in pre-OTN networks. In this section, we describe the fault recovery requirements that we see. For purposes of completeness, we do not try to avoid restatement of requirements listed in other drafts. 3.1 Overview of Recovery Requirements This subsection summarizes the survivability requirements for pre-OTN networks. Greater details on the requirements are provided in the subsequent subsections. The following classes (types) of recovery are required for span, LSP segment, and LSP recovery: o Protection - pre-computed route and pre-selected (i.e., cross- connected) resources o Restoration - pre-computed route and on-demand selection of resources - on-demand route and on-demand selection of resources A recovery scheme uses either protection or restoration (or both), together with failure detection and notification mechanisms and protocols. Depending on the service specification, the timing bounds for the recovery schemes range from 50ms (for local repair of services carrying voice calls) up to 2 seconds (for low priority path-based repair). For multi-layered networks, hold-off timers are required to allow recovery at lower layers, and escalation must be supported. Support for horizontal hierarchy must also be included, because large networks are usually segmented. Czezowski & Soumiya Expires - April 2003 [Page 4] draft-czezowski-optical-recovery-reqs-00.txt October 2002 In general, recovery schemes are required to operate in a stable and cooperative manner to maximize the network's reliability and availability. Such requirements entail that the recovery schemes also be resource efficient and as flexible as possible with respect to types of failures, service classes, and the network operator's policies. A temporal model of fault recovery is shown in Figure 1 below. The diagram is adapted from [11]. +-Network Impairment | +-Fault Detection | | +-Start of Fault Notification | | | +-Start of Traffic Switching | | | | +-Recovery Operation Complete | | | | | +-Traffic Recovered | | | | | | v v v v v v -----------------------------------------------> | T1 | T2 | T3 | T4 | T5 | time Figure 1. Recovery temporal model. The five recovery phases shown in the figure are (using the terms from [4]): 1. Failure Detection - The time between the network impairment and the detection at the control plane (via a technology dependant interface at the node of transport plane). 2. Failure Localization and Isolation - The time between when the detecting entity has detected a fault, and when the reporting entity starts the fault-recovery process. This time assumes that the fault-recovery process at a given layer may wait for restoration or recovery to occur at another layer. The reporting entity also performs failure correlation to reduce the number of notifications to be sent. 3. Failure Notification - The time between when the reporting entity starts the notifications and when all the necessary deciding and recovering entities have received the failure notifications. 4. Recovery (Protection or Restoration) - The time between the first and last recovery actions, after which the recovery path is carrying traffic. Czezowski & Soumiya Expires - April 2003 [Page 5] draft-czezowski-optical-recovery-reqs-00.txt October 2002 5. Reversion (Normalization) - The time (after recovery) until the original working path has been repaired and begins to carry the traffic again. Together, phases 1 and 2 are called Fault Management. It is evident that the critical component in guaranteeing the time constraints for the service recovery is the Failure Notification phase. A recovery scheme should follow these steps. The scheme should also allow the network operator to choose whether or not reversion is performed. 3.2 Shared Mesh-based Recovery In non-WDM optical networks, such as Synchronous Optical Network / Synchronous Digital Hierarchy (SONET/SDH), conventional protection techniques are currently the most commonly used. These techniques are based on linear and ring network topologies. Linear protection can be categorized as 1+1 and 1:N protection. Ring protection can be categorized as uni-directional path switched ring (UPSR) and bi- directional line switched ring (BLSR). However, linear 1+1 protection requires 100% redundancy in the spare resources for every working path. For ring-based protection, the available topology is restricted to a ring, and it requires 100% redundancy in the spare resources for every working path. Even with 1:N based on link protection, it is difficult to select different routes flexibly. From this point of view, they are extravagant in resource usage and have low flexibility, even though the level and speed of recovery from a failure can be assured. For reasons of efficiency and flexibility, pre-OTN network recovery schemes should support shared mesh-based recovery. Shared mesh recovery can save resources by sharing recovery capacity among multiple working paths. This approach increases the system flexibility because working paths and recovery paths can be set up between source and destination nodes without restrictions on the physical topology. Furthermore, this flexibility facilitates fast recovery because the shared mesh provides more node candidates for the routing of the recovery paths. 3.3 Failure Notification Mechanisms In general, there are two alternatives for control plane based failure notification: o Modified GMPLS signaling of failure notification messages o Controlled flooding of failure notification messages Czezowski & Soumiya Expires - April 2003 [Page 6] draft-czezowski-optical-recovery-reqs-00.txt October 2002 The GMPLS signaling protocol, RSVP-TE, supports notification using a notify message. Since RSVP-TE messages follow the path of the LSP, a 2 or 3-phase method, is used to notify the deciding and recovering entities. Under this scheme, the deciding entity relays the notification messages to the recovering entities on the recovery LSP. The controlled flooding of fiber link failure notification messages on the control plane, perhaps by extending LMP [9], is another alternative. Flooding the notifications supports recovery schemes that require policy- or priority-based decisions at multiple decision entities distributed within the network. To meet the time constraints for recovery, failure correlation/ aggregation time for the computations to be performed at the reporting entity must be minimized, and the time that elapses prior to all recovering entities receiving a failure notification (or recovery action) signal must also be minimized. 3.4 pre-OTN Network Failure Recovery Requirements This is our list of recovery requirements: o Requirements on efficiency of working and recovery bandwidth (1) A recovery scheme SHOULD allow efficient use of working LSP bandwidth using such measures as route optimization, taking into account route dependencies between a working path and its recovery path. (2) A recovery scheme SHOULD allow efficient use of recovery LSP bandwidth using such measures as route optimization, taking into account route dependencies between a working path and its recovery path. (3) A recovery scheme SHOULD allow sharing of recovery bandwidth among multiple recovery paths, when possible, to enable efficient use of recovery bandwidth. o Requirements on recovery actions (4) A recovery scheme SHOULD allow suppression of fault notification messages, so that spurious fault notification messages and recovery action messages are suppressed and are not broadcast within the network, ensuring scalability of the fault recovery mechanism. (5) A recovery scheme SHOULD ensure reliable fault recovery actions, providing the control plane is connected. Czezowski & Soumiya Expires - April 2003 [Page 7] draft-czezowski-optical-recovery-reqs-00.txt October 2002 (6) A recovery scheme SHOULD allow fallback operations of its recovery actions. For example, when the system encounters a fault class (eg., multiple simultaneous failures) which was not anticipated, the system should execute a best-effort recovery, such that as many working paths as possible are restored under the circumstances. (7) A recovery scheme SHOULD allow the network operator to choose whether or not the reversion actions are to be performed. (8) A recovery scheme SHOULD support recovery within bounded time constraints and MAY be compliant with generally used recovery times like 50ms for SONET/SDH protection. (9) A recovery scheme SHOULD allow testing and verification of the availability of the recovery path before its actual use. This testing may occur when the recovery path is provisioned, or after it is provisioned but before actual recovery action occurs and the path starts being used. (10) A recovery scheme SHOULD guarantee that recovery actions correctly deliver traffic from working paths to the respective recovery paths, such that the recovery actions do not result in any unintended connections or unintended diversion of traffic. o Requirements on recovery schemes (11) A recovery scheme SHOULD support and be compliant with generally used protection schemes such as 1+1, 1:1, 1:N, and M:N. (12) A recovery scheme SHOULD support recovery of failed LSPs even if the LSPs have different endpoints. (13) A recovery scheme SHOULD support priority-based recovery of failed LSPs. o Requirements on recovery priority of service classes (14) A recovery scheme SHOULD allow recovery of service classes based on their recovery priority, which is a continuous spectrum from lowest priority (best effort) to the highest priority (guaranteed), based on their service class usage and agreement with the customers. (15) A recovery scheme SHOULD allow support of service classes with different recovery time guarantee. For example, a service class carrying voice calls requires a recovery time of less than 50ms to Czezowski & Soumiya Expires - April 2003 [Page 8] draft-czezowski-optical-recovery-reqs-00.txt October 2002 avoid loss of connections, whereas a service class carrying private lines requires a recovery time of less than 2 sec. o Requirements on recovery granularity (16) A recovery scheme SHOULD allow aggregated recovery actions, ensuring scalability. o Requirements on failure notification delivery (17) A recovery scheme SHOULD be equipped with a failure notification mechanism that guarantees prompt and reliable delivery of notification of faults in the data plane to a deciding entity that is in charge of recovering the fault. 4. Security Considerations This draft does not introduce any new security issues. 5. Conclusions This draft describes requirements for control plane-based recovery from data plane failures in Optical IP Networks. While there are currently several Internet Drafts in the Sub-IP Area related to service recovery in GMPLS networks, the list of requirements for control plane-based recovery has not been specifically detailed in any one document. We identify that most important requirements are meeting the potentially strict timing, enabling flexible recovery schemes, and efficient use of resources. 17 requirements are listed in section 3.4. Reference [1] Bradner, S., "The Internet Standards Process -- Revision 3", BCP 9, RFC 2026, October 1996. [2] Bradner, S., "Key words for use in RFCs to Indicate Requirement Levels", BCP 14, RFC 2119, March 1997. [3] Mannie, E. (Ed.), "Generalized Multi-Protocol Label Switching (GMPLS) Architecture", Internet Draft, work in progress, draft- ietf-ccamp-gmpls-architecture-03.txt, August 2002. Czezowski & Soumiya Expires - April 2003 [Page 9] draft-czezowski-optical-recovery-reqs-00.txt October 2002 [4] Mannie, E. and D. Papadimitriou (Eds.), "Recovery (Protection and Restoration) Terminology for GMPLS", Internet Draft, work in progress, draft-ietf-ccamp-gmpls-recovery-terminology-00.txt, June 2002. [5] Lang, J.P. and B. Rajagopalan (Eds.), "Generalized MPLS Recovery Functional Specification", Internet Draft, work in progress, draft-bala-gmpls-recovery-functional-00.txt, August 2002. [6] Papadimitriou, D. and E. Mannie (Eds.), "Analysis of Generalized MPLS-based Recovery Mechanisms (including Protection and Restoration)", Internet Draft, work in progress, draft- papadimitriou-ccamp-gmpls-recovery-analysis-02.txt, August 2002. [7] Lai, W.S., and D. McDysan (Eds.), "Network Hierarchy and Multilayer Survivability", Internet Draft, work in progress, draft-ietf-tewg-restore-hierarchy-01.txt, July 2002. [8] Owens, K., et al., "Network Survivability Considerations for Traffic Engineered IP Networks", Internet Draft, work in progress, draft-owens-te-network-survivability-03.txt, May 2002. [9] Lang, J. (Ed.), "Link Management Protocol (LMP)", Internet Draft, draft-ietf-ccamp-lmp-06.txt, September 2002. [10] Berger, L. (Ed.), "Generalized MPLS Signaling - RSVP-TE Extensions", Internet Draft, work in progress, draft-ietf-mpls- generalized-rsvp-te-09.txt", September 2002. [11] ITU-T Draft Recommendation G.gps, "Generic Protection Switching", work in progress, April 2002. Acknowledgments The following individuals provided valuable input to this draft: Richard Rabbat, Ching-Fong Su and Takafumi Chujo of Fujitsu Labs of America, Inc., Norihiko Shinomiya and Akira Chugo of Fujitsu Laboratories, Ltd. Czezowski & Soumiya Expires - April 2003 [Page 10] draft-czezowski-optical-recovery-reqs-00.txt October 2002 Editors' Addresses Peter Czezowski Toshio Soumiya Fujitsu Labs of America, Inc. Fujitsu Laboratories Ltd. 595 Lawrence Expressway 1-1, Kamikodanaka 4-Chome Sunnyvale, CA 94085 Nakahara-ku, Kawasaki United States of America 211-8588, Japan Phone: +1-408-530-4516 Phone: +81-44-754-2765 Email: peterc@fla.fujitsu.com Email: soumiya.toshio@jp.fujitsu.com Contributing Authors Kohei Shiomoto NTT Network Innovation Laboratories Midori-machi 3-9-11, Musashino-shi Tokyo, Japan 180-8585 Phone: +81-422-59-4402 Email: Shiomoto.Kohei@lab.ntt.co.jp Shoichiro Seno Mitsubishi Electric Corporation 5-1-1 Ofuna, Kamakura Kanagawa, Japan 247-8501 Phone: +81-467-41-2430 Email: senos@isl.melco.co.jp Czezowski & Soumiya Expires - April 2003 [Page 11]