CCAMP Working Group                                  P. Czezowski (FLA)
   Internet Draft                                         T. Soumiya (FLL)
   draft-czezowski-optical-recovery-reqs-00.txt                  (Editors)
   Expires: April 2003 
                                                              October 2002
 
 
               Optical Network Failure Recovery Requirements 
 
               draft-czezowski-optical-recovery-reqs-00.txt 
 
 
Status of this Memo 
 
   This document is an Internet-Draft and is in full conformance with 
   all provisions of Section 10 of RFC2026 [1].  
    
   Internet-Drafts are working documents of the Internet Engineering 
   Task Force (IETF), its areas, and its working groups.  Note that      
   other groups may also distribute working documents as Internet-
   Drafts. 
    
   Internet-Drafts are draft documents valid for a maximum of six months 
   and may be updated, replaced, or obsoleted by other documents at any 
   time.  It is inappropriate to use Internet-Drafts as reference 
   material or to cite them other than as "work in progress." 
    
   The list of current Internet-Drafts can be accessed at 
        http://www.ietf.org/ietf/1id-abstracts.txt 
   The list of Internet-Draft Shadow Directories can be accessed at 
        http://www.ietf.org/shadow.html. 
 
 
Abstract 
    
   This draft describes requirements for control plane-based recovery 
   from data plane failures in pre-OTN networks.  pre-OTN networks are 
   transport networks that have a GMPLS-based control plane and various 
   transport plane technologies (such as Optical Cross Connects and 
   Optical Add/Drop Multiplexers, etc.)  An important feature of these 
   networks is recovery from failures - using either a protection or 
   restoration scheme.  Achieving recovery under strict time constraints 
   is a difficult problem.  Shared mesh-based recovery is especially 
   desirable for reducing spare capacity ratios and achieving flexible 
   recovery scenarios.  Following a brief overview and consideration of 
   the requirements, they are presented in an itemized list in section 
   3.4 of this document. 
    
 
Czezowski & Soumiya      Expires - April 2003                 [Page 1] 
             draft-czezowski-optical-recovery-reqs-00.txt October 2002 
 
 
Conventions used in this document 
    
   The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 
   "SHOULD", "SHOULD NOT", "RECOMMENDED",  "MAY", and "OPTIONAL" in this 
   document are to be interpreted as described in RFC-2119 [2]. 
    
Table of Contents 
    
   1. Introduction...................................................2 
   2. Glossary of Terms Used.........................................3 
   3. Failure Recovery Requirements..................................4 
      3.1 Overview of Recovery Requirements..........................4 
      3.2 Shared Mesh-based Recovery.................................6 
      3.3 Failure Notification Mechanisms............................6 
      3.4 pre-OTN Network Failure Recovery Requirements..............7 
   4. Security Considerations........................................9 
   5. Conclusions....................................................9 
   Reference.........................................................9 
   Acknowledgments..................................................10 
   Editors' Addresses...............................................11 
   Contributing Authors.............................................11 
    
    
1. Introduction 
    
   This draft describes requirements for control plane-based recovery 
   from data plane failures in pre-OTN Networks.  pre-OTN Networks are 
   transport networks that have a GMPLS-based [3] control plane and 
   various transport plane technologies (such as Optical Cross Connects 
   (OXC), Optical Add/Drop Multiplexers (OADM), etc).  Service recovery 
   from failures, using either a protection or restoration scheme, is an 
   important feature of these networks to ensure high-availability and 
   uninterrupted service.  Achieving service recovery under strict time 
   constraints is a difficult problem.  Several mechanisms for recovery 
   in mesh and ring topologies have been devised.  Protection and 
   restoration algorithms can be used for local repair (span-based or 
   node-based) and LSP recovery.  Shared mesh-based recovery is 
   especially desirable for reducing spare capacity ratios and achieving 
   flexible service recovery scenarios. 
          
   Span-based recovery can handle faults such as fiber link failures.  
   In the case of a node failure, recovery can be done through either 
   node-based or LSP-based techniques.  While LSP-based recovery has the 
   potential for efficient redundancy requirements, it also entails the 
   potentially lengthy delay incurred in notifying all nodes along the 
   recovery path of the failure of a remote resource.  For some 
   applications, recovery paths must be chosen carefully to meet strict 
   recovery time requirement (e.g., 50ms).  
 
 
Czezowski & Soumiya      Expires - April 2003                 [Page 2] 
             draft-czezowski-optical-recovery-reqs-00.txt October 2002 
 
 
   There are currently several Internet Drafts in the Sub-IP Area 
   related to recovery in GMPLS networks. They cover the topics of 
   terminology [4], functional specification [5] and mechanisms analysis 
   [6] for recovery in GMPLS-based networks, and survivability 
   requirements and considerations for traffic engineered or 
   hierarchical networks [7,8].  As a set, these documents provide their 
   readers with detailed descriptions of the concepts and mechanisms 
   used in network recovery.  However, the list of requirements for 
   control plane-based recovery has not been specifically detailed in 
   any one document. 
 
 
2. Glossary of Terms Used 
    
   The following acronyms are used in this document: 
    
      o  GMPLS:   Generalized Multiprotocol Label Switching [3] 
      o  LMP:     Link Management Protocol [9] 
      o  LSP:     Label Switched Path  
      o  LSR:     Label Switched Router 
      o  OADM:    Optical Add/Drop Multiplexer 
      o  OTN:     Optical Transport Network 
      o  OXC:     Optical Cross-Connect 
      o  RSVP-TE: Resource Reservation Protocol-Traffic Eng. [10] 
 
   The terminology for GMPLS-based recovery is documented in [4]. These 
   terms are borrowed from a work in progress at the ITU-T [11]. Here, 
   we use the following terms from that document: 
 
      o  Detecting Entity (Failure Detection): An entity that detects a 
         failure or group of failures; providing thus a non-correlated 
         list of failures. 
      o  Reporting Entity (Failure Correlation and Notification): An 
         entity that can make an intelligent decision on fault 
         correlation and report the failure to the deciding entity. 
         Fault reporting can be automatically performed by the deciding 
         entity detecting the failure. 
      o  Deciding Entity (part of the failure recovery decision 
         process): An entity that makes the recovery decision or select 
         the recovery resources. This entity communicates the decision 
         to the impacted LSPs/spans with the recovery actions to be 
         performed. 
      o  Recovering Entity (part of the failure recovery activation 
         process): An entity that participates in the recovery of the 
         LSPs/spans. 
      o  Bridge: A bridge is the function that connects the normal 
         traffic and extra traffic to the working and recovery LSP/span. 
 
 
Czezowski & Soumiya      Expires - April 2003                 [Page 3] 
             draft-czezowski-optical-recovery-reqs-00.txt October 2002 
 
 
         There are three types of bridges (Permanent Bridge, Broadcast 
         Bridge and Selector Bridge). 
      o  Selector: A selector is the function that extracts the normal 
         traffic either from the working or the recovery LSP/span. 
         There are two types of selectors (Selective selector and 
         Merging Selector). 
      o  Recovery phases: 1. Failure Detection, 2. Failure Localization 
         and Isolation, 3. Failure Notification, 4. Recovery (Protection 
         or Restoration), 5. Reversion (Normalization) 
 
 
3. Failure Recovery Requirements 
    
   Even though some requirements for fault recovery have been discussed 
   in working groups of the Sub-IP area, several additional aspects 
   should be examined and mentioned regarding recovery in pre-OTN 
   networks.  In this section, we describe the fault recovery 
   requirements that we see.  For purposes of completeness, we do not 
   try to avoid restatement of requirements listed in other drafts. 
    
3.1 Overview of Recovery Requirements 
    
   This subsection summarizes the survivability requirements for pre-OTN 
   networks. Greater details on the requirements are provided in the 
   subsequent subsections. 
    
   The following classes (types) of recovery are required for span, LSP 
   segment, and LSP recovery: 
    
      o  Protection 
         - pre-computed route and pre-selected (i.e., cross- 
           connected) resources 
      o  Restoration 
         - pre-computed route and on-demand selection of resources 
         - on-demand route and on-demand selection of resources 
    
   A recovery scheme uses either protection or restoration (or both), 
   together with failure detection and notification mechanisms and 
   protocols.  Depending on the service specification, the timing bounds 
   for the recovery schemes range from 50ms (for local repair of 
   services carrying voice calls) up to 2 seconds (for low priority 
   path-based repair). 
    
   For multi-layered networks, hold-off timers are required to allow 
   recovery at lower layers, and escalation must be supported.  Support 
   for horizontal hierarchy must also be included, because large 
   networks are usually segmented. 
    
 
Czezowski & Soumiya      Expires                                    - April 2003                 [Page 4] 
             draft-czezowski-optical-recovery-reqs-00.txt October 2002 
 
 
   In general, recovery schemes are required to operate in a stable and 
   cooperative manner to maximize the network's reliability and 
   availability.  Such requirements entail that the recovery schemes 
   also be resource efficient and as flexible as possible with respect 
   to types of failures, service classes, and the network operator's 
   policies. 
    
   A temporal model of fault recovery is shown in Figure 1 below.  The 
   diagram is adapted from [11]. 
 
 
          +-Network Impairment   
          |    +-Fault Detection   
          |    |    +-Start of Fault Notification 
          |    |    |    +-Start of Traffic Switching 
          |    |    |    |    +-Recovery Operation Complete 
          |    |    |    |    |    +-Traffic Recovered 
          |    |    |    |    |    | 
          v    v    v    v    v    v 
         -----------------------------------------------> 
          | T1 | T2 | T3 | T4 | T5 |               time 
 
   Figure 1. Recovery temporal model. 
    
    
   The five recovery phases shown in the figure are (using the terms 
   from [4]): 
 
      1. Failure Detection - The time between the network impairment and 
         the detection at the control plane (via a technology dependant 
         interface at the node of transport plane). 
 
      2. Failure Localization and Isolation - The time between when the 
         detecting entity has detected a fault, and when the reporting 
         entity starts the fault-recovery process. This time assumes 
         that the fault-recovery process at a given layer may wait for 
         restoration or recovery to occur at another layer. The 
         reporting entity also performs failure correlation to reduce 
         the number of notifications to be sent. 
    
      3. Failure Notification - The time between when the reporting 
         entity starts the notifications and when all the necessary 
         deciding and recovering entities have received the failure 
         notifications. 
 
      4. Recovery (Protection or Restoration) - The time between the 
         first and last recovery actions, after which the recovery path 
         is carrying traffic. 
 
 
Czezowski & Soumiya      Expires - April 2003                 [Page 5] 
             draft-czezowski-optical-recovery-reqs-00.txt October 2002 
 
 
      5. Reversion (Normalization) - The time (after recovery) until the 
         original working path has been repaired and begins to carry the 
         traffic again. 
 
   Together, phases 1 and 2 are called Fault Management.  It is evident 
   that the critical component in guaranteeing the time constraints for 
   the service recovery is the Failure Notification phase.  A recovery 
   scheme should follow these steps.  The scheme should also allow the 
   network operator to choose whether or not reversion is performed. 
 
3.2 Shared Mesh-based Recovery 
    
   In non-WDM optical networks, such as Synchronous Optical Network /  
   Synchronous Digital Hierarchy (SONET/SDH), conventional protection 
   techniques are currently the most commonly used.  These techniques 
   are based on linear and ring network topologies.  Linear protection 
   can be categorized as 1+1 and 1:N protection.  Ring protection can be 
   categorized as uni-directional path switched ring (UPSR) and bi-
   directional line switched ring (BLSR). 
    
   However, linear 1+1 protection requires 100% redundancy in the spare 
   resources for every working path.  For ring-based protection, the 
   available topology is restricted to a ring, and it requires 100% 
   redundancy in the spare resources for every working path.  Even with 
   1:N based on link protection, it is difficult to select different 
   routes flexibly.  From this point of view, they are extravagant in 
   resource usage and have low flexibility, even though the level and 
   speed of recovery from a failure can be assured.  For reasons of 
   efficiency and flexibility, pre-OTN network recovery schemes should 
   support shared mesh-based recovery.  
    
   Shared mesh recovery can save resources by sharing recovery capacity 
   among multiple working paths.  This approach increases the system 
   flexibility because working paths and recovery paths can be set up 
   between source and destination nodes without restrictions on the 
   physical topology.  Furthermore, this flexibility facilitates fast 
   recovery because the shared mesh provides more node candidates for 
   the routing of the recovery paths. 
 
3.3 Failure Notification Mechanisms 
    
   In general, there are two alternatives for control plane based 
   failure notification: 
    
      o  Modified GMPLS signaling of failure notification messages 
      o  Controlled flooding of failure notification messages 
    
 
Czezowski & Soumiya      Expires - April 2003                 [Page 6] 
             draft-czezowski-optical-recovery-reqs-00.txt October 2002 
 
 
   The GMPLS signaling protocol, RSVP-TE, supports notification using a 
   notify message.  Since RSVP-TE messages follow the path of the LSP, a 
   2 or 3-phase method, is used to notify the deciding and recovering 
   entities.  Under this scheme, the deciding entity relays the 
   notification messages to the recovering entities on the recovery LSP. 
    
   The controlled flooding of fiber link failure notification messages 
   on the control plane, perhaps by extending LMP [9], is another 
   alternative.  Flooding the notifications supports recovery schemes 
   that require policy- or priority-based decisions at multiple decision 
   entities distributed within the network. 
    
   To meet the time constraints for recovery, failure correlation/ 
   aggregation time for the computations to be performed at the 
   reporting entity must be minimized, and the time that elapses prior 
   to all recovering entities receiving a failure notification (or 
   recovery action) signal must also be minimized. 
    
3.4 pre-OTN Network Failure Recovery Requirements 
    
   This is our list of recovery requirements: 
    
   o  Requirements on efficiency of working and recovery bandwidth 
    
      (1) A recovery scheme SHOULD allow efficient use of working LSP 
   bandwidth using such measures as route optimization, taking into 
   account route dependencies between a working path and its recovery 
   path. 
    
      (2) A recovery scheme SHOULD allow efficient use of recovery LSP 
   bandwidth using such measures as route optimization, taking into 
   account route dependencies between a working path and its recovery 
   path. 
    
      (3) A recovery scheme SHOULD allow sharing of recovery bandwidth 
   among multiple recovery paths, when possible, to enable efficient use 
   of recovery bandwidth. 
    
   o  Requirements on recovery actions 
    
      (4) A recovery scheme SHOULD allow suppression of fault 
   notification messages, so that spurious fault notification messages 
   and recovery action messages are suppressed and are not broadcast 
   within the network, ensuring scalability of the fault recovery 
   mechanism. 
    
      (5) A recovery scheme SHOULD ensure reliable fault recovery 
   actions, providing the control plane is connected. 
 
 
Czezowski & Soumiya      Expires - April 2003                 [Page 7] 
             draft-czezowski-optical-recovery-reqs-00.txt October 2002 
 
 
      (6) A recovery scheme SHOULD allow fallback operations of its 
   recovery actions. For example, when the system encounters a fault 
   class (eg., multiple simultaneous failures) which was not 
   anticipated, the system should execute a best-effort recovery, such 
   that as many working paths as possible are restored under the 
   circumstances. 
    
      (7) A recovery scheme SHOULD allow the network operator to choose 
   whether or not the reversion actions are to be performed. 
 
      (8) A recovery scheme SHOULD support recovery within bounded time 
   constraints and MAY be compliant with generally used recovery times 
   like 50ms for SONET/SDH protection. 
    
      (9) A recovery scheme SHOULD allow testing and verification of the 
   availability of the recovery path before its actual use.  This 
   testing may occur when the recovery path is provisioned, or after it 
   is provisioned but before actual recovery action occurs and the path 
   starts being used. 
    
      (10) A recovery scheme SHOULD guarantee that recovery actions 
   correctly deliver traffic from working paths to the respective 
   recovery paths, such that the recovery actions do not result in any 
   unintended connections or unintended diversion of traffic. 
 
   o  Requirements on recovery schemes 
    
      (11) A recovery scheme SHOULD support and be compliant with 
   generally used protection schemes such as 1+1, 1:1, 1:N, and M:N. 
    
      (12) A recovery scheme SHOULD support recovery of failed LSPs even 
   if the LSPs have different endpoints. 
    
      (13) A recovery scheme SHOULD support priority-based recovery of 
   failed LSPs. 
    
   o  Requirements on recovery priority of service classes 
    
      (14) A recovery scheme SHOULD allow recovery of service classes 
   based on their recovery priority, which is a continuous spectrum from 
   lowest priority (best effort) to the highest priority (guaranteed), 
   based on their service class usage and agreement with the customers. 
    
      (15) A recovery scheme SHOULD allow support of service classes 
   with different recovery time guarantee. For example, a service class 
   carrying voice calls requires a recovery time of less than 50ms to 

 
Czezowski & Soumiya      Expires - April 2003                 [Page 8] 
             draft-czezowski-optical-recovery-reqs-00.txt October 2002 
 
 
   avoid loss of connections, whereas a service class carrying private 
   lines requires a recovery time of less than 2 sec. 
    
   o  Requirements on recovery granularity 
    
      (16) A recovery scheme SHOULD allow aggregated recovery actions, 
   ensuring scalability. 
    
   o  Requirements on failure notification delivery 
    
      (17) A recovery scheme SHOULD be equipped with a failure 
   notification mechanism that guarantees prompt and reliable delivery 
   of notification of faults in the data plane to a deciding entity that 
   is in charge of recovering the fault. 
 
 
4. Security Considerations 
    
   This draft does not introduce any new security issues. 
    
    
5. Conclusions 
    
   This draft describes requirements for control plane-based recovery 
   from data plane failures in Optical IP Networks.  While there are 
   currently several Internet Drafts in the Sub-IP Area related to 
   service recovery in GMPLS networks, the list of requirements for 
   control plane-based recovery has not been specifically detailed in 
   any one document.  We identify that most important requirements are 
   meeting the potentially strict timing, enabling flexible recovery 
   schemes, and efficient use of resources. 17 requirements are listed 
   in section 3.4. 
 
    
Reference
                     
   [1]  Bradner, S., "The Internet Standards Process -- Revision 3", BCP 
        9, RFC 2026, October 1996. 
    
   [2]  Bradner, S., "Key words for use in RFCs to Indicate Requirement 
        Levels", BCP 14, RFC 2119, March 1997. 
    
   [3]  Mannie, E. (Ed.), "Generalized Multi-Protocol Label Switching 
        (GMPLS) Architecture", Internet Draft, work in progress, draft-
        ietf-ccamp-gmpls-architecture-03.txt, August 2002. 
    

Czezowski & Soumiya      Expires - April 2003                 [Page 9] 
             draft-czezowski-optical-recovery-reqs-00.txt October 2002 
 
 
   [4]  Mannie, E. and D. Papadimitriou (Eds.), "Recovery (Protection 
        and Restoration) Terminology for GMPLS", Internet Draft, work in 
        progress, draft-ietf-ccamp-gmpls-recovery-terminology-00.txt, 
        June 2002. 
    
   [5]  Lang, J.P. and B. Rajagopalan (Eds.), "Generalized MPLS Recovery 
        Functional Specification", Internet Draft, work in progress, 
        draft-bala-gmpls-recovery-functional-00.txt, August 2002. 
    
   [6]  Papadimitriou, D. and E. Mannie (Eds.), "Analysis of Generalized 
        MPLS-based Recovery Mechanisms (including Protection and 
        Restoration)", Internet Draft, work in progress, draft-
        papadimitriou-ccamp-gmpls-recovery-analysis-02.txt, August 2002. 
    
   [7]  Lai, W.S., and D. McDysan (Eds.), "Network Hierarchy and 
        Multilayer Survivability", Internet Draft, work in progress, 
        draft-ietf-tewg-restore-hierarchy-01.txt, July 2002. 
    
   [8]  Owens, K., et al., "Network Survivability Considerations for 
        Traffic Engineered IP Networks", Internet Draft, work in 
        progress, draft-owens-te-network-survivability-03.txt, May 2002. 
    
   [9]  Lang, J. (Ed.), "Link Management Protocol (LMP)", Internet 
        Draft, draft-ietf-ccamp-lmp-06.txt, September 2002. 
    
   [10] Berger, L. (Ed.), "Generalized MPLS Signaling - RSVP-TE 
        Extensions", Internet Draft, work in progress, draft-ietf-mpls-
        generalized-rsvp-te-09.txt", September 2002. 
    
   [11] ITU-T Draft Recommendation G.gps, "Generic Protection 
        Switching", work in progress, April 2002. 
 
    
Acknowledgments 
    
   The following individuals provided valuable input to this draft: 
   Richard Rabbat, Ching-Fong Su and Takafumi Chujo of Fujitsu Labs of 
   America, Inc., Norihiko Shinomiya and Akira Chugo of Fujitsu 
   Laboratories, Ltd. 
 
 
Czezowski & Soumiya      Expires - April 2003                [Page 10] 
             draft-czezowski-optical-recovery-reqs-00.txt October 2002 
 
 
Editors' Addresses 
    
   Peter Czezowski                  Toshio Soumiya 
   Fujitsu Labs of America, Inc.    Fujitsu Laboratories Ltd. 
   595 Lawrence Expressway          1-1, Kamikodanaka 4-Chome 
   Sunnyvale, CA 94085              Nakahara-ku, Kawasaki 
   United States of America         211-8588, Japan 
   Phone: +1-408-530-4516           Phone: +81-44-754-2765 
   Email: peterc@fla.fujitsu.com    Email: soumiya.toshio@jp.fujitsu.com 
 
 
Contributing Authors 
    
   Kohei Shiomoto                          
   NTT Network Innovation Laboratories                 
   Midori-machi 3-9-11, Musashino-shi 
   Tokyo, Japan 180-8585 
   Phone: +81-422-59-4402 
   Email: Shiomoto.Kohei@lab.ntt.co.jp 
    
   Shoichiro Seno 
   Mitsubishi Electric Corporation 
   5-1-1 Ofuna, Kamakura 
   Kanagawa, Japan 247-8501 
   Phone: +81-467-41-2430 
   Email: senos@isl.melco.co.jp 


Czezowski & Soumiya      Expires - April 2003                [Page 11]