skip to main content
10.1145/3603269.3604867acmconferencesArticle/Chapter ViewAbstractPublication PagescommConference Proceedingsconference-collections
research-article
Open Access

Improving Network Availability with Protective ReRoute

Published:01 September 2023Publication History

ABSTRACT

We present PRR (Protective ReRoute), a transport technique for shortening user-visible outages that complements routing repair. It can be added to any transport to provide benefits in multipath networks. PRR responds to flow connectivity failure signals, e.g., retransmission timeouts, by changing the FlowLabel on packets of the flow, which causes switches and hosts to choose a different network path that may avoid the outage. To enable it, we shifted our IPv6 network architecture to use the FlowLabel, so that hosts can change the paths of their flows without application involvement. PRR is deployed fleetwide at Google for TCP and Pony Express, where it has been protecting all production traffic for several years. It is also available to our Cloud customers. We find it highly effective for real outages. In a measurement study on our network backbones, adding PRR reduced the cumulative region-pair outage time for RPC traffic by 63--84%. This is the equivalent of adding 0.4--0.8 "nines" of availability.

References

  1. Mohammad Alizadeh, Tom Edsall, Sarang Dharmapurikar, Ramanan Vaidyanathan, Kevin Chu, Andy Fingerhut, Vinh The Lam, Francis Matus, Rong Pan, Navindra Yadav, and George Varghese. 2014. CONGA: Distributed Congestion-Aware Load Balancing for Datacenters. In Proceedings of the 2014 ACM Conference on SIGCOMM (SIGCOMM '14). Association for Computing Machinery, New York, NY, USA, 503--514. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Shane Amante, Jarno Rajahalme, Brian E. Carpenter, and Sheng Jiang. 2011. IPv6 Flow Label Specification. RFC 6437. (Nov. 2011). Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Alia Atlas, George Swallow, and Ping Pan. 2005. Fast Reroute Extensions to RSVP-TE for LSP Tunnels. RFC 4090. (May 2005). Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Alia Atlas and Alex D. Zinin. 2008. Basic Specification for IP Fast Reroute: Loop-Free Alternates. RFC 5286. (Sept. 2008). Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Alexander Azimov. 2020. Self-healing Network or The Magic of Flow Label. https://ripe82.ripe.net/presentations/20-azimov.ripe82.pdf. (2020).Google ScholarGoogle Scholar
  6. Olivier Bonaventure, Christoph Paasch, and Gregory Detal. 2017. Use Cases and Operational Experience with Multipath TCP. RFC 8041. (Jan. 2017). Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Matthew Caesar, Martin Casado, Teemu Koponen, Jennifer Rexford, and Scott Shenker. 2010. Dynamic Route Recomputation Considered Harmful. ACM SIGCOMM Computer Communication Review 40, 2 (Apr. 2010), 66--71. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Neal Cardwell, Yuchung Cheng, and Eric Dumazet. 2016. TCP Options for Low Latency: Maximum ACK Delay and Microsecond Times-tamps, IETF 97 tcpm. https://datatracker.ietf.org/meeting/97/materials/slides-97-tcpm-tcp-options-for-low-latency-00. (2016).Google ScholarGoogle Scholar
  9. Sid Chaudhuri, Gisli Hjalmtysson, and Jennifer Yates. 2000. Control of lightpaths in an optical network. In Optical Internetworking Forum.Google ScholarGoogle Scholar
  10. Yuchung Cheng, Neal Cardwell, Nandita Dukkipati, and Priyaranjan Jha. 2021. The RACK-TLP Loss Detection Algorithm for TCP. RFC 8985. (Feb. 2021). Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. David Clark. 1988. The Design Philosophy of the DARPA Internet Protocols. SIGCOMM Comput. Commun. Rev. 18, 4 (aug 1988), 106--114. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Mike Dalton, David Schultz, Ahsan Arefin, Alex Docauer, Anshuman Gupta, Brian Matthew Fahs, Dima Rubinstein, Enrique Cauich Zermeno, Erik Rubow, Jake Adriaens, Jesse L Alpert, Jing Ai, Jon Olson, Kevin P. DeCabooter, Marc Asher de Kruijf, Nan Hua, Nathan Lewis, Nikhil Kasinadhuni, Riccardo Crepaldi, Srinivas Krishnan, Subbaiah Venkata, Yossi Richter, Uday Naik, and Amin Vahdat. 2018. Andromeda: Performance, Isolation, and Velocity at Scale in Cloud Network Virtualization. In 15th USENIX Symposium on Networked Systems Design and Implementation, NSDI 2018. USENIX Association, Renton, WA, 373--387.Google ScholarGoogle Scholar
  13. Quentin De Coninck and Olivier Bonaventure. 2017. Multipath QUIC: Design and Evaluation. In Proceedings of the 13th International Conference on Emerging Networking EXperiments and Technologies.Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Advait Dixit, Pawan Prakash, Y. Charlie Hu, and Ramana Rao Kompella. 2013. On the impact of packet spraying in data center networks. In 2013 Proceedings IEEE INFOCOM. 2130--2138. Google ScholarGoogle ScholarCross RefCross Ref
  15. Yilong Geng, Vimalkumar Jeyakumar, Abdul Kabbani, and Mohammad Alizadeh. 2016. Juggler: A Practical Reordering Resilient Network Stack for Datacenters. In Proceedings of the Eleventh European Conference on Computer Systems (EuroSys '16). Association for Computing Machinery, New York, NY, USA, Article 20, 16 pages. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Soudeh Ghorbani, Zibin Yang, P. Brighten Godfrey, Yashar Ganjali, and Amin Firoozshahian. 2017. DRILL: Micro Load Balancing for Low-Latency Data Center Networks. In Proceedings of the Conference of the ACM Special Interest Group on Data Communication (SIGCOMM '17). Association for Computing Machinery, New York, NY, USA, 225--238. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Phillipa Gill, Navendu Jain, and Nachiappan Nagappan. 2011. Understanding Network Failures in Data Centers: Measurement, Analysis, and Implications. SIGCOMM Comput. Commun. Rev. 41, 4 (aug 2011), 350--361.Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Google. 2015. gRPC Motivation and Design Principles (2015-09-08). https://grpc.io/blog/principles/. (2015).Google ScholarGoogle Scholar
  19. Google. 2022. PSP Architecture Specification (2022-11-17). https://github.com/google/psp/blob/main/doc/PSP_Arch_Spec.pdf. (2022).Google ScholarGoogle Scholar
  20. Google. 2022. Using Google Virtual NIC. https://cloud.google.com/compute/docs/networking/using-gvnic. (2022).Google ScholarGoogle Scholar
  21. Ramesh Govindan, Ina Minei, Mahesh Kallahalla, Bikash Koley, and Amin Vahdat. 2016. Evolve or Die: High-Availability Design Principles Drawn from Googles Network Infrastructure. In Proceedings of the 2016 ACM SIGCOMM Conference (SIGCOMM '16). Association for Computing Machinery, New York, NY, USA, 58--72. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Tamás Hauer, Philipp Hoffmann, John Lunney, Dan Ardelean, and Amer Diwan. 2020. Meaningful Availability. In 17th USENIX Symposium on Networked Systems Design and Implementation (NSDI 20). USENIX Association, Santa Clara, CA, 545--557. https://www.usenix.org/conference/nsdi20/presentation/hauerGoogle ScholarGoogle Scholar
  23. Chi-Yao Hong, Subhasree Mandal, Mohammad Al-Fares, Min Zhu, Richard Alimi, Kondapa Naidu B., Chandan Bhagat, Sourabh Jain, Jay Kaimal, Shiyu Liang, Kirill Mendelev, Steve Padgett, Faro Rabe, Saikat Ray, Malveeka Tewari, Matt Tierney, Monika Zahn, Jonathan Zolla, Joon Ong, and Amin Vahdat. 2018. B4 and after: Managing Hierarchy, Partitioning, and Asymmetry for Availability and Scale in Google's Software-Defined WAN. In Proceedings of the 2018 Conference of the ACM Special Interest Group on Data Communication (SIGCOMM '18). Association for Computing Machinery, New York, NY, USA, 74--87. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Christian Hopps and Dave Thaler. 2000. Multipath Issues in Unicast and Multicast Next-Hop Selection. RFC 2991. (Nov. 2000). Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Van Jacobson. 1988. Congestion Avoidance and Control. SIGCOMM Comput. Commun. Rev. 18, 4 (aug 1988), 314--329. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Sushant Jain, Alok Kumar, Subhasree Mandal, Joon Ong, Leon Poutievski, Arjun Singh, Subbaiah Venkata, Jim Wanderer, Junlan Zhou, Min Zhu, Jon Zolla, Urs Hölzle, Stephen Stuart, and Amin Vahdat. 2013. B4: Experience with a Globally-Deployed Software Defined Wan. SIGCOMM Comput. Commun. Rev. 43, 4 (aug 2013), 3--14.Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Abdul Kabbani, Balajee Vamanan, Jahangir Hasan, and Fabien Duchene. 2014. FlowBender: Flow-Level Adaptive Routing for Improved Latency and Throughput in Datacenter Networks. In Proceedings of the 10th ACM International on Conference on Emerging Networking Experiments and Technologies (CoNEXT '14). Association for Computing Machinery, New York, NY, USA, 149--160. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Naga Katta, Aditi Ghag, Mukesh Hira, Isaac Keslassy, Aran Bergman, Changhoon Kim, and Jennifer Rexford. 2017. Clove: Congestion-Aware Load Balancing at the Virtual Edge. In Proceedings of the 13th International Conference on Emerging Networking EXperiments and Technologies (CoNEXT '17). Association for Computing Machinery, New York, NY, USA, 323--335. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Naga Katta, Mukesh Hira, Changhoon Kim, Anirudh Sivaraman, and Jennifer Rexford. 2016. HULA: Scalable Load Balancing Using Programmable Data Planes. In Proceedings of the Symposium on SDN Research (SOSR '16). Association for Computing Machinery, New York, NY, USA, Article 10, 12 pages. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Ming Li, Deepak Ganesan, and Prashant Shenoy. 2009. PRESTO: Feedback-Driven Data Management in Sensor Networks. IEEE/ACM Transactions on Networking 17, 4 (2009), 1256--1269. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Michael Marty, Marc de Kruijf, Jacob Adriaens, Christopher Alfeld, Sean Bauer, Carlo Contavalli, Mike Dalton, Nandita Dukkipati, William C. Evans, Steve Gribble, Nicholas Kidd, Roman Kononov, Gautam Kumar, Carl Mauer, Emily Musick, Lena Olson, Mike Ryan, Erik Rubow, Kevin Springborn, Paul Turner, Valas Valancius, Xi Wang, and Amin Vahdat. 2019. Snap: a Microkernel Approach to Host Networking. In In ACM SIGOPS 27th Symposium on Operating Systems Principles. New York, NY, USA.Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Mubashir Adnan Qureshi, Yuchung Cheng, Qianwen Yin, Qiaobin Fu, Gautam Kumar, Masoud Moshref, Junhua Yan, Van Jacobson, David Wetherall, and Abdul Kabbani. 2022. PLB: Congestion Signals Are Simple and Effective for Network Load Balancing. In Proceedings of the ACM SIGCOMM 2022 Conference (SIGCOMM '22). Association for Computing Machinery, New York, NY, USA, 207--218. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Costin Raiciu, Sebastien Barre, Christopher Pluntke, Adam Greenhalgh, Damon Wischik, and Mark Handley. 2011. Improving Datacenter Performance and Robustness with Multipath TCP. SIGCOMM Comput. Commun. Rev. 41, 4 (aug 2011), 266--277.Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Jarno Rajahalme, Alex Conta, Brian E. Carpenter, and Dr. Steve E Deering. 2004. IPv6 Flow Label Specification. RFC 3697. (March 2004). Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Matt Sargent, Jerry Chu, Dr. Vern Paxson, and Mark Allman. 2011. Computing TCP's Retransmission Timer. RFC 6298. (June 2011). Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Pasi Sarolahti and Alexey Kuznetsov. 2002. Congestion Control in Linux TCP. In 2002 USENIX Annual Technical Conference (USENIX ATC 02). USENIX Association, Monterey, CA. https://www.usenix.org/conference/2002-usenix-annual-technical-conference/congestion-control-linux-tcpGoogle ScholarGoogle Scholar
  37. Siddhartha Sen, David Shue, Sunghwan Ihm, and Michael J. Freedman. 2013. Scalable, Optimal Flow Routing in Datacenters via Local Link Balancing. In Proceedings of the Ninth ACM Conference on Emerging Networking Experiments and Technologies (CoNEXT '13). Association for Computing Machinery, New York, NY, USA, 151--162. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Leah Shalev, Hani Ayoub, Nafea Bshara, and Erez Sabbag. 2020. A Cloud-Optimized Transport Protocol for Elastic and Scalable HPC. IEEE Micro 40, 6 (2020), 67--73. Google ScholarGoogle ScholarCross RefCross Ref
  39. Shan Sinha, Srikanth Kandula, and Dina Katabi. 2004. Harnessing TCP's burstiness with flowlet switching. In Proc. 3rd ACM Workshop on Hot Topics in Networks (Hotnets-III).Google ScholarGoogle Scholar
  40. Sucha Supittayapornpong, Barath Raghavan, and Ramesh Govindan. 2019. Towards Highly Available Clos-Based WAN Routers. In Proceedings of the ACM Special Interest Group on Data Communication (SIGCOMM '19). Association for Computing Machinery, New York, NY, USA, 424--440. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. Daniel Turner, Kirill Levchenko, Alex C. Snoeren, and Stefan Savage. 2010. California Fault Lines: Understanding the Causes and Impact of Network Failures. SIGCOMM Comput. Commun. Rev. 40, 4 (aug 2010), 315--326. Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. Erico Vanini, Rong Pan, Mohammad Alizadeh, Parvin Taheri, and Tom Edsall. 2017. Let It Flow: Resilient Asymmetric Load Balancing with Flowlet Switching. In 14th USENIX Symposium on Networked Systems Design and Implementation (NSDI 17). USENIX Association, Boston, MA, 407--420. https://www.usenix.org/conference/nsdi17/technical-sessions/presentation/vaniniGoogle ScholarGoogle ScholarDigital LibraryDigital Library
  43. Simon N Wood. 2017. Generalized additive models: an introduction with R (second ed.). Chapman and Hall/CRC, Boca Raton. Google ScholarGoogle ScholarCross RefCross Ref
  44. Dingming Wu, Yiting Xia, Xiaoye Steven Sun, Xin Sunny Huang, Simbarashe Dzinamarira, and T. S. Eugene Ng. 2018. Masking Failures from Application Performance in Data Center Networks with Shareable Backup. In Proceedings of the 2018 Conference of the ACM Special Interest Group on Data Communication (SIGCOMM '18). Association for Computing Machinery, New York, NY, USA, 176--190. Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. David Zats, Tathagata Das, Prashanth Mohan, Dhruba Borthakur, and Randy Katz. 2012. DeTail: Reducing the Flow Completion Time Tail in Datacenter Networks. In Proceedings of the ACM SIGCOMM 2012 Conference on Applications, Technologies, Architectures, and Protocols for Computer Communication (SIGCOMM '12). Association for Computing Machinery, New York, NY, USA, 139--150. Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. Hong Zhang, Junxue Zhang, Wei Bai, Kai Chen, and Mosharaf Chowdhury. 2017. Resilient Datacenter Load Balancing in the Wild. In Proceedings of the Conference of the ACM Special Interest Group on Data Communication (SIGCOMM '17). Association for Computing Machinery, New York, NY, USA, 253--266. Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. Zhizhen Zhong, Manya Ghobadi, Alaa Khaddaj, Jonathan Leach, Yiting Xia, and Ying Zhang. 2021. ARROW: Restoration-Aware Traffic Engineering. In Proceedings of the 2021 ACM SIGCOMM 2021 Conference (SIGCOMM '21). Association for Computing Machinery, New York, NY, USA, 560--579. Google ScholarGoogle ScholarDigital LibraryDigital Library
  48. Junlan Zhou, Malveeka Tewari, Min Zhu, Abdul Kabbani, Leon Poutievski, Arjun Singh, and Amin Vahdat. 2014. WCMP: Weighted Cost Multipathing for Improved Fairness in Data Centers. Article No. 5. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Improving Network Availability with Protective ReRoute

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in
        • Published in

          cover image ACM Conferences
          ACM SIGCOMM '23: Proceedings of the ACM SIGCOMM 2023 Conference
          September 2023
          1217 pages
          ISBN:9798400702365
          DOI:10.1145/3603269

          Copyright © 2023 Owner/Author(s)

          Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the owner/author(s).

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 1 September 2023

          Check for updates

          Qualifiers

          • research-article

          Acceptance Rates

          Overall Acceptance Rate554of3,547submissions,16%
        • Article Metrics

          • Downloads (Last 12 months)1,751
          • Downloads (Last 6 weeks)253

          Other Metrics

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader