ABSTRACT
Existing SDNs rely on a collection of intricate, mutually-dependent mechanisms to implement a logically centralized control plane. These cyclical dependencies and lack of clean separation of concerns can impact the availability of SDNs, such that a handful of link failures could render entire portions of an SDN non-functional. This paper shows why and when this could happen, and makes the case for taking a fresh look at architecting SDNs for robustness to faults from the ground up. Our approach carefully synthesizes various key distributed systems ideas -- in particular, reliable flooding, global snapshots, and replicated controllers. We argue informally that it can offer high availability in the face of a variety of network failures, but much work needs to be done to make our approach scalable and general. Thus, our paper represents a starting point for a broader discussion on approaches for building highly available SDNs.
- OSPF Version 2: The Flooding Procedure. Request for Comments 1583, Internet Engineering Task Force.Google Scholar
- T. Benson, A. Anand, A. Akella, and M. Zhang. MicroTE: Fine Grained Traffic Engineering for Data Centers. In CoNEXT, 2011. Google ScholarDigital Library
- E. Brewer. Towards robust distributed systems. Invited talk at Priniples of Distributed Computing, 2000. Google ScholarDigital Library
- K. M. Chandy and L. Lamport. Distributed snapshots: Determining global states of distributed systems. ACM Trans. Comput. Syst., 3(1), Feb. 1985. Google ScholarDigital Library
- R. Garg, V. K. Garg, and Y. Sabharwal. Scalable algorithms for global snapshots in distributed systems. In Proceedings of the 20th Annual International Conference on Supercomputing, ICS '06, 2006. Google ScholarDigital Library
- C.-Y. Hong, S. Kandula, R. Mahajan, M. Zhang, V. Gill, M. Nanduri, and R. Wattenhofer. Achieving high utilization with software-driven wan. In SIGCOMM, 2013. Google ScholarDigital Library
- S. Jain, A. Kumar, S. Mandal, J. Ong, L. Poutievski, A. Singh, S. Venkata, J. Wanderer, J. Zhou, M. Zhu, J. Zolla, U. Hölzle, S. Stuart, and A. Vahdat. B4: Experience with a globally-deployed software defined wan. In SIGCOMM, 2013. Google ScholarDigital Library
- X. Jin, L. Li, L. Vanbever, and J. Rexford. SoftCell: Scalable and Flexible Cellular Core Network Architecture. In CoNEXT, 2013. Google ScholarDigital Library
- X. Jin, H. Liu, R. Gandhi, S. Kandula, R. Mahajan, J. Rexford, R. Wattenhofer, and M. Zhang. Dionysus: Dynamic scheduling of network updates. In SIGCOMM, 2014. Google ScholarDigital Library
- K. Kingsbury and P. Bailis. The network is reliable. http://aphyr.com/posts/288-the-network-is-reliable.Google Scholar
- T. Koponen, K. Amidon, P. Balland, M. Casado, A. Chanda, B. Fulton, I. Ganichev, J. Gross, N. Gude, P. Ingram, E. Jackson, A. Lambeth, R. Lenglet, S.-H. Li, A. Padmanabhan, J. Pettit, B. Pfaff, R. Ramanathan, S. Shenker, A. Shieh, J. Stribling, P. Thakkar, D. Wendlandt, A. Yip, and R. Zhang. Network virtualization in multi-tenant datacenters. In NSDI, 2014. Google ScholarDigital Library
- T. Koponen, M. Casado, N. Gude, J. Stribling, L. Poutievski, M. Zhu, R. Ramanathan, Y. Iwata, H. Inoue, T. Hama, and S. Shenker. Onix: A distributed control platform for large-scale production networks. In OSDI, 2010. Google ScholarDigital Library
- C. Labovitz, A. Ahuja, A. Bose, and F. Jahanian. Delayed internet routing convergence. In SIGCOMM, 2000. Google ScholarDigital Library
- L. Lamport. Paxos made simple. ACM SIGACT News, 32(4):18--25, Dec. 2001.Google Scholar
- R. Mahajan and R. Wattenhofer. On consistent updates in software defined networks. In HotNets, 2013. Google ScholarDigital Library
- F. Mattern. Efficient algorithms for distributed snapshots and global virtual time approximation. J. Parallel Distrib. Comput., 18(4), Aug. 1993. Google ScholarDigital Library
- A. Panda, C. Scott, A. Ghodsi, T. Koponen, and S. Shenker. Cap for networks. In HotSDN, 2013. Google ScholarDigital Library
- M. Reitblatt, N. Foster, J. Rexford, C. Schlesinger, and D. Walker. Abstractions for network update. In SIGCOMM, 2012. Google ScholarDigital Library
- F. Ros and P. Ruiz. Five nines of southbound reliability in software-defined networks. In HotSDN, 2014. Google ScholarDigital Library
- A. Sahoo, K. Kant, and P. Mohapatra. Bgp convergence delay after multiple simultaneous router failures: Characterization and solutions. Comput. Commun., 32(7-10), May 2009. Google ScholarDigital Library
- P. Sun, R. Mahajan, J. Rexford, L. Yuan, M. Zhang, and A. Arefin. A network-state management service. In SIGCOMM, 2014. Google ScholarDigital Library
Index Terms
- A Highly Available Software Defined Fabric
Recommendations
Available bandwidth measurement in software defined networks
SAC '16: Proceedings of the 31st Annual ACM Symposium on Applied ComputingSoftware Defined Networking (SDN) is an emerging paradigm that is expected to revolutionize computer networks. With the decoupling of data and control plane and the introduction of open communication interfaces between layers, SDN enables ...
Toward Highly Available and Scalable Software Defined Networks for Service Providers
Software-defined networking is moving from its initial deployment in small-scale data center networks to large-scale carrier-grade networks. In such environments, high availability and scalability are two of the most prominent issues, and thus extensive ...
Highly available transactions: virtues and limitations
To minimize network latency and remain online during server failures and network partitions, many modern distributed data storage systems eschew transactional functionality, which provides strong semantic guarantees for groups of multiple operations ...
Comments