ABSTRACT
The rapid expansion of global cloud wide-area networks (WANs) has posed a challenge for commercial optimization engines to efficiently solve network traffic engineering (TE) problems at scale. Existing acceleration strategies decompose TE optimization into concurrent subproblems but realize limited parallelism due to an inherent tradeoff between run time and allocation performance.
We present Teal, a learning-based TE algorithm that leverages the parallel processing power of GPUs to accelerate TE control. First, Teal designs a flow-centric graph neural network (GNN) to capture WAN connectivity and network flows, learning flow features as inputs to downstream allocation. Second, to reduce the problem scale and make learning tractable, Teal employs a multi-agent reinforcement learning (RL) algorithm to independently allocate each traffic demand while optimizing a central TE objective. Finally, Teal fine-tunes allocations with ADMM (Alternating Direction Method of Multipliers), a highly parallelizable optimization algorithm for reducing constraint violations such as overutilized links.
We evaluate Teal using traffic matrices from Microsoft's WAN. On a large WAN topology with >1,700 nodes, Teal generates near-optimal flow allocations while running several orders of magnitude faster than the production optimization engine. Compared with other TE acceleration schemes, Teal satisfies 6--32% more traffic demand and yields 197--625× speedups.
- Parallelism in LP and MIP, August 2020. https://cdn.gurobi.com/wp-content/uploads/2020/08/How-to-Exploit-Parallelism-in-Linear-and-Mixed-Integer-Programming.pdf.Google Scholar
- Firas Abuzaid, Srikanth Kandula, Behnaz Arzani, Ishai Menache, Matei Zaharia, and Peter Bailis. Contracting Wide-area Network Topologies to Solve Flow Problems Quickly. In Proceedings of USENIX NSDI, pages 175--200, 2021.Google Scholar
- Dimitri P. Bertsekas. Constrained Optimization and Lagrange Multiplier Methods. Academic press, 2014.Google Scholar
- Jeremy Bogle, Nikhil Bhatia, Manya Ghobadi, Ishai Menache, Nikolaj Bjørner, Asaf Valadarsky, and Michael Schapira. TEAVAR: Striking the Right Utilization-Availability Balance in WAN Traffic Engineering. In Proceedings of ACM SIGCOMM. ACM, 2019.Google Scholar
- Stephen Boyd, Neal Parikh, Eric Chu, Borja Peleato, Jonathan Eckstein, et al. Distributed Optimization and Statistical Learning Via the Alternating Direction Method of Multipliers. Foundations and Trends® in Machine learning, 3(1):1--122, 2011.Google ScholarDigital Library
- CAIDA. The CAIDA AS Relationships Dataset, 2022.Google Scholar
- Sharan Chetlur, Cliff Woolley, Philippe Vandermersch, Jonathan Cohen, John Tran, Bryan Catanzaro, and Evan Shelhamer. cuDNN: Efficient Primitives for Deep Learning. arXiv preprint arXiv:1410.0759, 2014.Google Scholar
- Anwar Elwalid, Cheng Jin, Steven Low, and Indra Widjaja. MATE: MPLS Adaptive Traffic Engineering. In Proceedings of IEEE INFOCOM, volume 3, pages 1300--1309 vol.3, 2001.Google Scholar
- Wenqi Fan, Yao Ma, Qing Li, Yuan He, Eric Zhao, Jiliang Tang, and Dawei Yin. Graph Neural Networks for Social Recommendation. In International world Wide Web Conference, pages 417--426, 2019.Google ScholarDigital Library
- Lisa K. Fleischer. Approximating Fractional Multicommodity Flow Independent of the Number of Commodities. SIAM Journal on Discrete Mathematics, 13(4):505--520, 2000.Google ScholarDigital Library
- Jakob Foerster, Ioannis Alexandros Assael, Nando De Freitas, and Shimon Whiteson. Learning to Communicate with Deep Multi-Agent Reinforcement Learning. Advances in Neural Information Processing Systems, 29, 2016.Google Scholar
- Jakob Foerster, Gregory Farquhar, Triantafyllos Afouras, Nantas Nardelli, and Shimon Whiteson. Counterfactual Multi-Agent Policy Gradients. In Proceedings of AAAI conference on artificial intelligence, volume 32, 2018.Google Scholar
- Bernard Fortz, Jennifer Rexford, and Mikkel Thorup. Traffic Engineering with Traditional IP Routing Protocols. IEEE Communications Magazine, 40(10):118--124, 2002.Google ScholarDigital Library
- Nan Geng, Mingwei Xu, Yuan Yang, Chenyi Liu, Jiahai Yang, Qi Li, and Shize Zhang. Distributed and Adaptive Traffic Engineering with Deep Reinforcement Learning. In Proceedings of IEEE/ACM International Symposium on Quality of Service (IWQOS), pages 1--10, 2021.Google Scholar
- Justin Gilmer, Samuel S. Schoenholz, Patrick F. Riley, Oriol Vinyals, and George E. Dahl. Neural Message Passing for Quantum Chemistry. In International Conference on Machine Learning, pages 1263--1272. PMLR, 2017.Google Scholar
- Google Cloud. Cloud Tensor Processing Units (TPUs), 2022.Google Scholar
- Gurobi Optimization, LLC. Gurobi Optimizer Reference Manual, 2022.Google Scholar
- William L. Hamilton, Rex Ying, and Jure Leskovec. Representation Learning on Graphs: Methods and Applications. arXiv preprint arXiv:1709.05584, 2017.Google Scholar
- Tamir Hazan, Joseph Keshet, and David McAllester. Direct Loss Minimization for Structured Prediction. Advances in Neural Information Processing Systems, 23, 2010.Google Scholar
- Geoffrey E. Hinton and Sam Roweis. Stochastic neighbor embedding. Advances in Neural Information Processing Systems, 15, 2002.Google Scholar
- Chi-Yao Hong, Srikanth Kandula, Ratul Mahajan, Ming Zhang, Vijay Gill, Mohan Nanduri, and Roger Wattenhofer. Achieving High Utilization with Software-Driven WAN. ACM SIGCOMM Computer Communication Review, 43(4):15--26, August 2013.Google ScholarDigital Library
- Chi-Yao Hong, Subhasree Mandal, Mohammad A. Alfares, Min Zhu, Rich Alimi, Kondapa Naidu Bollineni, Chandan Bhagat, Sourabh Jain, Jay Kaimal, Jeffrey Liang, Kirill Mendelev, Steve Padgett, Faro Thomas Rabe, Saikat Ray, Malveeka Tewari, Matt Tierney, Monika Zahn, Jon Zolla, Joon Ong, and Amin Vahdat. B4 and After: Managing Hierarchy, Partitioning, and Asymmetry for Availability and Scale in Google's Software-Defined WAN. In Proceedings of ACM SIGCOMM, 2018.Google Scholar
- IBM. CPLEX Optimizer, 2022.Google Scholar
- GPU-Based Deep Learning Inference and Based Deep Learning. A Performance and Power Analysis. Nvidia Whitepaper, Nov, 2015.Google Scholar
- Sushant Jain, Alok Kumar, Subhasree Mandal, Joon Ong, Leon Poutievski, Arjun Singh, Subbaiah Venkata, Jim Wanderer, Junlan Zhou, Min Zhu, et al. B4: Experience with A Globally-Deployed Software Defined WAN. ACM SIGCOMM Computer Communication Review, 43(4):3--14, 2013.Google ScholarDigital Library
- Nathan Jay, Noga Rotman, Brighten Godfrey, Michael Schapira, and Aviv Tamar. A Deep Reinforcement Learning Perspective on Internet Congestion Control. In International Conference on Machine Learning, pages 3050--3059. PMLR, 2019.Google Scholar
- Yangqing Jia, Evan Shelhamer, Jeff Donahue, Sergey Karayev, Jonathan Long, Ross Girshick, Sergio Guadarrama, and Trevor Darrell. Caffe: Convolutional Architecture for Fast Feature Embedding. In Proceedings of the 22nd ACM international conference on Multimedia, pages 675--678, 2014.Google Scholar
- Srikanth Kandula, Dina Katabi, Bruce Davie, and Anna Charny. Walking the Tightrope: Responsive Yet Stable Traffic Engineering. ACM SIGCOMM Computer Communication Review, 35(4):253--264, 2005.Google ScholarDigital Library
- Diederik P. Kingma and Jimmy Ba. Adam: A Method for Stochastic Optimization. arXiv preprint arXiv:1412.6980, 2014.Google Scholar
- Simon Knight, Hung X. Nguyen, Nickolas Falkner, Rhys Bowden, and Matthew Roughan. The Internet Topology Zoo. IEEE Journal on Selected Areas in Communications, 29(9):1765--1775, 2011.Google ScholarCross Ref
- Vijay Konda and John Tsitsiklis. Actor-Critic Algorithms. Advances in Neural Information Processing Systems, 12, 1999.Google Scholar
- Mario Köppen. The Curse of Dimensionality. In Proceedings of Online World Conference on Soft Computing in Industrial Applications (WSC), volume 1, pages 4--8, 2000.Google Scholar
- Umesh Krishnaswamy, Rachee Singh, Nikolaj Bjørner, and Himanshu Raj. Decentralized Cloud Wide-Area Network Traffic Engineering with BLASTSHIELD. In Proceedings of USENIX NSDI, pages 325--338, Renton, WA, April 2022. USENIX Association.Google Scholar
- Umesh Krishnaswamy, Rachee Singh, Paul Mattes, Paul-Andre C. Bissonnette, Nikolaj Bjørner, Zahira Nasrin, Sonal Kothari, Prabhakar Reddy, John Abeln, Srikanth Kandula, et al. OneWAN Is Better than Two: Unifying a Split WAN Architecture. In 20th USENIX Symposium on Networked Systems Design and Implementation (NSDI 23), pages 515--529, 2023.Google Scholar
- Jitendra Kumar and Ashutosh Kumar Singh. Cloud Resource Demand Prediction Using Differential Evolution Based Learning. In Proceedings of IEEE International Conference on Smart Computing & Communications (ICSCC), pages 1--5. IEEE, 2019.Google Scholar
- Oliver Lange and Luis Perez. Traffic Prediction with Advanced Graph Neural Networks, 2020.Google Scholar
- Jay Yoon Lee, Michael L. Wick, Jean-Baptiste Tristan, and Jaime G. Carbonell. Enforcing Output Constraints via SGD: A Step Towards Neural Lagrangian Relaxation. In Proceedings of NeurIPS Workshop on Automated Knowledge Base Construction (AKBC), 2017.Google Scholar
- Hongqiang Harry Liu, Srikanth Kandula, Ratul Mahajan, Ming Zhang, and David Gelernter. Traffic Engineering with Forward Fault Correction. In Fabián E. Bustamante, Y. Charlie Hu, Arvind Krishnamurthy, and Sylvia Ratnasamy, editors, Proceedings of ACM SIGCOMM, pages 527--538. ACM, 2014.Google Scholar
- Libin Liu, Li Chen, Hong Xu, and Hua Shao. Automated Traffic Engineering in SDWAN: Beyond Reinforcement Learning. In IEEE INFOCOM WKSHPS Workshops, pages 430--435, 2020.Google Scholar
- Tanwi Mallick, Mariam Kiran, Bashir Mohammed, and Prasanna Balaprakash. Dynamic Graph Neural Network for Traffic Forecasting in Wide Area Networks. In Proceedings of IEEE International Conference on Big Data (Big Data), pages 1--10. IEEE, 2020.Google Scholar
- Hongzi Mao, Ravi Netravali, and Mohammad Alizadeh. Neural Adaptive Video Streaming with Pensieve. In Proceedings of ACM SIGCOMM, pages 197--210, 2017.Google Scholar
- Abduallah Mohamed, Kun Qian, Mohamed Elhoseiny, and Christian Claudel. Social-STGCNN: A Social Spatio-Temporal Graph Convolutional Neural Network for Human Trajectory Prediction. In Proceedings of IEEE/CVF CVPR, pages 14424--14432, 2020.Google Scholar
- Bashir Mohammed, Mariam Kiran, and Nandini Krishnaswamy. DeepRoute on Chameleon: Experimenting with Large-Scale Reinforcement Learning and SDN on Chameleon Testbed. In Proceedings of IEEE International Conference on Network Protocols (ICNP), pages 1--2, 2019.Google Scholar
- Vinod Nair, Sergey Bartunov, Felix Gimeno, Ingrid von Glehn, Pawel Lichocki, Ivan Lobov, Brendan O'Donoghue, Nicolas Sonnerat, Christian Tjandraatmadja, Pengming Wang, et al. Solving Mixed Integer Programs Using Neural Networks. arXiv preprint arXiv:2012.13349, 2020.Google Scholar
- Pooria Namyar, Behnaz Arzani, Ryan Beckett, Santiago Segarra, Himanshu Raj, and Srikanth Kandula. Minding the Gap Between Fast Heuristics and Their Optimal Counterparts. In Proceedings of the 21st ACM Workshop on Hot Topics in Networks, pages 138--144, 2022.Google Scholar
- Deepak Narayanan, Fiodar Kazhamiaka, Firas Abuzaid, Peter Kraft, Akshay Agrawal, Srikanth Kandula, Stephen Boyd, and Matei Zaharia. Solving Large-Scale Granular Resource Allocation Problems Efficiently with POP. In Proceedings of ACM SOSP, pages 521--537, 2021.Google Scholar
- John C. Nash. The (Dantzig) Simplex Method for Linear Programming. Computing in Science and Engg., 2(1):29--31, jan 2000.Google ScholarDigital Library
- Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al. PyTorch: An Imperative Style, High-Performance Deep Learning Library. In Advances in Neural Information Processing Systems, pages 8024--8035, 2019.Google Scholar
- Yarin Perry, Felipe Vieira Frujeri, Chaim Hoch, Srikanth Kandula, Ishai Menache, Michael Schapira, and Aviv Tamar. DOTE: Rethinking (Predictive) Wan Traffic Engineering. In 20th USENIX Symposium on Networked Systems Design and Implementation (NSDI 23), pages 1557--1581, 2023.Google Scholar
- Benjamin Sanchez-Lengeling, Emily Reif, Adam Pearce, and Alexander B. Wiltschko. A Gentle Introduction to Graph Neural Networks. Distill, 2021. https://distill.pub/2021/gnn-intro.Google ScholarCross Ref
- Brandon Schlinker, Hyojeong Kim, Timothy Cui, Ethan Katz-Bassett, Harsha V Madhyastha, Italo Cunha, James Quinn, Saif Hasan, Petr Lapukhov, and Hongyi Zeng. Engineering Egress with Edge Fabric: Steering Oceans of Content to the World. In Proceedings of ACM SIGCOMM, pages 418--431. ACM, 2017.Google ScholarDigital Library
- Rachee Singh, Sharad Agarwal, Matt Calder, and Paramvir Bahl. Cost-Effective Cloud Edge Traffic Engineering With Cascara. In Proceedings of USENIX NSDI, pages 201--216, 2021.Google Scholar
- Rachee Singh, Manya Ghobadi, Klaus-Tycho Foerster, Mark Filer, and Phillipa Gill. RADWAN: Rate Adaptive Wide Area Network. In Proceedings of ACM SIGCOMM, page 547--560, New York, NY, USA, 2018. Association for Computing Machinery.Google Scholar
- Yang Song, Alexander Schwing, Raquel Urtasun, et al. Training Deep Neural Networks via Direct Loss Minimization. In International Conference on Machine Learning, pages 2169--2177. PMLR, 2016.Google ScholarDigital Library
- Richard S. Sutton, David McAllester, Satinder Singh, and Yishay Mansour. Policy Gradient Methods for Reinforcement Learning with Function Approximation. Advances in Neural Information Processing Systems, 12, 1999.Google Scholar
- Tensorflow. An End-to-End Open Source Machine Learning Platform, 2022.Google Scholar
- The Linux Foundation. Open Neural Network Exchange, 2022.Google Scholar
- Asaf Valadarsky, Michael Schapira, Dafna Shahaf, and Aviv Tamar. Learning to Route. In Proceedings of ACM HotNets, pages 185--191, 2017.Google ScholarDigital Library
- Asaf Valadarsky, Michael Schapira, Dafna Shahaf, and Aviv Tamar. Learning to route with deep RL. In NIPS Deep Reinforcement Learning Symposium, 2017.Google Scholar
- Oriol Vinyals, Igor Babuschkin, Wojciech M. Czarnecki, Michaël Mathieu, Andrew Dudzik, Junyoung Chung, David H. Choi, Richard Powell, Timo Ewalds, Petko Georgiev, et al. Grandmaster Level in StarCraft II Using Multi-Agent Reinforcement Learning. Nature, 575(7782):350--354, 2019.Google Scholar
- Hao Wang, Haiyong Xie, Lili Qiu, Yang Richard Yang, Yin Zhang, and Albert Greenberg. COPE: Traffic Engineering in Dynamic Networks. In Proceedings of the 2006 conference on Applications, technologies, architectures, and protocols for computer communications, pages 99--110, 2006.Google Scholar
- David H. Wolpert and Kagan Tumer. Optimal Payoff Functions for Members of Collectives. In Modeling Complexity in Economic and Social Systems, pages 355--369. World Scientific, 2002.Google Scholar
- Zonghan Wu, Shirui Pan, Fengwen Chen, Guodong Long, Chengqi Zhang, and S. Yu Philip. A Comprehensive Survey on Graph Neural Networks. IEEE transactions on neural networks and learning systems, 32(1):4--24, 2020.Google ScholarCross Ref
- Xipeng Xiao, A. Hannan, B. Bailey, and L. M. Ni. Traffic Engineering with MPLS in the Internet. IEEE Network, 14(2):28--33, March 2000.Google ScholarDigital Library
- Zhiyuan Xu, Jian Tang, Jingsong Meng, Weiyi Zhang, Yanzhi Wang, Chi Harold Liu, and Dejun Yang. Experience-Driven Networking: A Deep Reinforcement Learning Based Approach. CoRR, abs/1801.05757, 2018.Google Scholar
- Francis Y. Yan, Hudson Ayers, Chenzhi Zhu, Sadjad Fouladi, James Hong, Keyi Zhang, Philip Levis, and Keith Winstein. Learning in Situ: A Randomized Experiment in Video Streaming. In Proceedings of USENIX NSDI, pages 495--511, Santa Clara, CA, February 2020. USENIX Association.Google Scholar
- Francis Y. Yan, Jestin Ma, Greg D. Hill, Deepti Raghavan, Riad S. Wahby, Philip Levis, and Keith Winstein. Pantheon: the Training Ground for Internet Congestion-Control Research. In Proceedings of USENIX ATC, pages 731--743, Boston, MA, July 2018. USENIX Association.Google Scholar
- Kok-Kiong Yap, Murtaza Motiwala, Jeremy Rahe, Steve Padgett, Matthew Holliman, Gary Baldus, Marcus Hines, Taeeun Kim, Ashok Narayanan, Ankur Jain, et al. Taking the Edge off with Espresso: Scale, Reliability and Programmability for Global Internet Peering. In Proceedings of ACM SIGCOMM, pages 432--445, 2017.Google Scholar
- Junjie Zhang, Minghao Ye, Zehua Guo, Chen-Yu Yen, and H. Jonathan Chao. CFR-RL: Traffic Engineering with Reinforcement Learning in SDN. IEEE Journal on Selected Areas in Communications, 38(10):2249--2259, 2020.Google ScholarCross Ref
- Zhizhen Zhong, Manya Ghobadi, Alaa Khaddaj, Jonathan Leach, Yiting Xia, and Ying Zhang. ARROW: Restoration-Aware Traffic Engineering. In Proceedings of ACM SIGCOMM, page 560--579, New York, NY, USA, 2021. Association for Computing Machinery.Google Scholar
- Hang Zhu, Varun Gupta, Satyajeet Singh Ahuja, Yuandong Tian, Ying Zhang, and Xin Jin. Network Planning with Deep Reinforcement Learning. In Proceedings of ACM SIGCOMM, pages 258--271, 2021.Google Scholar
Index Terms
- Teal: Learning-Accelerated Optimization of WAN Traffic Engineering
Recommendations
Invited A new traffic engineering manager for DiffServ/MPLS networks: design and implementation on an IP QoS Testbed
In a multi-service network, different applications have varying QoS requirements. The IETF has proposed the DiffServ architecture as a scalable solution to provide Quality of Service (QoS) in IP Networks. In order to provide quantitative guarantees and ...
MPLS traffic engineering for multimedia on satellite networks
Broadband satellite constellation networks will be required to carry all types of IP traffic, real time interactive traffic as well as non-real time one, warranting the need for appropriate QoS for these different traffic flows. In this paper we ...
An open source traffic engineering toolbox
We present the TOTEM open source Traffic Engineering (TE) toolbox and a set of TE methods that we have designed and/or integrated. These methods cover intra-domain and inter-domain TE, IP-based and MPLS-based TE. They are suitable for network ...
Comments