research-article

Teal: Learning-Accelerated Optimization of WAN Traffic Engineering

Authors:
Zhiying Xu

Harvard University, Cambridge, United States of America

Harvard University, Cambridge, United States of America

https://orcid.org/0000-0002-5326-6908
View Profile

,
Francis Y. Yan

Microsoft Research, Redmond, United States of America

Microsoft Research, Redmond, United States of America

https://orcid.org/0000-0002-2123-4258
View Profile

,
Rachee Singh

Cornell University, Ithaca, United States of America

Cornell University, Ithaca, United States of America

https://orcid.org/0000-0002-8118-3026
View Profile

,
Justin T. Chiu

Cornell University, NYC, USA

Cornell University, NYC, USA

https://orcid.org/0009-0000-9304-216X
View Profile

,
Alexander M. Rush

Cornell University, NYC, USA

Cornell University, NYC, USA

https://orcid.org/0000-0002-9900-1606
View Profile

,
Minlan Yu

Harvard University, Cambridge, United States of America

Harvard University, Cambridge, United States of America

https://orcid.org/0000-0002-2381-0212
View Profile

ACM SIGCOMM '23: Proceedings of the ACM SIGCOMM 2023 ConferenceSeptember 2023Pages 378–393https://doi.org/10.1145/3603269.3604857

Published:01 September 2023Publication History

ACM SIGCOMM '23: Proceedings of the ACM SIGCOMM 2023 Conference

Pages 378–393

ABSTRACT

The rapid expansion of global cloud wide-area networks (WANs) has posed a challenge for commercial optimization engines to efficiently solve network traffic engineering (TE) problems at scale. Existing acceleration strategies decompose TE optimization into concurrent subproblems but realize limited parallelism due to an inherent tradeoff between run time and allocation performance.

We present Teal, a learning-based TE algorithm that leverages the parallel processing power of GPUs to accelerate TE control. First, Teal designs a flow-centric graph neural network (GNN) to capture WAN connectivity and network flows, learning flow features as inputs to downstream allocation. Second, to reduce the problem scale and make learning tractable, Teal employs a multi-agent reinforcement learning (RL) algorithm to independently allocate each traffic demand while optimizing a central TE objective. Finally, Teal fine-tunes allocations with ADMM (Alternating Direction Method of Multipliers), a highly parallelizable optimization algorithm for reducing constraint violations such as overutilized links.

We evaluate Teal using traffic matrices from Microsoft's WAN. On a large WAN topology with >1,700 nodes, Teal generates near-optimal flow allocations while running several orders of magnitude faster than the production optimization engine. Compared with other TE acceleration schemes, Teal satisfies 6--32% more traffic demand and yields 197--625× speedups.

References

Parallelism in LP and MIP, August 2020. https://cdn.gurobi.com/wp-content/uploads/2020/08/How-to-Exploit-Parallelism-in-Linear-and-Mixed-Integer-Programming.pdf.Google Scholar
Firas Abuzaid, Srikanth Kandula, Behnaz Arzani, Ishai Menache, Matei Zaharia, and Peter Bailis. Contracting Wide-area Network Topologies to Solve Flow Problems Quickly. In Proceedings of USENIX NSDI, pages 175--200, 2021.Google Scholar
Dimitri P. Bertsekas. Constrained Optimization and Lagrange Multiplier Methods. Academic press, 2014.Google Scholar
Jeremy Bogle, Nikhil Bhatia, Manya Ghobadi, Ishai Menache, Nikolaj Bjørner, Asaf Valadarsky, and Michael Schapira. TEAVAR: Striking the Right Utilization-Availability Balance in WAN Traffic Engineering. In Proceedings of ACM SIGCOMM. ACM, 2019.Google Scholar
Stephen Boyd, Neal Parikh, Eric Chu, Borja Peleato, Jonathan Eckstein, et al. Distributed Optimization and Statistical Learning Via the Alternating Direction Method of Multipliers. Foundations and Trends® in Machine learning, 3(1):1--122, 2011.Google ScholarDigital Library
CAIDA. The CAIDA AS Relationships Dataset, 2022.Google Scholar
Sharan Chetlur, Cliff Woolley, Philippe Vandermersch, Jonathan Cohen, John Tran, Bryan Catanzaro, and Evan Shelhamer. cuDNN: Efficient Primitives for Deep Learning. arXiv preprint arXiv:1410.0759, 2014.Google Scholar
Anwar Elwalid, Cheng Jin, Steven Low, and Indra Widjaja. MATE: MPLS Adaptive Traffic Engineering. In Proceedings of IEEE INFOCOM, volume 3, pages 1300--1309 vol.3, 2001.Google Scholar
Wenqi Fan, Yao Ma, Qing Li, Yuan He, Eric Zhao, Jiliang Tang, and Dawei Yin. Graph Neural Networks for Social Recommendation. In International world Wide Web Conference, pages 417--426, 2019.Google ScholarDigital Library
Lisa K. Fleischer. Approximating Fractional Multicommodity Flow Independent of the Number of Commodities. SIAM Journal on Discrete Mathematics, 13(4):505--520, 2000.Google ScholarDigital Library
Jakob Foerster, Ioannis Alexandros Assael, Nando De Freitas, and Shimon Whiteson. Learning to Communicate with Deep Multi-Agent Reinforcement Learning. Advances in Neural Information Processing Systems, 29, 2016.Google Scholar
Jakob Foerster, Gregory Farquhar, Triantafyllos Afouras, Nantas Nardelli, and Shimon Whiteson. Counterfactual Multi-Agent Policy Gradients. In Proceedings of AAAI conference on artificial intelligence, volume 32, 2018.Google Scholar
Bernard Fortz, Jennifer Rexford, and Mikkel Thorup. Traffic Engineering with Traditional IP Routing Protocols. IEEE Communications Magazine, 40(10):118--124, 2002.Google ScholarDigital Library
Nan Geng, Mingwei Xu, Yuan Yang, Chenyi Liu, Jiahai Yang, Qi Li, and Shize Zhang. Distributed and Adaptive Traffic Engineering with Deep Reinforcement Learning. In Proceedings of IEEE/ACM International Symposium on Quality of Service (IWQOS), pages 1--10, 2021.Google Scholar
Justin Gilmer, Samuel S. Schoenholz, Patrick F. Riley, Oriol Vinyals, and George E. Dahl. Neural Message Passing for Quantum Chemistry. In International Conference on Machine Learning, pages 1263--1272. PMLR, 2017.Google Scholar
Google Cloud. Cloud Tensor Processing Units (TPUs), 2022.Google Scholar
Gurobi Optimization, LLC. Gurobi Optimizer Reference Manual, 2022.Google Scholar
William L. Hamilton, Rex Ying, and Jure Leskovec. Representation Learning on Graphs: Methods and Applications. arXiv preprint arXiv:1709.05584, 2017.Google Scholar
Tamir Hazan, Joseph Keshet, and David McAllester. Direct Loss Minimization for Structured Prediction. Advances in Neural Information Processing Systems, 23, 2010.Google Scholar
Geoffrey E. Hinton and Sam Roweis. Stochastic neighbor embedding. Advances in Neural Information Processing Systems, 15, 2002.Google Scholar
Chi-Yao Hong, Srikanth Kandula, Ratul Mahajan, Ming Zhang, Vijay Gill, Mohan Nanduri, and Roger Wattenhofer. Achieving High Utilization with Software-Driven WAN. ACM SIGCOMM Computer Communication Review, 43(4):15--26, August 2013.Google ScholarDigital Library
Chi-Yao Hong, Subhasree Mandal, Mohammad A. Alfares, Min Zhu, Rich Alimi, Kondapa Naidu Bollineni, Chandan Bhagat, Sourabh Jain, Jay Kaimal, Jeffrey Liang, Kirill Mendelev, Steve Padgett, Faro Thomas Rabe, Saikat Ray, Malveeka Tewari, Matt Tierney, Monika Zahn, Jon Zolla, Joon Ong, and Amin Vahdat. B4 and After: Managing Hierarchy, Partitioning, and Asymmetry for Availability and Scale in Google's Software-Defined WAN. In Proceedings of ACM SIGCOMM, 2018.Google Scholar
IBM. CPLEX Optimizer, 2022.Google Scholar
GPU-Based Deep Learning Inference and Based Deep Learning. A Performance and Power Analysis. Nvidia Whitepaper, Nov, 2015.Google Scholar
Sushant Jain, Alok Kumar, Subhasree Mandal, Joon Ong, Leon Poutievski, Arjun Singh, Subbaiah Venkata, Jim Wanderer, Junlan Zhou, Min Zhu, et al. B4: Experience with A Globally-Deployed Software Defined WAN. ACM SIGCOMM Computer Communication Review, 43(4):3--14, 2013.Google ScholarDigital Library
Nathan Jay, Noga Rotman, Brighten Godfrey, Michael Schapira, and Aviv Tamar. A Deep Reinforcement Learning Perspective on Internet Congestion Control. In International Conference on Machine Learning, pages 3050--3059. PMLR, 2019.Google Scholar
Yangqing Jia, Evan Shelhamer, Jeff Donahue, Sergey Karayev, Jonathan Long, Ross Girshick, Sergio Guadarrama, and Trevor Darrell. Caffe: Convolutional Architecture for Fast Feature Embedding. In Proceedings of the 22nd ACM international conference on Multimedia, pages 675--678, 2014.Google Scholar
Srikanth Kandula, Dina Katabi, Bruce Davie, and Anna Charny. Walking the Tightrope: Responsive Yet Stable Traffic Engineering. ACM SIGCOMM Computer Communication Review, 35(4):253--264, 2005.Google ScholarDigital Library
Diederik P. Kingma and Jimmy Ba. Adam: A Method for Stochastic Optimization. arXiv preprint arXiv:1412.6980, 2014.Google Scholar
Simon Knight, Hung X. Nguyen, Nickolas Falkner, Rhys Bowden, and Matthew Roughan. The Internet Topology Zoo. IEEE Journal on Selected Areas in Communications, 29(9):1765--1775, 2011.Google ScholarCross Ref
Vijay Konda and John Tsitsiklis. Actor-Critic Algorithms. Advances in Neural Information Processing Systems, 12, 1999.Google Scholar
Mario Köppen. The Curse of Dimensionality. In Proceedings of Online World Conference on Soft Computing in Industrial Applications (WSC), volume 1, pages 4--8, 2000.Google Scholar
Umesh Krishnaswamy, Rachee Singh, Nikolaj Bjørner, and Himanshu Raj. Decentralized Cloud Wide-Area Network Traffic Engineering with BLASTSHIELD. In Proceedings of USENIX NSDI, pages 325--338, Renton, WA, April 2022. USENIX Association.Google Scholar
Umesh Krishnaswamy, Rachee Singh, Paul Mattes, Paul-Andre C. Bissonnette, Nikolaj Bjørner, Zahira Nasrin, Sonal Kothari, Prabhakar Reddy, John Abeln, Srikanth Kandula, et al. OneWAN Is Better than Two: Unifying a Split WAN Architecture. In 20th USENIX Symposium on Networked Systems Design and Implementation (NSDI 23), pages 515--529, 2023.Google Scholar
Jitendra Kumar and Ashutosh Kumar Singh. Cloud Resource Demand Prediction Using Differential Evolution Based Learning. In Proceedings of IEEE International Conference on Smart Computing & Communications (ICSCC), pages 1--5. IEEE, 2019.Google Scholar
Oliver Lange and Luis Perez. Traffic Prediction with Advanced Graph Neural Networks, 2020.Google Scholar
Jay Yoon Lee, Michael L. Wick, Jean-Baptiste Tristan, and Jaime G. Carbonell. Enforcing Output Constraints via SGD: A Step Towards Neural Lagrangian Relaxation. In Proceedings of NeurIPS Workshop on Automated Knowledge Base Construction (AKBC), 2017.Google Scholar
Hongqiang Harry Liu, Srikanth Kandula, Ratul Mahajan, Ming Zhang, and David Gelernter. Traffic Engineering with Forward Fault Correction. In Fabián E. Bustamante, Y. Charlie Hu, Arvind Krishnamurthy, and Sylvia Ratnasamy, editors, Proceedings of ACM SIGCOMM, pages 527--538. ACM, 2014.Google Scholar
Libin Liu, Li Chen, Hong Xu, and Hua Shao. Automated Traffic Engineering in SDWAN: Beyond Reinforcement Learning. In IEEE INFOCOM WKSHPS Workshops, pages 430--435, 2020.Google Scholar
Tanwi Mallick, Mariam Kiran, Bashir Mohammed, and Prasanna Balaprakash. Dynamic Graph Neural Network for Traffic Forecasting in Wide Area Networks. In Proceedings of IEEE International Conference on Big Data (Big Data), pages 1--10. IEEE, 2020.Google Scholar
Hongzi Mao, Ravi Netravali, and Mohammad Alizadeh. Neural Adaptive Video Streaming with Pensieve. In Proceedings of ACM SIGCOMM, pages 197--210, 2017.Google Scholar
Abduallah Mohamed, Kun Qian, Mohamed Elhoseiny, and Christian Claudel. Social-STGCNN: A Social Spatio-Temporal Graph Convolutional Neural Network for Human Trajectory Prediction. In Proceedings of IEEE/CVF CVPR, pages 14424--14432, 2020.Google Scholar
Bashir Mohammed, Mariam Kiran, and Nandini Krishnaswamy. DeepRoute on Chameleon: Experimenting with Large-Scale Reinforcement Learning and SDN on Chameleon Testbed. In Proceedings of IEEE International Conference on Network Protocols (ICNP), pages 1--2, 2019.Google Scholar
Vinod Nair, Sergey Bartunov, Felix Gimeno, Ingrid von Glehn, Pawel Lichocki, Ivan Lobov, Brendan O'Donoghue, Nicolas Sonnerat, Christian Tjandraatmadja, Pengming Wang, et al. Solving Mixed Integer Programs Using Neural Networks. arXiv preprint arXiv:2012.13349, 2020.Google Scholar
Pooria Namyar, Behnaz Arzani, Ryan Beckett, Santiago Segarra, Himanshu Raj, and Srikanth Kandula. Minding the Gap Between Fast Heuristics and Their Optimal Counterparts. In Proceedings of the 21st ACM Workshop on Hot Topics in Networks, pages 138--144, 2022.Google Scholar
Deepak Narayanan, Fiodar Kazhamiaka, Firas Abuzaid, Peter Kraft, Akshay Agrawal, Srikanth Kandula, Stephen Boyd, and Matei Zaharia. Solving Large-Scale Granular Resource Allocation Problems Efficiently with POP. In Proceedings of ACM SOSP, pages 521--537, 2021.Google Scholar
John C. Nash. The (Dantzig) Simplex Method for Linear Programming. Computing in Science and Engg., 2(1):29--31, jan 2000.Google ScholarDigital Library
Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al. PyTorch: An Imperative Style, High-Performance Deep Learning Library. In Advances in Neural Information Processing Systems, pages 8024--8035, 2019.Google Scholar
Yarin Perry, Felipe Vieira Frujeri, Chaim Hoch, Srikanth Kandula, Ishai Menache, Michael Schapira, and Aviv Tamar. DOTE: Rethinking (Predictive) Wan Traffic Engineering. In 20th USENIX Symposium on Networked Systems Design and Implementation (NSDI 23), pages 1557--1581, 2023.Google Scholar
Benjamin Sanchez-Lengeling, Emily Reif, Adam Pearce, and Alexander B. Wiltschko. A Gentle Introduction to Graph Neural Networks. Distill, 2021. https://distill.pub/2021/gnn-intro.Google ScholarCross Ref
Brandon Schlinker, Hyojeong Kim, Timothy Cui, Ethan Katz-Bassett, Harsha V Madhyastha, Italo Cunha, James Quinn, Saif Hasan, Petr Lapukhov, and Hongyi Zeng. Engineering Egress with Edge Fabric: Steering Oceans of Content to the World. In Proceedings of ACM SIGCOMM, pages 418--431. ACM, 2017.Google ScholarDigital Library
Rachee Singh, Sharad Agarwal, Matt Calder, and Paramvir Bahl. Cost-Effective Cloud Edge Traffic Engineering With Cascara. In Proceedings of USENIX NSDI, pages 201--216, 2021.Google Scholar
Rachee Singh, Manya Ghobadi, Klaus-Tycho Foerster, Mark Filer, and Phillipa Gill. RADWAN: Rate Adaptive Wide Area Network. In Proceedings of ACM SIGCOMM, page 547--560, New York, NY, USA, 2018. Association for Computing Machinery.Google Scholar
Yang Song, Alexander Schwing, Raquel Urtasun, et al. Training Deep Neural Networks via Direct Loss Minimization. In International Conference on Machine Learning, pages 2169--2177. PMLR, 2016.Google ScholarDigital Library
Richard S. Sutton, David McAllester, Satinder Singh, and Yishay Mansour. Policy Gradient Methods for Reinforcement Learning with Function Approximation. Advances in Neural Information Processing Systems, 12, 1999.Google Scholar
Tensorflow. An End-to-End Open Source Machine Learning Platform, 2022.Google Scholar
The Linux Foundation. Open Neural Network Exchange, 2022.Google Scholar
Asaf Valadarsky, Michael Schapira, Dafna Shahaf, and Aviv Tamar. Learning to Route. In Proceedings of ACM HotNets, pages 185--191, 2017.Google ScholarDigital Library
Asaf Valadarsky, Michael Schapira, Dafna Shahaf, and Aviv Tamar. Learning to route with deep RL. In NIPS Deep Reinforcement Learning Symposium, 2017.Google Scholar
Oriol Vinyals, Igor Babuschkin, Wojciech M. Czarnecki, Michaël Mathieu, Andrew Dudzik, Junyoung Chung, David H. Choi, Richard Powell, Timo Ewalds, Petko Georgiev, et al. Grandmaster Level in StarCraft II Using Multi-Agent Reinforcement Learning. Nature, 575(7782):350--354, 2019.Google Scholar
Hao Wang, Haiyong Xie, Lili Qiu, Yang Richard Yang, Yin Zhang, and Albert Greenberg. COPE: Traffic Engineering in Dynamic Networks. In Proceedings of the 2006 conference on Applications, technologies, architectures, and protocols for computer communications, pages 99--110, 2006.Google Scholar
David H. Wolpert and Kagan Tumer. Optimal Payoff Functions for Members of Collectives. In Modeling Complexity in Economic and Social Systems, pages 355--369. World Scientific, 2002.Google Scholar
Zonghan Wu, Shirui Pan, Fengwen Chen, Guodong Long, Chengqi Zhang, and S. Yu Philip. A Comprehensive Survey on Graph Neural Networks. IEEE transactions on neural networks and learning systems, 32(1):4--24, 2020.Google ScholarCross Ref
Xipeng Xiao, A. Hannan, B. Bailey, and L. M. Ni. Traffic Engineering with MPLS in the Internet. IEEE Network, 14(2):28--33, March 2000.Google ScholarDigital Library
Zhiyuan Xu, Jian Tang, Jingsong Meng, Weiyi Zhang, Yanzhi Wang, Chi Harold Liu, and Dejun Yang. Experience-Driven Networking: A Deep Reinforcement Learning Based Approach. CoRR, abs/1801.05757, 2018.Google Scholar
Francis Y. Yan, Hudson Ayers, Chenzhi Zhu, Sadjad Fouladi, James Hong, Keyi Zhang, Philip Levis, and Keith Winstein. Learning in Situ: A Randomized Experiment in Video Streaming. In Proceedings of USENIX NSDI, pages 495--511, Santa Clara, CA, February 2020. USENIX Association.Google Scholar
Francis Y. Yan, Jestin Ma, Greg D. Hill, Deepti Raghavan, Riad S. Wahby, Philip Levis, and Keith Winstein. Pantheon: the Training Ground for Internet Congestion-Control Research. In Proceedings of USENIX ATC, pages 731--743, Boston, MA, July 2018. USENIX Association.Google Scholar
Kok-Kiong Yap, Murtaza Motiwala, Jeremy Rahe, Steve Padgett, Matthew Holliman, Gary Baldus, Marcus Hines, Taeeun Kim, Ashok Narayanan, Ankur Jain, et al. Taking the Edge off with Espresso: Scale, Reliability and Programmability for Global Internet Peering. In Proceedings of ACM SIGCOMM, pages 432--445, 2017.Google Scholar
Junjie Zhang, Minghao Ye, Zehua Guo, Chen-Yu Yen, and H. Jonathan Chao. CFR-RL: Traffic Engineering with Reinforcement Learning in SDN. IEEE Journal on Selected Areas in Communications, 38(10):2249--2259, 2020.Google ScholarCross Ref
Zhizhen Zhong, Manya Ghobadi, Alaa Khaddaj, Jonathan Leach, Yiting Xia, and Ying Zhang. ARROW: Restoration-Aware Traffic Engineering. In Proceedings of ACM SIGCOMM, page 560--579, New York, NY, USA, 2021. Association for Computing Machinery.Google Scholar
Hang Zhu, Varun Gupta, Satyajeet Singh Ahuja, Yuandong Tian, Ying Zhang, and Xin Jin. Network Planning with Deep Reinforcement Learning. In Proceedings of ACM SIGCOMM, pages 258--271, 2021.Google Scholar

Index Terms

Teal: Learning-Accelerated Optimization of WAN Traffic Engineering
1. Computing methodologies
  1. Machine learning
2. Networks
  1. Network algorithms
    1. Control path algorithms
      1. Traffic engineering algorithms

Recommendations

Invited A new traffic engineering manager for DiffServ/MPLS networks: design and implementation on an IP QoS Testbed

In a multi-service network, different applications have varying QoS requirements. The IETF has proposed the DiffServ architecture as a scalable solution to provide Quality of Service (QoS) in IP Networks. In order to provide quantitative guarantees and ...
Read More
MPLS traffic engineering for multimedia on satellite networks

Broadband satellite constellation networks will be required to carry all types of IP traffic, real time interactive traffic as well as non-real time one, warranting the need for appropriate QoS for these different traffic flows. In this paper we ...
Read More
An open source traffic engineering toolbox

We present the TOTEM open source Traffic Engineering (TE) toolbox and a set of TE methods that we have designed and/or integrated. These methods cover intra-domain and inter-domain TE, IP-based and MPLS-based TE. They are suitable for network ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
ACM SIGCOMM '23: Proceedings of the ACM SIGCOMM 2023 Conference
September 2023
1217 pages
ISBN:9798400702365
DOI:10.1145/3603269
Chairs:
Henning Schulzrinne,
Vishal Misra,
Program Chairs:
Eddie Kohler,
David Maltz
Copyright © 2023 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 1 September 2023
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Badges
- Artifacts Evaluated & Functional / v1.1
- Artifacts Available / v1.1
Author Tags
traffic engineering
wide-area networks
network optimization
machine learning
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate554of3,547submissions,16%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 1
  Total Citations
  View Citations
- 1,056
  Total Downloads
- Downloads (Last 12 months)1,056
- Downloads (Last 6 weeks)110
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Teal: Learning-Accelerated Optimization of WAN Traffic Engineering

ACM SIGCOMM '23: Proceedings of the ACM SIGCOMM 2023 Conference

ABSTRACT

References

Cited By

Index Terms

Recommendations

Invited A new traffic engineering manager for DiffServ/MPLS networks: design and implementation on an IP QoS Testbed

MPLS traffic engineering for multimedia on satellite networks

An open source traffic engineering toolbox