skip to main content
10.1145/3603269.3604857acmconferencesArticle/Chapter ViewAbstractPublication PagescommConference Proceedingsconference-collections

Teal: Learning-Accelerated Optimization of WAN Traffic Engineering

Published:01 September 2023Publication History

ABSTRACT

The rapid expansion of global cloud wide-area networks (WANs) has posed a challenge for commercial optimization engines to efficiently solve network traffic engineering (TE) problems at scale. Existing acceleration strategies decompose TE optimization into concurrent subproblems but realize limited parallelism due to an inherent tradeoff between run time and allocation performance.

We present Teal, a learning-based TE algorithm that leverages the parallel processing power of GPUs to accelerate TE control. First, Teal designs a flow-centric graph neural network (GNN) to capture WAN connectivity and network flows, learning flow features as inputs to downstream allocation. Second, to reduce the problem scale and make learning tractable, Teal employs a multi-agent reinforcement learning (RL) algorithm to independently allocate each traffic demand while optimizing a central TE objective. Finally, Teal fine-tunes allocations with ADMM (Alternating Direction Method of Multipliers), a highly parallelizable optimization algorithm for reducing constraint violations such as overutilized links.

We evaluate Teal using traffic matrices from Microsoft's WAN. On a large WAN topology with >1,700 nodes, Teal generates near-optimal flow allocations while running several orders of magnitude faster than the production optimization engine. Compared with other TE acceleration schemes, Teal satisfies 6--32% more traffic demand and yields 197--625× speedups.

References

  1. Parallelism in LP and MIP, August 2020. https://cdn.gurobi.com/wp-content/uploads/2020/08/How-to-Exploit-Parallelism-in-Linear-and-Mixed-Integer-Programming.pdf.Google ScholarGoogle Scholar
  2. Firas Abuzaid, Srikanth Kandula, Behnaz Arzani, Ishai Menache, Matei Zaharia, and Peter Bailis. Contracting Wide-area Network Topologies to Solve Flow Problems Quickly. In Proceedings of USENIX NSDI, pages 175--200, 2021.Google ScholarGoogle Scholar
  3. Dimitri P. Bertsekas. Constrained Optimization and Lagrange Multiplier Methods. Academic press, 2014.Google ScholarGoogle Scholar
  4. Jeremy Bogle, Nikhil Bhatia, Manya Ghobadi, Ishai Menache, Nikolaj Bjørner, Asaf Valadarsky, and Michael Schapira. TEAVAR: Striking the Right Utilization-Availability Balance in WAN Traffic Engineering. In Proceedings of ACM SIGCOMM. ACM, 2019.Google ScholarGoogle Scholar
  5. Stephen Boyd, Neal Parikh, Eric Chu, Borja Peleato, Jonathan Eckstein, et al. Distributed Optimization and Statistical Learning Via the Alternating Direction Method of Multipliers. Foundations and Trends® in Machine learning, 3(1):1--122, 2011.Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. CAIDA. The CAIDA AS Relationships Dataset, 2022.Google ScholarGoogle Scholar
  7. Sharan Chetlur, Cliff Woolley, Philippe Vandermersch, Jonathan Cohen, John Tran, Bryan Catanzaro, and Evan Shelhamer. cuDNN: Efficient Primitives for Deep Learning. arXiv preprint arXiv:1410.0759, 2014.Google ScholarGoogle Scholar
  8. Anwar Elwalid, Cheng Jin, Steven Low, and Indra Widjaja. MATE: MPLS Adaptive Traffic Engineering. In Proceedings of IEEE INFOCOM, volume 3, pages 1300--1309 vol.3, 2001.Google ScholarGoogle Scholar
  9. Wenqi Fan, Yao Ma, Qing Li, Yuan He, Eric Zhao, Jiliang Tang, and Dawei Yin. Graph Neural Networks for Social Recommendation. In International world Wide Web Conference, pages 417--426, 2019.Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Lisa K. Fleischer. Approximating Fractional Multicommodity Flow Independent of the Number of Commodities. SIAM Journal on Discrete Mathematics, 13(4):505--520, 2000.Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Jakob Foerster, Ioannis Alexandros Assael, Nando De Freitas, and Shimon Whiteson. Learning to Communicate with Deep Multi-Agent Reinforcement Learning. Advances in Neural Information Processing Systems, 29, 2016.Google ScholarGoogle Scholar
  12. Jakob Foerster, Gregory Farquhar, Triantafyllos Afouras, Nantas Nardelli, and Shimon Whiteson. Counterfactual Multi-Agent Policy Gradients. In Proceedings of AAAI conference on artificial intelligence, volume 32, 2018.Google ScholarGoogle Scholar
  13. Bernard Fortz, Jennifer Rexford, and Mikkel Thorup. Traffic Engineering with Traditional IP Routing Protocols. IEEE Communications Magazine, 40(10):118--124, 2002.Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Nan Geng, Mingwei Xu, Yuan Yang, Chenyi Liu, Jiahai Yang, Qi Li, and Shize Zhang. Distributed and Adaptive Traffic Engineering with Deep Reinforcement Learning. In Proceedings of IEEE/ACM International Symposium on Quality of Service (IWQOS), pages 1--10, 2021.Google ScholarGoogle Scholar
  15. Justin Gilmer, Samuel S. Schoenholz, Patrick F. Riley, Oriol Vinyals, and George E. Dahl. Neural Message Passing for Quantum Chemistry. In International Conference on Machine Learning, pages 1263--1272. PMLR, 2017.Google ScholarGoogle Scholar
  16. Google Cloud. Cloud Tensor Processing Units (TPUs), 2022.Google ScholarGoogle Scholar
  17. Gurobi Optimization, LLC. Gurobi Optimizer Reference Manual, 2022.Google ScholarGoogle Scholar
  18. William L. Hamilton, Rex Ying, and Jure Leskovec. Representation Learning on Graphs: Methods and Applications. arXiv preprint arXiv:1709.05584, 2017.Google ScholarGoogle Scholar
  19. Tamir Hazan, Joseph Keshet, and David McAllester. Direct Loss Minimization for Structured Prediction. Advances in Neural Information Processing Systems, 23, 2010.Google ScholarGoogle Scholar
  20. Geoffrey E. Hinton and Sam Roweis. Stochastic neighbor embedding. Advances in Neural Information Processing Systems, 15, 2002.Google ScholarGoogle Scholar
  21. Chi-Yao Hong, Srikanth Kandula, Ratul Mahajan, Ming Zhang, Vijay Gill, Mohan Nanduri, and Roger Wattenhofer. Achieving High Utilization with Software-Driven WAN. ACM SIGCOMM Computer Communication Review, 43(4):15--26, August 2013.Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Chi-Yao Hong, Subhasree Mandal, Mohammad A. Alfares, Min Zhu, Rich Alimi, Kondapa Naidu Bollineni, Chandan Bhagat, Sourabh Jain, Jay Kaimal, Jeffrey Liang, Kirill Mendelev, Steve Padgett, Faro Thomas Rabe, Saikat Ray, Malveeka Tewari, Matt Tierney, Monika Zahn, Jon Zolla, Joon Ong, and Amin Vahdat. B4 and After: Managing Hierarchy, Partitioning, and Asymmetry for Availability and Scale in Google's Software-Defined WAN. In Proceedings of ACM SIGCOMM, 2018.Google ScholarGoogle Scholar
  23. IBM. CPLEX Optimizer, 2022.Google ScholarGoogle Scholar
  24. GPU-Based Deep Learning Inference and Based Deep Learning. A Performance and Power Analysis. Nvidia Whitepaper, Nov, 2015.Google ScholarGoogle Scholar
  25. Sushant Jain, Alok Kumar, Subhasree Mandal, Joon Ong, Leon Poutievski, Arjun Singh, Subbaiah Venkata, Jim Wanderer, Junlan Zhou, Min Zhu, et al. B4: Experience with A Globally-Deployed Software Defined WAN. ACM SIGCOMM Computer Communication Review, 43(4):3--14, 2013.Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Nathan Jay, Noga Rotman, Brighten Godfrey, Michael Schapira, and Aviv Tamar. A Deep Reinforcement Learning Perspective on Internet Congestion Control. In International Conference on Machine Learning, pages 3050--3059. PMLR, 2019.Google ScholarGoogle Scholar
  27. Yangqing Jia, Evan Shelhamer, Jeff Donahue, Sergey Karayev, Jonathan Long, Ross Girshick, Sergio Guadarrama, and Trevor Darrell. Caffe: Convolutional Architecture for Fast Feature Embedding. In Proceedings of the 22nd ACM international conference on Multimedia, pages 675--678, 2014.Google ScholarGoogle Scholar
  28. Srikanth Kandula, Dina Katabi, Bruce Davie, and Anna Charny. Walking the Tightrope: Responsive Yet Stable Traffic Engineering. ACM SIGCOMM Computer Communication Review, 35(4):253--264, 2005.Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Diederik P. Kingma and Jimmy Ba. Adam: A Method for Stochastic Optimization. arXiv preprint arXiv:1412.6980, 2014.Google ScholarGoogle Scholar
  30. Simon Knight, Hung X. Nguyen, Nickolas Falkner, Rhys Bowden, and Matthew Roughan. The Internet Topology Zoo. IEEE Journal on Selected Areas in Communications, 29(9):1765--1775, 2011.Google ScholarGoogle ScholarCross RefCross Ref
  31. Vijay Konda and John Tsitsiklis. Actor-Critic Algorithms. Advances in Neural Information Processing Systems, 12, 1999.Google ScholarGoogle Scholar
  32. Mario Köppen. The Curse of Dimensionality. In Proceedings of Online World Conference on Soft Computing in Industrial Applications (WSC), volume 1, pages 4--8, 2000.Google ScholarGoogle Scholar
  33. Umesh Krishnaswamy, Rachee Singh, Nikolaj Bjørner, and Himanshu Raj. Decentralized Cloud Wide-Area Network Traffic Engineering with BLASTSHIELD. In Proceedings of USENIX NSDI, pages 325--338, Renton, WA, April 2022. USENIX Association.Google ScholarGoogle Scholar
  34. Umesh Krishnaswamy, Rachee Singh, Paul Mattes, Paul-Andre C. Bissonnette, Nikolaj Bjørner, Zahira Nasrin, Sonal Kothari, Prabhakar Reddy, John Abeln, Srikanth Kandula, et al. OneWAN Is Better than Two: Unifying a Split WAN Architecture. In 20th USENIX Symposium on Networked Systems Design and Implementation (NSDI 23), pages 515--529, 2023.Google ScholarGoogle Scholar
  35. Jitendra Kumar and Ashutosh Kumar Singh. Cloud Resource Demand Prediction Using Differential Evolution Based Learning. In Proceedings of IEEE International Conference on Smart Computing & Communications (ICSCC), pages 1--5. IEEE, 2019.Google ScholarGoogle Scholar
  36. Oliver Lange and Luis Perez. Traffic Prediction with Advanced Graph Neural Networks, 2020.Google ScholarGoogle Scholar
  37. Jay Yoon Lee, Michael L. Wick, Jean-Baptiste Tristan, and Jaime G. Carbonell. Enforcing Output Constraints via SGD: A Step Towards Neural Lagrangian Relaxation. In Proceedings of NeurIPS Workshop on Automated Knowledge Base Construction (AKBC), 2017.Google ScholarGoogle Scholar
  38. Hongqiang Harry Liu, Srikanth Kandula, Ratul Mahajan, Ming Zhang, and David Gelernter. Traffic Engineering with Forward Fault Correction. In Fabián E. Bustamante, Y. Charlie Hu, Arvind Krishnamurthy, and Sylvia Ratnasamy, editors, Proceedings of ACM SIGCOMM, pages 527--538. ACM, 2014.Google ScholarGoogle Scholar
  39. Libin Liu, Li Chen, Hong Xu, and Hua Shao. Automated Traffic Engineering in SDWAN: Beyond Reinforcement Learning. In IEEE INFOCOM WKSHPS Workshops, pages 430--435, 2020.Google ScholarGoogle Scholar
  40. Tanwi Mallick, Mariam Kiran, Bashir Mohammed, and Prasanna Balaprakash. Dynamic Graph Neural Network for Traffic Forecasting in Wide Area Networks. In Proceedings of IEEE International Conference on Big Data (Big Data), pages 1--10. IEEE, 2020.Google ScholarGoogle Scholar
  41. Hongzi Mao, Ravi Netravali, and Mohammad Alizadeh. Neural Adaptive Video Streaming with Pensieve. In Proceedings of ACM SIGCOMM, pages 197--210, 2017.Google ScholarGoogle Scholar
  42. Abduallah Mohamed, Kun Qian, Mohamed Elhoseiny, and Christian Claudel. Social-STGCNN: A Social Spatio-Temporal Graph Convolutional Neural Network for Human Trajectory Prediction. In Proceedings of IEEE/CVF CVPR, pages 14424--14432, 2020.Google ScholarGoogle Scholar
  43. Bashir Mohammed, Mariam Kiran, and Nandini Krishnaswamy. DeepRoute on Chameleon: Experimenting with Large-Scale Reinforcement Learning and SDN on Chameleon Testbed. In Proceedings of IEEE International Conference on Network Protocols (ICNP), pages 1--2, 2019.Google ScholarGoogle Scholar
  44. Vinod Nair, Sergey Bartunov, Felix Gimeno, Ingrid von Glehn, Pawel Lichocki, Ivan Lobov, Brendan O'Donoghue, Nicolas Sonnerat, Christian Tjandraatmadja, Pengming Wang, et al. Solving Mixed Integer Programs Using Neural Networks. arXiv preprint arXiv:2012.13349, 2020.Google ScholarGoogle Scholar
  45. Pooria Namyar, Behnaz Arzani, Ryan Beckett, Santiago Segarra, Himanshu Raj, and Srikanth Kandula. Minding the Gap Between Fast Heuristics and Their Optimal Counterparts. In Proceedings of the 21st ACM Workshop on Hot Topics in Networks, pages 138--144, 2022.Google ScholarGoogle Scholar
  46. Deepak Narayanan, Fiodar Kazhamiaka, Firas Abuzaid, Peter Kraft, Akshay Agrawal, Srikanth Kandula, Stephen Boyd, and Matei Zaharia. Solving Large-Scale Granular Resource Allocation Problems Efficiently with POP. In Proceedings of ACM SOSP, pages 521--537, 2021.Google ScholarGoogle Scholar
  47. John C. Nash. The (Dantzig) Simplex Method for Linear Programming. Computing in Science and Engg., 2(1):29--31, jan 2000.Google ScholarGoogle ScholarDigital LibraryDigital Library
  48. Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al. PyTorch: An Imperative Style, High-Performance Deep Learning Library. In Advances in Neural Information Processing Systems, pages 8024--8035, 2019.Google ScholarGoogle Scholar
  49. Yarin Perry, Felipe Vieira Frujeri, Chaim Hoch, Srikanth Kandula, Ishai Menache, Michael Schapira, and Aviv Tamar. DOTE: Rethinking (Predictive) Wan Traffic Engineering. In 20th USENIX Symposium on Networked Systems Design and Implementation (NSDI 23), pages 1557--1581, 2023.Google ScholarGoogle Scholar
  50. Benjamin Sanchez-Lengeling, Emily Reif, Adam Pearce, and Alexander B. Wiltschko. A Gentle Introduction to Graph Neural Networks. Distill, 2021. https://distill.pub/2021/gnn-intro.Google ScholarGoogle ScholarCross RefCross Ref
  51. Brandon Schlinker, Hyojeong Kim, Timothy Cui, Ethan Katz-Bassett, Harsha V Madhyastha, Italo Cunha, James Quinn, Saif Hasan, Petr Lapukhov, and Hongyi Zeng. Engineering Egress with Edge Fabric: Steering Oceans of Content to the World. In Proceedings of ACM SIGCOMM, pages 418--431. ACM, 2017.Google ScholarGoogle ScholarDigital LibraryDigital Library
  52. Rachee Singh, Sharad Agarwal, Matt Calder, and Paramvir Bahl. Cost-Effective Cloud Edge Traffic Engineering With Cascara. In Proceedings of USENIX NSDI, pages 201--216, 2021.Google ScholarGoogle Scholar
  53. Rachee Singh, Manya Ghobadi, Klaus-Tycho Foerster, Mark Filer, and Phillipa Gill. RADWAN: Rate Adaptive Wide Area Network. In Proceedings of ACM SIGCOMM, page 547--560, New York, NY, USA, 2018. Association for Computing Machinery.Google ScholarGoogle Scholar
  54. Yang Song, Alexander Schwing, Raquel Urtasun, et al. Training Deep Neural Networks via Direct Loss Minimization. In International Conference on Machine Learning, pages 2169--2177. PMLR, 2016.Google ScholarGoogle ScholarDigital LibraryDigital Library
  55. Richard S. Sutton, David McAllester, Satinder Singh, and Yishay Mansour. Policy Gradient Methods for Reinforcement Learning with Function Approximation. Advances in Neural Information Processing Systems, 12, 1999.Google ScholarGoogle Scholar
  56. Tensorflow. An End-to-End Open Source Machine Learning Platform, 2022.Google ScholarGoogle Scholar
  57. The Linux Foundation. Open Neural Network Exchange, 2022.Google ScholarGoogle Scholar
  58. Asaf Valadarsky, Michael Schapira, Dafna Shahaf, and Aviv Tamar. Learning to Route. In Proceedings of ACM HotNets, pages 185--191, 2017.Google ScholarGoogle ScholarDigital LibraryDigital Library
  59. Asaf Valadarsky, Michael Schapira, Dafna Shahaf, and Aviv Tamar. Learning to route with deep RL. In NIPS Deep Reinforcement Learning Symposium, 2017.Google ScholarGoogle Scholar
  60. Oriol Vinyals, Igor Babuschkin, Wojciech M. Czarnecki, Michaël Mathieu, Andrew Dudzik, Junyoung Chung, David H. Choi, Richard Powell, Timo Ewalds, Petko Georgiev, et al. Grandmaster Level in StarCraft II Using Multi-Agent Reinforcement Learning. Nature, 575(7782):350--354, 2019.Google ScholarGoogle Scholar
  61. Hao Wang, Haiyong Xie, Lili Qiu, Yang Richard Yang, Yin Zhang, and Albert Greenberg. COPE: Traffic Engineering in Dynamic Networks. In Proceedings of the 2006 conference on Applications, technologies, architectures, and protocols for computer communications, pages 99--110, 2006.Google ScholarGoogle Scholar
  62. David H. Wolpert and Kagan Tumer. Optimal Payoff Functions for Members of Collectives. In Modeling Complexity in Economic and Social Systems, pages 355--369. World Scientific, 2002.Google ScholarGoogle Scholar
  63. Zonghan Wu, Shirui Pan, Fengwen Chen, Guodong Long, Chengqi Zhang, and S. Yu Philip. A Comprehensive Survey on Graph Neural Networks. IEEE transactions on neural networks and learning systems, 32(1):4--24, 2020.Google ScholarGoogle ScholarCross RefCross Ref
  64. Xipeng Xiao, A. Hannan, B. Bailey, and L. M. Ni. Traffic Engineering with MPLS in the Internet. IEEE Network, 14(2):28--33, March 2000.Google ScholarGoogle ScholarDigital LibraryDigital Library
  65. Zhiyuan Xu, Jian Tang, Jingsong Meng, Weiyi Zhang, Yanzhi Wang, Chi Harold Liu, and Dejun Yang. Experience-Driven Networking: A Deep Reinforcement Learning Based Approach. CoRR, abs/1801.05757, 2018.Google ScholarGoogle Scholar
  66. Francis Y. Yan, Hudson Ayers, Chenzhi Zhu, Sadjad Fouladi, James Hong, Keyi Zhang, Philip Levis, and Keith Winstein. Learning in Situ: A Randomized Experiment in Video Streaming. In Proceedings of USENIX NSDI, pages 495--511, Santa Clara, CA, February 2020. USENIX Association.Google ScholarGoogle Scholar
  67. Francis Y. Yan, Jestin Ma, Greg D. Hill, Deepti Raghavan, Riad S. Wahby, Philip Levis, and Keith Winstein. Pantheon: the Training Ground for Internet Congestion-Control Research. In Proceedings of USENIX ATC, pages 731--743, Boston, MA, July 2018. USENIX Association.Google ScholarGoogle Scholar
  68. Kok-Kiong Yap, Murtaza Motiwala, Jeremy Rahe, Steve Padgett, Matthew Holliman, Gary Baldus, Marcus Hines, Taeeun Kim, Ashok Narayanan, Ankur Jain, et al. Taking the Edge off with Espresso: Scale, Reliability and Programmability for Global Internet Peering. In Proceedings of ACM SIGCOMM, pages 432--445, 2017.Google ScholarGoogle Scholar
  69. Junjie Zhang, Minghao Ye, Zehua Guo, Chen-Yu Yen, and H. Jonathan Chao. CFR-RL: Traffic Engineering with Reinforcement Learning in SDN. IEEE Journal on Selected Areas in Communications, 38(10):2249--2259, 2020.Google ScholarGoogle ScholarCross RefCross Ref
  70. Zhizhen Zhong, Manya Ghobadi, Alaa Khaddaj, Jonathan Leach, Yiting Xia, and Ying Zhang. ARROW: Restoration-Aware Traffic Engineering. In Proceedings of ACM SIGCOMM, page 560--579, New York, NY, USA, 2021. Association for Computing Machinery.Google ScholarGoogle Scholar
  71. Hang Zhu, Varun Gupta, Satyajeet Singh Ahuja, Yuandong Tian, Ying Zhang, and Xin Jin. Network Planning with Deep Reinforcement Learning. In Proceedings of ACM SIGCOMM, pages 258--271, 2021.Google ScholarGoogle Scholar

Index Terms

  1. Teal: Learning-Accelerated Optimization of WAN Traffic Engineering

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in
      • Published in

        cover image ACM Conferences
        ACM SIGCOMM '23: Proceedings of the ACM SIGCOMM 2023 Conference
        September 2023
        1217 pages
        ISBN:9798400702365
        DOI:10.1145/3603269

        Copyright © 2023 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 1 September 2023

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article

        Acceptance Rates

        Overall Acceptance Rate554of3,547submissions,16%

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader