Skip to main content
Log in

Manas: multi-agent neural architecture search

  • Published:
Machine Learning Aims and scope Submit manuscript

Abstract

The Neural Architecture Search (NAS) problem is typically formulated as a graph search problem where the goal is to learn the optimal operations over edges in order to maximize a graph-level global objective. Due to the large architecture parameter space, efficiency is a key bottleneck preventing NAS from its practical use. In this work, we address the issue by framing NAS as a multi-agent problem where agents control a subset of the network and coordinate to reach optimal architectures. We provide two distinct lightweight implementations, with reduced memory requirements (1/8th of state-of-the-art), and performances above those of much more computationally expensive methods. Theoretically, we demonstrate vanishing regrets of the form \({\mathcal {O}}(\sqrt{T})\), with T being the total number of rounds. Finally, we perform experiments on CIFAR-10 and ImageNet, and aware that random search and random sampling are (often ignored) effective baselines, we conducted additional experiments on 3 alternative datasets, with complexity constraints, and 2 network configurations, and achieve competitive results in comparison with the baselines and other methods.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3

Similar content being viewed by others

Data availability

All data used is publicly available.

Code availability

Code will be publicly available.

Notes

  1. Please notice, the observed reward is actually a random variable.

  2. We assume that architecture is feasible if and only if each agent chooses exactly one action.

References

  • Abbasi-Yadkori, Y., Bartlett, P., Gabillon, V., Malek, A., & Valko, M. (2018). Best of both worlds: Stochastic & adversarial best-arm identification. In Conference on learning theory (COLT).

  • Abdelfattah, M. S., Mehrotra, A., Dudziak, Ł., & Lane, N. D. (2021). Zero-Cost Proxies for Lightweight NAS. In International conference on learning representations (ICLR).

  • Auer, P., Cesa-Bianchi, N., Freund, Y., & Schapire, R. E. (2002). The nonstochastic multiarmed bandit problem. SIAM Journal on Computing, 32(1), 48–77.

    Article  MathSciNet  Google Scholar 

  • Auer, P., Cesa-Bianchi, N., Freund, Y., & Schapire, R. E. (1995). Gambling in a rigged casino: The adversarial multi-armed bandit problem. In Proceedings of IEEE 36th annual foundations of computer science, pp. 322–331. IEEE.

  • Bender, G., Liu, H., Chen, B., Chu, G., Cheng, S., Kindermans, P. J., & Le, Q. V. (2020). Can weight sharing outperform random architecture search? an investigation with tunas. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 14323–14332.

  • Bouneffouf, D., Laroche, R., Urvoy, T., Féraud, R., & Allesiardo, R. (2014). Contextual bandit for active learning: Active thompson sampling. In Neural information processing: 21st international conference, ICONIP, pp. 405–412. Springer.

  • Bouneffouf, D., Rish, I., & Aggarwal, C. (2020). Survey on applications of multi-armed and contextual bandits. In 2020 IEEE congress on evolutionary computation (CEC), pp. 1–8. IEEE.

  • Bubeck, S., Cesa-Bianchi, N., et al. (2012). Regret analysis of stochastic and nonstochastic multi-armed bandit problems. Foundations and Trends in Machine Learning, 5(1), 1–122.

    Article  Google Scholar 

  • Cai, H., Zhu, L., & Han, S. (2019). ProxylessNAS: Direct neural architecture search on target task and hardware. In International conference on learning representations (ICLR).

  • Cesa-Bianchi, N., & Lugosi, G. (2012). Combinatorial bandits. Journal of Computer and System Sciences, 78(5), 1404–1422.

    Article  MathSciNet  Google Scholar 

  • Chen, W., Gong, X., Wu, J., Wei, Y., Shi, H., Yan, Z., Yang, Y., & Wang, Z. (2021). Understanding and accelerating neural architecture search with training-free and theory-grounded metrics. arXiv preprint arXiv:2108.11939.

  • Chen, X., & Hsieh, C. (2020). Stabilizing differentiable architecture search via perturbation-based regularization. In Proceedings of the 37th international conference on machine learning, ICML 2020.

  • Chu, X., Wang, X., Zhang, B., Lu, S., Wei, X., & Yan, J. (2021). DARTS-: robustly stepping out of performance collapse without indicators. In 9th international conference on learning representations, ICLR.

  • Colby, M. K., Kharaghani, S., HolmesParker, C., & Tumer, K. (2015). Counterfactual exploration for improving multiagent learning. In Autonomous Agents and Multiagent Systems (AAMAS 2015), pp. 171–179. International Foundation for Autonomous Agents and Multiagent Systems.

  • Conneau, A., Schwenk, H., Barrault, L., & Lecun, Y. (2017). Very deep convolutional networks for text classification. In European chapter of the association for computational linguistics Volume 1, Long Papers, pp. 1107–1116.

  • Cubuk, E. D., Zoph, B., Mane, D., Vasudevan, V., & Le, Q. V. (2018). Autoaugment: Learning augmentation policies from data. arXiv:1805.09501.

  • Deng, J., Dong, W., Socher, R., Li, L. J., Li, K., & Fei-Fei, L. (2009). ImageNet: A large-scale hierarchical image database. In Computer vision and pattern recognition (CVPR), pp. 248–255.

  • Dong, X., & Yang, Y. (2019). Searching for a robust neural architecture in four GPU hours. In IEEE Conference on computer vision and pattern recognition, CVPR. Computer Vision Foundation / IEEE.

  • Fei-Fei, L., Fergus, R., & Perona, P. (2007). Learning generative visual models from few training examples: An incremental Bayesian approach tested on 101 object categories. Computer Vision and Image Understanding, 106(1), 59–70.

    Article  Google Scholar 

  • Freedman, D. A. (1975). On tail probabilities for martingales. The Annals of Probability pp. 100–118.

  • Gao, Y., Zhang, P., Yang, H., Zhou, C., Tian, Z., Hu, Y., Li, Z., & Zhou, J. (2022). Graphnas++: Distributed architecture search for graph neural networks. IEEE Transactions on Knowledge and Data Engineering. https://doi.org/10.1109/TKDE.2022.3178153

    Article  Google Scholar 

  • Garg, A., Saha, A. K., & Dutta, D. (2020). Direct federated neural architecture search. arXiv preprint arXiv:2010.06223.

  • Han, D., Kim, J., & Kim, J. (2017). Deep pyramidal residual networks. In Computer vision and pattern recognition (CVPR), pp. 5927–5935.

  • He, C., Annavaram, M., & Avestimehr, S. (2020). Fednas: Federated deep learning via neural architecture search. In CVPR 2020 workshop on neural architecture search and beyond for representation learning.

  • He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In Computer vision and pattern recognition (CVPR), pp. 770–778.

  • Hoang, M., & Kingsford, C. (2021). Personalized neural architecture search for federated learning. In 1st NeurIPS workshop on new frontiers in federated learning (NFFL 2021).

  • Huang, G., Liu, Z., Van Der Maaten, L., & Weinberger, K. Q. (2017). Densely connected convolutional networks. In Computer vision and pattern recognition (CVPR), pp. 4700–4708.

  • Ko, B. (2019). Imagenet classification leaderboard. https://kobiso.github.io/Computer-Vision-Leaderboard/imagenet.

  • Krizhevsky, A. (2009). Learning multiple layers of features from tiny images. Tech. rep., University of Toronto.

  • Li, L., Chu, W., Langford, J., & Schapire, R. E. (2010). A contextual-bandit approach to personalized news article recommendation. In Proceedings of the 19th international conference on World wide web, pp. 661–670.

  • Li, L., & Talwalkar, A. (2019). Random search and reproducibility for neural architecture search. arXiv:1902.07638.

  • Li, L. J., & Fei-Fei, L. (2007). What, where and who? classifying events by scene and object recognition. In International conference on computer vision (ICCV), pp. 1–8.

  • Liu, C., Zoph, B., Neumann, M., Shlens, J., Hua, W., Li, L. J., Fei-Fei, L., Yuille, A., Huang, J., & Murphy, K. (2018). Progressive neural architecture search. In European conference on computer vision (ECCV), pp. 19–34.

  • Liu, H., Simonyan, K., Vinyals, O., Fernando, C., & Kavukcuoglu, K. (2018). Hierarchical representations for efficient architecture search. In International conference on learning representations (ICLR).

  • Liu, H., Simonyan, K., & Yang, Y. (2019). DARTS: Differentiable architecture search. In International conference on learning representations (ICLR).

  • Liu, X., Zhao, J., Li, J., Cao, B., & Lv, Z. (2022). Federated neural architecture search for medical data security. IEEE Transactions on Industrial Informatics, 18(8), 5628–5636.

    Article  Google Scholar 

  • Lopes, V., & Alexandre, L. A. (2022). Towards less constrained macro-neural architecture search. arXiv preprint arXiv:2203.05508.

  • Lopes, V., Alirezazadeh, S., & Alexandre, L. A. (2021). EPE-NAS: Efficient performance estimation without training for neural architecture search. In International conference on artificial neural networks.

  • Lopes, V., Santos, M., Degardin, B., & Alexandre, L. A. (2022). Efficient guided evolution for neural architecture search. In Proceedings of the genetic and evolutionary computation conference.

  • Mellor, J., Turner, J., Storkey, A., & Crowley, E. J. (2021). Neural architecture search without training. In International conference on machine learning.

  • Merity, S., Keskar, N. S., & Socher, R. (2018). Regularizing and optimizing LSTM language models. In International conference on learning representations (ICLR).

  • Ning, X., Tang, C., Li, W., Zhou, Z., Liang, S., Yang, H., & Wang, Y. (2021). Evaluating efficient performance estimators of neural architectures. Advances in Neural Information Processing Systems, 34, 12265–12277.

    Google Scholar 

  • Pham, H., Guan, M., Zoph, B., Le, Q., & Dean, J. (2018). Efficient neural architecture search via parameter sharing. In International conference on machine learning (ICML. 4092–4101.

  • Quattoni, A., & Torralba, A. (2009). Recognizing indoor scenes. In Computer vision and pattern recognition (CVPR), pp. 413–420.

  • Rashid, T., Samvelyan, M., Witt, C. S., Farquhar, G., Foerster, J., & Whiteson, S. (2018). QMIX: Monotonic value function factorisation for deep multi-agent reinforcement learning. In International conference on machine learning (ICML), pp. 4292–4301.

  • Real, E., Aggarwal, A., Huang, Y., & Le, Q. V. (2018). Regularized evolution for image classifier architecture search. arXiv:1802.01548.

  • Real, E., Moore, S., Selle, A., Saxena, S., Suematsu, Y. L., Tan, J., Le, Q. V., & Kurakin, A. (2017). Large-scale evolution of image classifiers. In International conference on machine learning (ICML), pp. 2902–2911.

  • Ru, R., Esperança, P. M., & Carlucci, F. M. (2020). Neural architecture generator optimization. Advances in Neural Information Processing Systems, 33, 12057–12069.

    Google Scholar 

  • Shalev-Shwartz, S., Shammah, S., & Shashua, A. (2016). Safe, multi-agent, reinforcement learning for autonomous driving.

  • Szegedy, C., Ioffe, S., Vanhoucke, V., & Alemi, A. A. (2017). Inception-v4, Inception-ResNet and the impact of residual connections on learning. In AAAI conference on artificial intelligence.

  • Wan, X., Ru, B., Esperança, P. M., & Li, Z. (2022). On redundancy and diversity in cell-based neural architecture search. In International conference on learning representations.

  • Wan, X., Ru, B., Esperança, P. M., & Li, Z. (2022). On redundancy and diversity in cell-based neural architecture search. In International conference on learning representations (ICLR).

  • Wang, B., Xue, B., & Zhang, M. (2021). Surrogate-assisted particle swarm optimization for evolving variable-length transferable blocks for image classification. IEEE Transactions on Neural Networks and Learning Systems, 33, 3727–3740.

    Article  MathSciNet  Google Scholar 

  • Wang, L., Xie, S., Li, T., Fonseca, R., & Tian, Y. (2021). Sample-efficient neural architecture search by learning actions for monte Carlo tree search. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(9), 5503–5515.

    Google Scholar 

  • Wang, Y., Xu, Y., & Tao, D. (2020). Dc-nas: Divide-and-conquer neural architecture search. arXiv preprint arXiv:2005.14456.

  • Wei, C., Niu, C., Tang, Y., Wang, Y., Hu, H., & Liang, J. (2022). Npenas: Neural predictor guided evolution for neural architecture search. IEEE Transactions on Neural Networks and Learning Systems.

  • White, C., Neiswanger, W., & Savani, Y. (2021). Bananas: Bayesian optimization with neural architectures for neural architecture search. In Proceedings of the AAAI conference on artificial intelligence.

  • White, C., Zela, A., Ru, R., Liu, Y., & Hutter, F. (2021). How powerful are performance predictors in neural architecture search? Advances in Neural Information Processing Systems, 34, 28454–28469.

    Google Scholar 

  • Xie, S., Zheng, H., Liu, C., & Lin, L. (2019). SNAS: Stochastic neural architecture search. In International conference on learning representations (ICLR).

  • Xu, D., Mukherjee, S., Liu, X., Dey, D., Wang, W., Zhang, X., Awadallah, A. H., & Gao, J. (2022). Few-shot task-agnostic neural architecture search for distilling large language models. In Advances in Neural Information Processing Systems.

  • Yang, A., Esperança, P. M., & Carlucci, F. M. (2020). NAS evaluation is frustratingly hard. In International conference on learning representations (ICLR).

  • Yang, Y., Luo, R., Li, M., Zhou, M., Zhang, W., & Wang, J. (2018). Mean field multi-agent reinforcement learning. In International conference on machine learning (ICML).

  • Yao, Q., Xu, J., Tu, W., & Zhu, Z. (2020). Efficient neural architecture search via proximal iterations. In The Thirty-Fourth AAAI conference on artificial intelligence, AAAI 2020, The Tenth AAAI symposium on educational advances in artificial intelligence, EAAI. AAAI Press.

  • Yu, K., Sciuto, C., Jaggi, M., Musat, C., & Salzmann, M. (2019). Evaluating the search phase of neural architecture search. In International conference on learning representations (ICLR).

  • Yuan, J., Xu, M., Zhao, Y., Bian, K., Huang, G., Liu, X., & Wang, S. (2020). Federated neural architecture search. arXiv preprint arXiv:2002.06352.

  • Zela, A., Elsken, T., Saikia, T., Marrakchi, Y., Brox, T., & Hutter, F. (2018). Understanding and robustifying differentiable architecture search. In International conference on learning representations (ICLR).

  • Zhang, X., Zhou, X., Lin, M., & Sun, J. (2018). ShuffleNet: An extremely efficient convolutional neural network for mobile devices. In Conference on computer vision and pattern recognition (CVPR), pp. 6848–6856.

  • Zhao, Y., Wang, L., Yang, K., Zhang, T., Guo, T., & Tian, Y. (2021). Multi-objective optimization by learning space partitions. arXiv preprint arXiv:2110.03173.

  • Zhu, H., & Jin, Y. (2021). Real-time federated evolutionary neural architecture search. IEEE Transactions on Evolutionary Computation, 26(2), 364–378.

    Article  Google Scholar 

  • Zhu, H., Zhang, H., & Jin, Y. (2021). From federated learning to federated neural architecture search: A survey. Complex & Intelligent Systems, 7, 639–657.

    Article  Google Scholar 

  • Zoph, B., & Le, Q. (2017). Neural architecture search with reinforcement learning. In International conference on learning representations (ICLR).

  • Zoph, B., Vasudevan, V., Shlens, J., & Le, Q. V. (2018). Learning transferable architectures for scalable image recognition. In Computer vision and pattern recognition (CVPR), pp. 8697–8710.

Download references

Funding

Financial support to the authors was received from “FCT - Fundação para a Ciência e Tecnologia”, through the research grant “2020.04588.BD” [Vasco Lopes]; and from Huawei Technologies R &D (UK) Ltd [all other authors].

Author information

Authors and Affiliations

Authors

Contributions

Conceptualization: VL, FMC, PME, MS, AY, JW; Methodology: VL, FMC, PME, MS, AY, VG, HX, ZC; Formal analysis and investigation:VL, FMC, PME, MS, AY, VG; Writing - original draft preparation: FMC, PME, MS, AY, VG; Writing - review and editing: VL, FMC, PME; Supervision: JW.

Corresponding author

Correspondence to Fabio Maria Carlucci.

Ethics declarations

Conflicts of interest

The authors declare that they have no conflict of interest.

Ethics approval

Not required.

Consent to participate

Not required.

Consent for publication

Not required.

Additional information

Editor: James Cussens.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

A Datasets

CIFAR-10. The CIFAR-10 dataset (Krizhevsky, 2009) is a dataset with 10 classes and consists of 50, 000 training images and 10, 000 test images of size \(32{\times }32\). We use standard data pre-processing and augmentation techniques, i.e. subtracting the channel mean and dividing the channel standard deviation; centrally padding the training images to \(40{\times }40\) and randomly cropping them back to \(32{\times }32\); and randomly flipping them horizontally.

ImageNet. The ImageNet dataset (Deng et al., 2009) is a dataset with 1000 classes and consists of 1, 281, 167 training images and 50, 000 test images of different sizes. We use standard data pre-processing and augmentation techniques, i.e. subtracting the channel mean and dividing the channel standard deviation, cropping the training images to random size and aspect ratio, resizing them to \(224{\times }224\), and randomly changing their brightness, contrast, and saturation, while resizing test images to \(256{\times }256\) and cropping them at the center.

Sport-8. This is an action recognition dataset containing 8 sport event categories and a total of 1579 images (Li and Fei-Fei, 2007). The tiny size of this dataset stresses the generalization capabilities of any NAS method applied to it.

Caltech-101. This dataset contains 101 categories, each with 40 to 800 images of size roughly \(300{\times }200\) (Fei-Fei et al., 2007).

MIT-67. This is a dataset of 67 classes representing different indoor scenes and consists of 15, 620 images of different sizes (Quattoni and Torralba, 2009).

In experiments on Sport-8, Caltech-101 and MIT-67, we split each dataset into a training set containing \(80\%\) of the data and a test set containing \(20\%\) of the data. For each of them, we use the same data pre-processing techniques as for ImageNet.

B Implementation details

1.1 B.1 Methods

MANAS. Our code is based on a modified variant of Liu et al. (2019). To set the temperature and gamma, we used as starting estimates the values suggested by Bubeck et al. (2012): \(t=\frac{1}{\eta }\) with \(\eta =0.95\frac{\sqrt{\ln (K)}}{nK}\) (K number of actions, n number of architectures seen in the whole training); and \(\gamma = 1.05 \frac{K\ln (K)}{n}\). We then tuned them to increase validation accuracy during the search.

MANAS-LS. For our Least-Squares solution, we alternate between one epoch of training (in which all \(\beta\) are frozen and the \(\omega\) are updated) and one or more epochs in which we build the Z matrix from Sect. 4 (in which both \(\beta\) and \(\omega\) are frozen). The exact number of iterations we perform in this latter step is dependant on the size of both the dataset and the searched architecture: our goal is simply to have a number of rows greater than the number of columns for Z. We then solve \(\widetilde{\varvec{B}}_{t} = \left( \varvec{Z}\varvec{Z}^{\textsf{T}} \right) ^{\dagger }\varvec{Z}\varvec{L},\) and repeat the whole procedure until the end of training. This method requires no additional meta-parameters.

Number of agents. In both MANAS variants, the number of agents is defined by the search space and thus is not tuned. Specifically, for the image datasets, there exists one agent for each pair of nodes, tasked with selecting the optimal operation. As there are 14 pairs in each cell, the total number of agents is \(14 \times C\), with C being the number of cells (8, 14 or 20, depending on the experiment).

1.2 B.2 Computational resources

ImageNet experiments were performed on multi-GPU machines loaded with \(8\times\) Nvidia Tesla V100 16GB GPUs (used in parallel). All other experiments were performed on single-GPU machines loaded with \(1\times\) GeForce GTX 1080 8GB GPU.

C Factorizing the Regret

Factorizing the Regret: Let us firstly formulate the multi-agent combinatorial online learning in a more formal way. Recall, at each round, agent \({\mathcal {A}}_i\) samples an action from a fixed discrete collection \(\{\varvec{a}^{({\mathcal {A}}_i)}_{j}\}^{K}_{j=1}\). Therefore, after each agent makes a choice of its action at round t, the resulting network architecture \({\mathcal {Z}}_t\) is described by joint action profile \(\varvec{\textbf{a}}_t =\left[ \varvec{a}^{({\mathcal {A}}_1),[t]}_{j_1}, \ldots , \varvec{a}^{({\mathcal {A}}_N),[t]}_{j_{N}} \right]\) and thus, we will use \({\mathcal {Z}}_t\) and \(\varvec{\textbf{a}}_t\) interchangeably. Due to the discrete nature of the joint action space, the validation loss vector at round t is given by \(\varvec{\mathbf {{\mathcal {L}}}}^{(\textrm{val})}_{t} =\left( {\mathcal {L}}^{(\textrm{val})}_{t}\left( {\mathcal {Z}}^{(1)}_t \right) , \ldots , {\mathcal {L}}^{(\textrm{val})}_{t} \left( {\mathcal {Z}}^{(K^N)}_t\right) \right)\) and for the environment one can write \(\nu = \left( \varvec{\mathbf {{\mathcal {L}}}}^{(\textrm{val})}_{1}, \ldots , \varvec{\mathbf {{\mathcal {L}}}}^{(\textrm{val})}_{T}\right)\). The interconnection between joint policy \(\varvec{\pi }\) and an environment \(\nu\) works in a sequential manner as follows: at round t, the architecture \({\mathcal {Z}}_t\sim \varvec{\pi }_t(\cdot |{\mathcal {Z}}_1,{\mathcal {L}}^{(\textrm{val})}_{1},\ldots , {\mathcal {Z}}_{t-1}, {\mathcal {L}}^{(\textrm{val})}_{t-1})\) is sampled and validation loss \({\mathcal {L}}^{(\textrm{val})}_{t} = {\mathcal {L}}^{(\textrm{val})}_{t}({\mathcal {Z}}_t)\) is observed.Footnote 1 As we mentioned previously, assuming linear contribution of each individual action to the validating loss, one goal is to find a policy \(\varvec{\pi }\) that keeps the regret:

$$\begin{aligned} {\mathcal {R}}_{T}(\varvec{\pi }, \nu ) = {\mathbb {E}}\left[ \sum _{t=1}^{T} \varvec{\beta }^{\textsf{T}}_{t}\varvec{Z}_t -\min _{\varvec{Z}\in {\mathcal {F}}} \left[ \sum _{t=1}^{T}\varvec{\beta }^{\textsf{T}}_{t} \varvec{Z}\right] \right] \end{aligned}$$
(8)

small with respect to all possible forms of environment \(\nu\). We reason here with the cumulative regret the reasoning applies as well to the simple regret. Here, \(\varvec{\beta }_t\in {\mathbb {R}}^{KN}_{+}\) is a contribution vector of all actions and \(\varvec{Z}_t\) is binary representation of architecture \({\mathcal {Z}}_t\) and \({\mathcal {F}}\subset [0,1]^{KN}\) is set of all feasible architectures.Footnote 2 In other words, the quality of the policy is defined with respect to worst-case regret:

$$\begin{aligned} {\mathcal {R}}^*_{T} = \sup _{\nu }{\mathcal {R}}_{T}(\varvec{\pi }, \nu ) \end{aligned}$$
(9)

Notice, that linear decomposition of the validation loss allows us to rewrite the total regret (8) as a sum of agent-specific regret expressions \({\mathcal {R}}^{({\mathcal {A}}_i)}_T \left( \varvec{\pi }^{({\mathcal {A}}_i)}, \nu ^{({\mathcal {A}}_i)}\right)\) for \(i=1,\ldots , N\):

$$\begin{aligned} {\mathcal {R}}_{T}(\varvec{\pi }, \nu )&= {\mathbb {E}} \left[ \sum _{t=1}^{T}\left( \sum _{i=1}^{N} \varvec{\beta }^{({\mathcal {A}}_i),\textsf{T}}_{t} \varvec{Z}^{({\mathcal {A}}_i)}_t - \sum _{i=1}^N \min _{\varvec{Z}^{({\mathcal {A}}_{i})} \in {\mathcal {B}}^{(K)}_{||\cdot ||_0, 1}(\varvec{0})} \left[ \sum _{t=1}^{T}\varvec{\beta }^{({\mathcal {A}}_i), \textsf{T}}_{t}\varvec{Z}^{({\mathcal {A}}_i)} \right] \right) \right] \nonumber \\&=\sum _{i=1}^N{\mathbb {E}}\left[ \sum _{t=1}^T \varvec{\beta }^{({\mathcal {A}}_i),\textsf{T}}_{t} \varvec{Z}^{({\mathcal {A}}_i)}_t -\min _{\varvec{Z}^{({\mathcal {A}}_{i})} \in {\mathcal {B}}^{(K)}_{||\cdot ||_0, 1}(\varvec{0})} \left[ \sum _{t=1}^{T}\varvec{\beta }^{({\mathcal {A}}_i), \textsf{T}}_{t}\varvec{Z}^{({\mathcal {A}}_i)}\right] \right] \\&= \sum _{i=1}^N{\mathcal {R}}^{({\mathcal {A}}_i)}_T \left( \varvec{\pi }^{({\mathcal {A}}_i)},\nu ^{({\mathcal {A}}_i)}\right) \end{aligned}$$

where \(\varvec{\beta }_t = \left[ \varvec{\beta }^{{\mathcal {A}}_1, \textsf{T}}_{t},\ldots , \varvec{\beta }^{{\mathcal {A}}_N, \textsf{T}}_{t}\right] ^{\textsf{T}}\) and \(\varvec{Z}_{t} =\left[ \varvec{Z}^{({\mathcal {A}}_1),\textsf{T}}_t,\ldots , \varvec{Z}^{({\mathcal {A}}_N),\textsf{T}}_t\right] ^{\textsf{T}}\), \(\varvec{Z} =\left[ \varvec{Z}^{({\mathcal {A}}_1), \textsf{T}},\ldots , \varvec{Z}^{({\mathcal {A}}_N), \textsf{T}}\right] ^{\textsf{T}}\) are decomposition of the corresponding vectors on agent-specific parts, joint policy \(\varvec{\pi }(\cdot ) =\prod _{i=1}^{N} \varvec{\pi }^{({\mathcal {A}}_i)}(\cdot )\), and joint environment \(\nu = \prod _{i=1}^N\nu ^{({\mathcal {A}}_i)}\), and \({\mathcal {B}}^{(K)}_{||\cdot ||_0, 1}(\varvec{0})\) is unit ball with respect to \(||\cdot ||_0\) norm centered at \(\varvec{0}\) in \([0,1]^K\). Moreover, the worst-case regret (9) also can be decomposed into agent-specific form:

$$\begin{aligned} {\mathcal {R}}^\star _{T} = \sup _{\nu }{\mathcal {R}}_{T}(\varvec{\pi }, \nu ) \ \ \ \iff \ \ \ \sup _{\nu ^{({\mathcal {A}}_i)}} {\mathcal {R}}^{({\mathcal {A}}_i)}_T \left( \varvec{\pi }^{({\mathcal {A}}_i)}, \nu ^{({\mathcal {A}}_i)}\right) , \ \ \ i=1,\ldots ,N. \end{aligned}$$

This decomposition allows us to significantly reduce the search space and apply the two following algorithms for each agent \({\mathcal {A}}_i\) in a completely parallel fashion.

D Theoretical Guarantees

1.1 D.1 MANAS-LS

First, we need to be more specific on the way to obtain the estimates \(\tilde{\varvec{\beta }}^{({\mathcal {A}}_i)}_t[k]\).

In order to obtain theoretical guaranties we considered the least-square estimates as in Cesa-Bianchi and Lugosi (2012) as

$$\begin{aligned} \tilde{\varvec{\beta }}_t={\mathcal {L}}^{(\textrm{val})}_t \varvec{P}^{\dagger }\varvec{Z}_t \text { where } \varvec{P} = {\mathbb {E}} \left[ \varvec{Z} \varvec{Z}^T \right] \text { with } \varvec{Z} \text { has law } \varvec{\pi }_t(\cdot ) = \prod _{i=1}^{N}\varvec{\pi }_t^{({\mathcal {A}}_i)}(\cdot ) \end{aligned}$$
(10)

Our analysis is under the assumption that each \(\varvec{\beta }_t\in {\mathbb {R}}^{KN}\) belongs to the linear space spanned by the space of sparse architecture \(\varvec{\mathcal {Z}}\). This is not a strong assumption as the only condition on a sparse architecture comes with the sole restriction that one operation for each agent is active.

Theorem 1

Let us consider neural architecture search problem in a multi-agent combinatorial online learning form with N agents such that each agent has K actions. Then after T rounds, MANAS-LS achieves joint policy \(\{\varvec{\pi }_t\}^{T}_{t=1}\) with expected simple regret (Eq. 3) bounded by \({\mathcal {O}}\left( e^{-T/H}\right)\) in any adversarial environment with complexity bounded by \(H=N(\min _{j\ne k^\star _i,i\in \{1, \ldots ,N\} }\varvec{B}^{({\mathcal {A}}_i)}_T[j] -\varvec{B}^{({\mathcal {A}}_i)}_T[k^\star _i] )\), where \(k^\star _i= \min _{j\in \{1,\ldots ,K\}} \varvec{B}^{({\mathcal {A}}_i)}_T[j]\).

Proof

In Eq. 10 we use the same constructions of estimates \(\tilde{\varvec{\beta }}_t\) as in ComBand. Using Corollary 14 in Cesa-Bianchi and Lugosi (2012) we then have that \(\widetilde{\varvec{B}}_t\) is an unbiased estimates of \(\varvec{B}_t\).

Given the adversary losses, the random variables \(\tilde{\varvec{\beta }}_{t}\) can be dependent of each other and \(t\in [T]\) as \(\pi _{t}\) depends on previous observations at previous rounds. Therefore, we use the Azuma inequality for martingale differences by Freedman (1975).

Without loss of generality we assume that the loss \({\mathcal {L}}^{(\textrm{val})}_t\) are bounded such that \({\mathcal {L}}^{(\textrm{val})}_t\in [0,1]\) for all t. Therefore we can bound the simple regret of each agent by the probability of misidentifying of the best operation \(P(k^\star _i\ne a^{{\mathcal {A}}_i}_{T+1})\).

We consider a fixed adversary of complexity bounded by H. For simplicity, and without loss of generality, we order the operations from such that \(\varvec{B}^{({\mathcal {A}}_i)}_T[1] <\varvec{B}^{({\mathcal {A}}_i)}_T[2]\le \ldots \le \varvec{B}^{({\mathcal {A}}_i)}_T[K]\) for all agents.

We denote for \(k>1\), \(\Delta _{k} =\varvec{B}^{({\mathcal {A}}_i)}_T[k] -\varvec{B}^{({\mathcal {A}}_i)}_T[k^\star _i]\) and \(\Delta _{1} =\Delta _{2}\).

We also have \(\lambda _{min}\) as the smallest nonzero eigenvalue of \(\varvec{M}\) where \(\varvec{M}\) is \(\varvec{M}=E[\varvec{Z} \varvec{Z}^T]\) where \(\varvec{Z}\) is a random vector representing a sparse architecture distributed according to the uniform distribution.

$$\begin{aligned} P(k^\star _i\ne a^{{\mathcal {A}}_i}_{T+1})&= P\left( \exists k\in \{1,\ldots ,K\}: \widetilde{\varvec{B}}^{({\mathcal {A}}_i)}_T[1] \ge \widetilde{\varvec{B}}^{({\mathcal {A}}_i)}_T[k]\right) \\&\le P\left( \exists k\in \{1,\ldots ,K\}:\varvec{B}^{({ \mathcal {A}}_i)}_T[k]-\widetilde{\varvec{B}}^{({\mathcal {A}}_i)}_T[k]\right. \\&\left. \ge \frac{T\Delta _{k}}{2} \ \ \textrm{or}\ \ \widetilde{\varvec{B}}^{({\mathcal {A}}_i)}_T[1] -\varvec{B}^{({\mathcal {A}}_i)}_T[1] \ge \frac{T\Delta _{1}}{2}\right) \\&\le P\left( \widetilde{\varvec{B}}^{({\mathcal {A}}_i)}_{ T}[1] -\varvec{B}_T^{({\mathcal {A}}_i)}[1] \ge \frac{T\Delta _{1}}{2} \right) \\&\quad +\sum _{k=2}^{K} P\left( \varvec{B}_T^{({ \mathcal {A}}_i)}[k]-\widetilde{\varvec{B}}^{({ \mathcal {A}}_i)}_{T}[k] \ge \frac{T\Delta _{k}}{2}\right) \\&{\mathop {\le }\limits ^{{\textbf {(a)}}}} \sum _{k=1}^{K} \exp \left( -\frac{(\Delta _{k})^2T}{2Nlog(K)/\lambda _{min}}\right) \\&\le K\exp \left( -\frac{(\Delta _{1})^2T}{2Nlog(K)/\lambda _{min}}\right) , \end{aligned}$$

where (a) is using Azuma’s inequality for martingales applied to the sum of the random variables with mean zero that are \(\tilde{\varvec{\beta }}_{k,t}-\varvec{\beta }_{k,t}\) for which we have the following bounds on the range. The range of \(\tilde{\varvec{\beta }}_{k,t}\) is

\([0, Nlog(K)/\lambda _{min}]\). Indeed our sampling policy is uniform with probability 1/log(K) therefore one can bound \(\tilde{\varvec{\beta }}_{k,t}\) as in (Cesa-Bianchi and Lugosi 2012, Theorem 1) Therefore we have \(|\tilde{\varvec{\beta }}_{k,t}-\varvec{\beta }_{k,t} | \le Nlog(K)/\lambda _{min}\).

We recover the result with a union bound on all agents. \(\square\)

1.2 D.2 MANAS

We consider a simplified notion of regret that is a regret per agent where each agent is considering the rest of the agents as part of the adversarial environment. Let us fix our new objective as to minimise

$$\begin{aligned} \sum _{i=1}^N{\mathcal {R}}^{\star ,i}_{T}(\pi ^{({\mathcal {A}}_i)}) = \sum _{i=1}^N \sup _{\varvec{a}_{-i},\nu } {\mathbb {E}} \left[ \sum _{t=1}^{T} {\mathcal {L}}^{(\textrm{val})}_{t} (\varvec{a}^{({\mathcal {A}}_i)}_{t},\varvec{a}_{-i}) -\min _{\varvec{a}\in \{1,\ldots ,K\}} \left[ \sum _{t=1}^{T}{\mathcal {L}}^{(\textrm{val})}_{t} (\varvec{a},\varvec{a}_{-i})\right] \right] , \end{aligned}$$

where \(\varvec{a}_{-i}\) is a fixed set of actions played by all agents to the exception of agent \({\mathcal {A}}_i\) for the T rounds of the game and \(\nu\) contains all the losses as \(\nu =\{{\mathcal {L}}^{(\textrm{val})}_{t}(\varvec{a})\}_{t\in \{1,\ldots ,T\},\varvec{a}\in \{1,\ldots ,K^N\} }\).

We then can prove the following bound for that new notion of regret.

Theorem 2

Let us consider the neural architecture search problem in a multi-agent combinatorial online learning form with N agents such that each agent has K actions. Then after T rounds, MANAS achieves joint policy \(\{\varvec{\pi }_t\}^{T}_{t=1}\) with expected cumulative regret bounded by \({\mathcal {O}}\left( N\sqrt{TK\log K}\right)\).

Proof

First we look at the problem for each given agent \({\mathcal {A}}_i\) and we define and look at

$$\begin{aligned} {\mathcal {R}}^{\star ,i}_{T}(\pi ^{({\mathcal {A}}_i)}, \varvec{a}_{-i}) = \sup _{\nu } {\mathbb {E}}\left[ \sum _{t=1}^{T} {\mathcal {L}}^{(\textrm{val})}_{t}(\varvec{a}^{({\mathcal {A}}_i)}_{t}, \varvec{a}_{-i}) - \min _{\varvec{a}\in \{1,\ldots ,K\}} \left[ \sum _{t=1}^{T}{\mathcal {L}}^{(\textrm{val})}_{t} (\varvec{a},\varvec{a}_{-i})\right] \right] , \end{aligned}$$

We want to relate that the game that agent i plays against an adversary when the actions of all the other agents are fixed to \(\varvec{a}_{-i}\) to the vanilla EXP3 setting. To be more precise on why this is the EXP3 setting, first we have that \({\mathcal {L}}^{(\textrm{val})}_{t}(\varvec{a}_t)\) is a function of \(\varvec{a}_t\) that can take \(K^N\) arbitrary values. When we fix \(\varvec{a}_{-i}\), \({\mathcal {L}}^{(\textrm{val})}_{t}(\varvec{a}^{({\mathcal {A}}_i)}_{t}, \varvec{a}_{-i})\) is a function of \(\varvec{a}^{({\mathcal {A}}_i)}_{t}\) that can only take K arbitrary values.

One can redefine \({\mathcal {L}}^{@,(\textrm{val})}_{t} (\varvec{a}^{({\mathcal {A}}_i)}_{t})={\mathcal {L}}^{(\textrm{val})}_{t} (\varvec{a}^{({\mathcal {A}}_i)}_{t},\varvec{a}_{-i})\) and then the game boils down to the vanilla adversarial multi-arm bandit where each time the learner plays \(\varvec{a}^{({\mathcal {A}}_i)}_{t}\in \{1,\ldots ,K\}\) and observes/incurs the loss \({\mathcal {L}}^{@,(\textrm{val})}_{t}(\varvec{a}^{({\mathcal {A}}_i)}_{t})\). Said differently this defines a game where the new \(\nu '\) contains all the losses as \(\nu ' =\{{\mathcal {L}}^{@,(\textrm{val})}_{t} (\varvec{a}^{({\mathcal {A}}_i)}) \}_{t\in \{1,\ldots ,T\},\varvec{a}^{({\mathcal {A}}_i)}\in \{1,\ldots ,K\}}\).

For all \(\varvec{a}_{-i}\)

$$\begin{aligned} {\mathcal {R}}^{\star ,i}_{T}(EXP3, \varvec{a}_{-i}) \le 2\sqrt{TK\log (K)} \end{aligned}$$

Then we have

$$\begin{aligned} {\mathcal {R}}^{\star ,i}_{T}(EXP3)&\le \sup _{\varvec{a}_{-i}} 2 \sqrt{TK\log (K)}\\&= 2\sqrt{TK\log (K)} \end{aligned}$$

Then we have

$$\begin{aligned} \sum _{i=1}^N {\mathcal {R}}^{\star ,i}_{T}(EXP3)\le 2N\sqrt{TK\log (K)} \end{aligned}$$

\(\square\)

E Relation between weight sharing and cumulative regret

Ideally we would like to obtain for any given architecture \(\mathcal {Z}\) the value \({\mathcal {L}}_{val}(\mathcal {Z},\varvec{w}^\star (\mathcal {Z}))\). However obtaining \(\varvec{w}^\star (\mathcal {Z}) = \arg \min _{\varvec{w}} {\mathcal {L}}_{train}(\varvec{w}, \mathcal {Z})\) for any given fixed \(\mathcal {Z}\) would already require heavy computations. In our approach the \(\varvec{w}_t\) that we compute and update is actually common to all \(\mathcal {Z}_t\) as \(\varvec{w}_t\) replaces \(\varvec{w}^\star (\mathcal {Z}_t)\). This is a simplification that leads to learning a weight \(\varvec{w}_t\) that tend to minimise the loss \({\mathbb {E}}_{\mathcal {Z}\sim \pi _t}[{\mathcal {L}}_{val} (\mathcal {Z},\varvec{w}(\mathcal {Z})]\) instead of minimising \({\mathcal {L}}_{val}(\mathcal {Z}_t,\varvec{w}(\mathcal {Z}_t)\). If \(\pi _t\) is concentrated on a fixed \(\mathcal {Z}\) then these two previous expressions would be close. Moreover when \(\pi _t\) is concentrated on \(\mathcal {Z}\) then \(\varvec{w}_t\) will approximate accurately \(\varvec{w}^\star (\mathcal {Z})\) after a few steps. Note that this gives an argument for using sampling algorithm that minimise the cumulative regret as they naturally tend to play almost all the time one specific architecture. However there is a potential pitfall of converging to a local minimal solution as \(\varvec{w}_t\) might not have learned well enough to compute accurately the loss of other and potentially better architectures.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Lopes, V., Carlucci, F.M., Esperança, P.M. et al. Manas: multi-agent neural architecture search. Mach Learn 113, 73–96 (2024). https://doi.org/10.1007/s10994-023-06379-w

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10994-023-06379-w

Keywords

Navigation