Sherpa: Robust Hyperparameter Optimization for Machine Learning

Sherpa is a hyperparameter optimization library for machine learning models. It is specifically designed for problems with computationally expensive, iterative function evaluations, such as the hyperparameter tuning of deep neural networks. With Sherpa, scientists can quickly optimize hyperparameters using a variety of powerful and interchangeable algorithms. Sherpa can be run on either a single machine or in parallel on a cluster. Finally, an interactive dashboard enables users to view the progress of models as they are trained, cancel trials, and explore which hyperparameter combinations are working best. Sherpa empowers machine learning practitioners by automating the more tedious aspects of model tuning. Its source code and documentation are available at https://github.com/sherpa-ai/sherpa.


Motivation and significance
Hyperparameters are tuning parameters of machine learning models. Hyperparameter optimization refers to the process of choosing optimal hyperparameters for a machine learning model. This optimization is crucial to obtain optimal performance from the machine learning model. Since hyperparameters cannot be directly learned from the training data, their optimization is often a process of trial and error conducted manually by the researcher. There are two problems with the trial and error approach. Firstly, it is time consuming and can take days or even weeks of the researcher's attention. Secondly, it is dependent on the researcher's ability to interpret results and choose good hyperparameter settings. These limitations lead to a large need to automate this process. Sherpa is a software that addresses this need.
Existing hyperparameter optimization software can be divided into bayesian optimization software, bandit and evolutionary algorithm software, framework specific software, and all-round software. Software that implements bayesian optimization started with SMAC [1], Spearmint [2], and HyperOpt [3]. More recent software in this regime has been GPyOpt [4], RoBo [5], DragonFly [6], Cornell-MOE [7,8], and mlrMBO [9]. These software packages have high quality, stand-alone bayesian optimization implementations, often with unique twists. However, most of these do not provide infrastructure for parallel training.
A number of framework specific libraries have also been proposed. Auto-Weka [16] and Auto-Sklearn [17] focus on WEKA [18] and Scikit-learn [19], respectively. Furthermore, a number of packages have been proposed for the machine learning framework Keras [20]. Hyperas, Auto-Keras [21], Talos, Kopt, and HORD each provide hyperparameter optimization specifically for Keras. These libraries make it easy to get started due to their tight integration with the machine learning framework. However, researchers will inevitably run into limitations when a different machine learning framework is needed.
Lastly, a number of implementations aim at being framework agnostic and also support multiple optimization algorithms. Table 1 shows a detailed comparison of these "all-round" packages to Sherpa. Note that we excluded Google Vizier [22] and similar frameworks from other cloud computing providers since these are not free to use.
Sherpa is already being used in a wide variety of applications such as machine learning methods [27], solid state physics [28], particle physics [29], medical image analysis [30], and cyber security [31]. Due to the fact that the number of machine learning applications is growing rapidly we can expect there to be a growing need for hyperparameter optimization software such as Sherpa.

Hyperparameter Optimization
We begin by laying out the components of a hyperparameter optimization. Consider the training of a machine learning model. A user has a model that is being trained with data. Before training there are hyperparameters that need to be set. At the end of the training we obtain an objective value.
This workflow can be illustrated via the training of a neural network. The model is a neural network. The data are images that the neural network is trained on. The hyperparameter setting is the number of hidden layers of the neural network. The objective is the prediction accuracy on a hold-out dataset obtained at the end of training.
For automated hyperparameter optimization we also need hyperparameter ranges, a results table, and a hyperparameter optimization algorithm. The hyperparameter ranges define what values each hyperparameter is allowed to take. The results store hyperparameter settings and their associated objective value. Finally, the algorithm takes results and ranges and produces a new suggestion for a hyperparameter setting. We refer to this suggestion as a trial.
For the neural network example the hyperparameter range might be 1, 2, 3, or 4 hidden layers. We might have previous results that 1 corresponds to 80% accuracy and 3 to 90% accuracy. The algorithm might then produce a new trial with 4 hidden layers. After training the neural network with 4 hidden layers we find it achieves 88% accuracy and add this to the results. Then the next trial is suggested.

Components
We now describe how Sherpa implements the components described in Section 2.1. Sherpa implements hyperparameter ranges as sherpa.Parameter objects. The algorithm is implemented as a sherpa.algorithms.Algorithm object. A list of hyperparameter ranges and an algorithm are combined to create a sherpa.Study (Figure 1). The study stores the results. Trials are implemented as sherpa.Trial objects.  Sherpa implements two user interfaces. We will refer to the two interfaces as API mode and parallel mode.

API Mode
In API mode the user interacts with the Study object. Given a study s:

1.
A new trial of name t is obtained by calling s.get suggestion() or by iterating over the study (e.g. for t in s).
2. First, t.parameters is used to initialize and train a machine learning model. Then s.add observation(t, objective=o) is called to add objective o for trial t. Invalid observations are automatically excluded from the results.
3. Finally, s.finalize(t) informs Sherpa that the model training is finished.
Interacting with the Study class is easy. It also requires minimal setup. The limitation in API mode is that it cannot evaluate trials in parallel.

Parallel Mode
In parallel-mode multiple trials can be evaluated in parallel. The user provides two scripts: a server script and a machine learning (ML) script. The server script defines the hyperparameter ranges, the algorithm, the job scheduler, and the command to execute the machine learning script. The optimization starts by calling sherpa.optimize. In the machine learning script the user trains the machine learning model given some hyperparameters and adds the resulting objective value to Sherpa. Using a sherpa.Client called c a trial t is obtained by calling c.get trial().
To add observations c.send metrics(trial=t, objective=o) is used. Internally, sherpa.optimize runs a loop that uses the Study class.   Figure 2: Architecture diagram for parallel hyperparameter optimization in Sherpa. The user only interacts with Sherpa via the solid red arrows, everything else happens internally.

Available Hyperparameter Types
Sherpa supports four hyperparameter types: These correspond to a range of floats, a range of integers, an unordered categorical variable, and an ordered categorical variable, respectively. Each parameter has name and range arguments. The range expects a list defining lower and upper bound for continuous and discrete variables. For choice and ordinal variables the range expects the categories.

Diversity of Algorithms
Sherpa aims to help researchers at various stages in their model development. For this reason, it provides a choice of hyperparameter tuning algorithms. The following optimization algorithms are currently supported.
• sherpa.algorithms.RandomSearch: Random Search [32] samples hyperparameter settings uniformly from the specified ranges. It is a robust algorithm because it explores the space uniformly. Furthermore, with the dashboard the user can make their own inference on the results.
• sherpa.algorithms.GridSearch: Grid Search follows a grid over the hyperparameter space and evaluates all combinations. It is useful to systematically explore one or two hyperparameters. It is not recommended for more than two hyperparameters.
• sherpa.algorithms.bayesian optimization.GPyOpt: Bayesian optimization is a model-based search. For each trial it picks the most promising hyperparameter setting based on prior results. Sherpa's implementation wraps the package GPyOpt [4].
• sherpa.algorithms.successive halving.SuccessiveHalving: Asynchronous Successive Halving (ASHA) [33] is a hyperparameter optimization algorithm based on multi-armed bandits. It allows the efficient exploration of a large hyperparameter space. This is accomplished by the early stopping of unpromising trials.
• sherpa.algorithms.PopulationBasedTraining: Population-based Training (PBT) [12] is an evolutionary algorithm. The algorithm jointly optimizes a population of models and their hyperparameters. This is achieved by adjusting hyperparameters during training. It is particularly suited for neural network training hyperparameters such as learning rate, weight decay, or batch size.
• sherpa.algorithms.LocalSearch: Local Search is a heuristic algorithm. It starts with a seed hyperparameter setting. During optimization it randomly perturbs one hyperparameter at a time. If a setting improves on the seed then it becomes the new seed. This algorithm is particularly useful if the user already has a well performing hyperparameter setting.
All implemented algorithms allow parallel evaluation and can be used with all available parameter types. An empirical comparison of the algorithms can be found in the documentation 1 .

Accounting for Random Variation
Sherpa can account for variation via the Repeat algorithm. The objective value of a model may vary between training runs. Reasons for this can be random initialization or stochastic training. The Repeat algorithm runs each hyperparameter setting multiple times. Thus variation can be taken into account when analyzing results.

Visualization Dashboard
Sherpa provides an interactive web-based dashboard. It allows the user to monitor progress of the hyperparameter optimization in real time. Figure 3 shows a screenshot of the dashboard.
At the top of the dashboard is a parallel coordinates plot [34,35]. It allows exploration of relationships between hyperparameter settings and objective values (Figure 3 top). Each vertical axis corresponds to a hyperparameter or the objective. The axes can be brushed over to select subsets of trials. The plot is implemented using the D3.js parallel-coordinates library by Chang [36]. At the bottom right is a line chart. It shows objective values against training iteration (Figure 3 bottom right). This chart allows to monitor training progress of each trial. It is also useful to analyze whether a trial's training converged. At the bottom left is a table of all completed trials (Figure 3 bottom left). Hovering over trials in the table highlights the corresponding lines in the plots. Finally, the dashboard has a stopping button (Figure 3 top right corner). This allows the user to cancel the training for unpromising trials.
The dashboard runs automatically during a hyperparameter optimization. It can be accessed in a web-browser via a link provided by Sherpa. The dashboard is useful to quickly evaluate questions such as: • Are the selected hyperparameter ranges appropriate?
• Is training unstable for some hyperparameter settings?
• Does a particular hyperparameter have little impact on the performance of the machine learning algorithm?
• Are the best observed hyperparameter settings consistent?
Based on these observations the user can refine the hyperparameter ranges or choose a different algorithm, if appropriate.

Scaling up with a Cluster
In parallel mode Sherpa can run parallel evaluations. A job scheduler is responsible for running the user's machine learning script. The following job schedulers are implemented.
• The LocalScheduler evaluates parallel trials on the same computation node. This scheduler is useful for running on multiple local CPU cores or GPUs. It has a simple resource handler for GPU allocation (see Figure 5 for an example).
• The SGEScheduler uses Sun Grid Engine (SGE) [37]. Submission arguments and an environment profile can be specified via arguments to the scheduler.
• The SLURMScheduler is based on SLURM [38]. Its interface is similar to the SGEScheduler.
Concurrency between workers is handled via MongoDB, a NoSQL database program. Parallel mode expects that MongoDB is installed on the system.

Handwritten Digits Classification with a Neural Network
The following is an example of a Sherpa hyperparameter optimization. It uses the MNIST handwritten digits dataset [39]. A Keras neural network is used to classify the digits. The neural network has one hidden layer and a softmax output. The hyperparameters are the learning rate of the Adam [40] optimizer, the number of hidden units, and the hidden layer activation function. The search is first conducted using Sherpa's API mode. After that we show the same example using Sherpa's parallel mode. Figure 4 shows the hyperparameter optimization in Sherpa's API mode. The script starts with imports and loading of the MNIST dataset. Next, the hyperparameters learning rate, num units, and activation are defined. These refer to the Adam learning rate, number of hidden layer units, and hidden layer activation function, respectively. As optimization algorithm the GPyOpt algorithm is chosen. Hyperparameter ranges and algorithm are combined via the Study. The lower is better flag indicates that lower objective values are not better. This is because we will be maximizing the classification accuracy. After that a for-loop iterates over the study. The for-loop yields a trial at each iteration. A Keras model is instantiated using the hyperparameter settings. The Keras model is iteratively trained and evaluated via an inner for-loop. We add an observation for each iteration and use finalize after the training is finished. Note that we pass the loss as context to add observation. The context accepts a dictionary with any additional metrics that the user wants to record. Code to replicate this example is available as a Jupyter notebook 2 and on Google Colab 3 . A video tutorial is also available on YouTube 4 . Tutorials using the Successive Halving and Population Based Training algorithms are also available 56 . import sherpa import sherpa.algorithms.bayesian_optimization as bayesian_optimization import keras from keras.models import Sequential from keras.layers import Dense, Flatten from keras.datasets import mnist from keras.optimizers import Adam epochs = 15 (x_train, y_train), (x_test, y_test) = mnist.load_data (

Parallel Mode
We now show the same hyperparameter optimization using Sherpa's parallel mode. Figure 5 (top) shows the server script. First, the hyperparameters and search algorithm are defined. This time we also define a LocalScheduler instance. Hyperparameters, algorithm, and scheduler are passed to the sherpa.optimize function. We also pass a command "python trial.py". The command indicates how to execute the user's machine learning script. Furthermore, the argument max concurrent=2 indicates that two evaluations will be running at a time. Figure 5 (bottom) shows the machine learning script. First, we set environment variables for GPU configuration. Next we create a Client. To obtain hyperparameters we call the client's get trial method. Furthermore, during training we call the client's send metrics method. This replaces add observation in parallel mode. Also, in parallel mode no finalize call is needed.

Deep learning for Cloud Resolving Models 4.2.1. Introduction
The following illustrates an example of a Sherpa hyperparameter optimization in the field of climate modeling, specifically cloud resolving models (CRM). We apply Sherpa to optimize the deep neural network (DNN) of Rasp et al. [41].
The input to the model is a 94-dimensional vector. Features include temperature, humidity, meridional wind, surface pressure, incoming solar radiation, sensible heat flux, and latent heat flux. The output of the DNN is a 65-dimensional vector. It is composed of the sum of the CRM and radiative heating rates, the CRM moistening rate, the net radiative fluxes at the top of the atmosphere and surface of the earth, and the observed precipitation.

General Hyperparameter Optimization
Initially a random search was conducted on the following hyperparameters: batch normalization [42], dropout [43,44], Leaky ReLU coefficient [45], learning rate, nodes per hidden layer, number of hidden layers. The parameter ranges were chosen to encompass the parameters specified in [41]. From the dashboard (Figure A.6) we identify that the best performing configurations have low dropout, leaky ReLU coefficients mostly around 0.3 or larger, and learning rates mostly near 0.002. The majority of good models have 8 layers and batch normalization. However, the number of units does not seem to have a large impact. The hyperparameter ranges and best configuration are provided in Tables A.2 and A.3 in the appendix.

Optimization of the Learning Rate Schedule
An additional search was conducted to fine-tune the DNN training hyperparameters. Specifically, the initial learning rate and the learning rate decay were optimized. The range of initial learning rate values was ±10 −4 of the best value from Section 4.2.2. The range of learning rate decay factors was 0.5 to 1. The learning rate gets multiplied by this factor after every epoch to produce a new learning rate. In comparison, the model in Rasp et al. [41] uses a decay factor of approximately 0.58. The remaining hyperparameters were set to the best configuration from Section 4.2.2. A total of 50 trials were evaluated via random search. The best learning rate was found to be 0.001196. The best decay value was found as 0.843784. The overall optimal hyperparameter setting is shown in Table A.3 of the supplementary materials.

Results
We compare the model found by Sherpa to the model from Rasp et al. [41] via R 2 plots (Figure A.7). The R 2 plots show the coefficient of determination at different pressures and latitudes. We find that the Sherpa model consistently outperforms the comparison model. In particular, it is able to perform for latitudes for which the prior model fails. Figure A.7f shows that the Sherpa model's loss reduces further after the Rasp et al. [41] model has converged. This is the result of the learning rate fine-tuning from Section 4.2.3.

Impact
Machine learning is used to ever larger extends in the scientific community. Nearly every machine learning application can benefit from hyperparameter optimization. The issue is that researchers often do not have a practical tool at hand. Therefore, they usually resort to manually tuning parameters. Sherpa aims to be this tool. Its goal is to require minimal learning from the user to get started. It also aims to support the user as their needs for parallel evaluation or exotic optimization algorithms grow. As shown by references in Section 1, Sherpa is already being used by researchers to achieve improvements in a variety of domains. In addition to that, the software has been downloaded more than 6000 times from the PyPi Python package manager 7 . It also has over 160 stars on the software hosting website GitHub. A GitHub star means that another user has added the software to a personal list for later reference.

Conclusions
Sherpa is a flexible open-source software for robust hyperparameter optimization of machine learning models. It provides the user with several interchangeable hyperparameter optimization algorithms, each of which may be useful at different stages of model development. Its interactive dashboard allows the user to monitor and analyze the results of multiple hyperparameter optimization runs in real-time. It also allows the user to see patterns in the performance of hyperparameters to judge the robustness of individual settings. Sherpa can be used on a laptop or in a distributed fashion on a cluster. In summary, rather than a black-box that spits out one hyperparameter setting, Sherpa provides the tools that a researcher needs when doing hyperparameter exploration and optimization for the development of machine learning models.

Conflict of Interest
We wish to confirm that there are no known conflicts of interest associated with this publication and there has been no significant financial support for this work that could have influenced its outcome.

Appendix A. Deep learning for Cloud Resolving Models
Initially a random search was conducted on the hyperparameters listed in Table A A screenshot of the Sherpa dashboard at the end of the hyperparameter optimization is shown in Figure A  Following the secondary search for an optimal learning rate schedule (Section 4.2.3) the hyperparameters in Table A.3) were found to be overall optimal. The optimized learning rate and schedule found by Sherpa is of considerable importance. Referencing the loss curves in Figure A.7f one can see the learning rate schedule used in [41] forces the learning rate to decay rapidly causing an early plateau of the loss. The learning rate schedule discovered by Sherpa on the other hand allows the DNN to keep learning, further reducing the loss.