AutoMLBench: A Comprehensive Experimental Evaluation of Automated Machine Learning Frameworks

With the booming demand for machine learning applications, it has been recognized that the number of knowledgeable data scientists can not scale with the growing data volumes and application needs in our digital world. In response to this demand, several automated machine learning (AutoML) frameworks have been developed to fill the gap of human expertise by automating the process of building machine learning pipelines. Each framework comes with different heuristics-based design decisions. In this study, we present a comprehensive evaluation and comparison of the performance characteristics of six popular AutoML frameworks, namely, AutoWeka, AutoSKlearn, TPOT, Recipe, ATM, and SmartML, across 100 data sets from established AutoML benchmark suites. Our experimental evaluation considers different aspects for its comparison, including the performance impact of several design decisions, including time budget, size of search space, meta-learning, and ensemble construction. The results of our study reveal various interesting insights that can significantly guide and impact the design of AutoML frameworks.


Introduction
Nowadays, we are witnessing tremendous interest in artificial intelligence applications across governments, industries and research communities with a yearly cost of around 12.5 billion US dollars report (2017). The driver for this interest is the advent and increasing popularity of machine learning (ML) and deep learning (DL) techniques. The rise of generated data from different sources, processing capabilities, and ML algorithms opened the way for adopting ML in a wide range of real-world applications Zomaya and Sakr (2017). This situation is increasingly contributing towards a potential data science crisis, similar to the software crisis Fitzgerald (2012), due to the crucial need of having an increasing number of data scientists with solid knowledge and good experience so that they can keep up with harnessing the power of the massive amounts of data produced daily. Thus, we are witnessing a growing interest in automating the process of building machine learning pipelines where the presence of a human in the loop can be dramatically reduced. Research in Automated machine learning (AutoML) aims to alleviate both the computational cost and human expertise required for developing machine learning pipelines through automation with efficient algorithms. In particular, AutoML techniques enable the widespread use of machine learning techniques by domain experts and non-technical users.
Applying machine learning to real-world problems is a multi-stage process and highly iterative exploratory process. Therefore, several frameworks were designed to support automating the Combined Algorithm Selection and hyperparameter tuning (CASH) problem Shawi et al (2019); Zöller and Huber (2021); He et al (2019). These techniques have commonly formulated the problem as an optimization problem that many techniques can solve. Let A = {A (1) , ..., A (R) } be a set of machine learning algorithms, and let the hyperparameters of each algorithm A (j) have a domain Λ (j) . Let λ ∈ Λ (j) denotes a vector of hyperparameters, and A (j) with its hyperparameters instantiated to λ is denoted as A (j) λ . Given a dataset D, the CASH problem aims to find A * λ * ∈ argmin λ , D train , D valid ) measures the loss of a model generated by algorithm A (j) with hyperparameters λ on training data D train and evaluated on validation data D valid .
One constraint of the CASH problem is the time budget. In particular, solving the CASH problem aims to select and tune a machine learning algorithm that achieves (near)-optimal performance in terms of the user-defined evaluation metric (e.g., accuracy, sensitivity, specificity, F1-score) within the user-defined time budget for the search process ( Figure 1). Additionally, different AutoML frameworks consider various design decisions. For example, SmartML Maher and Sakr (2019) adopts a meta-learning based mechanism to improve the performance of the automated search process by starting with the most promising classifiers that performed well with similar datasets in the past. Another example, AutoSKlearn Feurer et al (2015) employs an option to take a weighted average of the predictions of an ensemble composed of the top trained models during the optimization process. Auto-Tuned Models (ATM) framework Swearingen et al (2017) restricts the default search space into only three classifiers, namely, decision tree, K-nearest neighbours and logistic regression. In general, there is no good understanding of the impact of various design decisions of the different AutoML frameworks on the performance of the output pipeline. In this work, we aim to answer the following four questions: (1) What is the impact of the time budget on the performance of different AutoML frameworks? Given more time budget, can AutoML frameworks guarantee consistent performance improvement?
(2) What is the impact of the search space size of the AutoML framework on the performance across different time budgets?
(3) Does meta-learning always yield consistent performance improvement across different time budgets? Is there a relationship between the characteristics of the datasets and the improvement caused by employing the meta-learning version of the AutoML framework?
(4) Does ensemble construction always yield better performance compared to single learners across different time budgets? Is there a relationship between the characteristics of the datasets and the improvement caused by employing the ensembling version of the AutoML framework?
This work is an extension of our initial work Eldeeb et al (2021) that mainly focused on studying the impact of different design decisions on the performance of AutoSKlearn. In particular, in this work, we follow a holistic approach to design and conduct a comparative study of six AutoML frameworks, namely Auto-Weka Kotthoff et al (2017), AutoSKlearn, TPOT Olson and Moore (2016), Recipe de Sá et al (2017), ATM and SmartML, focusing on comparing their general performance and their performance under various design decsions including time budget, size of search space, meta-learning and ensembling. For ensuring reproducibility as one of the main targets of this work, we provide access to the source codes and the detailed results for the experiments of our studies 1 .
The remainder of this paper is organized as follows. The related work is reviewed in Section 2. Section 3 provides an overview of the evaluated frameworks included in our study. Section 4 describes our benchmark design. The evaluation of the general performance of the benchmark frameworks and the evaluation of the different design decisions on the performance of the benchmark frameworks are presented in Section 5. We discuss the results and future direction in Section 6 before we finally conclude the paper in Section 7.

Related Work
Recently, few research efforts have attempted to tackle the challenge of benchmarking different AutoML frameworks Gijsbers et al (2019); He et al (2019); Shawi et al (2019); Truong et al (2019); Zöller and Huber (2021). In general, most experimental evaluation and comparison studies show that there is no clear winner as there are always some trade-offs that need to be considered and optimized according to user-defined objectives. For example, Gijsbers et al. Gijsbers et al (2019) have conducted an experimental study to compare the performance of 4 AutoML frameworks, namely, Auto-Weka, AutoSKlearn, TPOT and H2O LeDell and Poirier (2020) on 39 datasets across two time budgets (60 minutes and 240 minutes). The results showed that no single AutoML framework outperformed others across different time budgets, and on some datasets, none of the frameworks outperformed a Random Forest within 4 hours time budget. Truong et al. Truong et al (2019) compared the performance of 7 AutoML frameworks, namely, H2O, Auto-keras Jin et al (2019), AutoSKlearn, Ludwig 2 , Darwin 3 , TPOT and Auto-ml 4 on 300 datasets across different time budgets. The results showed that no single framework outperformed all others on a plurality of tasks. Across the various evaluations and benchmarks, H2O, Auto-keras and AutoSKlearn performed better than the rest of the frameworks. In particular, H2O slightly outperformed other frameworks for binary classification and regression tasks while achieving poor performance on multi-class classification tasks. Auto-keras showed a stable performance across all tasks and slightly outperformed other frameworks on multi-class classification tasks while achieving poor performance on binary classification tasks. Zöller and Huber Zöller and Huber (2021) compared the performance of different optimization techniques, namely, Grid Search, Random Search, RObust Bayesian Optimization (ROBO) Klein et al (2017), Bayesian Tuning and Bandits (BTB) Smith et al (2020), hyperopt Bergstra et al (2013), SMAC Hutter et al (2011), BOHB Falkner et al (2018) and Optunity Smith et al (2020). The results showed that all optimization techniques achieved comparable performance, and a simple search algorithm such as random search did not perform worse than other techniques. Thus, the study suggested that ranking optimization techniques on pure performance measures are not reasonable, and other aspects like scalability should also be considered. The study also compared the performance of 5 AutoML frameworks, namely, TPOT, hpsklearn Komer et al (2014), AutoSKlearn, ATM, and H2O on 73 real datasets. The study considered Auto-sklearn once with the default optimizer SMAC and once replacing SMAC with the random search while ensemble building To the best of our knowledge, our study is the first to investigate the impact of different AutoML design decisions on the predictive performance. We benchmark six open-source, centralized and distributed AutoML frameworks, namely, Auto-Weka, AutoSKlearn, TPOT, Recipe, ATM and SmartML on 100 datasets from established AutoML benchmark suites. Few recent benchmark studies focused only on comparing the performance of different AutoML frameworks while we take a holistic approach to studying the impact of various design decisions, including the size of the search space, time budget, meta-learning, and ensembling construction on the performance of the AutoML frameworks.

AutoML Frameworks
This section provides an introduction to the evaluated AutoML frameworks used in this study in terms of popularity (measured in terms of the number of stars on GitHub), machine learning tool-box used, optimization technique, whether they use meta-learning to learn from previous experience, whether they perform postprocessing (e.g., ensemble construction), whether they use Graphical User Interface (GUI), or whether they perform pre-processing. Table 1 briefly summarizes the comparison across the AutoML frameworks considered in this study. More detailed comparisons follow in the rest of this section.
Auto-Weka is the first and pioneer AutoML framework for classification and regression tasks. It is implemented in Java on top of WekaHall et al (2009), a popular machine learning library that has a wide range of machine learning algorithms. Auto-Weka employs Bayesian optimization using SMAC and tree-structured parzen estimator (TPE) Hutter et al (2011) for algorithm selection and hyperparameter tuning. In particular, SMAC draws the relationship between algorithm performance and a given set of hyperparameters by estimating the predictive mean and variance of their performance along with the trees of a random forest model. TPE is a robust technique that discards low-performing parameter configurations quickly after the evaluation of a small number of dataset folds. SMAC shows better performance on experimental results compared to TPE.
AutoSKlearn is a tool for automating the process of building machine learning pipelines for classification and regression tasks. AutoSKlearn is implemented on top of Scikit-Learn Buitinck et al (2013), a popular Python machine learning package, and uses SMAC for algorithm selection and hyperparameter tuning. AutoSKlearn uses meta-learning to initialize the optimization procedure. Additionally, ensemble selection is implemented by combining the best pipelines to improve the performance of the output model. AutoSKlearn supports different execution options including the vanilla version (AutoSKlearn-v), the meta-learning version (AutoSKlearn-m), the ensembling selection version (AutoSKlearn-e), and the full version (AutoSKlearn), where all options are enabled. AutoSKlearn-v supports only SMAC as a CASH solver, AutoSKlearn-m supports SMAC and meta-learning for warm-starting, AutoSKlearn-e supports SMAC and post-processing ensemble construction, and AutoSKlearn supports SMAC, meta-learning, and post-processing ensemble construction.
TPOT is an AutoML framework for building classification and regression pipelines based on genetic algorithms. TPOT supports all Scikit-Learn preprocessing, classification and regression techniques. TPOT creates arbitrary complex pipelines that make it prone to overfitting. To overcome this limitation, TPOT uses a multi-objective optimization technique to create pipelines that balance between performance and complexity.
Recipe is an AutoML framework for building machine learning pipelines for classification tasks. Recipe follows the same optimization procedure as TPOT, which in turn exploits the advantages of a global search. TPOT suffers from the unconstrained search problem in which resources can be spent on generating and evaluating invalid solutions. Recipe handles this problem by adding a grammar that reduces the generation of invalid pipelines and hence accelerating the optimization process. Recipe considers larger search space for models' configurations compared to Auto-SkLearn and TPOT.
ATM is a collaborative service for optimizing machine learning pipelines for classification tasks. In particular, ATM supports parallel execution through multiple nodes/cores with a shared model hub storing the results out of these executions and improving the selection of pipelines that may outperform the current chosen ones. ATM is based on hybrid Bayesian and multi-armed bandit optimization technique to traverse the search space and report the target pipeline.
SmartML is the first AutoML R package for classification tasks. In the algorithm selection phase, SmartML employs a meta-learning approach where the metafeatures of the input dataset are extracted and compared to the meta-features of datasets stored in the framework's knowledge base and populated from the results of the previous runs. The similarity search process identifies similar datasets in the knowledge base using the nearest neighbour approach. The retrieved results are used to identify the best performing algorithms on those similar datasets to nominate the candidate algorithms for the dataset at hand. The hyperparameter tuning of SmartML is based on SMAC. SmartML maintains the results of the new runs to continuously enrich its knowledge base to further improve the accuracy of the similarity search process and thus the performance and robustness of the future runs. SmartML supports two execution options which are the base version SmartML-m that employs SMAC and meta-leaning for algorithm selection and hyperparameter tuning, and the ensemble version SmartML-e that employs meta-learning for warm-starting and a voting ensemble mechanism, which is adopted by averaging the predicted probability of the top tuned models found during the optimization process based on their validation performance.

Benchmark Design
Each benchmark task consists of a dataset, a metric to optimize, and design decisions made by the user, including a specific time budget to use. We will briefly explain our choice for each. Datasets We used 100 datasets collected from the popular OpenML reposi-toryVanschoren et al (2013), allowing users to query data for different use cases. Detailed descriptions of the datasets used in this study are given in Table A in the Appendix A. To evaluate the AutoML frameworks on a variety of dataset characteristics, we selected multiple datasets according to different criteria, including the number of classes, size of the datasets, and feature dimensions. Figure 2 shows the characteristics of the benchmark datasets based on the number of classes (Figure 2(a)), the number of features (Figure 2(b)), and number of instances (Figure 2(c)). The datasets represent a mix of binary (50%) and multiclass (50%) classification tasks, where the size of the largest dataset is 643MB.
Performance metrics The benchmark can be run with a wide range of measures per user's choice. The reported results in this paper is based on accuracy. AutoML frameworks are optimized for the same metric they are evaluated on. The measures are estimated with hold-out validation; each dataset is partitioned into two parts, 70% for training and 30% for testing. All AutoML frameworks are applied to the same training and testing splits on all datasets.
Frameworks and design decisions The frameworks considered in this paper are selected based on ease of use, variety of underlying optimization techniques and machine learning toolboxes, popularity measured by the number of stars on GitHub, and citation count. All frameworks considered in this work are open source.
For AutoSKlearn, we consider three execution options including the base version AutoSKlearn-m, ensembling version AutoSKlearn-e and the full version AutoSKlearn. For SmartML, we consider two execution options including the meta-learning version SmartML-m, and the ensembling version SmartML-e. All AutoML frameworks were used with 4 different time budgets: 10, 30, 60 and 240 minutes. If an AutoML framework exceeds the time budget by more than 10%, the run is considered failed. We examined different design decisions, including the size of the search space, meta-learning, and ensemble construction as a post-processing step. We study the impact of these design decisions for only AutoML frameworks that support configuring these decisions. It is important to highlight that all the frameworks do not have the same optimization technique. So, no conclusion can be drawn about this point in this benchmark. We consider the following versions of the frameworks considered in this work: Auto-sklearn 0.11.0, Auto-Weka 2.5, TPOT 0.11.6, Recipe 1.0, ATM 0.2.2, and SmartML 0.2.
Hardware choice and resource specifications Our experiments were conducted on Google Cloud machines; each machine is configured with 2 vCPUs, 7.5 GB RAM and ubuntu-minimal-1804-bionic. To avoid memory leakage, we have rebooted the machines after each run to ensure that each experiment has the same available memory size.

Experimental Evaluation
This section provides empirical evaluations of the different AutoML frameworks. We first compare the general performance of the different AutoML frameworks in Section 5.1. Next, we examine the impact of various design decisions on the performance of the different AutoML frameworks in Section 5.2.

General Performance Evaluation
In this section, we focus on evaluating and comparing the general performance of the benchmark frameworks. Our evaluation considers different aspects for its comparison including (a) number of successful runs, (b) the time budget in which each framework achieves its best performance across 100 datasets, (c) significance of the performance difference between different frameworks across different time budgets, and (d) the robustness of the benchmark frameworks. Figure 3(a) shows the number of successful runs of each framework on the different time budgets. If an AutoML framework was unable to generate a model in a particular run, then it is considered a failed run. Generally, the results show that increasing the time budget for the AutoML frameworks increases the number of successful runs. AutoSKlearn achieves the largest number of successful runs across all time budgets, as shown in Figure 3 SmartML-e comes in the second place in terms of the number of successful runs, followed by Auto-Weka and SmartML. The genetic-based frameworks, TPOT and RECIPE come in the last place, as shown in Figure 3(a). For Recipe and TPOT, the number of successful runs achieved in the longest time budget, 240 minutes, is  We assess the impact of the time budget on the ability of each framework to achieve its best performance on 100 datasets. The best performance is evaluated by comparing the performance of each framework on each dataset across the different time budgets and identifying the time budget in which the framework achieved the highest accuracy. In Figure 3(b), we report the number of datasets in each time budget in which each framework achieved its peak performance. The results show that the largest number of datasets in which all frameworks achieved their peak performance is the longest time budget of 240 minutes. The results also show that the ability of the RECIPE, TPOT, Auto-Weka, SmartML-e, and AutoSKlearn to achieve their best performance are steadily increasing as the time budget increases. On the other Heatmaps show the number of datasets a given AutoML framework outperforms another in terms of accuracy over different time budgets. Two frameworks are considered to have the same performance on a task if they achieve accuracy within 1% of each other.
hand, increasing the time budget for some frameworks leads to significant drop in the number of datasets in which the frameworks achieved their peak performance. For example, the number of datasets in which AutoSKlearn-m reached its peak performance during the 10 minutes time budget is higher than that during the 30 minutes and 60-minutes time budgets. Also, the number of times in which SmartML-m achieved its best performance during the 10 minutes time budget is higher than that during the 30 and 60 minutes time budgets, as shown in Figure 3(b). We investigate pair-wise "outperformance" by calculating the number of datasets for which one framework outperforms another across different time budgets, shown in Figure 4. One framework outperforms another on a dataset if it has at least a 1% higher accuracy, representing a minimal threshold for performance improvement. In terms of "outperformance", it is worth mentioning that no single AutoML framework performs best across all 100 datasets on all-time budgets. For example, for the 10 minutes time budget, there are 2 datasets for which Recipe performs better than  AutoSKlearn, despite being the overall worst-and best-ranked algorithms, respectively, as shown in Figure 4(a). On average, the results show that AutoSKlearn framework comes in the first place, outperforming other frameworks on the most significant number of datasets for different time budgets, followed by ATM framework, while Recipe comes in the last place, as shown in Figure 4. The Wilcoxon signed-rank test Gehan (1965) was conducted to determine if a statistically significant difference in terms of accuracy exists between the AutoML frameworks over different time budgets, the results of which are summarized in Table 2. The results of the Wilcoxon test confirm the fact that there is no clear winner, and the significance in the performance difference among the frameworks can vary from one-time budget to another. The ensembling version and the full version of AutoSKlearn statistically significantly outperform most of the other frameworks across all time budgets.
The results show that SmartML-m, SmartML-e, and Auto-Weka are significantly outperformed by the rest of the frameworks, as shown in Table 2 We test the robustness of the AutoML frameworks evaluated by the ability of the framework to achieve the same results across different runs on the same input datasets. For each dataset, we run each AutoML framework for 10 different times on 10 minutes time budget. Figure 5 shows the results of the robustness of the AutoML frameworks. The results show that Recipe, AutoWeka, SmartML-m, and all versions of AutoSKlearn obtain very stable performance, while TPOT, ATM and SmartML-e get less stability.

Performance Evaluation of Different Design Decisions
In this section, we study the impact of different users' design decisions including time budget (Section 5.2.1), size of search space (Section 5.2.2), meta-learning (Section 5.2.3), and ensembling (Section 5.2.4) on the performance of the different AutoML frameworks across different time budgets.

Impact of Time Budget
We study the impact of the time budget on the performance of different AutoML frameworks to investigate how quickly the different frameworks can output ML pipelines and whether the different frameworks can guarantee consistent improvement in the performance given more time budget. We evaluate the accuracy of each framework bounded by different four-time budgets 10 minutes, 30 minutes, 60 minutes, and 240 minutes. Since the dataset sizes considered in this work do not exceed one million and a half records, the maximum time budget (240 minutes) is enough for the different frameworks to converge. For each framework, the mean and standard deviation (SD) of the accuracy for all successful runs (N) for each time budget is reported in Table 3. The results show that for the smallest time budget of 10 minutes, TPOT and ATM achieve a comparable highest mean performance and lowest standard deviation. For the rest of the time budgets, ATM achieves the highest mean performance and lowest standard deviation across all successful runs, as shown in Table 3. In contrast, SmartML-m achieves the lowest mean performance and highest standard deviation for all time budgets, as shown in Table 3. Figures C1,C2,C3,C4,C5,C6,C7,C8,C9, and C10 in the Appendix C show the impact of increasing the time budget for each AutoML framework on 100 datasets. Table 4 reports the gain (g) or loss (l) in the accuracy of the frameworks when increasing the time budget from 10 to 30 minutes, from 30 to 60 minutes and from 60 to 240 minutes. The gain is measured by the mean accuracy improvement over all improved datasets and the maximum accuracy improvement achieved per framework. Similarly, the loss is measured as the mean accuracy loss over all declined datasets and the maximum accuracy loss over all declined datasets. When increasing the time budget from 10 to 30 minutes, Recipe achieves the highest mean gain of 13.6 on 2 datasets, followed by SmartML-m, while AutoSKlearn comes in the last place achieving a mean gain of 3.2 on 17 datasets, as shown in Table 4. AutoWeka and SmartML-m, achieve the highest mean and maximum accuracy gain when increasing the time budget from 30 to 60 minutes and from 60 to 240 minutes, respectively. TPOT has the smallest mean and maximum losses in the accuracy across 5 datasets when increasing the time budget from 10 to 30 minutes, as shown in Table 4. AutoSKlearn-e and AutoSKlearn-m have the lowest mean and maximum losses in accuracy when increasing the time budget from 30 to 60 minutes and from 60 to 240 minutes, respectively, as shown in Table 4. It is noticeable that Recipe has the smallest number of datasets that witnessed performance improvement and performance degradation when increasing the time budget. AutoSKlearn-v have the largest number of datasets that witnessed performance improvement when increasing the time budget from 30 to 60 minutes and from 60 to 240 minutes, while AutoSKlearn-e witnessed performance improvement across the largest number of datasets when increasing the time budget from 10 to 30 minutes. In contrast, ATM has the most significant number of datasets with performance degradation when increasing the time budget from 10 to 30 minutes, while SmartML-e has the most significant number of datasets with performance degradation when increasing the time budget from 30 to 60 minutes and from 60 to 240 minutes. The Wilcoxon signed-rank test was conducted to determine if a statistically significant difference in terms of the average accuracy between the AutoML frameworks exists when increasing the time budget, the results of which are summarized in Table 5. The average accuracy variations across the specified time budgets are not significantly high. Table 5 shows that the significance of the impact of increasing the time budget varies from one framework to another. For example, the results show that for AutoSKlearn-m, Recipe, ATM, SmartML-m and SmartML-e, increasing the time budgets do not lead to significant performance impact, while such significant performance impact is achieved in most of the cases for the Auto-Weka, TPOT and all versions of AutoSKlearn except AutoSKlearn-m. These results show that end-users should always carefully consider the trade-off of the length of the time budget for the benchmark frameworks based on their specific goals.  Fig. 6 The frequency of using different machine learning models by the different AutoML frameworks.

Impact of the Size of Search Space
We study the impact of limiting the search space on the performance across different time budgets. In practice, search space defines the structural paradigm that the different optimization methods can explore; thus, designing a good search space is a  Fig. 7 The impact of limiting the search space size on each AutoML framework. Green markers represent better performance with F C search space, blue markers means that the difference between F C and 3C is < 1, red markers represent better performance with 3C search space, and yellow markers represent failed runs.
vital but challenging problem. Figure 6, provides an overview on the most frequent machine learning models that are commonly used by the different AutoML frameworks. By analyzing the returned best performing models, it is notable that there is no single machine learning model that has been frequently used by all AutoML frameworks, however it is apparent the tree-based models are the most frequent across all frameworks for all time budgets. For example, the returned pipelines by the Auto-Weka, AutoSKlearn-v, and SmartML-m show that random forest is the most frequently used classifier, as shown in Figures 6(a), 6(c), and 6(e). The most frequent classifier for TPOT is extra trees, while the most frequent one for ATM is decision tree, as shown in Figures 6(f) and 6(b), respectively. ATM limits its default search space to only three classifiers, namely, k-nearest neighbours, decision tree, and logistic regression, in order to efficiently utilize the time budget. Some AutoML frameworks such as AutoSKlearn allows the end-user to configure the search space by specifying the classifiers that should be included in the search space. In some scenarios, AutoML users prefer to use a small search space that includes previously known well-performing or preferred (e.g., easily interpreted) classifiers. For the AutoML frameworks included in this work that allows configuring the search space, ATM, AutoSKlearn, and TPOT, we compare the accuracy of using the full search space including all available classifiers (F C) to the accuracy when limiting the search space to 3 classifiers (3C) on 30 minutes time budget, the results of which are summarized in Figure 7. The 3 classifiers are identified by selecting the best performing three classifiers supported by all AutoML frameworks over the 100 datasets. These classifiers are support vector machine, random forest, and decision trees. For AutoSKlearn, the results show that the accuracy of the F C outperforms the accuracy of 3C on 28 datasets with average accuracy gain of 3.3% while the accuracy achieved using 3C outperforms the accuracy achieved using F C on 21 datasets by 5.9%, as shown in Figure 7(a). On 50 datasets, both of the F C and the 3C achieve comparable performance with accuracy differences less than 1%, as shown in Figure 7(a). For TPOT, the results shows that that 23 datasets failed to run with 3C and 20 datasets failed to run with F C, while 12 datasets failed on to run using both F C and 3C, as shown in Figure 7(b). For successful runs, TPOT achieved better performance on 21 datasets using F C compared to 3C with average accuracy improvement of 9.6%, while the performance of using both search spaces is comparable on 18 datasets (See Figure 7(b)). ATM failed to run on 19 datasets and 10 datsets using 3C and F C, respectively, while 19 datasets failed to run on ATM using both search spaces, as shown in Figure 7(c). For successful runs, ATM achieved better performance using 3C over F C on 17 datasets with average accuracy improvement of 4%. On the other hand, the ATM achieved better performance using F C compared to 3C on 15 datasets with average accuracy improvement of 9.3%, while the performance of ATM using both search spaces was comparable on 22 datasets, as shown in Figure 7(c). The Wilcoxon signed-rank test was conducted to determine if a statistically significant difference in terms of accuracy on all datasets exists between using F C and 3C on AutoSKlearn, TPOT, and ATM. For TPOT, the results showed that the difference in performance between the two search spaces is statistically significant with more than 95% level of confidence (p-value=0.003). In contrast, for AutoSKlearn and ATM, the results showed that there was no statistically significant difference in performance between the two search spaces.

Impact of Meta-learning
One way to define meta-learning is the process of learning from previous experience gained during applying various learning algorithms on different machine learning tasks, and hence reducing the time needed to learn new tasks Vanschoren (2018). In the following, we study the impact of meta-learning on the performance of AutoML frameworks allow enabling and disabling the meta-learning. The only framework supports configuring the meta-learning is AutoSKlearn. Furthermore, we investigate the relationship between the characteristics of the different datasets and the improvement caused by employing the vanilla version or the meta-learning version of the AutoSKlearn.
AutoSKlearn applies a meta-learning mechanism based on a knowledge base storing the meta-features of datasets as well as the best performing pipelines on these datasets. AutoSKlearn uses 38 meta-features including statistical, informationtheoretic and simple meta-features. In an offline phase, the meta-features and the empirically best-performing pipelines are stored for each dataset in their repository (140 datasets from OpenML repository). In an online phase, for any new dataset, the framework extracts the meta-features of the new dataset and searches for the most similar datasets in the knowledge base and return the top k best performing pipelines on these similar datasets. These k pipelines are used as a warm start for the Bayesian optimization algorithm used in the optimization process. To assess the impact of the meta-learning mechanism, we compare the performance of (AutoSKlearn-v)  The impact of meta-learning over all time budgets. Green markers represent better performance with AutoSKlearn-m, blue markers means that the absolute difference between the vanilla and metalearning versions is < 1, red markers represent better performance using AutoSKlearn-v, and yellow markers represent failed runs.
and (AutoSKlearn-m) on 100 datasets across different time budgets, as shown in Figure 8. The results show that using the meta-learning does not necessarily associated with improvement in the performance. The results show that, on average, the performance of the vanilla and the meta-learning versions is very comparable on the 4 time budgets. In particular, both versions have similar performance on 64, 55, 65, and 69 datasets for 10 minutes, 30 minutes, 60 minutes and 240 minutes, respectively. Table 6 summarizes the performance of both of AutoSKlearn-m and AutoSKlearn-v, in addition to the number of datasets achieved improvement in performance by employing AutoSKlearn-m over AutoSKlearn-v on different time budgets. The number of datasets achieved performance improvement by employing the meta-learning version is substantially higher than that achieved improvement by employing the vanilla version for the 10 minutes and 30 minutes time budgets. For the 60 minutes time budget, the number of datasets that achieved improvement by employing each of the meta-learning and vanilla versions is equal, while for the 240 minutes time budget, the number of datasets achieved improvement in performance by employing the vanilla version is higher than that achieved improvement in performance by employing the meta-learning version. The results show that employing the meta-learning achieved mean improvement of 2.9% on 25 and 27 datasets over 10 minutes and 30 minutes time budgets, respectively. For the 60 minutes and 240 minutes, the mean performance improvement by employing the meta-learning is higher than that achieved by employing the vanilla version, as shown in Table 6. We use Wilcoxon statistical test to assess the significance of the performance difference between the vanilla version and the meta-learning version. The results show that the impact of the meta learning is statistically significant only for the smallest time budget of 10 minutes with more than 95% level of confidence (pvalue=0.004). In the following, we investigate whether there exists a relationship between the characteristics of the datasets and the improvement in accuracy as a result of utilizing either the meta-learning version or the vanilla version of AutoSKlearn over different time budgets. To achieve this goal, we study different groups of datasets achieving consistent performance improvement given more time budget. Figure 9 tracks the improvement in accuracy induced by employing the meta-learning version over the vanilla version (Figure 9(a)) and the vanilla version over the meta-learning version (Figure 9(b)) across different time budgets. For readability, we partitioned the datasets into 4 groups denoted Group-xy(i), where i = {1, .., 4} be the index of time budgets included in this work, sorted in ascending order and x, y are the versions of the AutoSKlearn. Let Group-xy(1) represents the group of datasets achieving performance improvement using the AutoSKlearn version x over AutoSKlearn version y during the time budget of 10 minutes. Let Group-xy(i) for i > 1 represents the group of datasets achieving performance improvement using AutoSKlearn version x over AutoSKlearn version y in time budget i, such that these datasets do not appear in Group-xy(j), where 1 ≤ j < i. As shown in Figure 9(a), out of the 25 datasets in Group-mv(1), 13, 5, and 1 dataset(s) further improved due to meta-learning for the 30, 60, and 240 minutes time budgets, respectively, as shown in Figure 9(a). For the 14 datasets in Group-mv(2), the performance of 4, and 2 datasets improved during 60 and 240 minutes time budgets, respectively. Finally, for Group-mv(3), the performance of only 2 datasets further improved over the 240 minutes budget. Dataset namely, dataset 9 autos is the only one that witnessed improvement in accuracy using AutoSKlearn-m overall different time budgets. Overall, as the time budget increases, the number of datasets that have higher accuracy due to meta-learning decreases significantly from the 30 to 60 minutes, and from the 60 to 240 minutes time budgets. The results in Figure 9(b) show that for the 10 datasets in Group-vm(1), the performance of 5 and 2 datasets improved during the 30 and 60 minutes, respectively. For the 12 datasets in Group-vm(2), the performance of 4 datasets improved in the 60 minutes time budget, while the performance of only 1 dataset further improved in the 240 minutes time budget. For the 11 datasets in Group-vm (3), the performance of 4 datasets further improved in the 240 minutes time budget. In general, the results in Figure 9 show that most of the datasets do not continue to achieve performance improvement given more time budgets. The four datasets, namely, micro-mass, solar-flare 1, AP Omentum Ovary, and dataset 40 sonar achieved higher accuracy using AutoSKlearn-v than AutoSKlearn-m across 3 time budgets. Analyzing the meta-features of these datasets reveals that there is no association between the utilized version of AutoSKlearn and the performance. For all time budgets, no common datasets improved using AutoSKlearn-v and AutoSKlearn-m given more time budgets.

Impact of Ensembling
Ensembling Dietterich (2000) is the process of combining multiple machine learning base models for the same task to produce a better predictive model. These base models can be combined in different techniques, including simple voting (averaging), weighted voting (averaging), bagging, and boosting Dietterich (2000). In the following, we explore the impact of ensembling on the performance of the AutoML frameworks allowing enabling and disabling post-processing ensemble. Such frameworks include AutoSKlearn and SmartML. Furthermore, we investigate whether there is a relationship between the characteristics of the different datasets and the improvement caused by employing the vanilla version or the ensembling version of the AutoML framework. During the optimization process of AutoSKlearn and SmartML, the frameworks store the generated models instead of just keeping the best performing one. These models are used in a post-processing phase to construct an ensemble model. This automatic ensemble construction avoids relying on a single hyperparameter setting which makes the generated model more robust to overfitting.

Fig. 10
The performance difference between the AutoSKlearn-e and AutoSKlearn-v over different time budgets. Green markers represent better performance with AutoSKlearn-e, blue markers means that the absolute difference from the vanilla version is < 1, and red markers represent better performance using AutoSKlearn-v al. Caruana et al (2004), while SmartML uses majority voting Lam and Suen (1997). Ensemble selection is a greedy technique that starts with an empty ensemble and attractively adds base models to the ensemble in a way that maximizes the validation performance. The technique uses uniform weights; however, it allows repetitions. The majority voting is considered the simplest and the most effective scheme. The majority voting scheme adheres to democratic principles, i.e., the class with the most votes wins. We kept the default setting of AutoSKlearn and SmartML using 50 and 5 base models in the ensemble, respectively. To assess the impact of the ensembling, we compare the performance of vanilla/base version of each of AutoSKlearn and SmartML to their ensembling versions on 100 datasets across different time budgets.
AutoSKlearn: Figure 10 shows the performance difference between the AutoSKlearn-e and AutoSKlearn-v over 10, 30, 60 and 240 minutes time budgets. The results show that ensembling does not always contribute to better performance compared to the vanilla version. In addition, the two versions of AutoSKlearn achieve comparable performance on 65, 62, 66 and 63 datasets for 10, 30, 60, and 240 minutes time budgets, respectively. The results show that employing ensembling for AutoSKlearn achieved mean improvement of 4.1%, 3.4%, 4.9%, and 7.1% on 22, 27, 24 and 14 datasets over 10, 30, 60, and 240 minutes time budgets, respectively, as shown in Table 7. We use Wilcoxon statistical test to assess the significance of the performance difference between AutoSKlearn-e and AutoSKlearn-v. The results show that ensembling enhances the performance with a statistically significant gain of more than 95% level of confidence (p value < 0.05) on the 4 time budgets. The level of confidence is almost 99% over all the time budgets combined.   Fig. 11 The performance difference between the SmartML-v and SmartML-e. Green markers represent better performance with SmartML-e, blue markers means that the absolute difference from the base version is < 1, and red markers represent better performance using SmartML SmartML: Figure 11 shows the performance difference between SmartML-e and SmartML-m over different time budgets. The results shows that ensembing does not necessarily contribute to improvement in the performance compared to the base version. The two versions of SmartML achieve comparable performance on almost half of the datasets. SmartML-e slightly improved the performance over the SmartML-m by average accuracy of 12.6%, 12.5%, 11.2%, and 10.2% on 30, 33, 28, and 28 datasets for 10, 30, 60, and 240 minutes time budgets, respectively, as shown in Table 7. We use Wilcoxon statistical test to assess the significance of the performance difference between the base (meta-learning) and the ensembling versions of SmartML. The results show that the ensembling version enhance the performance with a statistically significant gain of more than 95% level of confidence (p value < 0.05) on the 4 time budgets. In the following, we investigate whether there exists a relationship between the characteristics of the datasets and the improvement in accuracy as a result of utilizing either the ensembling version or the vanilla/base version of AutoSKlearn and SmartML over different time budgets. To achieve this goal, we study the different groups of datasets achieving consistent performance improvement given more time budget.
AutoSKlearn: Figure 12 tracks the improvement in accuracy induced by employing the ensembling version over the vanilla version ( 12(a)) or the vanilla over the ensembling (Figure 12(b)) across different time budgets. As shown in Figure 12(a), for the 22 datasets in Group-ev(1), the performance of 12, 6 and 3 datasets improved in the 30, 60, and 240 minutes time budgets. For the 15 datasets Group-ev(2) the performance of 8, and 2 datasets improved in the 60, and 240 minutes time budgets, respectively. As shown in Figure 12(a), the performance of 4 datasets in Group-ev(3) achieved performance improvement in the 240 minutes time budget. Overall, the number of datasets that consistently improving using ensembling declines over time. The datasets AP Omentum Ovary, leukemia, and dbworld-bodies-stemmed have better performance with AutoSKlearn-e than that with AutoSKlearn-v across 4 time budgets. These datasets are characterized by the large number of features and the small number of instances. Figure 12(b) shows that out of 12 datasets in Group-ve(1), 4 datasets have continued to improve in the 30 minutes time budget, and only 1 dataset continued to improve in both of 60 and 240 minutes time budgets. None of the datasets in Group-ve(2) improved in the 60 and 240 minutes time budgets, while only 2 datasets in Group-ve(3) improved in the 240 minutes time budget. Only single dataset was common in all groups across all time budgets which is dataset 40 sonar. Analyzing the meta-features of these datasets reveals that there is no association between the meta-features and the performance improvement.
SmartML: In contrast to AutoSKlearn-e, the number of datasets improving using SmartML-e over SmartML-m is almost consistent over time. Figure 13 tracks the accuracy improvement induced by using SmartML-e over SmartML-m. For the 30 datasets in Group-em(1), the performance of 24, 21 and 20 datasets improved in the 30, 60, and 240 minutes time budgets, as shown in Figure 13(a). The performance of 20 datasets have consistently improved using SmartML-e during the four time budgets. Figure 13(b) shows that out of the 14 datasets in Group-me(1), 10 datasets have continued to improve in the 30 and 60 minutes time budgets and 9 datasets continued to improve in the 240 minutes time budget. For the 6 datasets in Group-me(2), only 1 dataset improved in the 60 minutes time budget. Studying the characteristics of the datasets improved when employed ensembling, we found that 55% of these datasets have almost more than double the average number of instances of the rest of the datasets.

Discussion and Future Direction
The global accuracy average shows that ATM and TPOT achieve the highest performance and appear to perform quite similarly across different time budgets with a maximum performance difference of 1%, while SmartML-m comes in the last place. Overall, AutoSklearn yields no significantly worse results than the best frameworks while achieving the highest number of successful runs across different time budgets and witnessed performance improvement over the most significant number of datasets when increasing the time budget. Our analysis reveals that the impact of meta-learning declines over longer time budgets (i.e., 60 mins, 240 mins). This insight calls for developing novel and more efficient meta-learning techniques that can significantly improve the performance of the optimization process. In contrast, ensembling achieves consistent performance improvement across all time budgets. Generally, all AutoML frameworks considered in this work build pipelines with an average length of 2. TPOT yields the shortest pipelines of an average length of 1.5. A possible explanation could be that TPOT generates pipelines that optimize both the pipelines' performance and complexity.
For some datasets, the performance of the different versions of AutoSKlearn varies significantly across different iterations, as shown in Table 8. These datasets are characterized by having far fewer instances than features. Analyzing the pipelines of the different versions of AutoSKlearn on these datasets across multiple iterations shows that data preprocessing component is responsible for the large performance variance between the different pipelines. For example, the performance difference between AutoSKlearn-v and AutoSKlearn-m on dbworld-bodies (bodies) varies significantly between 6% to 13% across different iterations, as shown in Table 8. The two generated pipelines for AutoSKlearn-v and AutoSKlearn-m used the same model (lda) with the same set of hyperparameters but different preprocessors. The performance difference between AutoSklearn-v and AutoSklearn-m on rsctc2010 3 varies significantly between 4% to 9% across different iterations. Analyzing the pipelines of AutoSklearn-v and AutoSklearn-m on rsctc2010 3 reveals that they share the same model (Gaussian naive Bayes) and the set of hyperparamters but have different preprocessors. For large datasets, meta-learning shows significant performance improvement. For example, AutoSKlearn-m achieves significantly better performance than AutoSKlearn-v on CovPokElec (See Table 8). A possible explanation is that meta-learning warm-starts the optimization process and increases the chances of finding a well-performing configuration in the limited attempts during the defined time budget.
Specifying the time budget needs to be considered carefully as significantly increasing the time budget for the search process (e.g., from 60 minutes to 240 minutes) may not lead to a significant improvement in the accuracy. This decision varies from one scenario/application to another. For some applications, spending a long budget to achieve an additional accuracy of 1% can be crucially important. It could be of less importance for other applications, and more favour can be towards reducing the allocated time budget. Carefully selecting a small search space with few topperforming classifiers can lead to a very comparable performance with a search space that includes a large number of classifiers which is the case for AutoSKlearn and ATM frameworks. Currently, the majority of the AutoML frameworks focus on supervised learning with very little work on the unsupervised learning Tschechlov et al (2021); ElShawi et al (2021); Guyon et al (2016). Clustering differs from existing classification and regression problems. It poses new challenges, including cluster evaluation, as no ground truths exist for real-world datasets, leading to obscure objective functions to optimize. Dedicated research in the area of clustering could contribute to advancement in AutoML. Most of the current work on AutoML considered automating the preprocessing, algorithm selection and hyperparameter tuning while ignoring the feature engineering part. In practice, the feature engineering part consumes most of the time to build ML pipelines and significantly affects the performance. The right feature engineering phase could turn the feature space into a linearly separable space, so even naive classifiers could achieve relatively high accuracy. On the other hand, skipping this phase or using the wrong feature engineering preprocessors makes it harder to achieve relatively high accuracy, even for the most efficient classifiers. Hence, further research in this area can improve the overall performance of the resulting AutoML pipelines.

Conclusion
In this paper, we present a comprehensive evaluation and comparison of the performance characteristics of six AutoML frameworks on 100 datasets from OpenML. Our analysis revealed that there is no single winning framework that outperforms others over all time budgets. Across various evaluations, AutoSklearn, ATM, and TPOT are the top-performing frameworks. The results also show that geneticbased frameworks (TPOT and Recipe) have high frequent failure rates for short time budgets while their success rates are steadily increasing as the time budget increases. Meta-learning has a significant impact on small-time budgets, and such impact declines as the time budget increases. In contrast, ensembling consistently improves performance significantly across all time budgets. Furthermore, carefully selecting a small search space with few top-performing classifiers can lead to a comparable performance with a search space that includes many classifiers. Furthermore, increasing the time budget does not necessarily improve predictive performance. We believe that the results of our analysis are beneficial for guiding and improving the design process of future AutoML techniques.

Data Availability
The datasets generated during and/or analysed during the current study are available in the AutoMLBench repository, https://datasystemsgrouput.github.io/ AutoMLBench/datasets.

Fig. C1
The impact of increasing the time budget on AutoSKlearn-v performance from x to y minutes (x-y). Green markers represent better performance with y time budget, blue markers means that the difference between x and y is < 1. Red markers represent better performance on x time budget.

Fig. C3
The impact of increasing the time budget on AutoSKlearn-e performance from x to y minutes (x-y). Green markers represent better performance with y time budget, blue markers means that the difference between x and y is < 1. Red markers represent better performance on x time budget.

Fig. C4
The impact of increasing the time budget on AutoSKlearn performance from x to y minutes (x-y). Green markers represent better performance with y time budget, blue markers means that the difference between x and y is < 1. Red markers represent better performance on x time budget.

Fig. C6
The impact of increasing the time budget on ATM performance from x to y minutes (x-y). Green markers represent better performance with y time budget, blue markers means that the difference between x and y is < 1. Red markers represent better performance on x time budget.

Fig. C7
The impact of increasing the time budget on SmartML performance from x to y minutes (x-y). Green markers represent better performance with y time budget, blue markers means that the difference between x and y is < 1. Red markers represent better performance on x time budget.

Fig. C9
The impact of increasing the time budget on AutoWeka performance from x to y minutes (x-y). Green markers represent better performance with y time budget, blue markers means that the difference between x and y is < 1. Red markers represent better performance on x time budget.

Fig. C10
The impact of increasing the time budget on Recipe performance from x to y minutes (x-y). Green markers represent better performance with y time budget, blue markers means that the difference between x and y is < 1. Red markers represent better performance on x time budget.