Grammar-based evolutionary approach for automated workflow composition with domain-specific operators and ensemble diversity

The process of extracting valuable and novel insights from raw data involves a series of complex steps. In the realm of Automated Machine Learning (AutoML), a significant research focus is on automating aspects of this process, specifically tasks like selecting algorithms and optimising their hyper-parameters. A particularly challenging task in AutoML is automatic workflow composition (AWC). AWC aims to identify the most effective sequence of data preprocessing and ML algorithms, coupled with their best hyper-parameters, for a specific dataset. However, existing AWC methods are limited in how many and in what ways they can combine algorithms within a workflow. Addressing this gap, this paper introduces EvoFlow, a grammar-based evolutionary approach for AWC. EvoFlow enhances the flexibility in designing workflow structures, empowering practitioners to select algorithms that best fit their specific requirements. EvoFlow stands out by integrating two innovative features. First, it employs a suite of genetic operators, designed specifically for AWC, to optimise both the structure of workflows and their hyper-parameters. Second, it implements a novel updating mechanism that enriches the variety of predictions made by different workflows. Promoting this diversity helps prevent the algorithm from overfitting. With this aim, EvoFlow builds an ensemble whose workflows differ in their misclassified instances. To evaluate EvoFlow's effectiveness, we carried out empirical validation using a set of classification benchmarks. We begin with an ablation study to demonstrate the enhanced performance attributable to EvoFlow's unique components. Then, we compare EvoFlow with other AWC approaches, encompassing both evolutionary and non-evolutionary techniques. Our findings show that EvoFlow's specialised genetic operators and updating mechanism substantially outperform current leading methods[..]


Introduction
Organisations have accumulated vast amounts of data from diverse sources over the years.However, many of these organisations, particularly small and medium-sized ones, do not analyse these historical data, thus missing valuable insights into their operational activities [1].Extracting useful and novel knowledge from such data is a complex process involving various phases, including problem domain analysis, data integration, dataset preprocessing, model building and deployment, interpretation of results, and decision-making based on study findings [2].
Although some phases remain inherently human-centric, other phases are prime candidates for (partial) automation [3].A notable example is the automatic selection of the most appropriate algorithm for model building [4].Recently, this idea of automating ML tasks has been formalised in the area of Automated Machine Learning (AutoML) [5].AutoML provides data scientists with a broader range of alternatives, enabling them to focus on phases that require their expertise and intuition, ultimately bridging the gap between knowledge discovery and domain experts [6,7].Recent studies have even shown that AutoML can outperform humans in specific tasks, such as designing artificial neural network architectures [8,9].
Within AutoML, algorithm selection and hyper-parameter optimisation are two of the most commonly addressed tasks [10].Algorithm selection is typically used to predict the optimal model building algorithm [11], e.g. a classifier [12].To a lesser extent, it has also been applied to recommend preprocessing algorithms, e.g., the best feature selection algorithm [13].Hyper-parameter optimisation is predominantly used to fine-tune specific algorithms.However, it should be noted that selecting the best algorithm(s) and potentially optimising their hyper-parameters individually for each phase could impact overall performance.These algorithms are often part of a more complex ML workflow, where algorithms are sequenced and interact through their outputs.Therefore, relationships, synergies, and constraints among them must be comprehensively examined as a whole.To alleviate these shortcomings, some authors have proposed automating workflow composition, involving two or more phases of the knowledge discovery process [14,15].Automated workflow composition (AWC) refers to the process of finding the sequence of data processing steps, which typically include data preprocessing, feature selection, and machine learning algorithms, optionally tuning their hyper-parameters, that provides the best performance for a particular machine learning (ML) task (e.g., classification) [5].AWC represents a challenging task [16] that is commonly approached as an optimisation problem aiming at constructing a workflow that optimally prepares and analyses a dataset.Since a workflow can be specified as an open template in which any combination of preprocessing algorithms is followed by an ML algorithm to build the decision model, the size of the search space can be extremely large.For instance, if we have a high-dimensional classification dataset with missing values, an ideal workflow might start with an imputation algorithm to handle missing data (e.g.nearest neighbour imputation) followed by a dimensionality reduction algorithm (e.g.PCA), and concluded with a classifier to make predictions (e.g.random forest).This workflow is not just a set of algorithms but a sequence where each step is logically and functionally connected to the next, ensuring effective processing of the source dataset.Notice that considering the broad catalogue of algorithms, ways to combine them, and hyper-parameter options, finding the workflow that best fits to input data is time-consuming, lasting several hours or even days.
Proposals for AWC often utilise Bayesian optimisation (BO) [17] and evolutionary algorithms (EAs) [18].BO is a sequential, model-based optimisation method aimed at reducing the number of evaluations necessary.It does this by selecting the most promising solution(s) for evaluation in each iteration.However, BO-based approaches usually have a predefined structure for the workflows they generate [6,15], which might restrict their use in certain scenarios.For example, Feurer et al. [6] propose optimising workflows that include a feature preprocessor from thirteen options, up to three data preprocessors selected from four alternatives, and a classifier.Similar approaches are seen in recent BO-based proposals, which tend to create even larger workflows [19,20].In contrast, an EA, an optimisation technique inspired by natural evolution [18], aims to iteratively enhance a set of solutions (a population) for an optimisation problem.Such an improvement is achieved using variation operators like crossover and mutation, which alter the structure of the individuals, namely their genotype.The objective here is to generate increasingly better adapted individuals, using a fitness function tailored to the problem to evaluate their quality.
EA-based approaches for AWC tend to compose workflows with more flexible structures [21,22], although they still tend to predefine the order and type of preprocessing algorithms applied at each step [23].Regardless of the technique used, a common practice in the literature is to construct ensembles from the best-performing workflows [6].The purpose is to mitigate overfitting, generally enhancing the generalisation capability of the final ensemble [24].Typically, ensembles are built by selecting the top n workflows based on predictive performance.However, constructing ensembles solely from workflows with similar predictions may not provide any advantage over using the best workflow, as they could misclassify the same instances.This is especially a concern in evolutionary approaches to AWC, where the optimisation process can lead to a convergence of the population, resulting in workflows that make very similar predictions.Therefore, it seems important to maintain prediction diversity among the selected workflows, offering an advantage over the option of choosing only the best workflow.
In this paper, we introduce EvoFlow, an approach based on grammar-guide genetic programming (G3P) that enables more flexible and domain-specific workflow structure definitions.G3P is a form of EA in which individuals are represented as tree structures, and their construction is guided by a predefined grammar [25].This grammar ensures the generation of valid individuals that conform to specific syntactic rules, making G3P particularly effective for problems like AWC where the solution structure is complex.For example, the grammar can prevent the creation of workflows where a model building algorithm incorrectly precedes a preprocessing algorithm.The grammar can be adapted to only include interpretable algorithms such as logistic regression or associative classifiers.Specifically, EvoFlow enhances flexibility and adaptability by allowing workflows to comprise any number of preprocessing steps of any type and order.As we are not imposing restrictions on the resulting algorithm sequence, the solution space is clearly expanded.Notably, G3P has been used by some previous proposals in this domain [23,22].
EvoFlow introduces two novel features that set it apart in the field of AWC.First, unlike other evolutionary proposals that rely on traditional GP operators [21,23], it implements variation operators like crossover and mutation specifically designed for the AWC problem.These operators consider both the structure and hyper-parameters of workflows, offering a more tailored approach to workflow optimisation.Second, EvoFlow integrates an update mechanism that emphasises diversity in workflow predictions, rather than focusing solely on the best predictive performance.This strategy of promoting prediction diversity is a distinctive innovation compared to other AWC proposals that build ensembles [6], enhancing the robustness and generalisability of the results.We observed that as optimisation converges, the generated workflows tend to produce highly similar predictions, even when not composed of similar algorithms.To mitigate this limitation, our update mechanism constructs an ensemble considering not only the fitness of the workflows (i.e.predictive performance) but also the diversity of their predictions [26], thereby mitigating the risk of overfitting.To guide the empirical validation of our proposal, we formulate three research questions (RQs): • RQ1.How do the AWC-specific genetic operators and ensembles contribute to EvoFlow's model?
• RQ2.How does EvoFlow's effectiveness compare to other AWC approaches using different techniques?
• RQ3.How does EvoFlow compare in effectiveness to another GP3-based AWC proposal?
RQ1 explores the individual and combined effects of EvoFlow's unique components.RQ2 and RQ3 focus on benchmarking EvoFlow's performance against other AWC techniques.Our experimental analysis employs 22 classification datasets frequently used to evaluate AWC approaches [15,23].The comparison methods are a representative sample of state-of-the-art techniques like BO [6], EAs [21,23], and automated planning and scheduling (AI planning) [27].Based on this experimentation, the results indicate that the use of our specific genetic operators and ensemble mechanism significantly enhances workflow quality.Furthermore, EvoFlow significantly outperforms existing approaches in terms of predictive performance.The source code and scripts required to replicate the experiments are publicly available as supplementary material.
The rest of the paper is organised as follows.Section 2 defines the AWC task, introduces key concepts and terminology related to the applied techniques, and presents the background.Section 3 reviews related work.Section 4 provides a detailed description of the proposed method, EvoFlow.The experimental framework and research questions are described in Section 5.Then, the experiments are conducted and discussed across three sections: Section 6 presents an ablation study where the internal components of EvoFlow are analysed; Section 7 compares our proposal against other AutoML approaches; and Section 8 conducts a specific comparative study against RECIPE (a prior approach applying G3P for AWC).Finally, Section 9 presents our conclusions and outlines future directions for research.

Background
This section provides an overview of the most relevant concepts related to the problem automated workflow composition (see Section 2.1) and grammar-based genetic programming (see Section 2.2).

Automated Workflow Composition
AWC involves optimising three related dimensions: (1) algorithms, (2) their relationships, and, optionally, (3) their hyper-parameters values.It is important to note that the algorithm selection problem [28] specifically refers to recommending the best algorithm(s) for a given dataset and phase of the knowledge discovery process [5].On the other hand, AWC provides more comprehensive support by covering multiple phases.The relationships between algorithms determine their arrangement within the workflow, allowing them to exploit synergies and handle specific constraints.Finally, the hyper-parameter optimisation problem [29] focuses on selecting the optimal values for hyper-parameters associated with algorithms, such as decision tree depth or SVM (Support Vector Machine) kernel.It is worth mentioning that both algorithm selection and hyper-parameter optimisation have been studied independently for decades.More recently, [15] recognised the need to address both problems jointly, which they referred to as the Combined Algorithm Selection and Hyper-parameter optimisation (CASH) problem.This was approached as a hierarchical optimisation problem, where the selection of one algorithm was considered a hyper-parameter, triggering the optimisation of its respective hyper-parameters while ignoring those of other algorithms.In this paper, we extend the definition of CASH to include the optimisation of relationships between algorithms.
The AWC problem addressed in this paper can be formally defined as follows.Given a set of algorithms A, we define S as the set of all the possible ordered sequences of these algorithms, ranging in size from 1 to |A|.It should be noted that algorithms in A can be applied for data preprocessing (A P ) or model building (A MB ), with A P ∩ A MB = ∅.Furthermore, we define S ∈ S as a non-empty tuple subject to the following constraints: • The size of a sequence S , denoted as |S |, must satisfy 1 ≤ |S | ≤ |A P | + 1.
• For a sequence S = (a 1 , ..., a i , ..., a n ), it must satisfy a n ∈ A MB .Moreover, ∀i such that 1 ≤ i < n, so a i ∈ A P .
From the above, we observe that any sequence is composed of a model building algorithm (e.g.SVM) which may be preceded by one or more preprocessing algorithms (e.g.feature selection).Regardless of their type, the performance of algorithms in A heavily depends on the values of their hyper-parameters.Consequently, the performance of sequence S is determined by the combination of hyper-parameter values of the algorithms it comprises.Let λ be the set of hyper-parameters associated with the algorithms in A and λ S be the hyper-parameters of the algorithms in sequence S , such that λ S ⊂ λ.Each hyper-parameter λ 1 ,λ 2 ,...,λ n has its respective domain Λ 1 ,Λ 2 ,...,Λ n , allowing us to define the hyper-parameter space Λ as the cross product of these domains, Λ 1 × Λ 2 × ... × Λ n .Furthermore, as a sequence S does not include all the algorithms in A, we define Λ S as the restricted search space that only considers the hyper-parameters of the algorithms in S , such that Λ S ⊂ Λ.Finally, given a labelled dataset D, which is split in D train and D valid , the AWC problem aims to find the sequence of algorithms S * ∈ S and their corresponding hyper-parameter values Λ S * ∈ Λ that maximise the predictive performance on D. Therefore, the AWC problem can be formally formulated as follows: where L(S λ , D train , D valid ) represents the loss, such as the misclassification rate, obtained by training the sequence S and its corresponding hyper-parameter values Λ S on D train and testing it on D valid .

Grammar-guided genetic programming
Evolutionary computation [30] is a field of artificial intelligence that encompasses methods, called evolutionary algorithms, inspired by the evolution of living organisms and are designed to solve complex combinatorial optimisation problems.Different paradigms exist within evolutionary computation, primarily differing in the schema used to represent individuals.Genetic algorithms [31], for instance, encode the genotype of individuals as a fixed vector of bits.Another approach is genetic programming (GP), where individuals are encoded as trees without a priori constraints on their shape, size, or structural complexity [32].These trees consist of a set of terminal nodes representing operands (e.g. 2 or X) and internal nodes representing operator functions (e.g.+ or ÷).This representation makes GP well-suited for evolving mathematical expressions and computer programs.Notably, GP requires specialised genetic operators to manipulate these tree-based genotypes.
G3P is an extension of GP in which a context-free grammar (CFG) defines the syntactic constraints that must be satisfied by valid individuals.A CFG is defined by a four-tuple {S , N , T , P}, where S is the root symbol, N is the set of non-terminal symbols, T denotes the set of terminal symbols, and P defines the set of production rules.It is important to note that the terminal symbols correspond to both operands and operators in GP.A production rule specifies how a non-terminal symbol can be rewritten into one of its derivations until the expression consists only of terminal symbols.Formally, a production rule can be expressed as a → B, where a ∈ N and B ∈ { N ∪ T } * .In G3P, each individual is created by deriving a unique sequence of production rules, represented as a derivation tree.The elements of the CFG are taken into account during the application of crossover and mutation operators to ensure the generation of valid individuals.

Review Work
A pioneering work in the field of AWC was proposed by [14], who used particle swarm optimisation to address the complete model selection problem.This involves selecting the best algorithms, together with their hyper-parameters, to perform feature scaling, feature selection, and classification tasks on a given dataset.Subsequently, [15] formalised the CASH problem, which was tackled by Auto-WEKA.This tool, a BO-based approach, automatically composes a two-step workflow consisting of a feature selection algorithm and a classifier, both taken from the WEKA software1 .Auto-WEKA has been extended to include regression algorithms [33] and other types of preprocessing algorithms [19].HyperOpt-Sklearn [34] and Auto-Sklearn [6] are two closely related BO-based approaches that use algorithms taken from scikit-learn 2 to construct their workflows.However, the latter also applies meta-learning to warm-start BO and combines prediction from the most accurate workflows found in an ensemble.This approach has been further enhanced by incorporating a new model selection strategy, a portfolio-building mechanism, and an automated policy selection method [35].Given that evaluating a single workflow can be time-consuming (it may take hours or even days), several BO-based proposals have introduced mechanisms to accelerate the process, such as surrogate models [36], caching algorithms [37], and parallelism with Apache Spark [38], among others.
As highlighted by Quemy [20], most AWC approaches are based on a fixed sequence of generally few steps with a strict ordering, which limits their applicability to specific scenarios or domains.However, evolutionary algorithms tend to generate more complex workflows.A notable example is TPOT [21], which employs GP to construct multi-branch classification workflows.The optimisation process is guided by a two-objective fitness function aiming at maximising the classification accuracy and minimising the workflow size.On the other hand, evolutionary approaches require the evaluation, i.e. training, of a larger number of workflows, which can be computationally prohibitive for large datasets.To address this, TPOT has been extended to train workflows on a subset of the data and use the entire dataset only for the most promising ones [39,40].The use of tree structures allows more complex workflows to be generated [41], as they are not limited to fixed-size genotypes like other evolutionary paradigms, e.g.evolution strategy [42].Moreover, some researchers have employed grammars to define the structure of a valid workflow, enabling them to avoid, for example, the application of a neural network to a dataset with categorical features.Both G3P [23,22] and grammatical evolution [43,44] have been applied to guide the optimisation process.However, their grammars still impose restrictions on the workflow structure.Specifically, RECIPE [23] enforces the order in which the different types of preprocessing algorithms can be executed.AutoML-DSGE [44] adapts the RECIPE grammar by defining a grammar for each dataset.Also, Auto-CVE [22] composes workflows with an optional data preprocessing algorithm, an optional feature selection algorithm, and a classifier.Similarly, HML-Opt [43] constructs workflows with the same three steps, where each step is recursively composed of different algorithms within that category.All these approaches employ genetic operators derived from the GP literature.
The AWC problem has been addressed using various optimisation techniques or combinations thereof.Notably, AI planning techniques, which were proposed before the definition of the CASH problem, have been applied.In this field, [45] and [46] use hierarchical task network (HTN) planning to compose workflows based on an ontology and a CFG, respectively.Similarly, [27] proposed ML-Plan, which, in addition to HTN planning, incorporates a specially designed mechanism to prevent overfitting.Although to a lesser extent, other techniques applied for AWC are reinforcement learning [47,16], multi-armed bandit [48], and particle swarm optimisation [14,49], among others.It is important to note that some approaches separate the selection of algorithms and their relationships from the setting of their hyper-parameters.Therefore, they first apply Monte-Carlo tree search [50], meta-learning and multi-armed bandit [51], or BO [20] to make algorithm selection, before adjusting hyper-parameters with BO.
As mentioned earlier, EvoFlow is a G3P-based technique for AWC that, unlike most existing approaches in the literature [52], does not enforce a prefixed workflow structure.More precisely, compared to current grammar-based approaches, it does not restrict the type of preprocessing algorithms to be applied or their order within the workflow sequence.This enables EvoFlow to expand the solution space and promote greater diversity among solutions.Moreover, domain-specific genetic operators dedicated to optimising both the structure and hyper-parameters of workflows have been developed for EvoFlow.Lastly, in contrast to current proposals that focus on ensembles based solely on the predictive performance of the workflow [6,42], EvoFlow also considers the workflow prediction diversity.This avoids the construction of ensembles composed of workflows that lead to the same or similar predictions, even when they are composed of different algorithms and/or different hyper-parameters.

EvoFlow
The general procedure of EvoFlow is outlined in Algorithm 1.This algorithm takes the following inputs: the maximum number of generations (maxGen), the size of the population (popSize), the grammar (cfg), the maximum number of derivation steps (maxDer), the probabilities for crossover (cxProb) and structural mutation (stMut-Prob), the number of individuals to be returned (archSize), the training set (train), the maximum allowed time for the optimisation process in seconds (budget), the allowed time for evaluation (evalBudget), and the weight assigned to diversity (divWeight), which can range from 0 to 1.The output of EvoFlow is an external archive (archive) that includes the most accurate and diverse individuals, i.e. the workflows.

Algorithm 1: EvoFlow
In : maxGen, popSize, cfg, maxDer, cxProb, stMutProb, archSize, train, budget, evalBudget, divWeight Out: archive 1 % Handling initial population 2 pop ← genWorkflows(popSize, cfg, maxDer) 3 try: Regarding its operation, the algorithm initially generates a random population, denoted as pop, consisting of popSize individuals.These individuals are created in accordance with the specified grammar cfg (line 2).Notice that the maxDer parameter limits the number of times non-terminal symbols can be derived using production rules, ef-fectively constraining the size of the derivation tree and the number of algorithms in the workflows.The individuals in pop are then evaluated, with their fitness being calculated within the allocated time frame of evalBudget seconds (line 4).In this paper, we use balanced accuracy as the fitness function, given that our experimentation focuses on classification datasets and this metric is commonly adopted by other proposals in the field.Balanced accuracy, which should be maximised, is calculated by determining the recall of each class, followed by computing the average across all classes (see Equation 2).Any individuals exceeding this time budget are assigned a fitness value of 0. To mitigate the risk of overfitting, a 5-fold cross-validation is implemented on the train dataset.This number of folds was chosen based on preliminary experiments that showed its effectiveness in striking a balance between predictive generalisation and evaluation time.This approach is in line with standard practices in the field, as exemplified by its default use in MLPlan [27].The external archive archive is initially populated with the archSize individuals demonstrating the highest fitness (line 5).The algorithm proceeds through its iterations until it reaches either the maxGen-th generation (line 6) or the budget time limit (line 33).
During each generation, the selection operator picks popSize individuals (parents) from the current pop population (line 8).A crossover operator, either cxHparams or cxStruct, is applied to each pair of parents with a probability of cxProb (lines 11-18), meaning not all parent pairs undergo recombination.The choice of genetic operators and their specifics will be elaborated upon in Section 4.2.For now, it is important to note that cxHparams is only applicable to parents with at least two hyper-parameters in common (line 12), while cxStruct is used in other cases (line 15).Similarly, two mutation operators, mutStruct and mutHparams, are applied based on the stMutProb probability (lines 20-26), with mutHparams being selected only when mutStruct is not.In practice, a low stMutProb prioritises mutHparams, allowing hyper-parameter values to be modified without changing the workflow structure, unlike mutStruct.Finally, the external archive is updated to retain individuals with the best fitness while ensuring diversity in their predictions (line 29).The importance of fitness and diversity is determined by the divWeight parameter.
Upon completing maxGen generations or exceeding the budget time limit, the archive, containing up to archSize workflows, is returned.These workflows, both accurate and diverse, are then employed to form an ensemble using a weighted majority voting scheme.The weight of each workflow is computed by dividing its fitness by the fitness of the best individual.It is important to note that after the evolutionary algorithm concludes, these individuals are retrained on the complete train set, as the cross-validation is performed during evaluation.Detailed explanations of individual enconding, genetic operators, and the archive update procedure will be provided in the subsequent sections.The genotype of a valid individual is represented by its derivation tree of production rules, as formally defined by the cfg.The phenotype represents the corresponding classification workflow.Figure 1 illustrates an example mapping between genotype and phenotype.In this case, the dataset is preprocessed using principal component analysis to extract new features, followed by the k-nearest neighbour algorithm to build the classification model.
As mentioned in Section 2.2, the CFG defines the sets of terminal ( T ) and nonterminal ( N ) symbols, as well as the production rules (P) for deriving valid expressions from the root symbol (S).In the context of AWC, terminal symbols represent specific preprocessing and classification algorithms, along with their respective hyperparameters.Non-terminal symbols define the derivable elements required to construct a classification workflow.Production rules dictate the derivation steps that generate valid workflows, which are ultimately expressed in terms of terminal symbols, as shown in Figure 1b  Figure 2 shows the proposed CFG for automated composition of classification workflows.For readability and space reasons, some symbols and production rules have been omitted (see supplementary material).The first two production rules in P determine how the root symbol (<workflow>) can be derived into a classifier (<classifier>).In the first case, it can be preceded by a set of preprocessing methods represented by <prepBranch>.The preprocessing branch allows for an inclusive sequence of preprocessing algorithms (<preprocess>), covering various tasks such as feature extraction or feature selection.Notably, there are no restrictions on the type and sequence of preprocessing algorithms.On the other hand, <classifier> can be derived into a classification algorithm and its corresponding hyper-parameters.It should be noted that adapting the grammar to other machine learning tasks, such as regression, can be achieved by directly adding the corresponding algorithms and hyper-parameters.Additionally, if additional constraints were imposed on workflows, the grammar could be adapted to consider specific algorithms, such as interpretable decision trees.

Genetic operators
Three genetic operators are employed in the evolutionary schema: selection, crossover, and mutation.The selection operator is applied at the beginning of each generation to choose a set of parents for breeding.It uses binary tournament selection, randomly selecting two individuals from pop and taking the one with the best fitness.This process is repeated until popSize individuals are selected.
Crossover is applied to each pair of parents with a given probability.Two crossover operators are defined.First, cxStruct is a classical GP operator that randomly selects a common non-terminal symbol in both parents to swap their respective subtrees.An example of this crossover is depicted in Figure 3a, with <prepBranch> as the selected non-terminal symbol.Second, cxHparams collects the common hyper-parameters of both parents and use them to compose two lists of the same length.A one-point crossover is then applied to swap the hyper-parameter values at a randomly selected crossover point.It is important to note that cxHparams requires parents to share at least two hyper-parameters, making it inapplicable in some cases.Figure 3b depicts an example where only the hyper-parameters of the principal component analysis and knearest neighbours are eligible for swapping.Consequently, EvoFlow gives preference to cxHparams over cxStruct when it is applicable.
Two variants of the mutation operator are considered.Firstly, mutStruct aims to generate workflows with diverse algorithms.To this end, it randomly selects a nonterminal symbol and rebuilds the tree branch by deriving all non-terminal symbols randomly until only terminal symbols are generated.Figure 4a illustrates an example where the minMaxScaler is removed and the pca algorithm is added, along with its respective hyper-parameters.Secondly, mutHparams randomly modifies the value of an hyper-parameter with a given probability, depending on whether it is related to a preprocessing or a classification algorithm.The probability of altering each preprocessing and classification hyper-parameter is calculated as the inverse of the number of preprocessing and classification hyper-parameters, respectively.For instance, the probability of modifying each preprocessing hyper-parameter in Figure 4b is 0.5, since there is only one preprocessing algorithm with two hyper-parameters, namely nComponents and whiten.

Update procedure
After applying the genetic operators, the archive is updated with archSize individuals based on both their accuracy and workflow prediction diversity.When the archive is empty in the first generation, the best archSize individuals are directly added only based on their fitness.The workflow prediction diversity of each individual in the archive is determined by comparing the predictions made during its evaluation with others.As mentioned above, cross-validation is used for evaluation, so the labels of all the samples in the train set should be predicted for all the evaluated individuals.Consequently, these labels can be concatenated to build a prediction vector (x) for each individual in archive.
Equation 3 shows the computation of workflow prediction diversity (div i ) for an individual i in the archive.Let x i be the prediction vector of individual i defined as (x i1 , x i2 , ..., x in ), where n corresponds to the number of samples in train.Similarly, let x j be the prediction vector of another individual j in archive denoted as (x j1 , x j2 , ..., x jn ), where i j.The workflow prediction diversity between these individuals is calculated by comparing x i and x j element-wise, counting the number of samples for which they make different predictions, as shown in Equation 4. This count is then divided by the number of samples in train to obtain the ratio of differences.This process is repeated for the remaining individuals in the archive.The resulting ratios are aggregated and divided by the size of the archive minus one.Thus, div i represents how dissimilar the predictions of individual i are compared to the other individuals in the archive, on average.Finally, a combined measure of workflow prediction diversity and fitness, denoted as divfit, is calculated using the obtained value as shown in Equation 5, where divWeight is a parameter that determines the weight assigned to each component.This measure is used to sort the individuals in the archive.
In subsequent generations, when the archive is not empty as it was after evaluating the initial population, the update procedure slightly differs.It begins by selecting those individuals in pop with a fitness greater than 0.Then, for each individual, it computes how diverse its predictions are compared to those individuals already in the archive.Since this individual is not in the archive, two modifications are made to Equation 3 to compute its div value: (1) the denominator of the outer division is limited to |archive|; and (2) the condition j i is eliminated as i and j will never refer to the same individual.After computing divfit, individuals are added to archive, maintaining the sorted order.Finally, the last individuals are trimmed from archive to retain only the top archSize individuals with the highest divfit value.

Experimental settings
An implementation of EvoFlow is provided in Python using the DEAP framework (Distributed Evolutionary Algorithms in Python) [53], which offers functionalities and data structures for implementing evolutionary algorithms.Preprocessing algorithms were obtained from imbalanced-learn3 and scikit-learn.For classification algorithms, scikit-learn and XGBoost4 were considered.Implementations are available for download from the supplementary material.
In this section, we first outline the research questions (RQs) that drive our objectives.Then, we describe the methodology employed in the experiments.

Research Questions
The conducted experiments aim to address the following RQs: • RQ1.How do the AWC-specific genetic operators and ensembles contribute to EvoFlow's model?Given that EvoFlow incorporates two mechanisms designed specifically for the AWC problem, it is essential to assess the individual and combined impact of these mechanisms on the predictive performance of the final models.
• RQ2.How does EvoFlow's effectiveness compare to other AWC approaches using different techniques?As mentioned in Section 3, various proposals have effectively tackled the AWC problem using different techniques. it is crucial to conduct a comprehensive comparison between EvoFlow, based on G3P, and these existing approaches to determine if it achieves state-of-the-art performance.
• RQ3.How does EvoFlow compare in effectiveness to another GP3-based AWC proposal?Similarly to RQ2, it is valuable to analyse the benefits that EvoFlow brings in comparison to other G3P-based AWC tools, ensuring that it advances the use of grammar-based methods for the AWC problem.For the experimentation, we selected a total of twenty-two classification datasets from the literature [15,23].Table 1 provides details on each dataset, including the number of features and classes, as well as the sizes of the training and test sets.These datasets offer diversity in terms of their sizes and complexities.For datasets sources from [15], we used the same data partition defined by the authors.Datasets taken from [23], marked with an asterisk (*), were randomly partitioned, with one-third of the samples allocated for validation, to maintain consistent conditions across all datasets.The generated partitions are publicly available as supplementary material, ensuring reproducibility.

Experimental framework
Table 2 shows the parameter values for EvoFlow.The population size, number of generations, and crossover and structural mutation probabilities were determined through preliminary experiments.For each execution, 1, 6 and 12 hours of budget were assigned to EvoFlow and the baseline methods, with the 12-hour budget used only for datasets where significant differences were observed with the other budgets.The evaluation budget was set to one-tenth of the total budget to prevent spending the execution time on a single individual.The maximum number of derivations, which governs the genotype size, was set to 13.This setting enables the creation of workflows comprising a sequence of up to five algorithms, in accordance with the proposed grammar.The archive size, representing the ensemble size, was limited to 10 to minimise the overhead of retraining the workflows with the full training set after EvoFlow completes.Finally, the diversity weight of 0.2 was chosen based on preliminary experiments, as higher values tend to yield ensembles with workflows exhibiting lower predictive performance.To address the RQs, three experiments were conducted.The first experiment (Section 6) analyses the impact of specific genetic operators and diverse ensembles on the predictive performance of EvoFlow.Our results are then compared with those of Auto-Sklearn [35], TPOT [21], and ML-Plan [27] in the second experiment (Section 7).These approaches employ BO, GP and AI planning, respectively.Finally, a third experiment (Section 8) compares EvoFlow with RECIPE [23], another G3P-based proposal for AWC.This separate comparison is necessary because RECIPE uses an older version of scikit-learn that does not compute the balanced accuracy score, which is the default measure employed by the publicly available implementations of TPOT and Auto-Sklearn at the time of this experimentation.Thus, the F 1 score, representing the harmonic mean of precision and recall, was used as the fitness measure for RECIPE.For EvoFlow, the solutions obtained from the previous experiment, optimised based on the balanced accuracy score, were used, although the F 1 score was computed for the test sets in this experiment.The hyper-parameters of the baseline methods were set to their default values, with only the budget being modified as indicated above.To ensure result validity, twenty repetitions were conducted using different random seeds.The raw results, along with the experiment scripts, are publicly available as supplementary material.

Experiment 1: Ablation study
To address RQ1, this section analyses the internal mechanisms of EvoFlow, specifically examining the impact of using specific genetic operators and constructing ensembles consisting of diverse workflows.Four versions of the method are considered: (1) basic-EvoFlow, which employs standard GP operators and returns the best workflow; (2) op-EvoFlow, which uses only specific genetic operators; (3) ens-EvoFlow, which focuses solely on building diverse ensembles; and (4) EvoFlow, the complete proposed method that incorporates both specific genetic operators and diverse ensembles.The experiment is conducted with a budget of 1 hour.First, we compare basic-EvoFlow with op-EvoFlow.Table 3 presents the average value obtained for each version.The "Inc/dec (%)" column indicates the percentage increase (positive value) or decrease (negative value) achieved by op-EvoFlow compared to basic-EvoFlow.As observed, both versions yield similar results, with the largest difference occurring in the winered dataset.We also perform a Wilcoxon signed-rank test on the average values across twenty repetitions for each dataset.At a significance level of 0.05, no significant differences in favour of any version are found.As mentioned in Section 3, the practice of combining the best workflows into an ensemble has already been employed in the AWC literature.Thus, comparing basic-EvoFlow with ens-EvoFlow alone might obscure the actual contribution of building diverse ensembles, as the improvement could merely come from having an ensemble.To address this concern, we examine two additional versions: top10-EvoFlow and top10W-EvoFlow.These versions create an ensemble comprising the top ten workflows based solely on their predictive performance.However, top10W-EvoFlow constructs a weighted ensemble, assigning greater importance to workflows with better fitness.Notice that diversity is not considered for these versions.Table 4 presents their balanced accuracy scores.As observed, ens-EvoFlow achieves the highest balanced accuracy score in thirteen out of twenty-two datasets, while basic-EvoFlow outperforms it in five datasets.Regarding top10-EvoFlow and top10W-EvoFlow, they only yield the best results in two and one dataset(s), respectively.Additionally, we perform the Friedman test to check for significant differences in average values across twentytwo datasets.After rejecting the null hypothesis, the post hoc Holm test reveals that ens-EvoFlow outperforms the other versions with a significance level of 0.05.It is worth noting that we chose the Wilcoxon signed-rank test and the Friedman test due to their robustness in non-parametric settings, essential given the complex nature of algorithmic performance in AutoML.As stated by García et al. [54], the Wilcoxon test is effective in dealing with paired samples for comparing different versions of our technique, and the Friedman test is particularly suited for evaluating multiple algorithms across various datasets.This ensures reliable statistical validation, even in the presence of outliers and small sample sizes.
Having established the effectiveness of building diverse workflow ensembles, we proceed to compare EvoFlow with the aforementioned versions.We conduct statis-tical tests on the results presented in Tables 3 and 4, showing that EvoFlow significantly outperforms basic-EvoFlow, op-EvoFlow, and ens-EvoFlow in four, five and one dataset(s), respectively.None of the other versions can outperform EvoFlow in any dataset.Furthermore, we compare these versions based on their average values across datasets.In this regard, EvoFlow consistently performs significantly better than the other options with a significance level of 0.05, thus demonstrating the beneficial combination of specific genetic operators and diverse workflow ensembles.Therefore, this version is henceforth used in the subsequent comparisons.

Experiment 2: EvoFlow compared to other AutoML proposals
This section addresses RQ2 by comparing EvoFlow with Auto-Sklearn, TPOT, and ML-Plan.Table 5 presents the results of this comparison for budget durations of 1 and 6 hours.It displays the average balanced accuracy score over twenty repetitions, along with the standard deviation per dataset.To statistically validate our proposal, we employ a Wilcoxon signed-rank test with a significance level of 0.05.The p-values are adjusted for each budget and dataset pair using the Holm method.The symbols ▲ or ▼ indicate whether the baseline approaches have significantly better or worse results than EvoFlow.Finally, the table provides a summary of the number of wins and loses for each tool based solely on the average values.
As observed, EvoFlow significantly outperforms Auto-Sklearn, TPOT, and ML-Plan across most datasets, irrespective of the time budget.For a 1-hour budget, EvoFlow significantly outperforms Auto-Sklearn in fourteen out of twenty-two datasets, with Auto-Sklearn only surpassing it in the amazon dataset.This result may be attributed to amazon having the highest number of features among datasets, excluding the sparse datasets dexter and dorothea.Notice that Auto-Sklearn uses BO, which aims to make as few evaluations as possible, and employs a warm-starting mechanism to start the optimisation process in promising regions of large search spaces.Similarly, EvoFlow significantly outperforms TPOT and ML-Plan in fourteen and thirteen datasets, respectively, with neither of them outperforming EvoFlow in any dataset.Additionally, it is noteworthy that EvoFlow achieves significantly better performance than both TPOT and ML-Plan in eight out of ten multi-class datasets, indicating the suitability of our proposal for such problem domains.Similar trends are observed for a 6-hour budget, where EvoFlow significantly outperforms Auto-Sklearn, TPOT, and ML-Plan in eleven, fourteen and fifteen datasets, respectively.Once again, none of the baseline approaches significantly outperforms EvoFlow in any dataset.
Interestingly, there is little difference between using a 6-hour and a 1-hour budget.In fact, a larger budget can potentially harm results by increasing the risk of overfitting.For TPOT and EvoFlow, a 6-hour budget outperforms a 1-hour budget in five and six datasets, respectively, with gisette, amazon and convex being the common datasets.We also observe that datasets with more features, such as amazon, benefit more from a larger budget than datasets with more samples.However, in two datasets (breastcancer for TPOT and dexter for EvoFlow), the 1-hour budget outperforms the 6-hour budget.Specifically, the breastcancer dataset (the smallest dataset) benefits more from a 1-hour budget, while dexter (a sparse dataset) favors the 1-hour budget for EvoFlow.For Auto-Sklearn, a 1-hour budget significantly improves results in seven datasets.However, in three datasets (amazon, semeion, and convex), the 1-hour budget achieves significantly better results.Notably, amazon and convex are among the largest datasets in terms of features and samples, respectively.The warm-starting procedure of Auto-Sklearn could be a contributing factor.As mentioned above, Auto-Sklearn significantly outperforms EvoFlow only in the amazon dataset with a 1-hour budget.Increasing the budget of ML-Plan leads to significant improvements in results in four datasets without any performance decrease, which can be attributed to its mechanism for avoiding overfitting.
It is worth noting that there is no clear consensus on which datasets benefit from a larger budget.Specifically, there is no dataset for which results significantly improve with a 6-hour budget across all approaches.However, convex and gisette show improved results with a 6-hour budget for three of the approaches.Therefore, we increased the budget to 12 hours for both datasets.In this case, only Auto-Sklearn and EvoFlow exhibit significant improvements in results, but only for the convex dataset (approximately 1% improvement in both cases).Similar to the 6-hour budget, EvoFlow remains significantly superior to all baselines in both datasets.Here, EvoFlow demonstrates the highest median performance in both scenar-ios.This finding aligns with the statistical analysis mentioned above, which is visually summarised in Figures 6a (1-hour budget) and 6b (6-hour budget).These diagrams show that there are no significant differences among the three comparison methods (Auto-Sklearn, TPOT, and MLPlan).To address RQ3, we compare our approach with RECIPE, a pioneering technique applying G3P to tackle the AWC problem.Table 6 presents the results of both approaches in terms of the F 1 score, which serves as the fitness function of RECIPE.It should be noted that for EvoFlow, we use the solutions obtained from Section 7, optimised based on the balanced accuracy score, and then compute their F 1 score for the purpose of comparison.As observed, RECIPE fails to produce results within the specified budgets for specific datasets (indicated as "-"), and thus, these datasets are not considered in the "wins/loses" row.Despite this, EvoFlow significantly outperforms RECIPE in nine and thirteen datasets for 1-hour and 6-hour budgets, respectively.Regardless of the budget, we observe that RECIPE does not always complete its optimisation process within the given time limits.This is likely due to RECIPE terminating the optimisation process after five generations without improving the best individual.Examining the 1hour budget scenario (refer to Figure 7a), the most notable performance differences are observed in the yeast, glass, and germancredit datasets, where EvoFlow surpasses RECIPE's results by 368% (omitted for readability reasons), 17%, and 11%, respectively.It is important to note significant overfitting in the yeast dataset, where RECIPE perfectly classifies all training samples in some runs.For the 6-hour budget scenario (Figure 7b), the largest differences in favour of EvoFlow are found in the same datasets, but the improvements are smaller.In this case, RECIPE shows improvements in the average values for the glass and germancredit datasets, while the results of EvoFlow for these datasets do not improve.Conversely, RECIPE outperforms EvoFlow in three datasets for both budgets: shuttle, winewhite (common to both budgets), and winered.In fact, the most notable performance difference in favour of RECIPE is observed in the winewhite dataset with an 11% improvement.Finally, a Wilcoxon signed-rank test is conducted on the average values of each dataset for both budgets, confirming that EvoFlow significantly outperforms RECIPE.

Experiment 3: EvoFlow compared to RECIPE
Regarding the budget comparison, it is worth mentioning that increasing the budget to 6 hours does not lead to significant improvements in the results of RECIPE for any dataset.This is likely due to RECIPE failing to produce any results within the 1-hour budget for the larger datasets, which would benefit the most from a budget increase.In fact, EvoFlow exhibits significant improvements in its results with the budget increase for five datasets, including gisette, amazon, and convex, for which RECIPE fails to generate any results within the 1-hour budget.
As a final note with a focus on the qualitative dimensions, it is worth highlighting that EvoFlow produced superior solutions that could not be replicated by RECIPE due to its grammar specification.This phenomenon is exemplified in executions involving the germancredit dataset.The grammar of RECIPE is designed to generate preprocessing sequences with a specific, fixed order and restricts the occurrence of specific operations to just one instance.These operations, though optional, encompass imputation, normalisation, scaling, feature selection, and feature generation.However, the best workflows, identified in some runs of EvoFlow, incorporated more than one scaling algorithm, executed feature selection prior to scaling, or opted for imputation of missing values at later stages of the preprocessing sequence.

Concluding remarks
We have introduced EvoFlow, a grammar-guided genetic programming algorithm designed to tackle the automated workflow composition problem.The use of a contextfree grammar provides flexibility and customisability to our approach, enabling practitioners to adapt EvoFlow according to their specific requirements.Unlike other evolutionary methods in the field, EvoFlow incorporates genetic operators that are specifically designed to optimise workflows, encompassing both their structure and hyperparameters.While it is common practice to construct ensembles from the best workflows discovered, we observed that as the evolution progresses, the population converges, resulting in similar workflows with identical predictions, i.e. misclassifying the same samples.To mitigate this issue, EvoFlow incorporates a mechanism for building ensembles that takes into account not only the predictive performance of the workflows but also the diversity of their predictions.
We have empirically validated EvoFlow using a collection of classification datasets from the AWC literature.Initially, we compared different versions of EvoFlow to establish that incorporating specific genetic operators and constructing ensembles of diverse workflows yields superior performance compared to the baselines.The results have demonstrated that combining these characteristics significantly outperforms the basic version of EvoFlow, with the particular emphasis on the creation of diverse ensembles.Also, we pitted EvoFlow against Auto-Sklearn, TPOT, ML-Plan and RECIPE, which use Bayesian optimisation, genetic programming, AI planning and grammar-guided genetic programming algorithms, respectively.The results have shown that EvoFlow significantly outperforms in terms of predictive performance for the given time budgets in up to 68% of the considered datasets, being statistically inferior in a marginal number of them.
In the future we plan to add more preprocessing and machine learning algorithms as grammar operators to support other learning tasks such as regression and clustering.Additionally, we believe it would be valuable to develop human-in-the-loop approaches to incorporate the expertise, experience, and intuition of data scientists into the optimisation process.Lastly, we intend to integrate EvoFlow with popular tools like KNIME or RapidMiner to enhance accessibility and practicability for domain experts.

Supplementary material
The source code of EvoFlow is publicly available, along with a replication package containing all the necessary artefacts to reproduce the experiments outlined in this paper.The package includes the required scripts and information on the Python environments used.Information about the datasets and their partitions is available.Additionally, the raw results of the experiments conducted for both EvoFlow and the baseline methods are provided.The complete statistical analysis, encompassing both unadjusted and adjusted p-values, is also reported.This supplementary material can be accessed from the following Zenodo repository: https://doi.org/10.5281/zenodo.10245033

Figure 1 :
Figure 1: Example of individual

Table 1 :
Datasets used for experiments

Table 2 :
Parameter setting of EvoFlow

Table 3 :
basic-EvoFlow compared to op-EvoFlow in terms of their balanced accuracy score

Table 4 :
Analysis of diverse ensembles in terms of their balanced accuracy score

Table 5 :
EvoFlow compared with Auto-Sklearn, TPOT and ML-Plan in terms of their balanced accuracy score (mean and standard deviation values are displayed)

Table 6 :
Comparison of EvoFlow with G3P-based approach, RECIPE in terms of their F 1 score (mean and standard deviation values are displayed)