Introduction

As the development of the evolutionary computation (EC) community, large-scale optimization problems (LSOPs) become more and more ubiquitous in the research community [1,2,3] and real-world applications [4,5,6]. Meanwhile, due to the existence of the curse of dimensionality [7], the search space and the complexity of problems increase explosively, which degenerates the performance of conventional evolutionary algorithms (EAs) rapidly. Inspired by the divide-and-conquer strategy, the cooperative coevolution (CC) [8] framework decomposes LSOPs into multiple sub-components and optimizes them alternately, which is an efficient and flexible approach to dealing with LSOPs.

Two critical components of the implementation of the CC framework are the decomposition strategy and the optimization technique. Up to now, hundreds of techniques have been proposed to decompose the LSOPs, and we will briefly survey the development of these grouping methods in “A brief survey of grouping methods”. One of the most popular mechanisms of the decomposition is the differential grouping (DG) [9] and its extensions [10,11,12], which identify the additive interactions by fitness difference of perturbed samples and are regarded as the most precise among the decomposition mechanisms. However, the original DG needs a high computational cost for realizing the decomposition. In the extreme case, DG will consume 1,001,000 fitness evaluations (FEs) for a 1000-D fully separable function, which is impossible to extend DG for large-scale expensive optimization problems (LSEOPs) in real-world applications such as aerodynamic design optimization [13], flow-shop scheduling problems [14], and so on. However, the recursive DG (RDG) [15] makes the extension of DG to LSEOPs possible. RDG examines the interactions between subset-to-subset rather than variable-to-variable with the binary search fashion and greatly reduces the averaging FEs to 1.47e4 on CEC2010 and CEC2013 benchmark functions. Moreover, efficient RDG (ERDG) [16] and merged DG (MDG) [17] further reduce the computational budget and improve the decomposition accuracy, which provides a great potential to tackle the LSEOPs based on the CC framework with DG-based decomposition techniques.

Another component of the CC framework is optimization. Conventional EAs are inclined to be determined as a stochastic search [18], they usually consume lots of FEs to find an acceptable solution, which severely limits the scalability of EAs to deal with expensive optimization problems (EOPs) [19]. To overcome this challenge, a mature approach is adopting the relatively low-computational surrogate model to assist in the optimization of EAs. This cluster of algorithms is called surrogate-assisted EAs (SAEAs) [20]. A variety of techniques can be employed to construct surrogate models such as Radial Basis Functions (RBF) [21], Polynomial Regression model (PR) [22], Gaussian Process Regression (GPR) [23] or Kriging [24], and Artificial Neural Networks (ANN) [25]. However, the performance of SAEAs has a strong relationship with the quality of the surrogate model, and the presence of the curse of dimensionality in LSEOPs will decline the accuracy and quality of surrogate models rapidly. Although some works [26,27,28] have adopted dimension reduction techniques and efficient surrogate model management schemes to tackle the high-dimensional EOPs, most of them focus on the EOPs with less than 200 decision variables and rarely involve 1000-D scales. To overcome the severe challenge of 1000-D LSEOPs, we suggest combining the CC framework and the SAEAs to deal with LSEOPs, which is also the motivation of our research.

Meanwhile, scholars have achieved some contributions to dealing with LSEOPs: Ivanoe et al. [29] first introduced the SAEA into the CC framework and proposed a surrogate-assisted CC (SACC) optimizer. Ren et al. [30] follow the SACC framework and combine it with RBF-embedded success-history-based adaptive differential evolution (RBF-SHADE-SACC) to further promote the development of SACC. Sun et al. [31] proposed a surrogate ensemble assisted large-scale expensive evolutionary algorithm with random grouping (SEA-LEEA-RG) to solve LSEOPs. The random grouping strategy is adopted to decompose the original problem, while local and global RBF models are constructed with partial and all samples, and the best-predicted sample participates in the subsequent optimization process. Similarly, Sun et al. [32] combined the random grouping strategy with RBF-assisted DE to deal with LSEOPs and proposed a large-scale expensive optimization with a switching strategy (LSEO-SS), where the switching strategy controls the usage of the local and the global surrogate model, and the unique escape mechanism by re-initialization is designed to avoid premature. Besides, Gu et al. [33] proposed a surrogate-assisted differential evolution algorithm with the adaptive multi-subspace search (SADE-AMSS) for LSEOPs. In the decomposition, SADE-AMSS uses the principal component analysis (PCA) and random decision variable selection to construct the multi-subspace, and three efficient search strategies are embedded in the SADE adaptively. However, most studies focus on optimizers and only adopt the random strategy in the sub-component division. Although capturing the interactions between decision variables based on probability is sometimes efficient and accurate decomposition techniques will consume some additional FEs for interaction identification, the wrong-determined interaction may lead the direction of optimization in the wrong way [34]. Therefore, decomposition accuracy and optimization performance are two mutually restrictive metrics under the background of LSEOPs, and it is necessary to integrate the high-accuracy and computationally cheap decomposition method with efficient optimizers to deal with LSEOPs.

This paper proposes a novel decomposition method named efficient dual differential grouping (EDDG) and adopts a surrogate ensemble assisted differential evolution (SEADE) as the basic optimizer. In the decomposition stage, EDDG contains the both merits of ERDG and Dual DG (DDG) that can detect additive and multiplicative separability with high efficiency and accuracy. Inspired by RDG2 [12] and RDG3 [35], we also introduce an adaptive threshold for multiplicative interaction identification and limit the maximum scale of sub-components to alleviate the negative effect of the curse of dimensionality in optimization at the cost of a certain level of decomposition accuracy. In the optimization stage, our adopted SEADE utilizes the global and local surrogate model to estimate promising solutions, and the participation of these elites is expected to accelerate the convergence of optimization with limited computational resources. More specifically, the contributions of this paper are summarized as follows:

1. In the decomposition stage, we embed the subset-to-subset version of DDG into ERDG, which can detect both the additive and multiplicative interactions efficiently and simultaneously. Inspired by the RDG2, our proposed EDDG inherits the adaptive threshold design, and inspired by RDG3, we limit the maximum scale of sub-components and further decompose the relatively large-scale sub-components to alleviate the curse of dimensionality on solving LSEOPs. Besides, the paper [36] only proved the establishment of the variable-to-variable version of DDG, and we provide mathematical support to the establishment of the subset-to-subset version of DDG.

2. We introduce the SEADE as the basic optimizer in the optimization stage. In each generation of optimization, SEADE constructs the generalized regression neural network (GRNN) model with all samples in the archive and the GPR model with samples in the current generation, which corresponds to the global and local models. Random search as the infill sampling criterion is applied to estimate the promising solutions, and the participation of elites in the optimization is expected to accelerate the convergence of optimization.

3. We implement a set of experiments to evaluate the performance of our proposal on the CEC2013 suite with fewer FEs than the standard experiment setting. To show the superiority of EDDG, we compare the decomposition accuracy (DA), consumed FEs, and optimization performance with state-of-the-art decomposition methods including RDG,Footnote 1 RDG2,Footnote 2 ERDG,Footnote 3 and MDG.Footnote 4 To show the efficiency of SEADE, we compare the optimization results between SEADE and conventional DE. Experimental and statistical results show that our proposal has great prospects to tackle LSEOPs. To the best of our knowledge, not much work has been reported on adopting the CC framework and SAEAs to deal with LSEOPs.

The remainder of this paper is organized as follows: the following section covers the related works. Section “Our proposal: SEADECC-EDDG” provides a detailed introduction to our proposal: SEADECC-EDDG. The subsequent section describes the numerical experiments on the CEC2013 suite. Section “Discussion” discusses the experimental results and provides some future topics. Finally, Section “Conclusion” concludes the paper.

Related works

Preliminaries

Cooperative coevolution (CC)

A general CC framework contains three stages: decomposition, optimization, and combination.

Decomposition: the original problem is divided into multiple sub-problems with a certain decomposition method.

Optimization: optimization techniques (e.g., DE, PSO) are applied to optimize the sub-components. Usually, a sub-solution is not a complete solution and the fitness value cannot be calculated, the context vector [37] will participate in the evaluation of sub-solutions.

Combination: sub-solutions of sub-components are combined, which is regarded as the optimum of the original problem.

In summary, the pseudocode of CC is shown in Algorithm 1.

Algorithm 1
figure a

CC framework

Differential evolution (DE)

DE is first proposed by Storn and Price[38]. Due to its efficiency, robustness, applicability, and other superior characteristics, DE has been widely applied in many fields such as structural damage detection [39,40,41], neural architecture search [42, 43], and drug design[44, 45]. The conventional DE consists of four steps: Initialization, mutation, crossover, and selection.

Initialization: as most EAs, DE is a population-based optimization approach, and random initialization is the most common method to initialize the population first. Supposing the population size is n and the dimension of the problem is m, Eq. (1) shows the structure of the population X and the generation of each individual \(x_{ij}\) of DE.

$$\begin{aligned} \begin{aligned} X&= \begin{bmatrix} x_{11} &{} x_{12} &{} \cdots &{} x_{1m} \\ x_{21} &{} x_{22} &{} \cdots &{} x_{2m} \\ x_{31} &{} x_{32} &{} \cdots &{} x_{3m} \\ \vdots &{} \vdots &{} \ddots &{} \vdots \\ x_{n1} &{} x_{n2} &{} \cdots &{} x_{nm} \\ \end{bmatrix} \\ x_{ij}&= r_1 \cdot \left( UB_j - LB_j\right) + LB_j, \end{aligned} \end{aligned}$$
(1)

where \(LB_j\) and \(UB_j\) are the lower and the upper bound of the \(j\textrm{th}\) dimension and \(r_1\) is a random number in (0, 1). \(x_{ij}\) represents the value of the \(j\textrm{th}\) dimension of the \(i\textrm{th}\) individual.

Mutation: the most characteristic feature of DE is generating the offspring population by the differential mutation operator, which is also the origin of the name of DE. Several variants of mutation strategies have been reported in many pieces of literature, and the commonly used schemes are listed as follows:

$$\begin{aligned}&{\textbf {DE/rand/1}}: \nonumber \\&v_{i} = X_{r_1} + F \cdot \left( X_{r_2} - X_{r_3}\right) . \end{aligned}$$
(2)
$$\begin{aligned}&{\textbf {DE/best/1}}: \nonumber \\&v_{i} = X_{best} + F \cdot \left( X_{r_1} - X_{r_2}\right) . \end{aligned}$$
(3)
$$\begin{aligned}&{\textbf {DE/cur-to-rand/1}}: \nonumber \\&v_{i} = X_{i} + F \cdot (X_{r_1} - X_{i}) + F \cdot \left( X_{r_2} - X_{r_3}\right) . \end{aligned}$$
(4)
$$\begin{aligned}&{\textbf {DE/cur-to-best/1}}: \nonumber \\&v_{i} = X_{i} + F \cdot \left( X_{best} - X_{i}\right) + F \cdot \left( X_{r_1} - X_{r_2}\right) , \end{aligned}$$
(5)

\(r_1, r_2, r_3\) are three mutually different integers uniformly generated from [0, n], \(X_{best}\) is the current best individual in the population.

Crossover: afterward, the crossover operator generates the trail vector u by Eq.(6)

$$\begin{aligned} \begin{aligned} u^j_i = {\left\{ \begin{array}{ll} v^j_i, \ \text {If} \ r \le CR \ or \ j = j_{rand} \\ X^j_i, \ \text {Otherwise} \end{array}\right. } \end{aligned} \end{aligned}$$
(6)

where r is a random number in (0, 1), \(j_{rand}\) is a random integer in [1, m], and \(CR \in (0, 1)\) is the crossover rate.

Selection: finally, the selection operator chooses the proper individual to survive. Taking the minimization problem as an example, Eq. (7) shows the selection mechanism of the conventional DE.

$$\begin{aligned} \begin{aligned} X_{i+1} = {\left\{ \begin{array}{ll} v_i, \ \text {If} \ f\left( v_i\right) < f\left( X_i\right) \\ X_i, \ \text {Otherwise}. \end{array}\right. } \end{aligned} \end{aligned}$$
(7)

RDG and ERDG

Original DG-based decomposition methods are high-accuracy but computationally expensive. To overcome this issue, RDG [15] first introduces a binary search fashion and extends the DG mechanism to subset-to-subset. Assuming that \(X_1 \subset X\) and \(X_2 \subset X\) are two mutually exclusive subsets of decision variables, which means \(X_1 \cap X_2 = \emptyset \). If there are two unit vectors \({{\textbf {u}}}_1 \in U_{X_{1}}, {{\textbf {u}}}_2 \in U_{X_{2}}\), two arbitrary numbers \(l_1, l_2 > 0\), and a candidate solution \({{\textbf {x}}}^*\) in the search space, such that

$$\begin{aligned} \begin{aligned}&f\left( {{\textbf {x}}}^*+l_1{{\textbf {u}}}_1+l_2{{\textbf {u}}}_2\right) -f\left( {{\textbf {x}}}^*+l_2{{\textbf {u}}}_2\right) \\&\quad \ne f\left( {{\textbf {x}}}^*+l_1{{\textbf {u}}}_1\right) -f\left( {{\textbf {x}}}^*\right) , \end{aligned} \end{aligned}$$
(8)

then, there exist some interactions between decision variables in \(X_1\) and \(X_2\). Based on this corollary, RDG initially assigns the first variable \(x_1\) to subset \(X_1\) and the rest of the variables to subset \(X_2\). If \(X_1\) does not interact with \(X_2\) detected by Eq. (8), the element in \(X_1\) is determined as a separable variable, otherwise, \(X_2\) is separated into two mutually exclusive subsets with equal size and recursively execute the identification. Figure 1 shows a demonstration of RDG.

Fig. 1
figure 1

The decomposition process of RDG

This binary search fashion strategy extremely reduces the time complexity for decomposition from \(O(N^2)\) to \(O(N\log (N))\). Furthermore, Efficient RDG (ERDG) [16] notices the redundant interaction examinations exist in the RDG, which can be avoided by historical information. From Fig. 1, variable subset \(X_3\) and \(X_4\) are mutually exclusive and \(X_3 \cup X_4 = X_2\), thus, Eq. (8) can be transformed to Eq. (9)

$$\begin{aligned} \begin{aligned}&f\left( {{\textbf {x}}}^*+l_1{{\textbf {u}}}_1+l_2\left( {{\textbf {u}}}_3+{{\textbf {u}}}_4\right) \right) - f\left( {{\textbf {x}}}^*+l_2\left( {{\textbf {u}}}_3+{{\textbf {u}}}_4\right) \right) \\&\quad \ne f\left( {{\textbf {x}}}^*+l_1{{\textbf {u}}}_1\right) -f\left( {{\textbf {x}}}^*\right) \end{aligned} \end{aligned}$$
(9)

Here, we define

$$\begin{aligned} \begin{aligned} \Delta _1&= f\left( {{\textbf {x}}}^*+l_1{{\textbf {u}}}_1+l_2\left( {{\textbf {u}}}_3+{{\textbf {u}}}_4\right) \right) \\&\quad -f\left( {{\textbf {x}}}^*+l_2\left( {{\textbf {u}}}_3+{{\textbf {u}}}_4\right) \right) \\ \Delta _2&= f\left( {{\textbf {x}}}^*+l_1{{\textbf {u}}}_1\right) -f\left( {{\textbf {x}}}^*\right) \end{aligned} \end{aligned}$$
(10)

If \(X_1\) interacts with \(X_2\), \(X_2\) will be separated into \(X_3\) and \(X_4\), and the interaction identification of \(X_1 \leftrightarrow X_3\) can be computed as

$$\begin{aligned} \begin{aligned} \Delta ^{'}_1&= f\left( {{\textbf {x}}}^*+l_1{{\textbf {u}}}_1+l_2{{\textbf {u}}}_3\right) -f\left( {{\textbf {x}}}^*+l_2{{\textbf {u}}}_3\right) \\ \Delta ^{'}_2&= f\left( {{\textbf {x}}}^*+l_1{{\textbf {u}}}_1\right) -f\left( {{\textbf {x}}}^*\right) = \Delta _2. \end{aligned} \end{aligned}$$
(11)

Similarly, Eq. (11) can be applied to identify the interaction of \(X_1 \leftrightarrow X_4\).

$$\begin{aligned} \begin{aligned} \Delta ^{''}_1&= f\left( {{\textbf {x}}}^*+l_1{{\textbf {u}}}_1+l_2{{\textbf {u}}}_4\right) -f\left( {{\textbf {x}}}^*+l_2{{\textbf {u}}}_4\right) \\ \Delta ^{''}_2&= \Delta _2. \end{aligned} \end{aligned}$$
(12)

And we can reasonably infer that if \((\Delta _1-\Delta _2) = (\Delta ^{'}_1-\Delta ^{'}_2)\), \(X_1\) does not interact with \(X_4\); otherwise, \(X_1\) interacts with \(X_4\), and the calculation of \(\Delta ^{''}_1\) in Eq. (12) can be avoided to save the FEs, more details can refer to [16]. Although the time complexity of ERDG is identical to RDG and its variants theoretically, the necessary FEs in practice can be significantly reduced to averaging 7.62e3 on CEC2013 benchmark functions.

Generalized regression neural network (GRNN)

GRNN is a highly parallel radial basis function network based on the one-pass algorithm [46, 47], and the universal approximation theorem [48] endows an ability to GRNN for approximating any functions theoretically. Specifically, the topology of GRNN is shown in Fig. 2

Fig. 2
figure 2

The structure of GRNN [47]

GRNN consists of four layers: the input layer, pattern layer, summation layer, and output layer. Equation (13) summarizes the GRNN logic in an equivalent nonlinear regression formula [49]:

$$\begin{aligned} E[Y/X]=\left( \int ^{+\infty }_{-\infty }Yf(X,Y)\textrm{d}Y\right) / \left( \int ^{+\infty }_{-\infty }f(X,Y)\textrm{d}Y\right) ,\nonumber \\ \end{aligned}$$
(13)

where \(X=\{X_1, X_2,..., X_n\}\) denotes the n-D input vector, Y is the predicted label by GRNN, E[Y/X] means the expectation of Y given an input X, and f(XY) is the joint probability density of X and Y [50].

Gaussian process regression (GPR)

GPR as a popular probabilistic surrogate model has been widely applied in SAEAs [51,52,53]. Deriving from the Bayesian theory, the GPR can be seen as a random process to undertake the non-parametric regression with the Gaussian processes [54]. For any inputs, the corresponding probability distribution over function f(x) follows the Gaussian distribution as:

$$\begin{aligned} \begin{aligned} f(x) \sim \mathcal {GPR}\left( m(x),k\left( x,x'\right) \right) , \end{aligned} \end{aligned}$$
(14)

where m(x) and \(k(x,x')\) are the mean and the covariance functions respectively, which can be expressed by:

$$\begin{aligned} \begin{aligned} m(x)&=E(f(x)) \\ k(x,x')&=E\left[ \left( m(x)-f\left( x'\right) \right) \left( m(x)-f\left( x'\right) \right) \right] , \end{aligned} \end{aligned}$$
(15)

\(E(\cdot )\) represents the expectation. m(x) is generally zero to simplify the computation practically. \(k(x,x')\) is also named the kernel function to explain the relevance degree between a target observation of the training data set and the prediction based on the similarity of the respective inputs.

In the regression problem, the prior distribution of output y can be denoted as

$$\begin{aligned} \begin{aligned} y \sim N\left( 0,k\left( x,x'\right) +\sigma ^2_nI_n\right) \end{aligned} \end{aligned}$$
(16)

where \(N(\cdot )\) represents the normal distribution. \(\sigma ^2_n\) is the noise term. Assuming the training dataset x and the testing set \(x'\) follow the identical Gaussian distribution, and the prediction \(y'\) would follow a joint prior distribution with the training output y as [55]

$$\begin{aligned} \begin{aligned} \left[\frac{y}{y'} \right]\sim N\left( 0, \begin{bmatrix}k(x,x)+\sigma ^2_nI_n &{} k\left( x,x'\right) \\ k\left( x,x'\right) ^T &{} k\left( x',x'\right) \end{bmatrix}\right) \end{aligned} \end{aligned}$$
(17)

where \(k(x,x), k(x,x')\) and \(k(x',x')\) represent the covariance matrices among inputs from the training set, the training and testing sets, as well as the testing set.

To guarantee the performance of the GPR, some hyper-parameters \(\theta \) in the covariance function require to be optimized with n samples in the training process. One efficient optimization solution is to minimize the negative log marginal likelihood \(L(\theta )\) as [56]

$$\begin{aligned} \begin{aligned} L(\theta )&=\frac{1}{2}\log \left[ \det \lambda (\theta )+\frac{1}{2}y^T\lambda ^{-1}(\theta )y+\frac{n}{2}\log (2\pi )\right] \\ \lambda (\theta )&=k(\theta )+\sigma ^2_nI_n. \end{aligned} \end{aligned}$$
(18)

After the hyper-parameters optimization of the GPR, the prediction \(y'\) can be obtained at data set \(x'\) through calculating the corresponding conditional distribution \(p(y'\vert x',x,y)\) as

$$\begin{aligned} \begin{aligned}&p\left( y'\vert x',x,y\right) \sim N\left( y' \vert \bar{y}', \textrm{cov}\left( y'\right) \right) \\&\bar{y}'=k\left( x,x'\right) ^T\left[ k(x,x)+\sigma ^2_nI_n\right] ^{-1}y \\&\textrm{cov}\left( y'\right) =k\left( x',x'\right) \\&\quad - k\left( x,x'\right) ^T\left[ k(x,x)+\sigma ^2_nI_n\right] ^{-1}k\left( x,x'\right) , \end{aligned}\nonumber \\ \end{aligned}$$
(19)

where \(\bar{y}'\) stands for the corresponding mean values of prediction. \(\textrm{cov}(y')\) denotes a variance matrix to reflect the uncertainty range of these predictions. More details of mathematical support can be found in [57].

The classification of SAEAs

In the past decades, many SAEAs have been published to deal with EOPs, which can be briefly divided into three categories:

Global surrogate model assisted EAs This kind of method applies the global surrogate model to approximate the whole fitness landscape of optimization problems. Liu et al. [23] focused on median-scale EOPs and proposed a surrogate model-aware mechanism to search in a small promising area carefully. Given the presence of the curve of dimensionality, the Sammon mapping is introduced to reduce the dimension of the problem. Yu et al. [58] introduced a generation-based optimal restart strategy to assist social learning PSO for solving medium-scale EOPs, where the global RBF model is re-constructed every few generations with the best samples archived in the historical database. Nuovo et al. [59] adopted an inexpensive fuzzy function to approximate the global fitness landscape, and the samples obtained from the evolution are applied to construct and refine the fuzzy approximation function. When the model becomes reliable, it is employed to evaluate solutions, and elites evaluated by the fuzzy approximation function will be evaluated by the real objective function. Nikolaus et al. [60] embedded a global quadratic model and a simple portfolio algorithm to covariance matrix adaptation evolution strategy (CMA-ES). In real-world applications, the global surrogate assisted technique also makes contributions: Pan et al. [61] hybridized the gannet optimization algorithm (GOA) and the differential evolution (DE) algorithm with the RBF model to deal with wireless sensor network (WSN) coverage problems and achieved great progress, Han et al. [62] introduced the RBF model to DE optimizer and proposed a surrogate assisted evolutionary algorithm with restart strategy (SAEA-RS) to tackle the constrained space component thermal layout optimization problem. Stander et al.[63] focused on the popular pressure swing adsorption system optimization problem and proposed a surrogate-assisted NSGA (SA-NSGA) for multi-objective optimization problems. Meanwhile, scholars noticed that it is not easy to construct a high-fidelity model for the whole fitness landscape. Thus, building a local surrogate model to execute more efficient exploitation has become a popular topic.

Local surrogate model assisted EAs The local surrogate model aims to improve the approximation accuracy of the surrogate model and search the promising solutions in sub-regions. Martinize et al. [64] proposed a surrogate-assisted local search to enhance the exploitation and embedded it to MOEA/D to deal with multi-objective EOPs. Lim et al. [65] proposed a generalized surrogate-assisted evolutionary framework by working on two major issues: (1) to mitigate the curse of uncertainty robustly, and (2) to benefit from the blessing of uncertainty. A diverse set of approximation techniques is employed to assist the memetic algorithm and improve the effectiveness of local search. Sun et al. [66] designed a fitness estimation strategy and embedded it into PSO to estimate the fitness value of particles based on their parents, ancestors, and siblings.

Ensemble surrogate model assisted EAs The ensemble surrogate model usually consists of a global and a local surrogate model, which has been proven that it can outperform a single model on most problems in practice [67]. Wang et al. [68] applied the GRNN and the RBF to assist DE. In the global surrogate assisted phase, DE is employed as the search engine to produce multiple trial vectors, and GRNN takes the responsibility to evaluate these trial vectors. In the local surrogate-assisted phase, the interior point method cooperated with the RBF is utilized to refine each individual in the population. Cai et al. [27] utilized the optima obtained from the global and local surrogate model constructed by the entire search space and the neighbor region around the personal historical best particle respectively to guide the direction of optimization of the PSO. In Wang et al.’s work [69], RBF as the global surrogate model is applied to pre-screen the offspring generated by the DE, and the local surrogate model trained with \(\tau \) best samples is applied to find promising solutions in sub-regions. Furthermore, the surrogate ensemble assisted technique is also a popular approach in real-world expensive optimization problems: Yu et al. [70] proposed a dynamic surrogate-assisted evolutionary algorithm framework for expensive structural optimization, where the adaptive surrogate model selection technology can automatically select the most accurate model from the meta-model pool by the minimum root of mean square error. Chen et al. [71] proposed a surrogate assisted evolutionary algorithm with the dimensionality reduction method for water flooding production optimization, where the Sammon mapping is adopted as the dimension reduction method, and the Kriging model with lower confidence bound (LCB) is employed to estimate promising solutions. Zhou et al. [72] developed a hierarchical surrogate-assisted evolutionary algorithm with multi-fidelity modeling technology for multi-objective shale gas reservoir problems, both the net present value and cumulative gas production are regarded as objective functions, where the low-fidelity model can adopt exploration behaviors, and the high-fidelity model can generate high-quality solutions as a local search operator. Tang et al. [73] proposed an adaptive dynamic surrogate-assisted evolutionary algorithm for the aerodynamic shape design optimization of transonic airfoil and wing. The construction of the high-fidelity global surrogate model is abandoned due to high computational cost, and the local surrogate model on the sub-region is encouraged to implement exploitation behaviors. A large number of surrogate assisted techniques have been introduced to research areas and applications, which have achieved great success.

A brief survey of grouping methods

In this section, we roughly survey the development of grouping methods and classify the mechanism of grouping methods into three categories: static decomposition, random decomposition, and learning-based decomposition.

Static decomposition CCGA [8], as the pioneer of the static decomposition method, divides N-D problems into \(N*1\)-D sub-problems and optimizes them with the genetic algorithm (GA), which shows the competitiveness on separable functions but may mislead the direction of optimization on non-separable functions. To address this issue, CCPSO-S\(_{K}\) [37] decomposes N-D problems into \(k*s\)-D sub-problems in order, where \(N=k*s\). Particle Swarm Optimization (PSO) is adopted as the basic optimizer for each sub-component. However, a remaining problem for static grouping is that if two interacting decision variables are separated initially, then there is no opportunity for them to be assigned to a group anymore.

Random decomposition Random-based grouping methods are developed to solve the mentioned problem of static decomposition, and random grouping can be further divided into two categories: fixed sub-component size and dynamic sub-component size.

(1). Fixed sub-component size DECC-I [74] is the first proposed random grouping, which divides the N-D problems into \(k*s\)-D sub-components with random order and optimizes them with Differential Evolution (DE). Later, DECC-G [75] introduces an adaptive weighting strategy and mathematically proves that the probability of assigning two interacting variables to the same sub-component is pretty high. CCPSO-S\(_{K}\)-rg-aw [76] extends the decomposition method in DECC-G with PSO optimizer.

(2). Dynamic sub-component size An obvious flaw of static decomposition and random grouping with the fixed sub-component size is that we manually limit the scale of sub-components, but this specific parameter is expected to be different among various problems. For example, fully separable functions prefer small-scale sub-components, whereas a relatively large-scale sub-component size can ensure the probability to capture the interactions in non-separable functions. Therefore, DECC-II [74] was proposed to randomly select the scale of sub-components s from a predefined range in the coevolution cycle. MLCC [77] provides a scale candidate \([s_1, s_2,..., s_n]\) and chooses the scale \(s_i\) by the recent improvement of the context vector.

Perturbation-based approaches In the early development of decomposition methods, researchers attempt to capture the interactions by probability until the appearance of Linkage Identification (LI) [78, 79]. Based on the different mechanisms of interaction identification, LI includes two kinds: Non-linearity Check and Monotonicity Detection. In this paper, we mainly focus on the Non-linearity Check, and the detailed introduction to Monotonicity Detection can be found in [78].

Non-linearity Check based methods are well-studied techniques. As a quantitative method, LINC constructs the approximate partial derivative to detect the interactions between decision variables. Mathematically,

$$\begin{aligned} \begin{aligned}&\textrm{If} \ \frac{\partial f^2(x)}{\partial x_i \partial x_j} = 0 \\&\textrm{Then} \ x_i \ and \ x_j \ are \ separable. \end{aligned} \end{aligned}$$
(20)

However, most of the studies focus on black-box optimization, and the value of the partial derivative cannot be obtained from the fitness landscape directly. Therefore, LINC perturbs in corresponding dimensions of samples to identify the separability:

$$\begin{aligned} \begin{aligned}&\forall s \in Pop: \\&\Delta _i = f(s_i) - f(s), \Delta _j = f\left( s_{ij}\right) - f\left( s_j\right) \\&\textrm{If} \ \vert \Delta _i - \Delta _j \vert < \varepsilon \\&\textrm{Then} \ x_i \ and \ x_j \ are \ separable \end{aligned} \end{aligned}$$
(21)

Where \(\varepsilon \) is an allowable error. Once detection for two decision variables needs four FEs. Supposing the population size is k, the dimension of the problem is N, and the total FEs is \(4k\frac{N(N-1)}{2}\). When N approaches 1000 and k equals 1, the necessary FEs is 1,998,000, which is completely unacceptable for LSOPs. Differential Grouping (DG) [9] extends the identical mechanism of LINC to LSOPs. Considering the high computational cost, DG only calculates the perturbation from the lower bound of search space to the upper bound. The neglect of indirect interaction reduces the FEs to \(\frac{N}{2m}(m+N-2)\), where m is the size of the sub-components. In the extreme case that when the 1000-D problem is a fully separable function, the consumed FEs is 1,001,000. A sensitivity test for \(\varepsilon \) is also provided in [9]. Global DG (GDG) [11] introduces the pairwise interactions to improve the decomposition accuracy. DG2 [80], a faster and more accurate decomposition method, reduces the FEs to \(\frac{N(N+1)}{2}+1\) by sharing the FEs information. Furthermore, given the conventional DG is sensitive to \(\varepsilon \) and the absolute difference between \(\vert \Delta _i - \Delta _j \vert \) in Eq. (21) may come from two aspects: interaction and rounding error from a real number to a floating-point number. Thus, DG2 proposes an adaptive \(\varepsilon \) based on IEEE 754 Standard. However, DG2 still costs 500,501 FEs for 1000-D problem decomposition, which limits the scalability of DG2 to higher-dimensional problems.

However, the high-computational budget of DG, DG2, and its variants still limits their scalability in LSOPs with more decision variables such as 5000-D, but the promotion of RDG makes this extension possible. RDG [15] detects the interactions between subset-to-subset with binary search fashion, which extremely decreases the computational complexity from \(O(N^2)\) to \(O(N\log (N))\). Moreover, ERDG [16] and merged DG (MDG) [17] avoid the redundant computation of RDG in interaction identification and further save the computational budget practically. To the best of our knowledge, MDG is the lightest DG-based decomposition method so far.

The above DG-based methods concentrate on additive interaction identification. However, other kinds of separability also exist. Dual DG (DDG) [36] applies the logarithmic operation to transform the multiplicative separability into additive separability and detect the interactions with DG. More specifically, in a multiplicative separable function \(g({\textbf{x}})\) with a minimum larger than 0, \(g_i({\textbf{x}}_i)\) are multiplicative separable sub-components. We define \(f({\textbf{x}}) = \ln g({\textbf{x}})\), and \(f({\textbf{x}})\) can be rewritten to

$$\begin{aligned} \begin{aligned} f({\textbf{x}})=&\ln g({\textbf{x}}) \\ =&\ln \prod ^{k}_{i=1} g_i({\textbf{x}}_i) \\ =&\sum ^{k}_{i=1} \ln g_i({\textbf{x}}_i), 1<k \le D \end{aligned} \end{aligned}$$
(22)

Therefore, the identification condition of DDG to detect the multiplicative separability can be formulated by Eq. (23)

$$\begin{aligned} \begin{aligned}&\textrm{If} \ f(s), f(s_i), f(s_j), f(s_{ij}) \ \textrm{are positive and} \\&\vert \ln (f(s_{ij}) - \ln f(s_j)) - \ln (f(s_i) - \ln f(s)) \vert < \varepsilon \\&\textrm{Then} \ x_i \ and \ x_j \textrm{are multiplicatively separable}. \end{aligned} \end{aligned}$$
(23)

Besides, GSG[81], a general separability grouping method, conducts a comprehensive theoretical investigation that extends the study from the existing additive separability to general separability. The core of the theoretical research is the minimum points shift principle, which can effectively identify general separability.

Our proposal: SEADECC-EDDG

This section introduces our proposal: SEADECC-EDDG in detail. Two parts in our proposal need to be explained: EDDG for decomposition and SEADE for optimization of sub-components.

Decomposition: EDDG

As we mentioned before, the motivation for our proposed EDDG is that we want to develop an advanced decomposition technique that contains the efficiency of ERDG and the advantage of DDG which can detect additive and multiplicative interactions simultaneously. Therefore, we adopt the decomposition framework of ERDG and embed the determination condition of DDG to ERDG, and propose EDDG. Moreover, we design an adaptive threshold for multiplicative interaction identification inspired by RDG2. Considering this research focuses on solving LSEOPs, and the relatively large-scale sub-components still consume a large number of FEs to search for acceptable solutions due to the curse of dimensionality, thus, inspired by RDG3, we decompose the sub-component whose scale is large than the pre-defined maximum scale. In summary, the pseudocode of EDDG is shown in Algorithm 2.

Algorithm 2
figure b

EDDG

NaN is a non-numeric value. In Algorithm 2, EDDG first detects the interactions between \(x_1\) and the rest decision variables (i.e., \(X_1\) and \(X_2\)) (from line 5 to line 8). If these two subsets are separable (see line 9), EDDG further identifies the length of the subset \(X_1\). And if more than one decision variable in \(X_1\), EDDG assigns the \(X_1\) as a non-separable sub-component (see line 11), otherwise, it is allocated to a separable decision variable (see line 13). If there exist interactions between \(X_1\) and \(X_2\), EDDG separates the interrelated decision variables in \(X_2\) and assigns them into \(X_1\) (see line 17). EDDG repeats the above process until all decision variables are well placed. Besides, the main difference between EDDG and ERDG in Algorithm 2 is in line 20 that if the scale of \(X_1\) is larger than the pre-defined maximum scale of sub-component MS, \(X_1\) is further divided into multiple sub-components randomly. Especially for fully non-separable and overlapping functions, the accurate decomposition to these two kinds of problems is that all decision variables are divided into a sub-component, but the optimization by this decomposition cannot be accelerated based on the CC framework. Thus, EDDG sacrifices the accuracy of decomposition and randomly decomposes the sub-component which has a larger scale than MS, and the accuracy of this decomposition will reduce to

$$\begin{aligned} \begin{aligned} Acc = \frac{MS}{D} \times 100\%, \end{aligned} \end{aligned}$$
(24)

where D is the dimension size of the original problem. Assuming the \(MS=100\) and \(D=1000\), the decomposition to the fully non-separable and overlapping function will extremely decrease to \(10\%\). However, this process can alleviate the curse of dimensionality and accelerate the convergence of optimization at the cost of decomposition accuracy.

Algorithm 2 shows the framework of EDDG which mainly inherits the skeleton of ERDG, and Algorithm 3 illustrates the interaction identification between \(X_1\) and \(X_2\).

Algorithm 3
figure c

Interact

\(\beta ^{MAX}_{multi}\) is larger than \(\varepsilon _{multi}\) greatly to ensure that when a negative value exists in FA, the \(\ln (\cdot )\) operator is invalid, and in this case \(X_1\) and \(X_2\) are multiplicatively non-separable. In Algorithm 3, EDDG first calculates the additive difference (see line 8) and multiplicative difference (see line 13), and If \(X_1\) and \(X_2\) are additively and multiplicatively non-separable, \(X_2\) is divided into two equal-sized and mutually exclusive subsets \(X^{'}_2\) and \(X^{''}_2\) (see line 21), and the interaction identification between \(X_1\) and these two subsets is further executed, and this process is continued until EDDG finds all variables that interact with \(X_1\).

After \(X_2\) is separated into \(X^{'}_2\) and \(X^{''}_2\), if \(X_1\) does not interact with \(X^{'}_2\), that we can reasonably infer that \(X_1\) interacts with \(X^{''}_2\) (see line 26), thus, the interaction identification between \(X_1\) and \(X^{''}_2\) can be saved. Similarly, if \(X_1\) interacts with \(X^{'}_2\) and \(\beta _{add} = {\hat{\beta }}_{add}\) or \(\beta _{multi} = {\hat{\beta }}_{multi}\), EDDG can infer that \(X_1\) does not interact with \(X^{''}_2\) (see line 33), which can save the FEs for decomposition.

Similarly, our proposed EDDG inherits the interaction identification technique of ERDG and embeds the subset-to-subset version of DDG to EDDG (i.e. from Algorithm 3 from line 9 to line 17). As a flexible decomposition method, DDG will not consume the extra FEs and only add a slight computational burden, which means it can be embedded into any existing DG-based framework easily, and the time complexity of EDDG is \(O(N \log (N))\) as same as ERDG. Here, we provide the computational complexity analysis of EDDG in detail.

Four categories of the problem exist: fully separable, fully non-separable, partially separable, and the nonseparable problem with overlaps. Supposing the dimension of the problem is D, for fully separable problems, the interaction is detected \(t=D-1\) times, and Algorithm 3 is invoked by Algorithm 2 \(t=D-1\) times, thus the consumed FEs of EDDG is \(2(D-1)+D-1+1=3D-2\). For fully non-separable problems, the decomposition process forms a binary tree, and interaction is detected \(t=2(D-1)-1=2D-3\) times, while Algorithm 3 is invoked by Algorithm 2 \(t=1\) time, thus the necessary FEs for fully non-separable problems is \(2(2D-3)+1+1=4D-4\). For partially separable and nonseparable problems with overlaps, the computational complexity of ERDG is \(O(D\log (D))\), which has been well-proven in [16], and the embedded multiplicative interaction identification does not add any extra computational complexity, thus, in summary, the computational complexity of EDDG is also \(O(D\log (D))\).

Moreover, although the original DDG is combined with RDG3 in[36], it did not prove this subset-to-subset version of DDG. Here, we provide mathematical support.

\({\textbf {Theorem 1}}\): Let \(f({{\textbf {x}}})\) to be a twice continuously differentiable function and \(X=\{x_1, x_2,..., x_N\}\) is a decision variable set of N-D problem. \(X_1\) and \(X_2\) are two mutually exclusive subsets where \(X_1 \cup X_2 = X\). \(U_{X_1}\) and \(U_{X_2}\) are two unit vector sets projected into \(X_1\) and \(X_2\) respectively. If existing two unit vectors \({{\textbf {u}}}^*_1 \in U_{X_1}\) and \({{\textbf {u}}}^*_2 \in U_{X_2}\), two positive real numbers \(l_1, l_2 > 0\), and a candidate solution x in search space to satisfy that

$$\begin{aligned} \begin{aligned} \Delta _1&\ne \Delta _2 \\ \Delta _1&= \ln \left( f\left( {{\textbf {x}}}+l_1{{\textbf {u}}}^*_1\right) \right) - \ln (f({{\textbf {x}}})) \\ \Delta _2&= \ln \left( f\left( {{\textbf {x}}}+l_1{{\textbf {u}}}^*_1+l_2{{\textbf {u}}}^*_2\right) \right) - \ln \left( f\left( {{\textbf {x}}}+l_2{{\textbf {u}}}^*_2\right) \right) . \end{aligned} \end{aligned}$$
(25)

Notice that \(f({{\textbf {x}}}+l_1{{\textbf {u}}}^*_1+l_2{{\textbf {u}}}^*_2), f({{\textbf {x}}}+l_2{{\textbf {u}}}^*_2), f({{\textbf {x}}}+l_1{{\textbf {u}}}^*_1), f({{\textbf {x}}})\) are limited to positive, then subset \(X_1\) has multiplicative interactions with subset \(X_2\).

\(Proof \ 1\): Variable-to-variable version of DDG has been proven in[36], which can be formulated as

$$\begin{aligned} \begin{aligned}&\Delta _1 = \ln \left( f\left( s_i\right) \right) - \ln (f(s)) \\&\Delta _2 = \ln \left( f\left( s_{ij}\right) \right) - \ln \left( f\left( s_j\right) \right) \\&\textrm{If} \ \vert \Delta _1 - \Delta _2 \vert < \varepsilon _{multi} \\&\textrm{Then} \ x_i \ and \ x_j \ are \ multiplicatively \ separable \end{aligned} \end{aligned}$$
(26)

where s is a trial solution, \(s_i\) and \(s_j\) perturb \(\delta \) on s in the \(i\textrm{th}\) and \(j\textrm{th}\) dimension respectively, and \(s_{ij}\) perturb \(\delta \) in \(i\textrm{th}\) and \(j\textrm{th}\) dimension simultaneously. Here, we define \(g(s) = \ln (f(s))\). Given \(f(s) > 0\) is always true, thus, this definition holds, and Eq. (26) can be transformed to Eq. (27), which is equivalent to the canonical version of DG.

$$\begin{aligned} \begin{aligned}&\Delta _1 = g\left( s_i\right) - g(s)\\&\Delta _2 = g\left( s_{ij}\right) - g\left( s_j\right) \\&\textrm{If} \ \vert \Delta _1 - \Delta _2 \vert < \varepsilon _{multi} \\&\textrm{Then} \ x_i \ and \ x_j \ are \ multiplicatively \ separable \end{aligned} \end{aligned}$$
(27)

the only difference is the allowable error \(\varepsilon _{multi}\). Therefore, we can reasonably extend Eq. (27) to the subset-to-subset version based on the theoretical fundamentals in RDG [15] and MDG [17]:

$$\begin{aligned} \begin{aligned}&\Delta _1 = g\left( {{\textbf {x}}}^*+l_1{{\textbf {u}}}_1+l_2{{\textbf {u}}}_2\right) - g\left( {{\textbf {x}}}^*+l_2{{\textbf {u}}}_2\right) \\&\Delta _2 = g\left( {{\textbf {x}}}^*+l_1{{\textbf {u}}}_1\right) -g\left( {{\textbf {x}}}^*\right) \\&\textrm{If} \ \vert \Delta _1 - \Delta _2 \vert < \varepsilon _{multi} \\&\textrm{Then} \ x_i \ and \ x_j \ are \ multiplicatively \ separable \end{aligned} \end{aligned}$$
(28)

and we replace the \(g({{\textbf {x}}})\) to \(\ln (f({{\textbf {x}}}))\), which is the subset-to-subset version of DDG. Thus, Eq. (25) is true, and Theorem 1 is proven.

Another important component in EDDG is the threshold for interaction identification, many works have revealed the sensitivity of the DG mechanism to the threshold. Here, we summarize the popular design of DG-based methods in Table 1.

Table 1 The design of threshold \(\varepsilon \) in popular DG-based techniques

In Table 1, DG, FII, and DDG have fixed threshold \(\varepsilon \) for interaction identification, and this strategy cannot fit fitness landscapes with various characteristics dynamically. RDG adaptively sets its threshold \(\varepsilon \) based on the magnitude of the objective space [11], where \({{\textbf {x}}}_1, {{\textbf {x}}}_2,..., {{\textbf {x}}}_{10}\) are randomly sampled solutions, and \(\alpha =10^{-12}\) is suggested in [67]. RDG2 and ERDG adopt the adaptive threshold setting based on the computational round-off error, where \(\gamma _k = \frac{k\mu _{m}}{1-k\mu _{m}}\) are satisfied in the floating-point number system based on the IEEE 754 Standard [83]. MDG adapts the threshold \(\varepsilon \) according to the fitness function value f(x), dimension size n, and the rounding error of floating numbers, where \(f_{max}=\max \{\vert f({{\textbf {x}}}_{l,l}) \vert , \vert f({{\textbf {x}}}_{u,l}) \vert , \vert f({{\textbf {x}}}_{l,m}) \vert , \vert f({{\textbf {x}}}_{u,m}) \vert \}\). Similarly, we follow the design of the adaptive threshold in RDG2 and ERDG and extend it to our proposal EDDG, and this dynamic strategy can determine the threshold \(\varepsilon _{add}\) and \(\varepsilon _{multi}\) automatically and intelligently.

Optimization: SEADE

This research focuses on optimizing the LSEOPs which have a limited computational budget, thus we introduce a surrogate ensemble assistance technique to conventional DE. Figure 3 shows the flowchart of SEADE to optimize a sub-component.

Fig. 3
figure 3

The flowchart of SEADE

For each sub-component optimization, SEADE first initializes the population and parameters and evaluates the population by the objective function. If the termination criterion is not achieved, SEADE generates offspring individuals with three strategies: one individual estimated by the global surrogate model, one individual estimated by the local surrogate model, and \(n-2\) individuals constructed by conventional schemes of mutation and crossover, where n is the population size. Then, the selection mechanism is adopted to choose the survival solutions. Next, we will details of our applied techniques.

Surrogate ensemble

Universal approximation theorems endow the neural networks (NN) with outstanding regression capacity and state that NN can represent a wide variety of functions when given appropriate weights, thus it has been extensively applied as the global surrogate model in many SAEAs [68, 84, 85]. Gaussian process regression (GPR), also known as the Kriging model, is a highly flexible model which can formulate uncertainty with great precision to build update points [86], and it has been employed as the local surrogate model in many pieces of literature [87,88,89]. Therefore, we combine both global and local surrogate models which are constructed by generalized regression neural network (GRNN) and GPR respectively in our proposed SEADE, and elite solutions estimated by these two surrogate models are expected to have high quality and accelerate the convergence of optimization.

Another component to construct the surrogate model is the training data. Figure 4 provides an example of two training data selection fashions for model construction. Figure 4 b selects the recent data to construct the GRP, which can approximate the local fitness landscape around the current individual. Figure 4c selects all historical data to construct the GRNN, and this scheme can learn the global information of the fitness landscape.

Fig. 4
figure 4

Training data for constructing surrogate models, and the sample surrounded by the red circle is the selected data. a The distribution of individuals as the generation increases. b The training data for constructing the GPR. c The training data for constructing the GRNN

Besides, given the value of different samples may greatly differ, we normalized the fitness value of training data with the max-min scaler before surrogate model construction.

Infill sampling criterion

The infill sampling criterion is also known as the acquisition function [90], which is adopted to estimate elite solutions in the surrogate model. Here, we use the excepted improvement (EI) as the infill sampling criterion, which is formulated in Eq. (29)

$$\begin{aligned} \begin{aligned} \text {EI}(x) = \max \left( f\left( x^*\right) - f(x), 0\right) , \end{aligned} \end{aligned}$$
(29)

where x is the initial solution, and f(x) is observed expectation from surrogate models. EI determines the next sampling solution \(x^*\) by maximizing the improvements from the initial solution. To search the \(x^*\) in the surrogate model, we simply implement the random search as the search engine, and the pseudocode is shown in Algorithm 4.

Algorithm 4
figure d

Random search

In summary, the pseudocode of our proposed SEADE is shown in Algorithm 5.

Algorithm 5
figure e

SEADE

Numerical experiments

In this section, we implement numerical experiments to evaluate our proposal: SEADECC-EDDG. Section“Experiment settings” introduces the experiment settings, and Section “Experimental results” provides the experimental results of our proposal with compared methods.

Experiment settings

Experimental environments and implementation

SEADECC-EDDG is programmed with Python 3.11 and implemented in Hokkaido University’s high-performance intercloud supercomputer, which is equipped with a CentOS operating system, Intel Xeon Gold 6148 CPU, and 384GB RAM. GRNN and GPR are provided by pyGRNN [91] and scikit-learn [92] libraries respectively.

Benchmark functions

We adopt the CEC2013 suite as our benchmark function, which consists of 15 test functions with 5 categories:

(1) \(f_1\) to \(f_3\): fully separable functions;

(2) \(f_4\) to \(f_7\): partially separable functions with 7 none-separable parts;

(3) \(f_8\) to \(f_{11}\): partially separable functions with 20 none-separable parts;

(4) \(f_{12}\) to \(f_{14}\): functions with overlapping sub-problems;

(5) \(f_{15}\): fully non-separable function;

\(f_1\)-\(f_{12}\) and \(f_{15}\) are 1000-D functions, and the rest \(f_{13}\) and \(f_{14}\) are 905-D, 3, 000, 000 FEs for decomposition and optimization are included in the standard CEC2013 suite. Due to our research focusing on LSEOPs, we limit the maximum FEs to \(100 \times D\) including the decomposition and the optimization cost, where D is the dimension size. When D equals 1000, the maximum FEs is 100, 000.

Compared methods and parameters

Two parts in our proposal need to be evaluated: decomposition and optimization. The compared algorithms are listed in Table 2.

Table 2 A summary of the algorithms under comparison

In the decomposition, the maximum scale of sub-components MS is 100. In the optimization, all decomposition techniques are equipped with DE, where the population size is set to 10, scaling factor and crossover rate are set to 0.8 and 0.5 respectively. All compared methods are implemented in 25 trial runs to alleviate the impact of randomness. Besides, to investigate the effectiveness of the surrogate ensemble technique, we implement the ablation experiment among DECC-EDDG, SADECC-EDDG with the local surrogate model (SADECC-EDDG-i), SADECC-EDDG with the global surrogate model (SADECC-EDDG-ii), and SEADECC-EDDG to analyze the efficiency of the independent surrogate model in practice.

Performance indicators

This section introduces the metrics to evaluate the performance of EDDG and SEADE.

Metrics for EDDG: We apply three metrics to evaluate the performance of EDDG: decomposition accuracy (DA) [16], consumed FEs, and optimization with compared techniques. Here, we explain the computation of DA:

Let sep and \(sep'\) be the separable decision variables of decomposition methods and ideal decomposition respectively. \(nonsep=\{g_1, g_2,...,g_n\}\) and \(nonsep'=\{g'_1, g'_2,...,g'_n\}\) have the similar definition.

For separable decision variables:

$$\begin{aligned} \begin{aligned} DA_{sep}=\frac{\vert sep \cap sep'\vert }{\vert sep' \vert } \end{aligned} \end{aligned}$$
(30)

For non-separable decision variables:

$$\begin{aligned} \begin{aligned} DA_{nonsep}=\frac{\sum _{g' \in nonsep}\vert g'\vert }{\sum _{g' \in nonsep'}\vert g'\vert } \end{aligned} \end{aligned}$$
(31)

To evaluate the optimization performance among various decomposition techniques, we execute the optimization of all compared methods with 25 trial runs and calculate the mean and standard deviation (std). To identify the statistical significance, the Kruskal–Wallis test is employed to the fitness value at the end of optimization. If the significance exists, we further apply the Holm test whose p-value is acquired from the Mann–Whitney U test. \(+\), \(\approx \), and − are applied to represent that our proposal is significantly better, with no significance, and significantly worse with the compared method, and the best value is in bold.

Metrics for SEADE To reveal the superiority and efficiency of the SEADE introduction, we apply the Holm test to the 25 trial runs of DECC-EDDG, SADECC-EDDG-i, SADECC-EDDG-ii, and SEADECC-EDDG. The statistical information is provided as well.

Experimental results

Comparison on decomposition

Table 3 summarizes the decomposition results including DA and consumed FEs among RDG2, ERDG, MDG, and EDDG on CEC2013 benchmark functions.

Comparison on optimization

Table 4 summarizes the optimization results among compared decomposition techniques, and Table 5 lists the optimization results between DECC-EDDG, SADECC-EDDG-i, SADECC-EDDG-ii, and SEADECC-EDDG to evaluate the performance of SEADE.

Table 3 The DA and consumed FEs of RDG, RDG2, ERDG, MDG, and EDDG on CEC2013 suite
Table 4 Optimization results of DECC-RDG, DECC-RDG2, DECC-ERDG, DECC-MDG, and DECC-EDDG
Table 5 Optimization results of DECC-EDDG, SADECC-EDDG-i, SADECC-EDDG-ii, and SEADECC-EDDG

Discussion

In this section, we analyze the performance of our proposal SEADECC-EDDG and list some open topics for future research.

Performance analysis of EDDG

We evaluate the decomposition and optimization performance on four kinds of problems: \(f_1\) to \(f_3\) are fully separable, \(f_4\) to \(f_{11}\) are partially separable with different characteristics, \(f_{12}\) to \(f_{14}\) are overlapping functions, and \(f_{15}\) is a fully non-separable function.

In fully separable functions \(f_1\) to \(f_3\), EDDG has a similar performance with the compared algorithms of RDG, RDG2, ERDG, and MDG in \(f_1\) and \(f_2\) that determines all decision variables as separable and achieve 100% decomposition accuracy DA, and the optimization results among compared decomposition techniques are without statistical significance. Meanwhile, we notice that all decomposition algorithms fail to detect the separability in \(f_3\) and divide all separable decision variables into a component, we attribute this difficulty to the inherent property of Ackley’s function which has a smooth and low-valued fitness landscape which directly causes the threshold \(\varepsilon \) also be strict. A similar phenomenon also appears in \(f_6\), the shifted and rotated Ackley’s function which contains 700 separable variables. Only RDG can identify a tiny part of separable variables, while the rest decomposition techniques identify this 700-D separable component as non-separable, and the optimization still suffers from the curse of dimensionality. However, our proposed EDDG can actively break the interactions and further decompose them into several sub-components, which can reduce the search space exponentially and accelerate the convergence of optimization significantly. Thus, we can observe that in the statistical analysis of optimization in \(f_3\), our proposed DECC-EDDG has superiority to the compared DECC-RDG, DECC-RDG2, DECC-ERDG, and DECC-MDG.

In partially separable functions \(f_4\) to \(f_{11}\), EDDG has a similar but slight degenerating performance with RDG2, ERDG, and MDG on DA, which is due to the extra multiplicative interaction identification that may identify the additively non-separable interaction as multiplicatively separable and divide the corresponding sub-components. However, there is no statistical difference between optimization performance among DECC-RDG, DECC-RDG2, DECC-ERDG, DECC-MDG, and DECC-EDDG. Therefore, our proposed EDDG is competitive with compared decomposition algorithms in partially separable functions.

In overlapping functions \(f_{12}\) to \(f_{14}\), RDG, RDG2, ERDG, and MDG correctly identify the interactions between decision variables and divide them into a sub-component. This decomposition degrades the optimization performance extremely due to the existence of the curse of dimensionality. Meanwhile, our proposed EDDG sacrifices the DA mostly to achieve optimization acceleration. A demonstration of the relationship between DA and the optimization performance of overlapping functions is shown in Fig. 5.

Fig. 5
figure 5

A simple example to illustrate the relationship between DA and the optimization performance for the overlapping function

\(f(x)=\sum ^{1000}_{i=1}(x_i+x_{i+1})^2\), \(x_i \in [-5,5]\) is an overlapping function. Assuming a decomposition algorithm \({\mathcal {A}}\) divides all decision variables into a component and DA achieves 100%, while a decomposition algorithm \({\mathcal {B}}\) divides the 1000-D problems into two equal-sized sub-components, and the DA of this decomposition extremely deteriorates to 50%. However, the optimization performance of decomposition \({\mathcal {B}}\) may perform better than \({\mathcal {A}}\) oppositely due to the search space of a sub-component decomposed by algorithm \({\mathcal {B}}\) is \(10^{500}\), while the search space of the component decomposed by algorithm \({\mathcal {A}}\) is \(10^{1000}\), which has \(10^{500}\) times larger scale of search space than the sub-component in \({\mathcal {B}}\). Therefore, the balance between the effect of the curse of dimensionality and DA will benefit solving LSEOPs based on the CC framework, which is the main reason to explain that our proposed EDDG has lower DA than RDG, RDG2, ERDG, and MDG but the optimization performance of DECC-EDDG is higher than DECC-RDG, DECC-RDG2, DECC-ERDG, DECC-MDG, and DECC-EDDG on \(f_{12}\) to \(f_{14}\).

In the fully non-separable function \(f_{15}\), EDDG also sacrifices the DA and to further decomposes the sub-component with a relatively large scale, thus the deterioration of DA can be observed from Table 3 compared with RDG, RDG2, ERDG, and MDG. However, the optimization of DECC-EDDG is significantly inferior to DECC-RDG, DECC-RDG2, DECC-ERDG, DECC-MDG, and DECC-EDDG. In this situation, the accuracy of decomposition for decision variables is more important in optimization than the alleviation of the curse of dimensionality, and we re-emphasize the importance of the balance between the DA and the effect of the curse of dimensionality in the specific large-scale optimization task, which is a challenging topic for our future research.

Performance analysis of SEADE

Ablation experiments are provided to investigate the efficiency of the surrogate ensemble technique, and the experimental and statistical results can be observed in Table 5. GPR as a non-parametric regression model is good at dealing with relatively low- and median-dimensional regression tasks, and GRNN can approximate any high-dimensional problems theoretically. From these numerical experiments and statistical summary, we conclude that the surrogate ensemble technique can accelerate the convergence of optimization significantly in most cases, and the cooperation of the global and the local surrogate model is better than the single surrogate model in practice. In the meantime, we notice an abnormal phenomenon that the introduction of surrogate ensemble assistance will significantly decelerate the optimization in \(f_{15}\), and we infer that this deterioration is caused by the poor decomposition of EDDG in \(f_{15}\). Observing the experimental results in Table 4, our proposal EDDG is not efficient on fully non-separable function \(f_{15}\), and this poor decomposition leads the direction of optimization in the wrong way. Although the surrogate ensemble technique can estimate high-quality elite solutions, the poor decomposition will destroy the fitness landscape, and further influence the real quality of estimated solutions in the original fitness landscape.

Open topics for future research

The above experimental results and analysis show our proposed SEADECC-EDDG is a promising approach to dealing with LSEOPs, however, there are still some open topics for future improvements:

Estimating promising sub-regions

CC framework adopts the decomposition techniques to investigate the interrelationship between decision variables and divides the LSOPs into multiple sub-components, which can alleviate the curse of dimensionality. And another technique named search space reduction (SSR), which limits the optimization within promising sub-regions is also suitable to deal with LSOPs. Here, we give a simple mathematical explanation: Supposing that the dimension of the problem is N, and the search space of a decision variable is a. Simply, the total search space is \(a^N\). If an SSR technique can reduce 10% search space in every dimension, then, the new search space becomes \((0.9a)^N\) and the proportion between new and original search space will be \(0.9^N\). When N is [10, 100, 500, 1000], this proportion is \([34.86\%, 2.65e\)-\(03\%, 1.32e\)-\(21\%, 1.74e\)-\(44\%]\). As the dimension increases, the proportion will decrease rapidly, and the optimization in the limited promising sub-regions will be significantly accelerated by the SSR technique. In the future, the design of the criterion to employ the SSR technique is promising to tackle LSOPs.

The trade-off between decomposition and optimization

We point out that one conflict in solving LSEOPs based on the CC framework is the computational resource allocation between decomposition and optimization. In a more severe case when the computational budget cannot afford the full decomposition of LSEOPs by RDG-based techniques, the balance of resource allocation between decomposition and optimization becomes more important. A feasible approach is to sacrifice some accuracy to realize the decomposition, and the effectiveness has been proven by our experiments and reported by other publications [35]. Specifically, we first manually separate the LSEOP in order or randomly, then RDG-based methods can be applied to decompose each sub-component in parallel. A mathematical explanation is that supposing the dimension of the problem is N, and the time complexity of RDG-based methods is \(O(N\log (N))\). If we first divide this problem into k sub-components, and the total time complexity becomes \(\sum _{i=1}^k(O((N/k)\log (N/k)))=O(N\log (N/k))\). Although the initial random grouping or in-order grouping will decrease the decomposition accuracy, the necessary FEs for decomposition can also be saved, which has stronger scalability to tackle larger-scale EOPs.

Various interaction identification

Our proposed EDDG can detect the additive and multiplicative interactions between decision variables rather than conventional DE/RDG-based techniques which only focus on additive interaction determination. However, other kinds of interactions also exist such as composite separability [81]. A simple example is that if \(f(x)=U(g(x))\), where \(U(\cdot )\) is monotonically increasing in its search space, g(x) is a separable function, then f(x) is a compositely separable function. Therefore, a remaining problem is whether can we detect composite separability based on the DG-based mechanism, which is a meaningful topic to promote the development of the CC community.

Conclusion

This paper proposes a novel algorithm named SEADECC-EDDG to solve LSEOPs based on the CC framework. In the decomposition, we design the EDDG which contains the efficiency of ERDG and the multiplicative interaction identification principle of DDG. Active interaction break operation accelerates the optimization at the cost of decomposition accuracy. In the optimization, we introduce a surrogate ensemble assisted DE as the basic optimizer. The global surrogate model constructed by generalized regression neural network (GRNN) and historical samples can describe the characteristics of the whole fitness landscape, and the local surrogate model constructed by Gaussian process regression (GPR) and current generation solutions can approximate the regularity of the fitness landscape around the recent samples. Estimated elite solutions by the combination of expected improvement (EI) and the random search from surrogate models are expected to accelerate the convergence of the optimization.

In the future, we will focus on introducing the search space reduction technique (SSR) to LSOPs, detecting various interactions with RDG-based frameworks, and solving more complex LSEOPs.