A new formulation of gradient boosting

In the setting of regression, the standard formulation of gradient boosting generates a sequence of improvements to a constant model. In this paper, we reformulate gradient boosting such that it is able to generate a sequence of improvements to a nonconstant model, which may contain prior knowledge or physical insight about the data generating process. Moreover, we introduce a simple variant of multi-target stacking that extends our approach to the setting of multi-target regression. An experiment on a real-world superconducting quantum device calibration dataset demonstrates that our approach outperforms the state-of-the-art calibration model even though it only receives a paucity of training examples. Further, it significantly outperforms a well-known gradient boosting algorithm, known as LightGBM, as well as an entirely data-driven reimplementation of the calibration model, which suggests the viability of our approach.


Introduction
Like many supervised learning algorithms in the setting of regression, the standard formulation of gradient boosting is entirely data-driven. That is, the boosting machine receives training examples, as well as hyperparameters, such as the fitting criterion, then it initializes its fit of an additive expansion with a constant model. In turn, the boosting machine improves upon the constant model based entirely on the evidence present in the training examples. Then, the boosting machine returns the fitted additive expansion, which is a linear combination of basis functions [1][2][3][4][5][6][7][8][9][10][11].
But in the sciences, and many other disciplines, we may already have a nonconstant model, which is based on domain expertise. So, it seems natural to suggest that we reformulate gradient boosting such that it is able to leverage an existing, possibly nonconstant, model, and improve upon the existing model based entirely on the evidence present in the predictions made by the existing model, as well as the corresponding observations.
In this paper, we fulfill this proposal. To do so, we supplant the constant term in the standard additive expansion with a real-valued single-source prediction term, and we make a corresponding modification to the boosting machine. That is, our boosting machine receives finitely many predictions from an existing model, as well as the corresponding observations and hyperparameters, then it initializes its fit of an additive expansion with a coordinate functional in place of a constant model, which ensures the initial additive expansion always makes the same prediction as the existing model. In turn, the boosting machine improves upon the existing model based entirely on the evidence present in the predictions made by the existing model, as well as the corresponding observations. Then, the boosting machine returns the fitted additive expansion, which requires a prediction from the existing model in order to generate its own more accurate prediction.
Interestingly, a similar idea appears in the setting of multi-target regression, whereby the conventional multi-target stacking approach forms a composition of real vector-valued functions such that the first compositional layer maps an instance from the original domain to a multi-target prediction, then the second Figure 1. Conceptual representation of our learning framework. To the left of the red-dashed line, we start with finitely many multi-target predictions made by an existing model, as well as the corresponding multi-targets. To the right of the red-dashed line, a learning machine receives training examples, as well as hyperparameters. In turn, the learning machine prepares single-target training examples and invokes the boosting machine to fit an additive expansion for each single-target regression subtask. Then, the learning machine concatenates the fitted additive expansions and returns the resultant multi-target regressor, which predicts a multi-target, given a multi-target prediction from the existing model. compositional layer maps the multi-target prediction to a more accurate multi-target prediction. Accordingly, we introduce a simple variant of multi-target stacking that supplies a learning machine with finitely many multi-target predictions made by an existing model, as well as the corresponding multi-targets and hyperparameters. In turn, the learning machine prepares single-target training examples and invokes the boosting machine to fit an additive expansion for each single-target regression subtask. Then, the learning machine concatenates the fitted additive expansions and returns the resultant real vector-valued function, which predicts a multi-target, given a multi-target prediction from the existing model; see figure 1. Notedly, multi-target stacking emanates from ensemble learning, and it is a form of regularization, or shrinkage, which is related to ideas on multivariate Gaussian mean estimation, early stopping, multiple linear regression, multi-task learning, and unsupervised pre-training of neural networks [12][13][14][15][16][17][18][19][20][21][22][23][24][25].
Our learning framework arose in the study of automated superconducting quantum device calibration. In this setting, a calibration procedure must infer a collection of control parameters that determine a superconducting quantum device's performance. To do so, the calibration procedure performs a sequence of experiments, where the result from one experiment becomes the input to the subsequent experiment, and so forth. A key experiment gives rise to a multi-target regression problem, where the low rate of calibration data acquisition makes it extremely difficult to outperform the state-of-the-art calibration model with an entirely data-driven approach [26][27][28]. Remarkably, the learning framework in this paper enabled us to leverage energy spectrum predictions made by the calibration model, as well as the measured energy spectrum, and improve upon the calibration model.
In the learning framework section, we review multi-target regression, then we describe the conventional multi-target stacking approach, as well as our variant of multi-target stacking. With regard to the latter, we review the standard formulation of gradient boosting, then we introduce our formulation of gradient boosting, as well as augmented gradient boosting algorithm with a built-in model selection step.
In the benchmark task section, we review the calibration of a nearest-neighbor coupled linear array of superconducting qubits, then we report the results from an experiment on a real-world calibration dataset. With regard to the experiment, figure 5 shows an optical micrograph of the actual superconducting quantum device. Figures 8 and 9 show the performance of the state-of-the-art calibration model and our learning framework on test examples, where we measure accuracy with absolute and squared error, respectively. Figures 10 and 11 show the performance of our learning framework on test examples, when the embedded boosting machine is either an off-the-shelf gradient boosting algorithm, known as LightGBM, or our formulation of gradient boosting, where we measure accuracy with absolute and squared error, respectively [29].
In the explainable machine learning section, we describe an entirely data-driven reimplementation of the calibration model, then we apply the Shapley additive explanations approach to uncover parameter dependencies in the calibration model [30]. Figures 12 and 13 show the performance of the data-driven reimplementation and our learning framework on test examples, where we measure accuracy with absolute and squared error, respectively.

Learning framework
In this section, we begin with a review of the distribution-free setting of multi-target regression, wherein we assume that a multi-target loss function decomposes over the single-targets. Naturally, this leads us to discuss the independent model and conventional multi-target stacking approaches to multi-target regression, as well as our learning framework, which is a variant of the conventional multi-target stacking approach.

Multi-target regression
In the distribution-free setting of multi-target regression, let X denote the domain. We refer to elements of the domain as instances, and we regard instances as the input to a black box whose behavior we wish to model. If the domain is a Cartesian space with the usual definitions of vector addition and scalar multiplication, then we refer to entries of an instance as features. Let Y ⊆ R n denote the codomain, which is a Cartesian n-space with the usual definitions of vector addition, scalar multiplication, and inner product space structure. We refer to n-tuples in the codomain as multi-targets and to entries of multi-targets as single-targets, and we regard multi-targets as the output of the black box. As such, we begin with a sequence of ordered pairs which is supposed random, and we frequently refer to an ordered pair as an example. Accordingly, our goal is to find a multi-target regressor f : X → Y, which is a rule that accurately assigns to each instance some multi-target. In order to formalize the meaning of accuracy, we define a mapping ℓ : Y × Y → R ≥0 , known as a multi-target loss function, and we denote the nonnegative loss of some multi-target regressor f on an example (X, Y) by ℓ(Y, f(X)). Typical examples of a multi-target loss function, such as absolute error and squared error decompose over the single-targets where Y j ∈ R denotes the jth single-target, f j (X) denotes the jth entry of the image of X under the multi-target regressor f, and ℓ j : R × R → R ≥0 denotes the jth single-target loss function. Importantly, a multi-target loss function is a random variable, so it has an associated distribution function. In turn, we define the expected multi-target loss E X,Y [ℓ(Y, f(X))] of the multi-target regressor f, which is equal to a sum of n expected single-target losses whenever we chose a multi-target loss function that satisfies equation (4).
In any empirical work, we must estimate each expected single-target loss to obtain an estimate of the expected multi-target loss, since we cannot directly calculate an expectation value in the distribution-free learning model. Consequently, we split equation (1) where the quantity in parenthesis estimates the expected jth single-target loss. In particular, if the multi-target loss function is absolute error (equation (2)), then the quantity in parenthesis is known as the mean absolute error and equation (6) is known as the average mean absolute error. Similarly, if the multi-target loss function is squared error (equation (3)), then the quantity in parenthesis is known as the mean squared error and equation (6) is known as the average mean squared error.

Conventional multi-target stacking
Let us begin with the independent model approach, which is a straightforward way to extend supervised learning algorithms to the setting of multi-target regression, when they do not natively induce a real vector-valued function. That is, in this approach, a learning machine receives training examples, as well as hyperparameters. Then, for each single-target regression subtask, it slices single-target training examples and invokes an embedded learning algorithm to search a space of single-target regressors f j : X → R to identify the best real-valued function in the space. Lastly, the learning machine concatenates the fitted single-target regressors into a multi-target regressorf = (f 1 ,f 2 , . . . ,f n ). Thusly, it outputs the resultant multi-target regressor, which predicts a multi-target, given an instancê where each entry is the image of an instance under a fitted single-target regressor.
To outperform the naïve independent model approach, the conventional multi-target stacking approach forms a composition of real vector-valued functions through consecutive invocations of the independent model approach. That is, in this approach, a learning machine receives training examples, as well as hyperparameters. Next, it induces a multi-target regressorf with the independent model approach. Subsequently, the learning machine replaces each instance X (i) in the training examples with either a predictionŶ (i) =f(X (i) ) or a concatenation of a prediction and instance (Ŷ (i) , X (i) ), where i = 1, 2, . . . , m train . Then, it induces a multi-target regressorĝ with the independent model approach. Lastly, the learning machine composes the multi-target regressors. Thusly, it outputs the resultant composition of multi-target regressors, which predicts a multi-target, given an instance, by either Y =ĝ(f(X)) = ĝ 1 (f(X)),ĝ 2 (f(X)), . . . ,ĝ n (f(X)) ∈ Y, orŶ =ĝ(f(X), X) = ĝ 1 (f(X), X),ĝ 2 (f(X), X), . . . ,ĝ n (f(X), X) ∈ Y, according to the instance replacement strategy.

Variant of multi-target stacking
In our variant of multi-target stacking, we compose an existing real vector-valued model and a real vector-valued function from a modified gradient boosting approach, where we bypass the formation of the first compositional layer. That is, in our approach, we start with data in the form of equation (1), and we wrangle it into an m × 2n design matrix 5  where X ∈ X ⊆ R n denotes a multi-target prediction made by the existing model, Y ∈ Y ⊆ R n denotes a multi-target, and X j corresponds to the single-target prediction of Y j . Next, we split equation (7) and invokes an embedded boosting machine to induce a single-target regressor for each single-target regression subtask, where j = 1, 2, . . . , n. Lastly, the learning machine concatenates the fitted single-target regressors into a multi-target regressorf = (f 1 ,f 2 , . . . ,f n ). Thusly, it outputs the resultant multi-target regressor, which predicts a multi-target, given a multi-target prediction from the existing model In the next subsection, we review the boosting machine, which is the standard formulation of gradient boosting. Then, in the subsequent subsection, we introduce our modification to the boosting machine, as well as an augmented gradient boosting algorithm that includes a built-in model selection step, which allows us to specify the embedded boosting machine and complete the learning framework discussion.

Boosting machine: the standard formulation of gradient boosting
From the foregoing, let V j = R X be the set of all single-target regressors under the natural operations of addition of two functions and multiplication of a function by a real number. We equip V j with the following inner product so it is a Hilbert space of single-target regressors, where ||f j || = f j , f j denotes the norm induced by the inner product. Let H j ⊆ V j denote a nonempty set of basis functions that is closed under scalar multiplication, where span({H j }) denotes the smallest subspace, which contains the set of basis functions. Also, let l j : V j → R be an empirical loss functional defined by where the single-target loss function on the right-hand side is assumed to be amenable to gradient-based methods. Then, the goal of functional gradient descent is to minimize the empirical loss function by taking steps in the direction of steepest descent subject to the constraint that the single-target regressor must be a linear combination of well-defined basis functions. Notedly, this constraint guarantees that the resultant single-target regressor is a well-defined function; and, in what follows, we refer to the linear combination as an additive expansion.
Before we derive the method of functional gradient descent, we must formulate the inner product between the gradient of the empirical loss functional at a single-target regressor and an arbitrary single-target regressor, since this mathematical object appears in a quadratic approximation that underlies the derivation. To do so, it is convenient to think of the derivative as a map from the space of functions on the Hilbert space with k continuous derivatives to the space of all maps from the vector space R m train to the space of all linear maps from the vector space R m train to R n , where k is as large as necessary. So the gradient is a special case from which we can formulate the gradient of the empirical loss functional where (R m train ) * denotes the dual space (R m train → R), which consists of linear maps from the vector space R m train to the ground field of real numbers R. Then, the gradient of the empirical loss functional at a single-target regressor is a linear functional. Hence, the Riesz representation theorem enables us to formulate the inner product between the gradient of the empirical loss functional at a single-target regressor and an arbitrary single-target regressor where f j , g j ∈ V j . Thus, we are ready to derive the method of functional gradient descent.
To begin the derivation, let f j,k ∈ span({H j }) denote the current additive expansion, which follows from an integral number k of functional gradient descent updates to an initial additive expansion f j,0 ∈ H j . Now, suppose we wish to improve upon the current additive expansion with a functional gradient descent update. To do so, we replace the objective function in the constrained optimization problem with a quadratic approximation that is equivalent to a second-order Taylor approximation of the empirical loss functional at the current additive expansion where we assume the Hessian of the empirical loss functional at the current additive expansion so the the Riesz representation theorem implies the rightmost summand Notedly, Zheng et al put forth this line of reasoning in a coordinate-dependent way, then Grubb and Bagnell put forth this line of reasoning in a coordinate-independent way, where they study the convergence analysis of functional gradient descent [31,32].
To complete the derivation, we manipulate the approximant into an equivalent, but more manageable, form where r j,k = −(∇(l j ))(f j,k ) denotes the pseudo-residual. Then, we invoke a learning algorithm to fit a basis function to the pseudo-residual in the least squares sense, since the pseudo-residual does not generalize to out-of-sample instances as pointed out by Friedman [7,11]. Lastly, we append a weighted version of the fitted basis function to the current additive expansion to update the additive expansion.
To implement the method of functional gradient descent with the boosting machine, we fit an additive expansion through forward stagewise additive modeling, where α j,0 denotes a constant term, α j,k denotes an expansion coefficient, b(X; θ j,k ) denotes a basis function, which is characterized by a set of parameters θ j,k , and k = 1, 2, . . . , K j . That is, fitting begins with a constant function f j,0 (X) = c, where the constant offset value c is usually equal to 0 or the optimal constant model argmin c Next, a for loop executes the following statements k = 1, 2, . . . , K j times: (a) For i = 1, 2, . . . , m train , compute the pseudo-residual and evaluate at the current additive expansion .
(b) Invoke a learning algorithm to fit a basis function to and learn the set of parameters θ j,k . (c) Invoke an optimization algorithm to determine the expansion coefficient Lastly, the boosting machine returns the fitted additive expansionf j = f j,K j . In the following subsection, we show that the standard formulation of gradient boosting generates a sequence of improvements to a constant model. In turn, we show how to formulate gradient boosting so that it generates a sequence of improvements to an existing, possibly nonconstant, model. Then, we introduce an augmented gradient boosting algorithm, which ensures robustness of our modification to the boosting machine.

Modification to the boosting machine
To show that the standard formulation of gradient boosting generates a sequence of improvements to a constant model, we split the proof in to two cases. In the first case, we assume the constant offset value is c = 0. As such, in iteration k = 1, the loop body updates the additive expansion to a weighted basis function. Then, in iteration k = 2, the loop body updates the additive expansion to a weighted basis function plus a weighted basis function, and so on. Thus, we deduce the following sequence of improvements to the constant zero model In the second case, we assume the constant offset value is c = 0, and without loss of generality, we suppose it is equal to a nonzero sample meanȲ j = argmin c As such, in iteration k = 1, the loop body updates the additive expansion to the sample mean plus a weighted basis function. Then, in iteration k = 2, the loop body updates the additive expansion to the sample mean plus a weighted basis function plus a weighted basis function, and so on. Thus, we deduce the following sequence of improvements to the constant nonzero model which follows from Friedman's original formulation of LS_Boost, as well as Bühlmann and Yu's formulation of L 2 Boosting with the caveat that an expansion coefficient α j,k = ν corresponds to a shrinkage parameter 0 < ν ≤ 1, where k = 1, 2, . . . , K j [7,9,10] Therefore, we conclude that the standard formulation of gradient boosting always generates a sequence of improvements to a constant model.
To formulate gradient boosting so that it is able to generate a sequence of improvements to the existing model we propose to fit the following additive expansion 6 f j (X) = π j (X) +  through forward stagewise additive modeling, where π j : X → X j denotes the jth coordinate functional, which maps an instance to the jth single-target prediction. That is, fitting proceeds as before with the caveat that it begins with the jth coordinate functional f j,0 (X) = π j (X) = X j in place of a constant function 7 . As such, in iteration k = 1, the loop body updates the additive expansion to the jth single target prediction plus a weighed basis function. Then, in iteration k = 2, the loop body updates the additive expansion to jth single target prediction plus a weighed basis function plus a weighed basis function, and so on; see the pseudocode in figure 2. Thus, we deduce equation (10).
To ensure robustness of our modification to the boosting machine, we introduce an augmentation to Algorithm 1, which begins with a built-in model selection step; see the pseudocode in figure 3. That is, the built-in model selection step randomly partitions equation (8) into k non-overlapping folds, where k is typically a natural number between 5 and 10, inclusive; see figure 4 for an illustration. Next, it repeats the following two steps k times with each of the withheld folds used exactly once as the validation examples: • Of the k folds, withhold one for validation. Supply Algorithm 1 with the remaining k − 1 folds, as well as hyperparameters. • Independently evaluate the incumbent single-target regression predictions made by the existing model and the candidate fitted additive expansion on the withheld fold from the previous step by computing the average withheld fold loss of the candidate and incumbent. 7 We conjecture the existing model accelerates functional gradient descent, whenever . Figure 4. Illustration of an augmented 5-fold cross-validation procedure. We partition the single-target training examples (equation (8)) into five non-overlapping folds of equal size. In each iteration, we supply Algorithm 1 with four training folds, shown in light blue, then we independently evaluate the single-target regression predictions made by the existing model and the fitted additive expansion on the withheld fold, shown in light green. Lastly, we independently average their five withheld fold results, and we refer to these averages as the incumbent and cross-validation error, respectively.
Subsequently, the built-in model selection step independently averages the incumbent and candidate withheld fold results, and we refer to these averages as the incumbent and cross-validation error, respectively. Then, it selects the existing model, whenever the incumbent error is less than or equal to the cross-validation error. Namely, it breaks and returns the fitted single-target regressorf j (X) = π j (X) = X j . Otherwise, it selects Algorithm 1, which means it returns the fitted single-target regressorf j = f j,K j .
To complete the learning framework discussion, let us recall the step in which the learning machine invokes an embedded boosting machine. Notedly, we intend embedded boosting machine to mean either Algorithm 1 or Algorithm 2.

Benchmark task
In this section, we review the calibration of a nearest-neighbor coupled linear array of superconducting qubits, then we report the results from an experiment on a real-world calibration dataset. In the experiment, we compare our learning framework against the state-of-the-art calibration model, which contains an explicit description of the model Hamiltonian, as well as our learning framework, where the embedded boosting machine is an off-the-shelf gradient boosting algorithm, known as LightGBM, which initializes its fit of an additive expansion with a constant model [26][27][28][29].

Superconducting quantum device calibration procedure
In what follows, we study a nearest-neighbor coupled linear array of superconducting qubits with tunable qubit frequencies and inter-qubit interactions, where each qubit belongs to the span of the ground and first excited states of a nonlinear photonic resonator in the microwave regime. The total Hamiltonian of the nearest-neighbor coupled linear array is approximately described by the Bose-Hubbard model truncated at two local excitations where n > 1 is the number of qubits,â † (â) is the bosonic creation (annihilation) operator, δ j is the random on-site detuning, L is the on-site Hubbard interaction, and g j,j+1 is the hopping rate between nearest neighbor sites 8 .

Figure 5.
Optical micrograph of the superconducting quantum device, which shows a nearest-neighbor coupled linear array of nine superconducting qubits and eight interleaving couplers, as well as twenty-six control lines and nine readout resonators. Notedly, the four leftmost qubits and couplers were idle and the n = 5 rightmost qubits and four couplers were operational during the generation of the calibration dataset. From [26]; reprinted with permission from AAAS.
The calibration objective is to learn how to transform time-dependent control pulses, which emanate outside of the cryostat, i.e. the cooling device that encloses the superconducting quantum device, to entries in a matrix representation of equation (12). To do so, a calibration procedure proceeds in two steps [27]. In the first step, the procedure calibrates room temperature time-dependent control pulses to arrive orthogonally, synchronously, and without distortion at the superconducting quantum device. In the second step, the procedure infers a finite number of control model parameters through three substeps. In the first substep, the procedure fits the two lowest transition energies of each qubit as a function of qubit and coupler flux-biases. In the second substep, it benchmarks the collective dynamics of the superconducting quantum device with a many-body Ramsey spectroscopy technique such that all of the qubits are coupled and near resonance with each other [26]. Then, in the third substep, it invokes a numerical optimizer to minimize the absolute error between the measured and sorted eigenenergies obtained in the previous substep and the energy spectrum predictions made by the control model. Thus, the third substep gives rise to a multi-target regression problem, which is the focus of the next subsection.

Experiment on calibration dataset
The calibration dataset pertains to the second step in the aforedescribed calibration procedure. Figure 5 shows the nearest-neighbor coupled linear array of nine superconducting qubits and eight interleaving couplers, where the four leftmost qubits and couplers were idle and the n = 5 rightmost qubits and four couplers were operational during the data collection process. The relevant constituents include 136 of each: a collection of five qubit and four coupler biases, which was the input to the optimized control model, an energy spectrum prediction, which was the output from the optimized control model, and an instance of the many-body Ramsey spectroscopy technique, which measured and ascendingly ordered five eigenenergies belonging to equation (12), when it describes a photon hopping in a disordered potential; see appendix A.
With regard to the learning framework, we wrangle the data into equation (7) with shape 136 × 10, where X ⊆ R 5 denotes the domain of spectroscopic predictions from the optimized control model, Y ⊆ R 5 denotes the codomain of measured and ascendingly ordered eigenenergies, and figure 6 shows pairwise correlations of the columns. We randomly split equation (7) into m train = 95 and m test = 41 rows such that equation (8) has shape 95 × (5 + 1) in each single-target regression subtask, where j = 1, 2, . . . , 5. We explain how to access the remaining implementation details in appendix D. Figure 7 depicts the built-in model selection step in Algorithm 2 with an augmented learning curve, where the single-target regression subtask is Y 3 and we measure accuracy with absolute error, so the appropriate unit is megahertz (MHz). That is, the augmented learning curve shows the training, cross-validation, and incumbent error in blue, orange, and red, respectively, where the single-target training example sizes vary between 23 and 95 ordered pairs, inclusive, and the incumbent error bounds the Figure 6. Pairwise correlations of the design matrix columns. The notation X j refers to a single-target prediction from the state-of-the-art calibration model [26][27][28] and Y j refers to a single-target from an instance of a many-body Ramsey spectroscopy technique [26], where each spectroscopic instance sorts the eigenenergies belonging to equation (12) into ascending order: Y1 ≤ Y2 ≤ · · · ≤ Y5. Of note is the strong pairwise correlation between certain single-targets, such as Y1 and Y2, as well as single-target predictions and single-targets, such as X1 and Y2. We depict the built-in model selection step in Algorithm 2, where the single-target training example sizes vary between 23 and 95 ordered pairs, inclusive, and the incumbent error bounds the cross-validation error from above by 0.87 MHz due to the model selection criterion. Evidently, the built-in model selection step tended to chose the state-of-the-art calibration model [26][27][28], when there were less than 51 ordered pairs, and it always chose Algorithm 1, when there were 51, or more, ordered pairs. Notedly, the training and cross-validation errors tend to increase and decrease, respectively, and both errors exhibit random fluctuations that eventually taper off. cross-validation error from above by 0.87 MHz due to the model selection criterion. Evidently, the built-in model selection step tended to chose the optimized control model, when there were less than 51 ordered pairs, and it always chose Algorithm 1, when there were 51, or more, ordered pairs. In general, the built-in model selection step always chose Algorithm 1, when there were 60, or more, ordered pairs in any single-target regression subtask; see figures 18 and 19 in appendix B. The bar plot shows that our learning framework performs better than the calibration model on the multi-target regression task, as well as every single-target regression subtask, where we measure accuracy with absolute error. Notedly, Roushan et al estimate the minimal average mean absolute error to be 1 MHz [26]. Figure 9. Evaluation of the state-of-the-art calibration model [26][27][28] and our learning framework on test examples, where the embedded boosting machine is Algorithm 2 such that Algorithm 2 called Algorithm 1 in every single-target regression subtask. The bar plot shows that our learning framework performs better than the calibration model on the multi-target regression task, as well as every single-target regression subtask, where we measure accuracy with squared error.
Prior to evaluation on test examples, we induce two multi-target regressors with our learning framework. That is, the first multi-target regressor is the resultant of Algorithm 2, as the embedded boosting machine, where Algorithm 2 calls Algorithm 1 with K j = 1 in every single-target regression subtask. The second multi-target regressor is the resultant of LightGBM, as the embedded boosting machine. Figures 8 and 9 compare the first multi-target regressor against the optimized control model on test examples, where we measure accuracy with absolute and squared error, respectively. Figures 10 and 11 compare the first multi-target regressor against the second multi-target regressor on test examples, where we measure accuracy with absolute and squared error, respectively. In summary, the first multi-target regressor outperforms the optimized control model, and they both significantly outperform the second multi-target regressor.

Explainable machine learning
In this section, we describe an entirely data-driven reimplementation of the optimized control model, and we compare it against our learning framework. Then, we apply the Shapley additive explanations approach to uncover parameter dependencies in the optimized control model [30].  [29] or Algorithm 2 such that Algorithm 2 called Algorithm 1 in every single-target regression subtask. The bar plot shows that our learning framework performs significantly better with our formulation of gradient boosting, where we measure accuracy with absolute error. Further, a comparison with figure 8 shows that LightGBM significantly degrades the energy spectrum predictions from the state-of-the-art calibration model [26][27][28]; and LightGBM is much more prone to overfitting in the single-target regression subtasks Y4 and Y5, which have the highest single-target sample standard deviations. Figure 11. Evaluation of our learning framework on test examples, where the embedded boosting machine is either LightGBM [29] or Algorithm 2 such that Algorithm 2 called Algorithm 1 in every single-target regression subtask. The bar plot shows that our learning framework performs significantly better with our formulation of gradient boosting, where we measure accuracy with squared error. Further, a comparison with figure 9 shows that LightGBM significantly degrades the energy spectrum predictions from the state-of-the-art calibration model [26][27][28]; and LightGBM is much more prone to overfitting in the single-target regression subtasks Y4 and Y5, which have the highest single-target sample standard deviations.

Entirely data-driven reimplementation of the calibration model
In what follows, we refer to constituents from the calibration dataset. That is, we begin with data in the form of equation (1), where X (i) ∈ X ⊆ R 9 denotes an instance in the domain of qubit and coupler bias values, Y (i) ∈ Y ⊆ R 5 denotes a multi-target in the codomain of measured and ascendingly ordered eigenenergies, and i = 1, 2, . . . , 136, so m = 136. Next, we wrangle equation (1) into a design matrix with shape 136 × 14, and we split it into m train = 95 and m test = 41 rows such that the split indices preserve the association with our previous experiment on the calibration dataset. Subsequently, we induce a multi-target regressor with the independent model approach, where the training examples have shape 95 × 14, the single-target training examples have shape 95 × 10, and the embedded learning algorithm is a fully-connected neural network, which follows from model selection [34]; see appendix D.
Prior to evaluation on test examples, let us recall the first multi-target regressor from the experiment on calibration dataset subsection, which is the resultant of Algorithm 2, as the embedded boosting machine, where Algorithm 2 calls Algorithm 1 with K j = 1 in every single-target regression subtask. Figures 12 and 13 compare the first multi-target regressor against the multi-target regressor from the independent model approach on test examples, where we measure accuracy with absolute and squared error, respectively. In Figure 12. Evaluation of an entirely data-driven reimplementation of the calibration model and our learning framework on test examples, where the embedded boosting machine is Algorithm 2 such that Algorithm 2 called Algorithm 1 in every single-target regression subtask. The bar plot shows that our learning framework performs significantly better than the data-driven reimplementation, where we measure accuracy with absolute error. Further, a comparison with figure 8 shows that the data-driven reimplementation performs significantly worse than the actual calibration model [26][27][28]. Figure 13. Evaluation of an entirely data-driven reimplementation of the calibration model and our learning framework on test examples, where the embedded boosting machine is Algorithm 2 such that Algorithm 2 called Algorithm 1 in every single-target regression subtask. The bar plot shows that our learning framework performs significantly better than the data-driven reimplementation, where we measure accuracy with squared error. Further, a comparison with figure 9 shows that the data-driven reimplementation performs significantly worse than the actual calibration model [26][27][28].
summary, the first multi-target regressor significantly outperforms the multi-target regressor from the independent model approach.

Shapley additive explanations approach
The Shapley additive explanations approach, known as SHAP, natively supports explainable machine learning in the setting of single-target regression. Accordingly, we supply SHAP with the single-target training examples from the previous subsection so that SHAP invokes a fully-connected neural network to induce a single-target regressor, as in the previous subsection. In turn, SHAP supplies an approximation algorithm with single-target training example predictions from the fitted single-target regressor. Then, the approximation algorithm returns nine SHAP values for each single-target training example prediction, where a SHAP value represents the importance of a qubit or coupler bias feature in a particular single-target training example prediction; we breifly review SHAP in appendix C. Figure 14 shows a summary plot, where the single-target regression subtask is Y 1 , the horizontal axis represents the SHAP value, the vertical axis represents the importance of a qubit or coupler bias feature, and the color represents the actual qubit or coupler bias value. Evidently, the most important feature is the qubit 8 bias and the second most important feature is the coupler 5/6 bias, where the former and latter correspond to the 8th qubit site near the physical boundary of the nearest-neighbor coupled linear array and the coupler Figure 14. SHAP summary plot for the single-target regression subtask Y1; see figures 20 and 21 in appendix C for the other single-target regression subtasks. The plot indicates the most important feature is the qubit 8 bias, and the least important feature is the coupler 8/9 bias. Moreover, the qubit 8 bias value tends to be high and the coupler 5/6 bias value tends to be low, when their SHAP value is positive. Similarly, the qubit 8 bias value tends to be low and the coupler 5/6 bias value tends to be high, when their SHAP value is negative.
between the 5th and 6th qubit sites near the artificial boundary of the nearest-neighbor coupled linear array, respectively; see figure 5. In comparison, the most important feature in the single-target regression subtasks Y 3 and Y 4 is the coupler 5/6 bias, and the most important feature in the single-target regression subtasks Y 2 and Y 5 is the coupler 8/9 bias; see figures 20 and 21 in appendix C.
To complete our study of the parameter dependencies in the optimized control model, let us recall that each instance of the many-body Ramsey spectroscopy technique measures and ascendingly orders the eigenenergies. Accordingly, we might expect that on average, over all instances, the feature dependence would be qualitatively similar in each single-target regression subtask. Indeed, under independent and identically distributed sampling of the input parameters, we would expect the data to exhibit a symmetry under permutation among the local bias and coupling parameters in equation (12). In line with this intuition, we observe a noticeably marked dependence on the coupler bias features closest to the physical or artificial boundaries in every single-target regression subtask. However, more generally, the permutation symmetry is broken in the calibration dataset, not least because the model consists of few sites and is patently not well approximated by closed boundary conditions. Some of the individual single-targets, for instance, have a stronger dependence on specific on-site biases than others. This suggest that different sites correlate more strongly with larger or smaller eigenenergies. An example is the aforementioned strong dependence of the single-target Y 1 on the on-site bias at site 8. We attribute this to the geometry of the physical configuration and note that this asymmetric feature dependence is already present in the predictions made by the optimized control model.

Conclusion
We have developed a formulation of gradient boosting that is able to leverage domain expertise and improve upon an existing, possibly nonconstant, model. Our formulation is based on a modification to the additive expansion and a corresponding modification to the boosting machine. As a corollary of our work, we have established a template, which enables the reformulation of other boosting algorithms [3,[35][36][37][38][39]. Additionally, we have introduced a variant of multi-target stacking that extends our approach to the setting of multi-target regression, where the results from our experiment indicate the viability of our approach, especially in the low data regime. In future, we plan to evaluate our approach on other datasets. Also, it may be interesting to supplant the constant term in the additive expansion with a real-valued multiple-source prediction term and make the corresponding modification to a boosting algorithm.

Data availability statement
The data that support the findings of this study are openly available at the following URL/DOI: https://github.com/a-wozniakowski/scikit-physlearn.
To generate each multi-target in the calibration dataset, an instance of the many-body Ramsey spectroscopy technique begins by setting the parameters in equation (12) such that the on-site detuning is sampled uniformly in [−100, 100] MHz, the hopping rate is sampled uniformly in [0, 50] MHz, and the on-site Hubbard interaction is fixed at 0. Then, the many-body Ramsey spectroscopy technique iteratively applies the time-domain spectroscopy circuit shown in figure 15 to fully resolve the energy spectrum of equation (12) with randomly programmed parameters, where k = 1, 2, . . . , 5 denotes the choice of superposition qubit and readout resonator.
Namely, in the kth run of the time-domain spectroscopy circuit, every qubit starts in the fiducial state |0 and there is no photon in the system. Next, a microwave pulse is applied to the kth qubit, e.g. k = 1 in figure 15, which places the kth qubit in a superposition of the standard basis and initializes a single-photon in the system. Subsequently, the system evolves according to the time-independent Hamiltonian (equation (12)) with randomly programmed parameters. Then, a microwave pulse is applied to the kth qubit to measure either σ X or σ Y , and from the measurement of these observables the value σ X + i σ Y is instantiated.
Once k = 1, 2, . . . , 5 iterations of the time-domain spectroscopy circuit have been completed, there exists an ordered basis of initial states. As such, one can deduce the energy eigenstates through inner product computations. Then, the peaks in the fast Fourier transform of σ X + i σ Y are identified as the eigenenergies of the Hamiltonian, and these eigenenergies are sorted into ascending order such that To generate each multi-target prediction in the calibration dataset, the control model maps a collection of five qubit and four coupler bias features from an instance of the many-body Ramsey spectroscopy technique to the 5 × 5 single-photon block matrix in the representation of equation (12). Next, a numerical eigensolver produces five eigenenergy approximations, and they are sorted in ascending order such that X 1 ≤ X 2 ≤ · · · ≤ X 5 . Figure 16 depicts qubit and compler bias feature boxplots from top-to-bottom respectively, where Qubit j denotes the qubit bias corresponding to qubit site j and Coupler j/j + 1 denotes the coupler bias corresponding to the nearest neighbor coupler for qubit sites j and j + 1. Figure 17 jointly depicts single-target predictions and single-targets. Figure 15. Time-domain spectroscopy circuit from the many-body Ramsey spectroscopy technique. Initially, every qubit is in the fiducial state |0 ⟩, then a microwave pulse is applied to the kth qubit, which places the kth qubit in a linear combination of the standard basis. Subsequently, the quantum system evolves according to the time-independent Hamiltonian (equation (12)) with randomly programmed parameters. Then, a microwave pulse is applied to measure either ⟨σ X ⟩ or ⟨σ Y ⟩, where σ X and σ Y denote Pauli operators. Figure 16. Boxplots of qubit and coupler bias feature with training example indices. Here Qubit j denotes the qubit bias corresponding to qubit site j and Coupler j/j + 1 denotes the coupler bias corresponding to the nearest neighbor coupler for qubit sites j and j + 1. Figure 18. Augmented learning curve for the single-target regression subtasks Y1 (top) and Y2 (bottom), where we measure accuracy with absolute error, so the appropriate unit is megahertz (MHz). We depict the built-in model selection step in Algorithm 2, where the single-target training example sizes vary between 23 and 95 ordered pairs, inclusive. Evidently, the built-in model selection step always chose Algorithm 1. Figure 19. Augmented learning curve for the single-target regression subtasks Y4 (top) and Y5 (bottom), where we measure accuracy with absolute error, so the appropriate unit is megahertz (MHz). We depict the built-in model selection step in Algorithm 2, where the single-target training example sizes vary between 23 and 95 ordered pairs, inclusive. Evidently, the built-in model selection step tended to chose the state-of-the-art calibration model [26][27][28], when there were less than 60 ordered pairs, and it always chose Algorithm 1, when there were 51, or more, ordered pairs in the single-target regression subtask Y4. In contrast, the built-in model selection step always chose Algorithm 1 in the single-target regression subtask Y5.  To access the other hyperparameters, visit the main_body.py file in the paper_results directory.
The off-the-shelf gradient boosting algorithm corresponds to lightgbm.
LGBMRegressor, where the other hyperparameters choices are shown in the boost_wout_prior_knowledge.py file in the paper_results directory. The embedded learned algorithm in the entirely data-driven reimplementation of the calibration model corresponds to sklearn.neural_network.MLPRegressor, where the other hyperparameters choices are shown in the supplementary.py file in the paper_results directory. The training example sizes in figures 7, 18, and 19 correspond to 23, 25, 27, 29, 31, 32, 34, 36, 38, 40,