Optimal matching for heterogeneous treatment effect estimation

Advantages of the proposed framework for predicting heterogeneous treatment effects by matching


Introduction
Traditional causal inference is mainly carried out at the population level, focusing on the average treatment effect. However, this technique may fail to capture the nuanced heterogeneity in individual responses. As treatments may elicit varying effects among individuals, it is significant for researchers to infer heterogeneous treatment effects based on individual characteristics, enabling them to identify the beneficiary population. Exploration of the heterogeneity of causal effects leads decision makers to appropriate judgments grounded in the characteristics of individuals. With the deep-going research on heterogeneous treatment effects, the heterogeneity estimation of causal effect plays an increasingly important role in the fields of precision medicine [1] , marketing [2] , public policy [3] , and others that involve heterogeneity in the inference of causal effect at the individual or subgroup level.
The main methods for inferring causality are randomized controlled trials and observational data studies. Because randomized trials are time-consuming and costly, researchers mine causality from observational data. In causal inference studies, we can generally only observe the potential results of individuals under a single treatment condition; we cannot directly know the counterfactual results of individuals. The core problem of estimating heterogeneous treatment effects is how to obtain these counterfactual results. In observational studies, missing counterfactual outcomes and confounding bias are two major challenges in the evaluation of heterogeneous treatment effects. In the previous literature, many methods have emerged to estimate heterogeneous treatment effects from ob-servational data, such as tree-based methods [4,5] , matching methods [6][7][8] , meta-learning methods [9,10] , and neural networks [11] . Although the real treatment effect is unknown, it is still worth discussing how to estimate it and evaluate the efficacy of model estimation.
Among the methods for estimating heterogeneous treatment effects, matching is a relatively straightforward and interpretable method, but it does not consider extrapolation regions where there are no reasonable matching pairs. Hence, a possible solution to this challenge is to employ a hybrid approach that combines matching with machine learning methods.
In this article, we adopt a combination of matching and machine learning methods to detect heterogeneous treatment effects, introducing a new method we call "TRIMATCH". Specifically, we obtain the pseudo-individualized treatment effects via the matching method and then establish a nonparametric regression model based on the pseudo treatment effects. Different from previous literature on matching, TRIM-ATCH considers both the closeness of the match and the balance of the covariates from the perspective of the average distance. It is necessary to seek a Pareto solution when dealing with multiobjective optimization problems, which entails minimizing the average closeness of matched pairs on a multivariate distance and minimizing the average imbalance. This solution can be derived by utilizing the minimum average cost flow algorithm, which is a tried and tested approach. In contrast to the previous method of bipartite matching, we adopt a new matching network structure: the tripartite graph proposed by Zhang et al. [12] . Based on the network structure of the tripartite graph, Zhang et al. [12] used the minimum cost k flow algorithm to solve the multiobjective matching problem from the perspective of the minimum matching distance. They proved, theoretically and by experiment, that the tripartite graph matching method is better than the traditional bipartite graph matching method, but this method focuses on fixed proportion matching; that is, the number of matches of each individual is unified. Fixed proportion matching has limitations due to its high dependence on the sample distribution.
Drawn upon the tripartite graph, the objective of the TRIMATCH matching algorithm in this paper is to find the matching set with variable matching ratios that minimizes the average matching distance. Theoretical and experimental findings indicate that the algorithm for minimum average matching distance surpasses its counterpart, the algorithm for minimum matching distance, in generating a greater number of matching combinations while simultaneously minimizing average matching distance [13] . By incorporating this tripartite structure, we can better consider the matching goals separately and improve the matching quality of the matching algorithm. Compared with the regression-based estimation approach, the new approach prioritizes balancing the covariates between the treatment and control groups in the observed data using a matching method. Doing so minimizes the impact of sample imbalance on subsequent predictions. Additionally, TRIMATCH is suitable for the analysis of heterogeneous treatment effects in the case of high-dimensional covariates.
The specific contributions of this article are as follows: (1) In terms of estimating heterogeneous treatment effects, we propose a data-driven method for estimating heterogeneous treatment effects from observational data by integrating matching methods in causal inference with machine learning techniques. This approach, called TRIMATCH, combines the strengths of both methods to create a robust and accurate estimator. On the one hand, machine learning methods have good generalization performance and can perform reasonable extrapolation based on matched samples, which is difficult for traditional matching methods. On the other hand, machine learning methods are sensitive to sample distribution, and some "black box models" lack interpretability. The integration of these matching methods not only reduces the negative impact of sample imbalance on machine learning method estimation but also improves the interpretability of the model. This study uses simulation experiments to verify the effectiveness and accuracy of TRIMATCH and proves theoretically that the method has a tolerable upper limit of estimation error. This work's integration of machine learning and traditional causal methods is conducive to the development of causal theory in the era of big data. The work also provides a reference for exploring the combination of traditional causal inference methods and machine learning methods.
(2) Regarding research on matching methods, this article considers the tradeoff between multiple matching objectives from the perspective of average cost and solves the optimization problem using a method for minimum average cost flow. In previous literature, researchers usually selected matching samples by minimizing the total distance of matching. Gao et al. [7] proposed a single-objective matching method based on the minimum average cost flow algorithm. Based on this research, this article extends the research perspective to mul-tiobjective matching. We present theoretical and experimental results showing that the matching set obtained by the proposed method has superior properties compared to prior methods.
The overall structure of the article is as follows. In section 2, we review methods for estimating heterogeneous treatment effects using matching techniques. Section 3 introduces a stepby-step approach to estimating heterogeneous treatment effects using TRIMATCH. In section 4, we discuss the impact of six different caliper settings on simulated datasets and show the proposed method's performance on simulation datasets. In section 5, the method is applied to real-world data, and the results are analyzed. The paper concludes with a discussion in section 6.

Related Work
Matching is an effective method for estimating treatment effects in observational studies, which can be achieved by matching treated and control groups with similar covariate distributions [6] . The match-based approach reduces the estimation bias brought by observed confounders. The keys to solving the matching problem are to choose a distance metric to define "closeness" and a matching algorithm to implement matching.
For distance metrics, the most straightforward way to match is to perform exact matching based on some discrete covariates that define the distance as 0 when the covariates are equal and infinity otherwise. Since exact matching does not apply to the matching problem on continuous covariates, Iacus et al. [14] proposed a new method known as coarsened exact matching (CEM), which converts continuous variables into ordered multicategorical variables under the exact matching framework. This method considers the extrapolation region where unmatched units exist. However, as the covariate dimension increases, it is challenging to match exactly based on multiple covariates or coarsened covariates due to computational complexity and data problems. For multivariate matching, Euclidean distance [15] or Mahalanobis distance [6] was typically selected in the previous literature. Numerous studies have shown that matching based on the propensity score can reduce bias to ensure balance on the variables highly correlated with the treatment assignment [16,17] . The prognostic score can be considered an analog to the propensity score, and it can be used to produce a balance on prognostically relevant variables [18] . Since both the propensity score and the prognostic score must be estimated, score model misspecification can result in low-quality matches. Compared to propensity score or prognostic score matching, matching on both scores may improve the accuracy of treatment effect estimation and enhance robustness to score model misspecification [19,20] . In the matching process, previous publications recommend including confounding factors associated with treatment assignment and the outcome variable as much as possible. They also recommend excluding some variables to improve the estimate's accuracy, such as the post-treatment variables and the instrumental variables [6,21,22] .
Various matching algorithms exist to achieve different matching goals. Based on a selected distance measure, the k k k k k most direct and simple method is nearest-neighbor matching. Unfortunately, in cases where the sizes of the treatment and control groups are disproportionate, utilizing a one-to-one matching strategy can result in sample wastage. Rubin [15] proposed 1 to (or to 1) matching where is a fixed value, and the choice of is generally contingent upon the tradeoff between estimated bias and variance. If some individuals are spatially distant from their nearest neighbors within a given dataset, a fixed ratio of matches will result in matches of inferior quality. Considering the uneven distribution of observational data, variable ratio matching has been proposed to enhance bias reduction [23,24] . To acquire satisfactory matching pairs with certain desired properties, researchers usually impose constraints on the matching algorithms, such as fine and near-fine balance, exact and near-exact matching on crucial nominal variables, or maximum number of matches per individual. Soft or hard constraints can be applied to filter out matching pairs that do not meet the desired criteria corresponding to the matching goal, and there are tradeoffs between different matching targets [25] . The problem of optimal matching satisfying specific properties can be expressed as an optimization problem subject to constraints, and such problems can be solved by mixed integer programming or network flow methods. Yu et al. [26] explored the connection between directional penalties and integer programming technique to improve covariate balance in a matched sample. Morucci et al. [27] found hyper-box-shaped regions where the treatment effect is roughly constant throughout with mixed integer program method. Another approach to solve the matching problem is the network flow algorithm, which is usually reformulated into a minimum cost flow problem [28,29] .

Overall design of the framework
We follow the potential outcomes framework proposed by Rubin and Donald [30] . Suppose we have independent observations from , where is the observed outcome, is a binary treatment of interest, and is a P-dimensional vector of covariates, each unit's potential outcomes are given by , whose components represent the outcomes of units assigned to the control group and the treatment group. The potential outcome can be modeled as , where [31] .
i For each unit , the individual treatment effect (ITE) is defined as: Since two potential outcomes for the same individual cannot be observed simultaneously, the true individual treatment effect is unknown. However, it is possible to estimate the average treatment effect among individuals with the same vector of covariates , which is referred to as the conditional average treatment effect (CATE). Under the following classical assumptions, the CATE can be estimated to capture heterogeneity among causal effects within the population, which is defined as: Assumption 1 (Stable Unit Treatment Value Assumption). The potential outcomes for any unit do not depend on the treatments assigned to other units. There are no different versions of the treatment.
Assumption 2 (Unconfoundedness). Given the covariates , treatment assignment is independent of the potential outcome for each unit. .

Matching as a minimum average cost flow problem
The matching problem is usually modeled as an optimization problem subject to various constraints, and it can be solved in various ways, such as network optimization techniques. Motivated by Zhang et al. [12] , we adopt a tripartite graph technique to construct the match network structure rather than using a bipartite graph. In contrast to classical bipartite matching, tripartite matching can achieve the goals of matching proximity and covariate balance simultaneously without conflict. The matching problem can be transformed into a minimum average cost flow problem based on a tripartite graph, which is solved by general algorithms for that problem.

Tripartite matching structure
Considering the general structure of the network, the basic framework of the network comprises a set of vertices and edges , where . Each edge is of the form , which represents the edge between and , where . In the left part of the tripartite graph, there are T treated subjects, denoted by , and C control subjects, denoted by , where in general. Each subject (treatment or control) has a duplicate in the right part of tripartite graph, which is denoted by or . A source and a sink exist in addition to the abovementioned vertices at the beginning and end of the graph, respectively. The structure of tripartite matching is shown in Fig. 1.

Distance metric
To ensure the efficacy of matching estimators, matching designs generally aim to achieve covariate balance and matching closeness within an appropriate matching ratio. The former ensures that the distribution of covariates observed in the treatment group is similar to that in the matched control group, making it closer to a randomized study. The latter requires close pairing for key observed covariates, which can reduce the estimation bias brought by confounders.
Our proposed matching algorithm considers both of these matching goals simultaneously. We use to represent the cost of edge as the metric of pairing proximity, and we use to represent the cost of edge as the metric of covariate imbalance. The penalty term denoted by achieves a tradeoff between the two distances.
Regarding matching closely, when the covariates are highdimensional, it is not necessary to match based on all covariates, although Glazerman et al. [32] suggest including as many variables as possible that are highly predictive of outcomes and treatment assignment. Meanwhile, Pearl and Judea [33] advise excluding the instrumental variables affecting only the treatment assignment because they tend to amplify the bias of treatment effect estimators. We implement feature selection via random forests to screen for covariates with the smallest Cai et al. mean squared error in predicting outcome variables. For the multivariate distance, the Mahalanobis distance or robust Mahalanobis distance is typically used as the measure of distance because the Mahalanobis distance is not affected by dimension and takes into account the correlation between variables in multivariate settings, making it superior to the Euclidean distance. When the key covariables are all nominal variables, we can also utilize the Hamming distance as the distance measure or use exact matching.
Regarding balance matching, the propensity score is a balance score that refers to the estimated probability of a subject receiving the treatment arrangement given the covariates, defined as . Matching on the propensity score tends to produce a balance between the treated and untreated groups, consequently removing bias that can arise due to the influence of covariates that are highly predictive of treatment assignment.

Minimum average cost flow
The network flow optimization algorithm is an important method for solving the variable ratio matching problem. Unlike the minimum cost flow-based algorithm matching, our objective is to minimize the average distance of the matches, which is equivalent to the minimum average cost flow optimization problem. As suggested by Gao et al. [7] , a matching method that aims to minimize the average match distance results in a greater number of matched pairs and lower average distance between the pairs than any method that focuses on minimizing the total match distance.
The problem of minimal average cost flow was put forward by Y. L. Chen [34] to find the feasible flow from a specified node to another node with the minimum average transportation cost, which is essentially a problem of flow allocation and route planning with a given amount of flow. Let be a network with nodes, edges, and fixed cost . For any edge in , the transportation cost per unit flow is , the capacity of each edge is , and the flow is . The total flow from the source node to the sink node in network is , and the flow between two nodes and in the network is for , so the minimum average cost flow problem can be formulated as follows. (1) The minimum cost flow problem is essentially a linear programming problem and can be solved by general algorithms in the previous literature. However, the minimum average cost flow problem is a nonlinear programming problem for which no general solution is currently known. Inspired by Y. L. Chen [34] , we know that when the marginal cost is equal to the average cost, the corresponding flow in the flow network is the optimal solution of the minimum average cost flow problem. We use binary search to improve the operation efficiency, and we find the optimal solution by reformulating the minimum average cost flow problem into a minimum cost flow problem given the optimal feasible flow.
Based on the network flow model, the problem of finding the minimum average matching distance can be transformed into a minimum average cost flow problem. In the network structure depicted by Fig. 1, we use to denote the left part of the graph, consisting of the edges and , while using to represent the right part of the graph, comprising the edges and . Since the objective is to find the optimal solution that minimizes the matching cost, we only  (7): need to consider the cost between the control group and the treatment group. In the tripartite graph, the cost of the edge between the treated units and controls corresponds to the matching distance, and the cost of all other edges is 0. The flow of the edge between the treated units and controls, , indicates whether the individuals connecting this edge are matched. The minimum average cost flow problem can be reformulated as follows: The objective function contains two distance metrics representing the cost of two parts of the network structure. Part prioritizes pairing individuals who are close on key covariates, and gives preference to matching groups with smaller propensity score gaps. The former determines the matching set, while the latter forces the matching set to satisfy the property of covariate balance. Hence, the resulting match is represented by the feasible flow of edge in . In real world observed data, the number of control group individuals is typically larger than the number of treatment group individuals, and researchers are more concerned with the treatment effects on treatment groups in practical applications. Thus, in the matching process, every unit in the treatment group is guaranteed to be matched, while some control group units are allowed to remain unmatched. Constraint (3) ensures that each treated unit is matched with at most control units and at least one control unit. Constraints (4) ensure that each treated unit is matched with at most control units. There is a tradeoff between the matching accuracy and the number of matching pairs that should be considered in the selection of the value; that is, a value of that is too large will increase the bias of the estimate and reduce the computational efficiency, while an overly small value may impractically decrease the number of pairs. What is worse, there is no feasible solution for the target function. Following Brito et al. [35] , we choose as the maximum number of matches per individual. Constraint (5) ensures that the flow out of source is equal to the flow into sink , and constraint (6) guarantees that the inflow and outflow of each unit are equal, except for the source and sink; that is, the net flow is 0. Proposition 1. If there is a feasible flow in the abovementioned network structure, then: i. The flow through equals the flow through ; that is, the number of pairs generated by the two parts of the net-work is equal.
ii. For the same control unit in and , the number of pairs is equivalent, but the treated units paired with the control unit are not necessarily identical.
iii. The total distance of the optimal solution generated by tripartite matching is less than or equal to the minimum total distance generated by bipartite matching.
Proof. Equation (5) constrains the outbound flow from the source to equal the inbound flow from the destination , so claim (i) holds. Equation (6) implies that for each control group, the flow out of equals the flow into , and or 0, so for the same control unit in and , the number of pairs of their matches are equal, while the treated units paired with the control unit are not necessarily identical, which supports claim (ii). If we force them to match the same set of treatment groups for the same control unit in and , by adding constraints , this is equivalent to bipartite matching. In other words, under the same conditions, bipartite matching is more stringent than tripartite matching. Therefore, the optimal solution of bipartite matching must be a feasible solution of tripartite matching, but it is not necessarily the optimal solution of tripartite matching. Therefore, the minimum average cost of tripartite matching must be at most the minimum average cost of bipartite matching, which proves claim (iii).
As Korte et al. [36] showed, the algorithmic complexity of the minimum cost flow problem is , which ranges from to depending on the complexity of the network structure. The network is dense if , and sparse if . The minimum average cost flow problem only takes arithmetic operations (see Y. L. Chen [34] Lemma 9), where is the largest capacity of the edge in our network, and is the time required for solving a minimum cost flow problem in a network with edges and nodes. In our network structure, we can reduce the complexity of the algorithm by increasing the sparsity of the graph. In other words, incorporating constraints can reduce the number of edges in the network. One viable approach to sparsifying the network is to coerce individuals into exact matches with key nominal variables. Another option is to impose a propensity caliper restriction on potential matches, that is, to discard matching pairs whose propensity score gap exceeds the given caliper threshold.
Depending on the matching objective, different types and widths of calipers can be chosen. In previous literature, this caliper is generally 0.2 pooled standard deviations of the estimated propensity score or prognostic score. Leacy et al. [19] proved that the model based on the joint use of propensity and prognostic scores performed better than the single score model. In our network structure, different calipers can be used in two parts of the network, with a loose caliper in and a tight caliper in . To remove the ineligible edges from the network, we can apply hard constraints by setting the cost of the edges that do not meet requirements to infinity or setting the capacity cap of the undesirable edges to 0. In addition, soft constraints are also used to remove those ineligible edges with a penalty on the cost of the edges. When comparing hard and soft constraints in optimization problems, the conditions Cai et al. of soft constraints are less stringent, and the methods employed to satisfy them are more adaptable. The objective may be feasible even when soft constraints are violated, making them a preferable choice in this paper. Nevertheless, when dealing with large volumes of data, hard constraints can contribute to simplifying the model and increasing the efficiency of the algorithm.

CATE estimation
Based on the result of matching, each unit's counterfactual outcome can be estimated by the mean of the outcomes of the individuals in its matching set. Let be the matching set of unit , and denote the number of matches of : Let be the proxy of the individual treatment effect, which can be computed by: Using a potential outcomes framework, Rosenbaum et al. [16] proved that and are also conditionally independent, conditional on the propensity score under the unconfoundedness assumption, Therefore, we can derive an estimator of the conditional average treatment effect X re X ir where refers to covariates relevant to the outcome, and refers to covariates irrelevant to the outcome.
To estimate the conditional average treatment effect of each unit, we build the XGBoost tree for regression on the pseudo individual treatment effect as the proxy of the true individual treatment effect. The regressors include the key covariates screened for matching previously, which are highly predictive for the outcome variable. In the matching step, covariates that lack correlation with the outcome variable have been eliminated via implementation of the random forest methodology. The propensity score summarizes information highly predictive of the treatment indicator. When using full covariates for regression analysis, the curse of dimensionality often arises. To tackle this issue, we can remove variables that are irrelevant to the outcome variable. This will not only help to effectively reduce the dimensionality of predictive factors but also avoid the impact of redundant variables. The propensity score serves as a useful tool to balance covariates because it acts as a balancing score. Incorporating the propensity score into the predictive factors can help control confounding factors and improve the accuracy of predictions. Moreover, nonparametric regression that employs propensity scores resembles "one-to-one" matching based on propensity scores, establishing a one-to-one mapping relationship.
As stated in Proposition 2, the expected error upper bound of heterogeneous treatment effects can be effectively reduced by narrowing the distance within the matching group and the difference in prognostic score. (See Appendix B for a specific proof.) Let be a regressor based on pseudo treatment effects , and be the final estimated conditional average treatment effect, defined as . Let , , and . Let be the closest distance to unit based on covariates. Assume that is Lipschitz continuous; that is, there exists some constant L such that . Using the basic framework of the above model, if the tripartite matching problem is solvable within the caliper denoted by , then:

Simulation
This section analyzes the results from several simulation experiments carried out to compare the performance of TRIM-ATCH with some of the canonical models proposed for estimating heterogeneous treatment effects, including matchbased and tree-based models. The simulated data were generated from the following model: e(X i ) = expit(γX i ) Note that denotes a propensity score, denotes a prognostic score defined as , and denotes the function we want estimate, namely, the true individual treatment effect.
We compare the following estimators to TRIMATCH: (1) propensity score matching (PSM); (2) prognostic score matching (PGM); (3) matching on the joint use of propensity and prognostic scores(PP; Unpublished data by Ye et al.); (4) variable ratio matching based on the minimum average cost flow algorithm of a bipartite graph (COMBO; Gao et al. [7] ); (5) a nonparametric Bayesian regression approach using dimensionally adaptive random basis elements with Bayesian additive regression trees (BART; Chipman et al. [37] ); and (6) an estimator based on an R-Learner implemented by random forests with causal forests (CF; Wager and Athey [4] We estimate the propensity scores using logistic regression, which is implemented by the function in the R package [37] . Each method's performance was assessed according to mean absolute error (MAE), defined as , and mean square error (MSE) between the true treatment effects and estimated treatment effects, defined as .

Comparison with different calipers
To determine the best choice of calipers and hyperparameters for the proposed TRIMATCH method, we compare the results of matching based on the average difference of the true propensity score (TP), the average difference of the estimated propensity score (EP) between the treatment group and the control group after matching, the average Mahalanobis distance of the treatment group and the control group based on key covariates (MAHAL), the distance of each key covariate between the control group and the treatment group, and the number of matched pairs (PAIRS). We conduct six simulation settings with different calipers as follows: a. matching without calipers on the graph's left part, and matching within 0.2 pooled standard deviations of the estimated propensity score caliper on the graph's right part.
b. matching within 0.2 pooled standard deviations of the estimated propensity score caliper on the graph's left part, and matching without a caliper on the graph's right part.
c. matching within 0.2 pooled standard deviations of the estimated propensity score caliper on the graph's left part, and matching within 0.05 pooled standard deviations of the estimated prognostic score caliper on the graph's right part.
d. matching within 0.2 pooled standard deviations of the estimated prognostic score caliper on the graph's left part, and matching within 0.05 pooled standard deviations of the estimated propensity score caliper on the graph's right part.
e. matching within 0.2 pooled standard deviations of the estimated propensity score caliper on the graph's left part, and matching within 0.05 pooled standard deviations of the estimated propensity score caliper on the graph's right part.
f. matching within 0.2 pooled standard deviations of the estimated prognostic score caliper on the graph's left part, and matching within 0.05 pooled standard deviations of the estimated prognostic score caliper on the graph's right part.
We generated experimental data with a sample size of 1000 and covariate dimension 10 based on the linear confounder model, and the simulation experiment was carried out 100 times with each of the above six different caliper settings. The final experimental results are shown in Table 1. From the simulation results, matching under caliper setting (a) can effectively reduce the gap between the treatment group and the matched control group and significantly improve the matching quality. Among the double caliper settings, matching under caliper setting (d) shows satisfactory performance in covariate balance and pairing proximity. The prognostic score extracts information about the covariates associated with the outcome, and the propensity score summarizes information relevant to the treated variables, so we recommend matching with the prognostic score caliper for closeness, using the propensity score caliper for balance. That is, calipers correspond to the matching objective function and the matching goals. Whether using a single caliper or double calipers, it is advisable to impose a propensity score caliper on the right part of the graph.

Comparison with different estimators bart dbart s
Acknowledging the simulation study results in the previous section, we carry out matching with 0.2 pooled standard deviations of the estimated propensity score on the right part of the graph to balance covariates. We consider the performance of each model in the context of linear and nonlinear confounders. All experiments were repeated 100 times based on datasets with a sample size of 1000 or 2000. For all matching procedures, we use the function in the R package to estimate the propensity score and the function Cai et al. glmnet glmnet in the R package to estimate the prognostic score.
The box plots in Fig. 2 depict the prediction performance of different estimators under different sample sizes and data generation models. We use the mean absolute error and mean squared error to measure the accuracy of the model estimates. For both linear confounders and nonlinear confounders, among all the methods, TRIMATCH has the lowest errors in the evaluation metrics of mean absolute error and mean square error, and TRIMATCH's variance is also relatively low, indicating that the proposed method has the optimal per-formance in the stability and accuracy of estimation. With increasing dimension and complexity of the data model, TRIM-ATCH's accuracy decreases, but only by a small magnitude, and the standard deviation of the estimation error also maintains a low level, which reflects the robustness of our method. Compared with the fixed ratio matching methods, the error estimated by the variable ratio matching method based on the minimum average cost flow algorithm is smaller, which reflects the superiority of the variable ratio matching method. Compared with the COMBO method, TRIMATCH eliminates the influence of irrelevant factors in the measurement of

Fig. 2. Comparison between different methods
Optimal matching for heterogeneous treatment effect estimation Cai et al. matching distance and considers the balance between covariables, which reduces the estimation bias compared to the COMBO method and improves the stability of the estimation results.

Application
In this section, we apply TRIMATCH to a real-world dataset to study the heterogeneous treatment effect of colleges on reducing low-wage work, which was previously analyzed by Brand et al. [38] . The data are derived from the National Longitudinal Survey of Youth 1979 (NLSY79), and we use the same samples and variables as Brand et al. [38] . Following their study, the outcome variable of interest is the proportion of time in low-wage work over the person's career, while the treatment denotes whether the individual has completed college by age 25. The covariates include (1) sociodemographic factors (male, Black, Hispanic, southern residence at age 14, rural residence at age 14), (2) family background factors (parents' household income, fathers' highest education, mothers' highest education, father upper-white collar occupation, two-parent family at age 14, sibship size), (3) cognitive and psychosocial factors (cognitive ability ASVAB, high school college preparatory program, Rotter Locus of Control scale, juvenile delinquency activity scale, educational expectations, educational aspirations, friends' educational aspirations), (4) school factors (school disadvantage scale), and (5) family formation factors (marital status at age 18, had a child by age 18). The dataset consists of 1764 observations, with 347 in the treated group ( ) and 1417 in the controlled group ( ). According to the correlation coefficient matrix of covariates, the factor of parental income is associated with several covariates. It is positively related to father's or mothers' highest education, father's upper-white collar occupation, intact family, high school college preparatory program, educational expectations, educational aspirations, friends' educational aspirations, and propensity score. However, parental income is negatively related to sibship size, Rotter Locus of Control scale, Black, Hispanic, rural residence at age 14, and marital and childbearing status by age 18. Fig. 3a depicts the estimated heterogeneous treatment effect posterior distributions corresponding to different approximated propensity score percentiles (i.e., to individuals in the sample whose estimated propensity scores are equal or closest to the PS percentiles). Apart from the 100th percentile, the mean of the conditional average treatment effect distributions corresponding to the other percentiles is less than 0, which reflects the positive impact of college completion on reducing the proportion of time in low-wage work. The distribution of the estimated heterogeneous treatment effects near the 0th percentile and 100th quantile of propensity score is more dispersed, indicating greater uncertainty in the estimates of individuals located in the poor overlap region.
−0.5 −0.5 When ASVAB cognitive ability is less than , Fig. 3b shows a roughly positive correlation between conditional average and ASVAB cognitive ability, which means that the lower the cognitive ability, the more significant the effects of college completion on reducing the proportion of low-wage work time. When ASVAB cognitive ability is at least , the heterogeneous treatment effects fluctuated more consistently and were less than zero for most individuals, implying that college education is beneficial in reducing the proportion of time in low-wage work.
As shown in Fig. 4, from the overall sample perspective, college completion positively impacts reducing the propor-tion of low-paying jobs in an individual's career. This effect is more pronounced for individuals with lower cognitive abilities, as indicated by an ASVAB score of less than -1.1. This implies that completing college may be particularly beneficial for individuals with lower cognitive abilities, as it may help them secure higher-paying jobs. Juvenile delinquency activity scale is also an important node for dividing homogeneous subgroups, and those who had delinquent behaviors benefit more from completing college than those without delinquent behaviors.

Conclusion
In this article, we propose a two-step approach called TRIM-ATCH for estimating individual treatment effects. First, we construct a variable ratio matching model based on a tripartite graph, utilize the minimum average cost flow algorithm to obtain the optimal matching and create a "near-randomized" sample by matching to obtain a proxy for the individual treatment effect. Second, based on the matching outcomes, we employ the extreme gradient boosting tree technique to build a prediction model for individual treatment effects.
We validate the effectiveness and accuracy of the proposed approach through both theoretical analysis and empirical experiments. Our experimental findings demonstrate that TRIMATCH effectively enhances the quality of matching not only by reducing the average distance within the matching set but also by improving the balance of covariates between the treatment and control groups after matching. Furthermore, our approach surpasses published alternative methods in terms of the accuracy of individual treatment effect estimation.
The variable ratio matching problem is commonly reformulated as a minimum average cost flow problem based on bipartite graphs. The minimum average cost flow algorithm offers advantages over the minimum cost flow algorithm be-cause the former can produce more feasible flows for a lower average cost. Tripartite graph matching surpasses bipartite graph matching in its ability to address multiobjective matching problems and yields a lower overall matching cost. Our proposed TRIMATCH offers flexibility and generality in the following ways: (1) It does not impose any restrictions on the type of covariates, making it suitable for both discrete and continuous variables. (2) The matching ratio is variable and can be adapted to the data distribution by selecting an appropriate matching set size. (3) In the first step of matching process, TRIMATCH can be combined with covariate balancing techniques such as exact matching and refined balance matching to enhance the balance of the matched covariates. In the second step, other machine learning methods can replace the extreme gradient boosting tree method to construct the prediction model. Compared to other regression-based causal inference methods, matching enables the adjustment of the covariate equilibrium, and the joint use of matching methods and regression methods helps to reduce the model's sensitivity to unobserved confounders.
TRIMATCH exhibits impressive matching quality, but it requires further optimization in terms of time and space complexity. To optimize the matching algorithm, we can start from the network structure and the solution algorithm, respectively, and combine the network sparsity method to simplify the network structure, explore the optimization time of other nonlinear programming solution algorithms and seek the possibility of further optimization. In the second step of the proposed method, the extreme gradient boosting tree can be replaced by other machine learning methods, combined with meta-learners to explore whether there are more accurate individual treatment effects estimation methods. Moreover, our method is currently applicable only to the case of binary treatment variables. However, TRIMATCH can be extended to the case of multiple treatments by mapping different treatment groups to different layers of the network structure, where the number of layers corresponds to the number of categories of treatment variables. Furthermore, different weights could be assigned to network layers to prioritize matching tasks accordingly.