EL V.2 Model for Predicting Food Safety Risks at Taiwan Border Using the Voting-Based Ensemble Method

Border management serves as a crucial control checkpoint for governments to regulate the quality and safety of imported food. In 2020, the first-generation ensemble learning prediction model (EL V.1) was introduced to Taiwan’s border food management. This model primarily assesses the risk of imported food by combining five algorithms to determine whether quality sampling should be performed on imported food at the border. In this study, a second-generation ensemble learning prediction model (EL V.2) was developed based on seven algorithms to enhance the “detection rate of unqualified cases” and improve the robustness of the model. In this study, Elastic Net was used to select the characteristic risk factors. Two algorithms were used to construct the new model: The Bagging-Gradient Boosting Machine and Bagging-Elastic Net. In addition, Fβ was used to flexibly control the sampling rate, improving the predictive performance and robustness of the model. The chi-square test was employed to compare the efficacy of “pre-launch (2019) random sampling inspection” and “post-launch (2020–2022) model prediction sampling inspection”. For cases recommended for inspection by the ensemble learning model and subsequently inspected, the unqualified rates were 5.10%, 6.36%, and 4.39% in 2020, 2021, and 2022, respectively, which were significantly higher (p < 0.001) compared with the random sampling rate of 2.09% in 2019. The prediction indices established by the confusion matrix were used to further evaluate the prediction effects of EL V.1 and EL V.2, and the EL V.2 model exhibited superior predictive performance compared with EL V.1, and both models outperformed random sampling.


Introduction
Taiwan's food supply relies heavily on imports, with a vast array of imported ingredients and products comprising a substantial portion of the population's dietary consumption. This underscores the importance of managing imported food to protect public health and consumer rights. In Taiwan, the number of inspection applications for imported food has grown annually. Between 2011 and 2022, inspection applications increased from 419,000 batches to 723,000 batches, nearly doubling. Given the substantial volume of food imports, conducting border sampling inspections is of great significance for effectively strengthening control over high-risk products and accurately detecting substandard items.
Food risk management and control at Taiwan's border employ a food inspection method, which can be primarily classified into two categories: review and inspection. The review is conducted in writing, comparing customs clearance data with product information. The inspection involves sampling selected batches and sending them to authorized inspection laboratories for pesticide, pigment, or heavy metal compound testing. The entire process can be completed in approximately three to seven days. According to Taiwan's border inspection measures, inspection methods can be classified into general food, raw materials, and feed [2][3][4][5][6][7][8][9][10]. In 2016, Marvin et al. proposed that the Bayesian network algorithm can handle diverse big data and facilitate the understanding of driving factors related to food safety via systematic analysis, such as the impact of climate change on food quality, economy, and human behavior. Combined with the data, this algorithm can be used to predict possible food safety risk events [11]. In 2015, Bouzembrak et al. used the Rapid Alert System for Food and Feed (RASFF) of the European Union to construct a Bayesian network model to predict the types of food fraud that can occur in imported products of known food product categories and countries of origin. The findings can assist in border risk management and control and serve as an important reference for EU governments in conducting inspections and law enforcement [2,12,13].
The amount of imported food in the United States is increasing year by year. Due to limited inspection capacity, the Food and Drug Administration has divided the control of border imported food into two stages. The first stage is mainly electronic document review, with only 1% of imported food actually inspected each year. The second stage involves using the Predictive Risk-based Evaluation for Dynamic Import Compliance Targeting (PREDICT) system for risk prediction. Big data are employed to collect relevant data from products and manufacturers for evaluation, determining the risk level of imported goods. The risk factors calculated in the PREDICT system include at least four types of data, such as product risk (epidemic outbreak, recall, or adverse event), regulatory risk (specific factors of the manufacturer itself and past compliance with food safety regulations), factory inspection records of the manufacturer within three years, and historical data of the customs broker (quality analysis of data provided by the customs broker or importer within one year, such as reporting status). These data are used to screen factors related to the product itself for risk score calculation and further propose whether to conduct product sampling inspection [14].
The data sources used by the PREDICT system are mainly import alert and import notification data, domestic inspection and product tracking records, foreign factory inspections (such as equipment inspections), and identification system evaluation. Using these data, the PREDICT system can conduct data mining and analysis, enabling it to use artificial intelligence methods to predict the possible risks of imported goods and intercept them in a timely manner. This approach is undoubtedly the best for countries facing massive imports each year, which need to maintain normal export and import while still taking into account the safety and quality of goods.
Regarding the quality sampling inspection of imported food at the border, there are currently the following international experiences: The United States employs machine learning to assist in border inspection operations, while the European Union deploys methods such as Bayesian network analysis to predict factors that may cause border food risks, and then reports back to EU countries to strengthen their attention to import control. These practices demonstrate that big data applications, such as artificial intelligence and machine learning, can provide better operational quality for government border management and ensure the health and safety of the public. Therefore, this study referred to the data sources and practices of the European Union and the United States to collect risk factors and establish prediction model planning.

Selection of Algorithms
In recent years, ensemble learning has received great attention from researchers and has been widely applied in many fields for various purposes, such as medical diagnosis and disease prediction [15][16][17][18], Improvement of patient quality of life [19], Internet of Things (IoT) security [20,21], fault detection and error prediction for industrial processes [22][23][24], advertising and marketing [25], as well as agricultural monitoring, management, and productivity improvement [6,26]. In the food industry, it has been used for productivity improvement in food manufacturing, quality assessment and monitoring, food ingredient identification, food safety, and the quality of food delivery services (FDS). Parastar [10] developed a handheld near-infrared spectroscopy device based on ensemble learning for measuring and monitoring the authenticity of chicken meat that showed better performance in authenticity testing than common single classification methods such as partial least squares-discriminant analysis (PLS-DA), artificial neural network (ANN) and support vector machine (SVM). Using a combination of deep learning and ensemble learning techniques on milk spectral data, Neto [6] proposed a method for predicting common fraudulent milk adulterations in the dairy industry. Their method outperformed not only common statistical learning methods but also the Fourier transformed infrared spectroscopy (FTIR), which is typically used for identifying the composition of a sample in the dairy industry. Further, Adak [27] constructed a model with customer reviews of FDS using machine learning and deep learning techniques to predict customer sentiment about FDS. Based on previous studies and following consultation with experts, the following algorithms were used for constructing the new model: Decision Tree C5.0 and CART, Random Forest (RF), Logistic Regression (LR), Naïve Bayes (NB), Elastic Net (EN), and Gradient Boosting Machine (GBM). These algorithms offer interpretable approaches that are easy to understand by users, so they were adopted in ensemble learning for EL V.1 and EL V.2. Deep learning has not been included, given its low interpretability, but it may be considered in subsequent studies. EL V.1 was constructed using five algorithms: Decision Tree C5.0 and CART, Random Forest (RF), Logistic Regression (LR), and Naïve Bayes (NB). These algorithms exhibit great interpretability and explain ability, so they were primarily used for prediction tasks with ensemble-based classification techniques. On the basis of EL V.1, this study intends to construct a model with higher predictive performance and greater computational efficiency. To reduce computation time, Elastic Net (EN) and Gradient Boosting Machine (GBM) were used to control the sampling decision within one minute for each batch of cases. The test results revealed that the computation time could be controlled within the limit using the EN and GBM. Therefore, they were integrated into the construction of EL V.2.

Improvement of Ensemble Learning Model
The ensemble learning model is jointly established by a group of independent machine learning classifiers, combines their respective prediction results, and implements an integration strategy to reduce the total error and improve the performance of a single classifier [28][29][30]. Each classifier may have different generalization capabilities, i.e., different inference abilities for various samples, similar to the opinions of different experts. Finally, combining the output of these individual classifiers can deliver the final classification results, significantly reducing the probability of classification errors in the results [9,30].
For example, Solano [31] proposed an ensemble voting model for solar radiation prediction based on machine learning algorithms. The results of the study show that the weighted average voting method based on random forest and classification boosting has superior performance and is also better than a single machine learning algorithm and other ensemble models. Chandrasekhar [32] used six algorithms (Random Forest, K-Nearest Neighbors, Logistic Regression, Naive Bayes, Gradient Boosting, and AdaBoost Classifier) for voting ensemble learning, which improved the accuracy of heart disease prediction. Alsulami [33] proposed a data mining model including three traditional algorithms (decision trees, Naive Bays, and random forests) to evaluate student e-learning data to help policy makers make informed and appropriate decisions for their institutions. These methods effectively improve model prediction performance by using three ensemble techniques, including bagging, boosting, and voting. The combination of multiple different classifiers has been proven to improve the classification accuracy of the overall classification system [34][35][36][37].
In this study, four methods proposed by scholars were utilized to enhance the diversity of classification models (or classifiers) within the ensemble learning model, including the use of different training datasets and training of different classification models with different parameter settings, algorithms, and characteristic factors [31,38]. In previous studies, five algorithms were used to construct the ensemble learning model EL V.1. To improve and stabilize the predictive performance of the model, in this study, an attempt was made to construct model EL V.2 by adding "algorithmic classification models", adjusting the "factor screening method", and adding "sampling rate control parameters" such that the prediction method of imported food sampling inspection at the border can play a better role. Therefore, in addition to the algorithms used in the first-generation ensemble learning model EL V.1 constructed in previous studies (including Decision Tree C5.0 and CART, Random Forest (RF), Logistic Regression (LR), and Naïve Bayes (NB)), the newly added algorithms in this study were Elastic Net (EN) and Gradient Boosting Machine (GBM). The aforementioned seven algorithms, combined with the classification model constructed by the bagging method, will use the integration method for strategic integration with the "majority decision" approach. After completing the model construction, the prediction of border inspection applications will be conducted.

Materials and Methods
To improve the robustness of EL V.2, improvements were made on the basis of EL V.1 with a more refined method for selecting characteristic risk factors and an increased number of algorithms for classification. The details of the research methodology are described in the following sections.

Data Sources and Analytical Tools
The modeling data for this study were sourced from the food cloud established by the Food and Drug Administration of the Ministry of Health and Welfare of Taiwan. The food cloud is centered around the Food and Drug Administration's Five Systems, including the Registration Platform of Food Businesses System (RPFBS), the Food Traceability Management System (FTMS), the Inspection Management System (IMS), the Product Management Decision System (PMDS), and the Import Food Information System (IFIS). Additionally, it comprises cross-agency data communication, including financial and tax electronic invoices, customs electronic gate verification data, national business tax registration data, industrial and commercial registration data, indicated chemical substance flow data, domestic industrial oil flow data, imported industrial flow data, waste oil flow data, toxic chemical substance flow data, feed oil flow data, and campus food ingredient login and inspection data [39]. After imported food enters Taiwan, it must be declared and inspected through IFIS. Only after approval can the imported food enter the domestic market. The relevant business data must be registered in RPFBS, national business tax registration data, and business registration data. The flow information generated by domestic and imported products entering the market from the border should be recorded in IFIS and FTMS, as well as in electronic invoices and electronic gate goods import and export verification records. All government-conducted product sampling inspection records should be saved in PMDS, IFIS, and IMS. Information related to the company's products can also be accessed via RPFBS and FTMS.
The main sources of this study were border inspection application data, food inspection data, food product flow information, and business registration data from Taiwan's food cloud, as well as international open data databases related to food safety, including gross domestic product (GDP), GDP growth rate, global food security index, corruption perceptions index (CPI), human development index (HDI), legal rights index (LRI), and regional political risk index. A total of 168 factors were included in the analysis. The analytical tools used in the study were R 3.5.3, SPSS 25.0, and Microsoft Excel 2010.

Research Methodology
In this study, we selected food inspection application data of S-type products that had been sampled and had inspection results as the research scope. The data were divided into training, validation, and testing sets. First, different data types and analysis methods of the training set were considered to establish various models. The optimal model was selected from the prediction results obtained by importing validation set data into the model. The The entire modeling process was based on previous studies on the construction of the EL V.1 method, and improvements were made to this method to aid in improving the hit rate of unqualified products detected via sampling inspection. According to the execution order, this study can be divided into four stages: "data collection", "data integration and pre-processing", "establishing risk prediction models", and "evaluating prediction effectiveness". "Establishing risk prediction models" included three procedures: "characteristic factor extraction", "data mining and modeling", and "establishing the optimum prediction model". Changes were made in the calculation methods of "characteristic factor extraction" and "data mining", as shown below: (Figure 1).

Data Collection
The data in this study included the border inspection application database, inspection database, flow direction database, and registration database of Taiwan Food Cloud, as well as open information related to international food risk. (as shown in Table  1) A total of 168 factors were used as the main data source for constructing the risk prediction model (as shown in Table 1).

Data Collection
The data in this study included the border inspection application database, inspection database, flow direction database, and registration database of Taiwan Food Cloud, as well as open information related to international food risk. (as shown in Table 1) A total of 168 factors were used as the main data source for constructing the risk prediction model (as shown in Table 1).

Integration and Data Pre-Processing
In addition to data noise cleaning, the data needed to be subjected to manufacturer name and product name attribution and data string filing to further integrate the data in accordance with six aspects: manufacturer, importer, customs broker, border inspection, product, and country of manufacture. The integration process included data cleaning, error correction, and attribution.

•
Data processing: This step required data segmentation by year to prepare training, validation, and test sets. The training set was divided into two forms: 2011-2017 and 2016-2017. The validation set was data from 2018, and the test set was data from 2019. To realize accurate model prediction, in this study, we first attempted to model these two data forms and then used the validation set to confirm the most suitable time interval for data modeling.

•
Selection of characteristic risk factors: This step was to improve the first-generation model of EL V.1. There were two strategies for extracting characteristic factors. First, the "single-factor analysis" and "stepwise regression", used to extract characteristic factors in EL V.1, were changed to Elastic Net. Specifically, Elastic Net is a combination of Lasso regression (i.e., L1 normalization) and Ridge regression (i.e., L2 normalization). The equations are as follows: (e.g., Equations (1)-(3)) Lasso regression: Ridge regression: Elastic Nets: Lasso regression can aid Elastic Net in selecting characteristic factors. When selecting variable factors, Lasso regression retains only one highly collinear variable, making it the best choice. Ridge regression filters the independent variables into separate groups such that highly collinear variables can exist in the model when they have an effect on dependent variables as opposed to retaining only one of them, like in Lasso regression. Ogutu et al. indicated that due to its own characteristics, Elastic Net would try its best to discard variables within the model that have no influence on the independent variables, which can improve the explanatory power and predictive capability of the model. Relatively speaking, if all highly collinear independent variable factors are retained, the prediction performance of the model may not be increased, and the model will become more complex and unstable [40]. In this study, there were many factors. Hence, there were doubts about high collinearity. To avoid the problem of collinearity among factors that may be ignored when using "single-factor analysis and stepwise regression" to select factors in the past, Elastic Net was selected to reduce the possible bias of the prediction model and improve the accuracy of prediction.
The second strategy involved modeling based on inspection data from 2011 to 2017. Monthly data from January to October 2018 were added over time. The model was updated once a month, and the number of characteristic factors used was calculated. With seven algorithms, each factor can be used up to 70 times. The factor that was used more than once was kept and included in the model required for EL V.2 construction. In this study, a total of 68 characteristic risk factors were obtained (as shown in Table 2), which were important characteristic factors that participated in EL V.2 modeling.  In this study, we conducted modeling based on the training set. In addition to the algorithms used in EL V.1 (including Bagging-C5.0, Bagging-CART, Bagging-LR, Bagging-RF, and Bagging-BN), Bagging-EN and Bagging-GBM were also added for "data mining and modeling". Bagging can train multiple prediction classifiers for the same algorithm with a non-weighted method, which is then aggregated into the model constructed by the computational classifier. In this study, we used seven models established by Bagging-C5.0, Bagging-CART, Bagging-LR, Bagging-RF, Bagging-BN, Bagging-EN, and Bagging-GBM, and then ensembled them via the voting rule of "majority decision" as the final ensemble prediction model ( Figure 2).

•
Establish the optimum prediction model Training set resampling According to historical border inspection application data, the number of unqualified batches accounts for a small proportion of the total number of inspection applications, and modeling based on this data can easily lead to prediction bias. Therefore, in this study, we adopted two resampling methods (the synthesized minority oversampling technique (SMOTE) and proportional amplification) to deal with the data imbalance problem and tried to use the ratios of qualified to unqualified batches of 7:3, 6:4, 5:5, 4:6, and 3:7 for evaluation to find the best proportional parameters and unbalanced data processing method.

Repeated modeling
In this study, after the training set was resampled to balance the number of qualified and unqualified cases, the data combination of "time interval (AD)/whether to include the vendor blacklist/data imbalance processing method" was used to reduce the misjudgment due to a single sampling error. There were two types of time intervals (AD): 2016-2017 and 2016-2017. Blacklisted vendors refer to those whose unqualified rate was greater than the average of the overall unqualified rate. The most commonly used methods for handling data imbalance were proportional amplification and SMOTE. Based on this combination, a total of six types A to F were formed, namely,

Selection of the optimal model
The validation data set was imported into the model to obtain seven classifiers established by seven algorithms. Then seven classifiers were integrated for integrated learning to extract the optimum prediction model from the predicted results.

GDP 17
Rate of non-timely declaration of goods received 7 Cumulative number of imports not released 2 Number of business projects in the food industry 1 • Data exploration and modeling In this study, we conducted modeling based on the training set. In addition to the algorithms used in EL V.1 (including Bagging-C5.0, Bagging-CART, Bagging-LR, Bagging-RF, and Bagging-BN), Bagging-EN and Bagging-GBM were also added for "data mining and modeling". Bagging can train multiple prediction classifiers for the same algorithm with a non-weighted method, which is then aggregated into the model constructed by the computational classifier. In this study, we used seven models established by Bagging-C5.0, Bagging-CART, Bagging-LR, Bagging-RF, Bagging-BN, Bagging-EN, and Bagging-GBM, and then ensembled them via the voting rule of "majority decision" as the final ensemble prediction model (Figure 2).

Evaluation of the Prediction Effectiveness
In this step, the test set was imported into the model, and the confusion matrix (Table 3) output prediction indicators (accuracy rate (ACR), F1, positive predictive value (PPV), Recall, and area under curve (AUC) of receiver operating characteristic (ROC)) were used to evaluate the prediction effect. The purpose was to confirm whether the model can improve the predictive effect of the unqualified rate for border inspection applications. Table 3. Types and definitions of confusion matrices.

Type Definition
True Positive, TP Each batch of inspection applications was predicted as unqualified by the model, and it was actually unqualified.
False Positive, FP Each batch of inspection applications was predicted as unqualified by the model, but it was actually qualified.
True Negative Each batch of inspection applications was predicted as qualified by the model, and it was actually qualified.
False Negative Each batch of inspection applications was predicted as qualified by the model, but it was actually unqualified. ACR represents the model's ability to discriminate among overall samples. However, due to the presence of unbalanced samples in this study and the small number of unqualified samples, ACR may tend to present qualified prediction results due to its strong discriminative power towards qualified predictions. Therefore, in this study, more emphasis was placed on PPV, Recall, and F1 (Equation (4)). Recall represents the proportion of the number of unqualified products correctly identified by the model to the total number of unqualified products (Equation (5)). PPV refers to the proportion of the number of products that are actually unqualified to the number of products identified by the model as unqualified, making it also known as the unqualified rate (Equation (6)). F1 is the harmonic mean of recall and positive predictive value. Assuming that the PPV and F1 thresholds are set to 0.5, i.e., the weights of the two are equal, the performance of F1 is estimated. The larger the numerical value, the more favorable it is for the number of unqualified products TP to increase (Equation (7)).
The ROC can be plotted as a curve. The larger the area below the curve, the higher the classification accuracy. Performance can be compared between multiple ROC curves. The area under the curve (AUC) refers to the ratio of the area under the ROC curve divided by the total area. AUC can serve as the decision threshold when comparing the changes between the True Positive Rate (TPR) (Equation (8)) and False Positive Rate (FPR) (Equation (9)). The ROC curve is a graphical representation of a binary classification model's performance that clarifies the trade-off between the True Positive Rate (TPR) and the False Positive Rate (FPR) for various threshold values. When TPR is equivalent to FPR, AUC = 0.5, which indicates that the results of the prediction model sampling inspection are equivalent to those of random sampling inspection, and the prediction model has no classification capability. AUC = 1 indicates that the classifier is perfect; 0.5 < AUC < 1 indicates that the model is superior to random sampling; AUC < 0.5 indicates that the model is inferior to random sampling ( Figure 3).
The evaluation index for the effectiveness of model prediction in this study was the confusion matrix. Firstly, the classification prediction results were calculated, and the selection of models with a decision threshold greater than 0.5 for AUC (equivalent to random sampling) was prioritized. Then, a comprehensive evaluation was conducted. This study primarily focused on the unqualified rate to truly reflect the prediction hit rate. Therefore, the main evaluation index was the positive predictive value (PPV), also known as precision, which represented the ratio of the number of samples judged as unqualified by the model to the actual number of unqualified samples. Additionally, there was Recall, which was the ratio of the number of unqualified products correctly identified by the model to the total number of unqualified products. However, the larger the Recall, the higher the sampling rate. Hence, increasing PPV within the tolerable range of the sampling rate was the most important step. This also indicated the importance of realizing a balance between the harmonic mean F1, Recall, and PPV.

Evaluation of the Prediction Effectiveness
In this study, the data from the 2019 test set was used to make predictions through the model and simulated the actual prediction after the model launch for effectiveness evaluation. The evaluation of prediction effectiveness and selection of the optimum prediction model was based on the confusion matrix. The evaluation indicator PPV referred to the proportion of the number of products that were actually unqualified to the number of products identified by the model as unqualified. Recall referred to the accuracy of classification for all unqualified samples. EL V.1 was officially launched to conduct online risk forecasting at the border on 8 April 2020. It was switched to EL V.2 on 3 August 2020 for continuous online real-time forecasting. Therefore, in this study, we compared the unqualified rates in 2020, 2021, and 2022 after the launch with that in 2019 before the launch. The chi-square test was used to evaluate whether there was a significant increase in the unqualified rate with the aid of risk prediction and sampling of EL V.2 constructed in this study, which was used as the final evaluation result of the prediction effectiveness.

Resampling Method and Optimal Ratio
To overcome the problem of the number of unqualified batches being too small, in this study, we tried using proportional amplification and the synthesized minority oversampling technique (SMOTE) for resampling to select the best method to deal with unbalanced data and avoid deviation in model prediction. To explore the proportional parameter of qualified to unqualified batches, tests were conducted using proportional amplification at 7:3 and SMOTE at 7:3, 6:4, 5:5, 4:6, and 3:7. After pairing with Bagging, 10 iterations were conducted to obtain the average result for each of the seven algorithms. Then, the "majority decision" in the ensemble learning method was used to obtain the results. The predictive effect was observed via PPV and F1. Previous studies found that 10 and 100 iterations of modeling exhibited comparable results, but the time required for 100 iterations significantly exceeded that for 10 iterations and was 3-8 times longer. Therefore, 10 iterations were selected for modeling, considering the time limitations.
In this study, we selected the inspection data of S-type food as the training set. After ensemble learning, the research results showed ( Table 4) that when the extracted PPV and F1 were the highest, the optimal proportion of imbalanced sample processing was SMOTE 7:3. F1 was 11.03%, PPV was 6.03%, and Recall was 64.91%. Therefore, this study adopted a 7:3 ratio for qualified to unqualified samples. Based on historical experience, a ratio of 7:3 was used for proportional amplification in this study. It was not yet confirmed that SMOTE and proportional amplification were the most suitable methods for processing imbalanced data in this study. Therefore, both will continue to be included in the evaluation project in the future.

Generation of the Optimum Prediction Model
In this study, the "time interval" and "whether blacklisted manufacturers were included" were used as fixed risk factors in the training set, and the unbalanced data processing method of "SMOTE or proportional amplification" was adopted. Therefore, six data combinations were generated in the study, named A-F. Subsequently, seven algorithms were adopted for modeling, including Bagging-CART, Bagging-C5.0, Bagging-LR, Bagging-NB, Bagging-RF, Bagging-EN, and Bagging-GRM. After that, together with ensemble learning (EL), a total of 42 models and performance indicator evaluation results were generated, as listed in Table 5.
To construct the optimal prediction model in this study, the first step was to examine the effectiveness evaluation index AUC of the model, which should be greater than 50%, to ensure that the probability of unqualified batches being selected was greater than that of random sampling. Secondly, the top three combinations with the highest F1 values were prioritized. Furthermore, 25.0% for D7 random forest and both 23.0% for C8 and D8 ensemble learning indicated better performance. Another important evaluation indicator of PPV was further observed. Among the three aforementioned methods, 22.9% for the C8 ensemble method was the best. Meanwhile, Recall was 29.0%, 23.2%, and 28.3%, respectively, all of which reached the acceptable level. To comply with the requirement in practice that the general sampling rate should be controlled between 2% and 10%, it was important to note that the performance of Recalls was closely related to the sampling rate. When Recall was higher, the sampling rate was also relatively higher. Additionally, in this study, we also focused on the comparison of the number of unqualified pieces in the sampling to avoid situations where the unqualified rate was high while the sampling rate and the number of unqualified pieces were low. In summary, in this study, we selected the "C8 ensemble method" as the optimum prediction model.
In this study, we obtained similar results when examining the robustness of the model's future prediction and the top three F1 scores of D7, C8, and D8. Therefore, a total of 16 combinations of Group C and Group D were retained for subsequent real-world prediction simulation to determine the appropriateness of the selected optimal prediction model.

Model Prediction Effectiveness
In this study, we imported the test set data into the best model C8 identified in the previous stage and simultaneously into combinations with similar evaluation results (including C1-7 and D1-8) to observe the predictive performance of the model. The research results showed ( Table 6) that the top three models (C8, D7, and D8), which were originally the best choices, output F1 scores of 21.6%, 14.3%, and 15.8% and PPV values of 16.4%, 10.4%, and 12.3%, respectively, after the test set was imported for effectiveness evaluation. This result confirmed that C8 remained the optimum prediction model.   ). Therefore, compared to any other algorithm, the ensemble method in this study can have an equivalent or better effect, and it was also more robust.
In 2019, the total number of inspection batches for S-type food was 29,573, and the actual number of randomly selected batches with inspection results was 4154 (excluding annual inspection batches). These 318 batches with sampling results were used as test sets for prediction. The number of batches sampled according to the prediction model recommendation was 318. The recommended sampling rate by the model was 7.66%, the hit rate was 16.35%, and the number of hit batches of the model was 52. The original overall sampling rate was 10.68%, the unqualified rate was 2.09%, and the number of unqualified batches was 618. The hit rate of sampling inspection with model recommendation was 7.82 times that of the original random sampling (Table 7).   In summary, the results of this study showed that the C8 ensemble method was the optimal model choice for this study. After effectiveness evaluation, it was determined that the hit rate of sampling inspection after the model recommendation was greater than that of random sampling.

Discussion
To enhance the prediction performance of EL V.2, in this study, we employed several methods that differed from EL V.1. These methods included adjusting the selection approach for characteristic risk factors, incorporating additional algorithms into the model, and utilizing F adjustment to maintain the sampling rate within 2-8% after EL V.2 was launched. Simultaneously, 2% was reserved for random sampling to avoid model overfitting, thereby strengthening the robustness and prediction hit rate of ensemble model prediction results (Table 8).

F β Was Employed to Regulate the Sampling Inspection Rate
In this study, it was discovered that during the operation of EL V.1, the risk score distribution for each model varied (Figure 4). Hence, using the same threshold F β to regulate the sampling rate was not advisable. Therefore, the optimal threshold F β was set for each model separately through β. The F-value employed in the current evaluation model was the harmonic mean of PPV (unqualified rate in sampling inspection) and Recall (identification rate of unqualified products in sampling inspection). F β adjusted the weights of PPV and Recall based on different β values. The larger the β, the greater the weight of Recall (Equation (10)). Then, based on the threshold setting, the unqualified rate and sampling rate were evaluated. Table 8. Differences between EL V.2 and EL V.1 modeling methods.

Screening of characteristic risk factors
Single-factor analysis and stepwise regression were used to screen characteristic factors using simple statistical methods.
New data were added monthly to participate in modeling, and then key factors were selected for actual participation.
Prevent factor collinearity. Make the remaining factors more independent and important.

Add algorithms 5 algorithms 7 algorithms
When the prediction effect of multiple models is reduced, the AUC > 50% can still be retained for integration to improve the robustness of the model.
Adjust model parameters F β regulated the sampling inspection rate. Five models had consistent values.
F β regulated the sampling inspection rate Seven models were independently adjusted.
The sampling rate was regulated, and the elasticity was set at 2-8%.
In this study, we used F β to identify the prediction results of the optimal threshold for each model to maximize the F value with different β values. We reviewed the model thresholds F β established via various algorithms to evaluate the sampling unqualified rate and sampling rate of S-type products from 1 May 2020 to 31 May 2020. The final output is listed in the threshold regulation analysis table with different β values, as presented in Table 9.
To control the sampling rate at 7%, using Beta 2.6 as an example, the unqualified rate of sampling was 16.67%, and the sampling rate was 7.23%. When all classification models utilized the same threshold, the unqualified rate of sampling was 15.45%, and the sampling rate was 7.56% (Table 10). This study found that regulating the sampling rate with Beta can increase the unqualified rate of sampling. If the sampling rate was low, the Beta value could be adjusted higher to improve the sampling rate; if the sampling rate was too high, the Beta value could be lowered to reduce the sampling rate. Therefore, the EL V.2 constructed in this study was designed to regulate the Beta value according to the required sampling rate. Through the automated generation of optimal thresholds by the model, the accuracy of each model can be enhanced, and the effectiveness of sampling management can be strengthened.

Comparison between Single Algorithm and Ensemble Algorithm
Among the 42 prediction models established in the stage of optimal model selection, for each of the six data combinations of A-F, both F1 and PPV of the ensemble learning method ranked in the top three among the eight models when compared to the single algorithm. Moreover, their AUCs were all greater than that of 50% random sampling (Table 4). When further using the test set to simulate actual predictions, the ensemble method in the C and D data combinations (Table 5) remained in the top three (C8 ensemble method F1 21.6%, PPV 16.4%, AUC 69.9% > 50%; D8 ensemble method F1 15.1%, PPV 12.3%, AUC 69.0% > 50%). The results of this study showed that the ensemble method was the most suitable approach for constructing border food prediction models, and its robustness could ensure that high-risk products could be efficiently predicted and detected as unqualified through sampling and inspection. Thus, the occurrence of food safety incidents could be prevented.

Comparison of Prediction Effectiveness between EL V.2 and EL V.1 Models
In this section, we explored whether the second-generation ensemble learning prediction model (EL V.2) constructed by our research institute (composed of seven algorithms: Bagging-CART, Bagging-C5.0, Bagging-Logistic, Bagging-NB, Bagging-RF, Bagging-EN, and Bagging-GRM) exhibited better predictive performance than the first-generation model (EL V.1) constructed by the previous study using five algorithms: Bagging-CART, Bagging-C5.0, Bagging-Logistic, Bagging-NB, and Bagging-RF. In this study, we selected the time interval in 2020 with ensemble learning for effectiveness evaluation. EL V.1 analysis interval: 8 April 2020 to 2 August 2020; EL V.2 analysis interval: 3 August 2020 to 30 November 2020. After using the prediction index established by the confusion matrix, the results showed that: 1.
The AUC of EL V.1 ranged from 53.43% to 69.03%, while the AUC of EL V.2 ranged from 49.40% to 63.39%. After a majority decision, the Bagging-CART model of EL V.2 with AUC less than 50% was considered unsuitable. By adopting a majority decision strategy through ensemble learning, the influence of the Bagging-CART model was diluted by the other six models. Thus, EL V.2 exhibited better robustness than EL V.1. The advantage of ensemble learning was that when a small number of algorithms were not suitable (worse than random sampling), there was a mechanism for eliminating or weakening influence. The performance of AUC showed that EL V.1 and EL V.2 had a greater prediction probability than randomly selecting unqualified cases (Table 11). 2.
The predictive evaluation index F1 (8.14%) and PPV (4.38%) of EL V.2 had better results compared to F1 (4.49%) and PPV (2.47%) of EL V.1, indicating that EL V.2 had better predictive effects than EL V.1 (Table 12). Table 11. AUC comparison between EL V.1 and EL V.2 models. The above results indicated that EL V.2 had better predictive performance than EL V.1, but it should still be noted that the Recall of EL V.2 was about twice that of EL V.1. This suggested that there might be a relative increase in the sampling rate. Therefore, determining how to control the sampling rate within the general sampling rate range (2-10%) while improving the unqualified hit rate was a key consideration after the model's launch.

Evaluation of the Effectiveness of the Prediction Model after Its Launch
In this study, we used the ensemble learning method to construct the EL V.1 model, which was launched on 8 April 2020. The S-type food was imported for sampling inspection prediction. On 3 August 2020, EL V.1 was replaced by EL V.2. To understand the effectiveness of the model after its launch, the performance from 2020 to 2022 was compared with that of the random sampling method in 2019. The results showed that from 2020 to 2022, after conducting general sampling inspection predictions using the ensemble learning model, the unqualified rates obtained were 5.10%, 6.36%, and 4.39%, respectively, which were higher than the unqualified rate of 2.09% in 2019. The overall annual sampling rates were 6.07% in 2020, 9.14% in 2021, and 10.9% in 2022, which were all controlled within the range of 2-10% (without rounding below the decimal point) (Tables 13 and 14). In this study, we further utilized statistical analysis for the chi-square test. The results showed that the ensemble learning method for border food sampling inspection had statistical significance (p value = 0.000 ***) in improving the unqualified rate (Table 14). Therefore, the ensemble learning model EL V.2, constructed by the seven algorithms used in this study and launched on 3 August 2020, can effectively increase the unqualified rate while maintaining the general sampling rate within a reasonable range of 2-10%.  The chi-square test was used to evaluate whether there was a significant impact on the evaluation results in the years before and after the launch (2019). "*" means p < 0.05; "**" means p < 0.01; "***" means p < 0.001.
The findings of this study are as follows: 1. EL V.2 is better than random sampling. After the ensemble learning model EL V.2, developed in this study, was launched online, the predicted results from 2020 to 2022 were reviewed. Based on the overall general sampling cases throughout the year, it was determined that the unqualified rate was 3.74% in 2020, 4.16% in 2021, and 3.01% in 2022, all of which were significantly higher than 2.09% in 2019. Further observation showed that the unqualified rates of cases recommended for sampling inspection through ensemble learning in 2020, 2021, and 2022 were 5.10%, 6.36%, and 4.39%, respectively, which were significantly higher than the 2.09% under random sampling inspection in 2019.

2.
The ensemble learning model should be periodically re-modeled. Based on Table 12, it can be observed that the unqualified rate showed a growing trend from 2019 to 2021 but a slight decrease in 2022 ( Figure 5). The results of the further chi-square test showed that the unqualified rate in 2022 was still significantly higher than that in 2019 (p value = 0.000 *** < 0.001) (Table 14). However, for ensemble learning prediction models constructed using various machine learning algorithms, the factors and data required for modeling often change with factors such as the external environment and policies. Re-modeling was necessary to make the best adjustments to "data drift" or "concept drift" in the real world to prevent model failure. Drift refers to the degradation of predictive performance over time due to hidden external environmental factors. Due to the fact that data changed over time, the model's capability to make accurate predictions may decrease. Therefore, it was necessary to monitor data drift and conduct timely reviews of modeling factors. When collecting new data, the data predicted by the model should be avoided to prevent the new model from overfitting when making predictions. The goal of this study is to enable the new model to adjust to changes in the external environment, which will be a sustained effort in the future. 3.
The trade-off between unqualified batch hit rate and computational efficiency needs to be established. While the rejection rate was improved using the model constructed with seven algorithms (i.e., EL V.2), there were approximately 0.1% of batches where the model took more than one minute to compute. The model was designed to facilitate inspectors at the border to make fast decisions on sampling. Considering computational efficiency and real-time prediction, random sampling would be automatically selected for batches with over 1 min computation time to avoid delay in border inspections due to model failure. "concept drift" in the real world to prevent model failure. Drift refers to the degradation of predictive performance over time due to hidden external environmental factors. Due to the fact that data changed over time, the model's capability to make accurate predictions may decrease. Therefore, it was necessary to monitor data drift and conduct timely reviews of modeling factors. When collecting new data, the data predicted by the model should be avoided to prevent the new model from overfitting when making predictions. The goal of this study is to enable the new model to adjust to changes in the external environment, which will be a sustained effort in the future. 3. The trade-off between unqualified batch hit rate and computational efficiency needs to be established. While the rejection rate was improved using the model constructed with seven algorithms (i.e., EL V.2), there were approximately 0.1% of batches where the model took more than one minute to compute. The model was designed to facilitate inspectors at the border to make fast decisions on sampling. Considering computational efficiency and real-time prediction, random sampling would be automatically selected for batches with over 1 min computation time to avoid delay in border inspections due to model failure.

Research Limitations
When determining the research scope, it was necessary to ensure that each product classification for border inspection applications had unqualified cases and that the number of unqualified cases was not too small. Therefore, for those with an unqualified 10

Research Limitations
When determining the research scope, it was necessary to ensure that each product classification for border inspection applications had unqualified cases and that the number of unqualified cases was not too small. Therefore, for those with an unqualified rate of less than 1% in past sampling and fewer than 10 unqualified cases, the original random sampling mechanism was maintained. The product classification was not included in the scope of this study when it was impossible to find a classification with high product homogeneity and similar inspection items that could be merged. Owing to legal requirements associated with government data, the types, content, and hyperparameters of risk factors cannot be presented in this paper to protect information security and confidentiality.

Conclusions
In this study, we constructed a second-generation integrated learning prediction model, EL V.2. The research results showed that EL V.2 exhibited better prediction performance than random sampling and the first-generation integrated learning prediction model, EL V.1. Additionally, the model was composed of seven algorithms. Hence, when the model was inadequate (AUC < 50%), the overall prediction results remained robust when integrated learning was conducted through the majority decision voting method.
The outbreak of the COVID-19 pandemic in late 2020 had a worldwide impact on border control measures as well as economic and trade exchanges. Compared with unqualified rates in 2019, 2020 and 2021 saw increases in unqualified cases in Taiwan, which is likely to be attributed to the great changes in the origin and quantity of imported goods caused by the pandemic. Another reason for the changes in unqualified rates could be the modification of some related regulations and inspection standards. The effects of these aspects on the evaluation of the performance of EL V.1 and EL V.2 still require further observation and analysis in the future. Since 2020, Taiwan's border management has gradually introduced an intelligent management operation model. Border management powered by artificial intelligence enables Taiwan to strengthen its risk prediction capabilities and quickly adapt to trends in the context of rapid changes in the international environment, thereby ensuring people's health and safety.