Portrait Segmentation Using Ensemble of Heterogeneous Deep-Learning Models

Image segmentation plays a central role in a broad range of applications, such as medical image analysis, autonomous vehicles, video surveillance and augmented reality. Portrait segmentation, which is a subset of semantic image segmentation, is widely used as a preprocessing step in multiple applications such as security systems, entertainment applications, video conferences, etc. A substantial amount of deep learning-based portrait segmentation approaches have been developed, since the performance and accuracy of semantic image segmentation have improved significantly due to the recent introduction of deep learning technology. However, these approaches are limited to a single portrait segmentation model. In this paper, we propose a novel approach using an ensemble method by combining multiple heterogeneous deep-learning based portrait segmentation models to improve the segmentation performance. The Two-Models ensemble and Three-Models ensemble, using a simple soft voting method and weighted soft voting method, were experimented. Intersection over Union (IoU) metric, IoU standard deviation and false prediction rate were used to evaluate the performance. Cost efficiency was calculated to analyze the efficiency of segmentation. The experiment results show that the proposed ensemble approach can perform with higher accuracy and lower errors than single deep-learning-based portrait segmentation models. The results also show that the ensemble of deep-learning models typically increases the use of memory and computing power, although it also shows that the ensemble of deep-learning models can perform more efficiently than a single model with higher accuracy using less memory and less computing power.


Introduction
Image segmentation has been one of the most challenging problems in image processing and computer vision for the last three decades. It is different from image classification, as an image classification algorithm has to classify the object which has a particular label, but an ideal image segmentation algorithm has to segment even the unknown objects [1]. Good image segmentation is expected to have uniform and homogeneous segmented regions, but achieving a desired segmented image is a challenging task [2]. Numerous image segmentation algorithms have been developed and reported in the literature. The traditional image segmentation algorithms are generally based on clustering technology, and the Markov model is one of the most well-known approaches. In general, the traditional image segmentation algorithms based on digital image processing and mathematical morphology are susceptible to noise, and require a lot of interaction between a human and computer for accurate segmentation [3][4][5][6][7][8][9][10][11][12][13]. Amongst image segmentation algorithms, Deep Learning (DL)-based image segmentation has created a new generation of image segmentation models, and surpassed traditional image segmentation methods in many aspects [14,15]. DL-based image segmentation aims to predict a category label for every image pixel, which is an important yet challenging task [16]. Ensemble methods can be used for improving prediction performance, and the idea of ensemble method is to build an effective model by integrating multiple models [17]. Model averaging is a general strategy amongst many ensemble methods in machine learning, and its idea is to train several different models separately, and then all the models vote on the output of each model. It is known that neural networks provide a wide variety of solutions, and can benefit from model averaging, even if all of the models are trained on the same dataset [18].
Integrating multiple models to enhance the prediction performance has been studied and analyzed by several researchers in machine learning: Breiman explained a method to improve the performance accuracy by combining several models [19]; Warfield et al. presented an algorithm for combining and estimating the segmentation performance [20]; Rohlfing et al. introduced a new way to combine multiple segmentation and enhance performance [21]; Hansen and Salamon suggested that an ensemble of similar functioning and configured networks increased the predictive performance of the individual network [22]; Singh et al. specified that ensemble segmentation had resulted in better performance than individual segmentation [23]; Yan Zou et al. introduced a rapid ensemble learning by combining deep convolution networks and random forests to reduce learning time using limited training data [24]; and Andrew Holliday et al. suggested a model compression technique to speed up DL segmentation using an ensemble. The authors combined the strengths of different architectures, while maintaining the real-time performance of segmentation [25]. Ishan Nigam et al. proposed an ensemble knowledge transfer to improve aerial segmentation. The authors trained multiple models by progressive fine-tuning and combined the collections of models to improve performance [26]. D. Marmanis et al. suggested an ensemble approach using Fully Convolution Network (FCN) models for the semantic segmentation of aerial images. It has been shown in this work that ensembles of several networks achieve excellent results [27]. Machine Learning contests are usually won by methods using model averaging, for example the Netflix Grand Prize [28]. Y. W. Kim et al. introduced an ensemble approach by combining three DL-based portrait segmentation models, and showed significant improvement in the segmentation accuracy [29].
Semantic image segmentation plays a central role in a broad range of applications such as medical image analysis, autonomous vehicles, video surveillance and Augmented Reality (AR) [14,30]. Portrait segmentation, which is a subset of semantic image segmentation, is generally used to segment a human's upper body in an image, but may not be limited to the upper body. The use of portrait segmentation technology is becoming more and more popular due to the popularity of "selfies" (self-portrait photographs). Portrait segmentation is widely used as a pre-processing step in multiple applications such as security system, entertainment application, video conference, AR, etc. In portrait segmentation, the precise segmentation of the human body is crucial but challenging [31]. A substantial amount of DL-based portrait segmentation approaches have been developed since the performance and accuracy of semantic image segmentation have improved significantly due to the recently introduced DL technology [32][33][34][35][36][37][38][39][40][41]. However, studies related to these approaches are focused on a single portrait segmentation model.
In this paper, we propose a novel approach using ensemble method by combining multiple heterogeneous DL-based portrait segmentation models, and the experiment results show that the proposed approach can produce improved performance over single DL-based portrait segmentation models. The major contributions of this research are as follows: (1) A novel approach of combining multiple heterogeneous DL-based portrait segmentation models was introduced. According to the best of our knowledge, there are no reported works on portrait segmentation using an ensemble of multiple heterogeneous DL models. (2) The efficiency rate of memory and computing power was measured to evaluate the efficiency of portrait segmentation models. We attempted to measure cost efficiency of five state-of-the-art DL-based portrait segmentation models, and proposed an In addition to these major contributions, the minor contributions of our research work are presented below: (4) Six state-of-the-art DL-based portrait segmentation models were experimented on and compared. (5) The Simple Soft Voting method and Weighted Soft Voting method were used to make an ensemble of individual portrait segmentation models. (6) A simple and efficient way to combine the output results of individual portrait segmentation models was introduced. (7) A quantitative experiment was used to evaluate the performance of portrait segmentation models.
Section 2 gives a brief introduction about the portrait segmentation and soft voting method. Section 3 explains portrait segmentation models used in the experiment, the concept of an ensemble, approach to combine portrait segmentation models, evaluation metrics and experiment environment. Section 4 presents the results of the experiment, followed by the brief discussion of the results in Section 5. The summary of experiment and contribution of this work are covered in Section 6.

Portrait Segmentation
In general, segmentation subdivides an image into its constituent regions or objects. The application problem defines the level to which the subdivision is supposed to be carried out, i.e., the segmentation should stop once the object of interest is segmented. The segmentation accuracy determines the success or failure of the application problem. Acknowledging the importance of segmentation applications in the real world specifies the necessity for obtaining highly precise and accurate segmented images. Portrait segmentation in general can be referred as a process of segmenting a person in an image from its background. There are many applications using a portrait segmentation technology. Some popular camera apps can change the background of a selfie photo automatically with one-click. A portrait segmentation is used for online video conferencing, as well as entertainment. Many organizations are using online video conferencing solutions, as many people are working from home since the COVID-19 pandemic. Portrait segmentation technology is used to hide the user's background for privacy, security and aesthetic reasons, while users are attending the online video conferencing. In addition to these, the portrait segmentation technology can be used for content-based image retrieval. For example, a segmented human body in photos can be stored in a database. When a user enters a query to find all photos with similar people, the human body of the entered photo can be segmented and passed to the database to find similar people in the database.
Recently, a substantial amount of DL-based portrait segmentation networks was developed in the literature. Song-Hai Zhang et al. proposed a novel semantic segmentation network called PortraitNet, which was specifically designed for real-time portrait segmentation on mobile devices with limited computational power, and their experimental results demonstrate both high accuracy and efficiency [33]. Hyojin Park et al. introduced SINet, which executes well on mobile devices with 100.6 frames per second, and has a high accuracy of 95.29% [34]. FCN was proposed by Jonathan Long et al., and adapts Entropy 2021, 23,197 4 of 20 the contemporary classification networks into fully convolutional networks and transfers their learned representations by fine-tuning to the segmentation task, achieving a state-of-the-art performance [35]. Mark Sandler et al. proposed a new mobile architecture, MobileNetV2, which improves the performance of mobile models on multiple tasks and benchmarks [36]. Its main contribution is a novel layer module. Andrew Howard et al. presents the next generation of MobileNets called MobileNetV3 in their work, based on a combination of complementary search techniques as well as novel architecture design [37]. Jimei Yang et al. introduced a DL algorithm for Contour Detection with a Fully Convolutional Encoder-Decoder Network (CEDN) [38]. This Fully Convolutional Encoder-Decoder was developed, inspired from a fully convolutional network and unpooling layers from deconvolutional network, and it focuses on detecting higher level object contours. Authors compared it with traditional methods and observed that proposed DL-based contour detections outperformed traditional methods and other DL-based approaches. Xianzhi et al. proposed a boundary-sensitive network for portrait segmentation, and authors showed the DL-based portrait segmentation outperformed the graph cut method [31]. Hyojin Park et al. introduced a new extremely lightweight portrait segmentation model named ExtremeC3Net, which even with far fewer parameters, produced competitive segmentation accuracy with PortraitNet [39]. DeepLabV3 Network was introduced by Liang-Chieh Chen et al., which significantly improves over their previous DeepLab versions without DenseCRF post-processing, and attains comparable performance with other state-of-art models [40]. Xi Chen et al. proposed the Boundary-Aware network (BANet), which is a lightweight network architecture that focuses on extracting detailed information in the boundary area to produce a high-quality segmented image in less time [41].
A segmentation model learns a desired task from the given inputs to produce the expected output. In this paper, we combine the outputs of DL-based image segmentation models. These models use the Convolution Neural Network (CNN) DL algorithm. CNN has led to the development of state-of-the-art segmentation models, among which six portrait segmentation models are used in this paper. The FCN model owes its name to its architecture; this model uses a novel method to obtain a better upsampled feature map. The MobileNetV2 model has used a unique architecture to improve the performance on small devices. The other portrait segmentation models used in this paper have their architectures built based on the Encoder-Decoder architecture. The pooling layer in encoder module produces low-resolution feature map containing high-level information, and the decoder module aims to produce high-resolution segmentation output from the highlevel information received from the encoder. Figure 1 shows the simple structure of the Encoder-Decoder model in general, for portrait segmentation.

Soft Voting Ensemble
In contrast to ordinary learning approaches, which try to construct one learner from training data, ensemble methods try to construct a set of learners and combine them. Generally, an ensemble is constructed in two steps, i.e., generating the base learners and then combining them. To get good ensemble, it is generally believed that the base learners should be as accurate and diverse as possible. There are many ensemble learning algo-

Soft Voting Ensemble
In contrast to ordinary learning approaches, which try to construct one learner from training data, ensemble methods try to construct a set of learners and combine them. Generally, an ensemble is constructed in two steps, i.e., generating the base learners and then combining them. To get good ensemble, it is generally believed that the base learners should be as accurate and diverse as possible. There are many ensemble learning algorithms. Averaging and Voting are the most popular and fundamental ensemble methods for numeric and nominal outputs [42]. Considering classification as an example, the working of voting can be explained as, supposing a set of T individual classifiers {h 1 , . . . , h T } are given where h indicates the classifiers. The task is to combine set of classifiers to predict the class label from a set of l possible class labels {c 1 , . . . , c l }, where c is the class label. There are many types of voting, such as Majority voting, Plurality voting, Weighted voting, Soft voting, etc. Soft voting is used for individual classifiers which produce class probability outputs. In this method, the individual classifier h i outputs a l-dimensional vector (h 1 (x), . . . , h l (x)) T for instance of x, where h j (x) ∈ [0, 1] i.e., h j (x) which is the output of classifier h i for the class label c j , can be regarded as an estimate of the posterior probability. Soft voting is further classified into Simple Soft Voting and Weighted Soft Voting. In simple soft voting the individual classifiers are treated equally, and it generates the combined output by simply averaging all the individual outputs. The final output for class c j is given in Equation (1) below [42]: In weighted soft voting, the individual outputs are combined with different weights. The weighted soft voting method takes any of the following three forms: • A class-specific weight is assigned to each classifier, and the combined output for class c j is given in Equation (2) below where w i is the weight assigned to the classifier h i .
• A class-specific weight is assigned to each classifier per class, and the combined output for class c j is given in Equation (3) below where w j i is the weight assigned to the classifier h i for the class c j . • A weight is applied to each example of each class for each classifier, and the combined output of c j is given in Equation (4) below where w j ik is the weight of the instance x k of the class c j for the classifier h i . Equation (4) is not often used in real practice, since it may involve a large number of weight coefficients. The weights w i 's are usually assumed to be constrained, as given in Equation (5) below: Entropy 2021, 23,197 6 of 20 In this paper, simple soft voting using Equation (1) and weighted soft voting using Equation (2) are used to make an ensemble of the individual portrait segmentation models, as they are simple and efficient approaches to obtain effective results.

Experimented Portrait Segmentation Models
In this paper, we used PortraitNet (PN), SINet (SN), FCN, MobileNetV2 (MNV2), MobileNetV3 (MNV3) and the CEDN for our experiment. Table 1 shows the model names, the dataset used for training the model, the input image resolution of the model and the dataset used for testing the model.  [35] PASCAL VOC 2011 500 × 500 EG1800 + CDI MobileNetV2 [36] Custom 128 × 128 EG1800 + CDI MobileNetV3 [37] Custom 224 × 224 EG1800 + CDI PortraitNet [33] EG1800 224 × 224 EG1800 + CDI SINet [34] EG1800 224 × 224 EG1800 + CDI The AISegment dataset [43], which is publicly available for portrait segmentation training and testing, was used to train the CEDN model. The Custom dataset consists of 18,698 portrait images which are publicly available on websites, and was used by the MobileNetV2 and MobileNetV3 pre-trained models. The EG1800 dataset [44] consists of 1447 publicly available images for training and 289 images for validation. In this paper, 270 images from the validation set of EG1800 dataset and 298 validation portrait image datasets collected from publicly available websites (we call it the "CDI" dataset) were used for the experiment.

Ensemble of Portrait Segmentation Models
In machine learning, ensemble models are developed by combining the prediction of multiple individual models to improve the overall predictions. The ensemble model can be created either by combining multiple modelling algorithms or using different training datasets. The main purpose of ensemble method is to reduce the generalization error in the machine learning algorithms. In this paper, we propose the ensemble approach of n portrait segmentation models to enhance the accuracy performance of the single models, where n indicates the number of single models combined at a time. Figure 2 shows the structure of ensemble approach of n portrait segmentation models. Consider a set of T number of portrait segmentation models i.e., {L 1 , L 2 , L 3 , . . . , L T }, where L i are individual portrait segmentation models (First-Level learners). The input images from the EG1800 + CDI datasets are fed to the individual models, and the segmented output masks are obtained and stored. These individual segmented output masks are combined to enhance final outputs. For the ensemble of the outputs of individual models, the simple soft voting method and weighted soft voting method are used in this research. The idea of the ensemble is to learn from the outputs of individual models, and collectively produce a better segmented result.
In this paper, weighted soft voting method is used as a Meta-Classifier of stacking ensemble. Stacking is the process of training individual classifiers called First-Level learners and combiners called Second-Level learners, or Meta-Learners that combine them [19,45,46]. The basic idea of stacking ensembles is to train First-Level learners using the original dataset, and the classification results generated by these learners will be used as a new training dataset for Second-Level learner. Here, the labels of the original dataset are still used as the labels of the new training dataset. First-Level learners, and L is the Second-Level learner, where T is the number of learners and m is the number of datasets. Although heterogeneous learning algorithms are often used as First-Level learners, it is also possible to use homogeneous learning algorithms. In this paper, segmentation models of Table 1    In this paper, weighted soft voting method is used as a Meta-Classifier of stacking ensemble. Stacking is the process of training individual classifiers called First-Level learners and combiners called Second-Level learners, or Meta-Learners that combine them [19,45,46]. The basic idea of stacking ensembles is to train First-Level learners using the original dataset, and the classification results generated by these learners will be used as a new training dataset for Second-Level learner. Here, the labels of the original dataset are still used as the labels of the new training dataset. Figure 3 shows the diagram of stacking ensemble, where D, the original dataset {(x1, y1), (x2, y2), ..., (xm, ym)}, L1, …, LT are First-Level learners, and L is the Second-Level learner, where T is the number of learners and m is the number of datasets. Although heterogeneous learning algorithms are often used as First-Level learners, it is also possible to use homogeneous learning algorithms. In this paper, segmentation models of Table 1

Concept of Combining Segmented Outputs
In this paper, simple soft voting and weighted soft voting as a Meta-Classifier in a

Concept of Combining Segmented Outputs
In this paper, simple soft voting and weighted soft voting as a Meta-Classifier in a stacking ensemble were used to combine the segmented output masks of individual models. In the simple soft voting method using Equation (1), the weights of outputs of individual models are treated equally. In weighted soft voting using Equation (2), the weights of outputs of individual models are not equal, and are trained from the result of First-Level classifiers in the stacking ensemble to find a respective weight ratio. The segmented output mask generated from individual models have a binary value, i.e., 1 or 0. This binary value was converted into probability value by applying the Gaussian Blur function. Gaussian Blur is commonly used in image processing to smoothen and reduce the noise in the image. Gaussian Blur uses a non-uniform low pass filter that preserves the low spatial frequency, and reduces the noise and negligible details in the image. It belongs to linear smoothing filters class with weights according to the shape of a Gaussian function [47]. The 2-Dimensional Gaussian filter is defined in Equation (6) below [48] G(x, y) = 1 2πσ 2 e −(x 2 +y 2 ) 2σ (6) where G is the Gaussian mask at the location with coordinates x and y, and σ is the parameter which defines the standard deviation of the Gaussian. The value of σ is directly proportional to the image smoothing effect. In this paper, Gaussian Blur is used to convert the binary image mask into the probability distribution, which indicates how much the pixel is close to the segmented human region. Figure 4 shows the effect of applying Gaussian Blur to a binary image mask. The left metrics shows 9 × 9 pixels divided into two values, 1 or 0. If the pixel value is 1, then it belongs to the segmented foreground region. If the pixel value is 0, then it belongs to the background region. The right metrics shows that the binary pixel values are converted into probability values after applying Gaussian Blur using a 5 × 5 kernel. If the pixel value is closer to 1, then it has higher probability to belong to the segmented foreground region. Converting the binary pixel value into the probability value is also useful to calculate Equations (1) and (2).

Performance Measurement
In this paper, the IoU metric was used for performance measurement. The IoU is a metric used to measure the accuracy of a classifier. When A is a segmented image and B is a ground truth image, then IoU is defined as the Equation (7) below where A ∩ B is intersection of A and B, and A ∪ B is the union of A and B. In this paper, we used IoU standard deviation to measure the variance of prediction. FNR and FDR were calculated to validate the accuracy enhancement. FNR measures the false rate of the segmentation area, which is smaller than the ground truth. FDR measures the false rate of the segmentation area, which is larger than the ground truth. FNR and FDR are given as in Equations (8) and (9) below

Performance Measurement
In this paper, the IoU metric was used for performance measurement. The IoU is a metric used to measure the accuracy of a classifier. When A is a segmented image and B is a ground truth image, then IoU is defined as the Equation (7) below where A ∩ B is intersection of A and B, and A ∪ B is the union of A and B. In this paper, we used IoU standard deviation to measure the variance of prediction. FNR and FDR were calculated to validate the accuracy enhancement. FNR measures the false rate of the segmentation area, which is smaller than the ground truth. FDR measures the false rate of the segmentation area, which is larger than the ground truth. FNR and FDR are given as in Equations (8) and (9) below Where TP is True Positive, FP is False Positive and FN is False Negative values of the segmented image. It is known that the ensemble of multiple classifiers can reduce the bias error of classifiers. In this paper, we used FDR + FNR to measure the bias error, as the bias error indicates the amount of difference between the prediction and true values.
To measure cost efficiency of models, the Memory Efficiency Ratio (MER) and Computing Efficiency Ratio (CER) are used. In general, the efficiency ratio indicates the costs as a percentage of gain (costs/gain). MER measures the memory efficiency of a model, which indicates the required memory size to gain certain accuracy. When P is parameter size of a model and IoU is accuracy metric of a model given by Equation (7), then MER is defined as Equation (10) below: CER measures the computing power efficiency of the model, which indicates the required computing power to gain certain accuracy. When C is the Floating Point Operations (FLOPs) of a model and IoU is accuracy metric of a model given by Equation (7), then CER is defined as Equation (11) below:

Experiment Environment
In this paper, two groups of five state-of-the-art portrait segmentation models were used for the experiment. One group is used to evaluate IoU value, IoU standard deviation and false prediction rate, while another group is used to evaluate cost efficiency. For the experiment, the Two-Models ensemble and Three-Models ensemble were used to demonstrate the proposed ensemble approach. All possible two and three combinations of five single models were experimented for accuracy and cost efficiency measurements. The outputs of single models were combined using simple soft voting and weighted soft voting as a Meta-Classifier in a stacking ensemble. Equation (1) for simple soft voting and Equation (2) for weighted soft voting were used for the ensemble of single models. The validation images of EG1800 and CDI dataset were fed to the single models, and then the segmented output masks from these single models were combined using ensemble methods. A total of 568 portrait images of the EG1800 + CDI dataset were used to evaluate the accuracy and cost efficiency of individual portrait segmentation models and proposed ensemble models. The experiment results of Two-Models and Three-Model ensemble were compared with the result of single models and analysed. This experiment was conducted on an Ubuntu 18.4.4 LTS operating system. All the training and testing was performed using GeForce GTX 1080 GPU.

Results
In this section, the experiment's result and its analysis of single models, Two-Models ensemble and Three-Models ensemble, are presented. IoU, IoU standard deviation, FNR and FDR are used to measure the accuracy of single models and suggested ensemble models. Table 2 shows the experiment results of five single models. The best performed results are highlighted in bold. The MNV2 model shows 96.2082% of IoU, and is the highest accuracy among the experimented single models. The FCN model shows the lowest accuracy, with 95.2268% of IoU. The false prediction rate (FNR + FDR) shows a lower value, when the model has a higher IoU value.

Result of Two-Models Ensemble
In Tables 3 and 4, all possible combination of Two-Models ensemble and Three-Models ensemble of five single models were experimented on using simple soft voting and weighted soft voting as a Meta-Classifier in a stacking ensemble. The first column shows the model combination; the other columns show IoU, IoU standard deviation and false prediction rate (FNR + FDR). Table 3 shows the experiment result of Two-Models ensemble. The best results of the experiment are highlighted in the table. In the experiment, the FCN + PN ensemble shows the highest IoU, and the lowest false prediction rate for both simple soft voting and weighted soft voting. FCN + MNV2 ensemble also shows good results similar to the FCN + PN ensemble. It is also observed that the weighted soft voting as a Meta-Classifier in a stacking ensemble produces a higher IoU value than the simple soft voting, i.e., the accuracy of ensemble models using the weighted soft voting is better than simple soft voting in a same model combination.  Table 4 shows the result of Three-Models ensemble. The best results of the experiment are highlighted in the table. The FCN + MNV2 + SN ensemble shows the highest IoU and the lowest false prediction rate in both simple soft voting and weighted soft voting. Like the result of Two-Models ensemble, it is observed that the weighted soft voting produces higher accuracy than the simple soft voting.

IoU Comparison
From Tables 3 and 4, the average of IoU, the average of IoU standard deviation and the average of the false prediction rate for individual models in both Two-Models ensemble and Three-Models ensemble were calculated. To calculate the average value of a single model, results are collected from Two-Models ensembles or Three-Models ensembles where the single model occurs. For example, the IoU of MNV2 in Two-Models ensemble is the average of IoU values taken from Two-Models ensembles where MNV2 occurs. Table 5 shows the IoU comparison for a single model without ensemble, the Two-Models ensemble using simple soft voting, the Two-Models ensemble using weighted soft voting, the Three-Models ensemble using simple soft voting and the Three-Models ensemble using weighted soft voting. It is observed that the MNV2 model, which has the highest IoU before ensemble, shows the highest IoU in Two-Models ensemble using weighted soft voting. The FCN model which has the lowest IoU before ensemble shows the highest IoU in the Two-Models ensemble using simple soft voting, as well as in the Three-Models ensemble. It shows the most improvement after ensemble, i.e., 95.2268% of IoU before ensemble is enhanced to 97.1839% after ensemble. The result shows that our approach to improve the accuracy does not degrade the accuracy of models with higher IoU, and at the same time improves the accuracy of models with lower IoU. Figure 5 is the graphical representation of IoU comparison of individual models for no ensemble, the Two-Models ensemble and the Three-Models ensemble. It is observed that the Three-Models ensemble shows the best result compared to no ensemble and the Two-Models ensemble, while the Two-Models ensemble shows a better result than no ensemble.   Table 6 shows the comparison of IoU standard deviation. The IoU standard deviation is a measure of the amount of variation of IoU values. A low IoU standard deviation indicates that the IoU values tend to be close to the mean IoU. In Table 6, the experiment results of IoU standard deviation for the single model without ensemble, Two-Models ensemble using simple soft voting, Two-Models ensemble using weighted soft voting, Three-Models ensemble using simple soft voting and Three-Models ensemble using weighted soft voting are presented. The IoU standard deviation of the PN model, which is the lowest before ensemble, i.e., 3.7915, is reduced to 3.0581. The FCN model, which is the second highest of IoU standard deviation before ensemble, i.e., 5.6956, is reduced to 2.8583 after ensemble shows the most improvement. The result shows that the Two-Models and Three-Models ensemble produce a better result than the single model in most cases.  Table 6 shows the comparison of IoU standard deviation. The IoU standard deviation is a measure of the amount of variation of IoU values. A low IoU standard deviation indicates that the IoU values tend to be close to the mean IoU. In Table 6, the experiment results of IoU standard deviation for the single model without ensemble, Two-Models ensemble using simple soft voting, Two-Models ensemble using weighted soft voting, Three-Models ensemble using simple soft voting and Three-Models ensemble using weighted soft voting are presented. The IoU standard deviation of the PN model, which is the lowest before ensemble, i.e., 3.7915, is reduced to 3.0581. The FCN model, which is the second highest of IoU standard deviation before ensemble, i.e., 5.6956, is reduced to 2.8583 after ensemble shows the most improvement. The result shows that the Two-Models and Three-Models ensemble produce a better result than the single model in most cases.  Table 7 shows the experiment result of false prediction rate for the single model, Two-Models ensemble and Three-Models ensemble using the weighted soft voting method. In the table, the lower FDR + FNR value means the lower bias error as the bias error indicates the amount of difference between prediction and true values. The result shows that MNV2 has the lowest FDR + FNR value in the single model and Two-Models ensemble. MNV2, which is the lowest before ensemble, i.e., 3.8657, is reduced to 3.0057 after ensemble. The FCN model shows the most improvement after ensemble. The average of FDR + FNR shows that the Two-Models ensemble and Three-Models ensemble produce a lower bias error than the single model. It is also observed that the average of FDR + FNR is decreased as the number of single models combined for the ensemble is increased. In the table, |FDR-FNR| means the absolute value of FDR-FNR. This value indicates the amount of difference between FDR and FNR values. If this value is close to 0, then the model produces a well-balanced result in terms of falsely predicted regions. If the value is increased, then it means that falsely predicted positive regions are increased, while falsely predicted negative Entropy 2021, 23,197 13 of 20 regions are decreased, or vice versa. From the average value in the table, it is observed that |FDR-FNR| values decrease significantly, as the number of models for the ensemble increases.

Examples of FNR Reduction
Figures 6 and 7 show the comparison of reduction in the FNR. The first picture shows ground truth image, the middle pictures show the segmented image masks of single models and the last picture shows the segmented image mask of the ensemble model. As the FNR indicates the error rate of missing regions compared to ground truth image, the result shows that the missing regions of single models were recovered after ensemble. In Figure 6, the examples show the reduction of FNR after the ensemble of FCN + MNV2. Figure 7 shows the reduction of FNR after the ensemble of FCN + MNV2 + SN.    Figures 8 and 9 show the comparison of reduction in the FDR. The first picture shows the ground truth image, while the middle pictures show the segmented image masks of single models and the last picture shows the segmented image mask of ensemble model. As the FDR indicates the error rate of extra regions compared to the ground truth image, the result shows that the extra regions of single models were removed after the ensemble. In Figure 8, the examples show the reduction of FDR after ensemble of FCN + MNV2. Figure 9 shows the reduction of FDR after ensemble of FCN + MNV2 + SN. The results show that the proposed ensemble approaches produces a lesser error rate than single  Figures 8 and 9 show the comparison of reduction in the FDR. The first picture shows the ground truth image, while the middle pictures show the segmented image masks of single models and the last picture shows the segmented image mask of ensemble model. As the FDR indicates the error rate of extra regions compared to the ground truth image, the result shows that the extra regions of single models were removed after the ensemble. In Figure 8, the examples show the reduction of FDR after ensemble of FCN + MNV2. Figure 9 shows the reduction of FDR after ensemble of FCN + MNV2 + SN. The results show that the proposed ensemble approaches produces a lesser error rate than single models. This means that the proposed ensemble approaches produce better results than original single models. models. This means that the proposed ensemble approaches produce better results than original single models.   (a)  Figures 10 and 11 show the segmented results of single models and proposed ensemble models. The first column is the original image, the second column is ground truth segmentation, followed by single models' segmentation, and the last column is the proposed ensemble model's segmentation.  Figures 10 and 11 show the segmented results of single models and proposed ensemble models. The first column is the original image, the second column is ground truth segmentation, followed by single models' segmentation, and the last column is the proposed ensemble model's segmentation.

Cost Efficiency Analysis
In general, the ensemble of single classifiers can improve the accuracy of individual classifiers, but it requires more memory and computing power than single classifier, as the ensemble model has to execute all individual classifiers to combine its results. The total use of memory and computing power of ensemble model will be the summation of individual models included in it. The same results can be observed in our experiments.

Cost Efficiency Analysis
In general, the ensemble of single classifiers can improve the accuracy of individual classifiers, but it requires more memory and computing power than single classifier, as the ensemble model has to execute all individual classifiers to combine its results. The total use of memory and computing power of ensemble model will be the summation of individual models included in it. The same results can be observed in our experiments. The averaged results of single models, Two-Models ensemble and Three-Models ensemble show that ensemble models generally improve segmentation accuracy, but also increase the use of memory and computing power.
To measure and compare the efficiency of single models and ensemble models, MER using Equation (10) and CER using Equation (11) were used. MER indicates the required memory size to achieve certain accuracy, where lower MER is more efficient in memory usage. CER indicates required computing power to achieve certain accuracy, while lower CER is more efficient in computing power usage. For the experiment, the EG1800 + CDI dataset was fed to the SN, PN, MNV3, MNV2 and FCN models, and the result of individual models was combined using the Two-Models ensemble and Three-Models ensemble. The simple soft voting method was used to make an ensemble of single models. Tables 8-10 show the efficiency rate of single models and ensemble models. The first column shows the model name, followed by the number of parameters (Params), FLOPs, IoU values, MER values and CER values of each models. The IoU value of ensemble models was calculated from the result of the simple soft voting ensemble.
In Table 8, the SN model shows the lowest MER and CER values compared to other single models. In the table, the SN, PN and MNV3 models show a lower MER and CER value than MNV2. This means that SN, PN and MNV3 are more cost efficient than MNV2, even though MNV2 shows the best IoU value, indicating the most accurate in segmentation. Table 9 shows the efficiency rate of the Two-Models ensemble. In the table, Params is the summation of Params of two individual models, FLOPs is the summation of FLOPs of two individual models and the IoU value is the result of the ensemble model. Comparing the FCN model in Table 8, the FCN + SN ensemble shows a lower MER and CER value, while the IoU value is higher than FCN single model, and it indicates that FCN + SN combination is a cost-efficient ensemble model with better accuracy than the FCN single model. Table 10 shows the efficiency rate of the Three-Models ensemble. In the table, Params is the summation of Params of three individual models, FLOPs is the summation of FLOPs of three individual models and the IoU value is the result of the ensemble model. Comparing the MNV2, which shows the highest accuracy among experimented single models, the MNV3 + PN + SN ensemble shows a lower MER and CER value, while the IoU value is higher than the MNV2 single model. The summation of Params and FLOPs of MNV3 + PN + SN are also less than the MNV2 single model. These results show that the ensemble model can perform with the same or higher accuracy than single models in a cost-efficient way.

Discussion
The analysis of experiment results shows that IoU, IoU standard deviation and false prediction rate of DL-based portrait segmentation models can be improved through an ensemble. This indicates that the accuracy of segmentation can be enhanced, and variance and bias errors can be reduced using an ensemble of DL-based portrait segmentation models. By increasing the number of single models participating in the ensemble from two to three, higher accuracy and lower prediction errors can be produced. The weighted soft voting method can be used to improve the accuracy of ensemble models that used the simple soft voting method.
The analysis of cost efficiency shows that the ensemble of DL-based models typically increases the use of memory and computing power, but also shows that the ensemble of DL-based models that perform more efficiently than a single DL-based model with higher accuracy using less memory and less computing power is possible.

Conclusions
In this paper, simple and efficient ensemble approaches for portrait segmentation using six state-of-the-art DL-based portrait segmentation models were proposed, and the experiment results of the Two-Models and Three-Models ensemble were presented and analyzed. The simple soft voting method and weighted soft voting methods as a Meta-Classifier in a stacking ensemble were used to combine the individual portrait segmentation models. The images from validation set of EG1800 and CDI dataset were used for the experiment. The IoU metric was used to evaluate the accuracy of single models and the proposed ensemble approach. The IoU standard deviation and false prediction rate were analyzed to evaluate the improvement in variance and bias errors. The efficiency rate was analyzed to evaluate the efficiency of single models and the proposed ensemble approach. The experiment result showed that the IoU value, IoU standard deviation and false prediction rate of single models were significantly improved after ensemble. The result of the ensemble using the weighted soft voting method showed better accuracy than the simple soft voting method. It was also observed that the Three-Models ensemble showed better results than the Two-Models ensemble in terms of accuracy, variance and bias errors. The analysis of cost efficiency showed that the ensemble of DL-based portrait segmentation models typically increased the use of memory and computing power. However, it has also shown that the ensemble of DL-based portrait segmentation models can perform more efficiently than a single DL-based portrait segmentation model with higher accuracy, using less memory and less computing power.
In this paper, we evaluated six state-of-the-art DL-based portrait segmentation models and compared them with the proposed ensemble approach. We also introduced methods to improve the performance of DL-based portrait segmentation models using the ensemble approach, as well as methods to evaluate the performance of DL-based portrait segmentation models and its ensemble models. We hope these findings will benefit other researchers.

Informed Consent Statement: Not applicable.
Data Availability Statement: Data available in a publicly accessible repository. The data presented in this study are openly available in reference [43,44].