Testing dataset for head segmentation accuracy for the algorithms in the ‘BGSLibrary’ v3.0.0 developed by Andrews Sobral

This dataset consists of video files that were created to test the accuracy of background segmentation algorithms contained in the C++ wrapper ‘BGSLibrary’ v3.0.0 developed by Andrews Sobral. The comparison is based on segmentation accuracy of the algorithms on a series of indoor color-depth video clips of a single person's head and upper body, each highlighting a common factor that can influence the accuracy of foreground-background segmentation. The algorithms are run on the color image data, while the ‘ground truth’ is semi-automatically extracted from the depth data. The camera chosen for capturing the videos features paired color-depth image sensors, with the color sensor having specifications typical of mobile devices and webcams, which cover most of the use cases for these algorithms. The factors chosen for testing are derived from a literature review accompanying the dataset as being able to influence the efficacy of background segmentation. The assessment criteria for the results were set based on the requirements of common use cases such as gamecasting and mobile communications to allow the readers to make their own judgements on the merits of each algorithm for their own purposes.


b s t r a c t
This dataset consists of video files that were created to test the accuracy of background segmentation algorithms contained in the C ++ wrapper 'BGSLibrary' v3.0.0 developed by Andrews Sobral. The comparison is based on segmentation accuracy of the algorithms on a series of indoor color-depth video clips of a single person's head and upper body, each highlighting a common factor that can influence the accuracy of foreground-background segmentation. The algorithms are run on the color image data, while the 'ground truth' is semi-automatically extracted from the depth data. The camera chosen for capturing the videos features paired colordepth image sensors, with the color sensor having specifications typical of mobile devices and webcams, which cover most of the use cases for these algorithms. The factors chosen for testing are derived from a literature review accompanying the dataset as being able to influence the efficacy of background segmentation. The assessment criteria for the results were set based on the requirements of common use cases such as gamecasting and mobile communications to allow the readers to make their own judgements on the merits of each algorithm for their own purposes. © 2020 Published by Elsevier Inc. This is an open access article under the CC BY-NC-ND license ( http://creativecommons.org/licenses/by-nc-nd/4.0/ ) Table   Subject Computer Science -Computer Vision and Pattern Recognition Specific subject area Foreground-background segmentation of the head and upper body. Type of data Table Figures Video clips  How data were acquired Intel RealSense Depth Camera D435 featuring a global shutter, large color pixels of 3 um square, and a depth sensor using disparity mapping from stereo infra-red cameras. Custom acquisition software found at GitHub repository https://github.com/scloke/SegTest . Testing was performed on a system using Microsoft Windows 10 with a four-core Intel Xeon E3 processor running at 3.5 GHz using Visual Basic and Visual C ++ 2019 in a 64-bit address space with 32 GB RAM allocated and an NVIDIA GeForce GTX 1060 6GB graphics card installed. Image processing was done using the libraries in EmguCV 3.2.0 and Accord.Net 3.8.0. The system speed was rated at 475 million floating points per second (MFLOPS) using the Intel processor diagnostic tool 2.10, 64-bit version. Data format Raw Analyzed Parameters for data collection

Specifications
The following camera settings were used: structured light projector on, autofocus enabled, autoexposure disabled, automatic white balancing disabled, backlight compensation disabled, and powerline frequency compensation disabled. Capture resolution was 640 × 480 pixels at 30 frames per second (fps) for color data and 90 fps for depth data, with the depth data processed using temporal and spatial smoothing with hole-filling to reduce artefacts. Synthetic paired color and depth frames were motion interpolated from the source frames to generate video clips without any inter-frames. The clips were then saved to Audio Video Interleave (AVI) files using FFMPEG, with dimensions of 640 × 480 pixels at 30 fps. Color clips were encoded using the lossy MPEG-4 Part 2 codec at a bit rate of 4 megabit per second (Mbps) except for the noise clips which were encoded at 12 Mbps to preserve the noise artefacts. The clips were captured at night under controlled bidirectional diagonal and side lighting with Philips Hue White & Color Ambiance bulbs set to the 'Energize' preset with a color temperature of 6410 K, calculated from the Mired Color Temperature supplied by the Philips Hue Software Development Kit. The camera was placed 120 cm in front of either a plain green screen (standard), a cream-colored screen (camouflage), or with the screen removed (complex), having the subject standing 60 cm in front of the screen, with no intervening objects. This resulted in a foreground area that was consistently about half the total background size, which is sufficiently balanced to not distort the measures of segmentation efficacy, and yet not too big which would prevent the face detection routine from working properly. Description of data collection One of the chosen factors listed in Table 2 was then applied. All clips were 40 s long with the first 10 s showing just the background. The subject entered the scene at the 10 s mark and stood in the center of the frame while keeping a neutral expression, with the face and upper body fully visible. The comparison period was set to all frames between the 20 and 40 s mark inclusive. The brightness for all clips was normalized by applying the appropriate constant gamma correction to keep the average pixel brightness throughout the clip at 50% of the maximum brightness. For lighting change factors, the lighting conditions were altered mid-way through the comparison period. For the 'Ghost Images' clip, the comparison period was set to one second after the transition. This is because most algorithms for detection and removal of moving objects, ghosts, and shadows operate in less than a second. The comparison period for the 'Sleeping Foreground' clip was set to 10 s after the transition to allow enough time for the object to be incorporated into the background model. In clips where a subject was present, the face location was determined using libfacedetection by Shiqi Yu, and a seed point and depth obtained from the center of the bounding rectangle. The 'ground truth' ( continued on next page ) foreground was then extracted from a floating range flood fill starting at the seed point, with a maximum difference of 2 cm between adjacent pixels. Since the depth data from disparity mapping is coarse and lacks edge accuracy, an automated GrabCut algorithm was used to refine the edges of the foreground. The 'ground truth' clips were verified by visually inspecting at 4 fps, and adjustments were made with manual GrabCut assistance where the areas of inaccuracy exceeded 5% of the foreground ( Fig. 1 ). For clips without a subject, the depth data was ignored and the 'ground truth' foreground was set to zero. The segmentation foreground was obtained by processing the color data using the appropriate 'BGSLibrary' algorithm with default settings, except for the 'GMG' and 'GMM KaewTraKulPong' routines which require an earlier version of OpenCV. To speed up processing, the segmentation routines were run in four isolated parallel threads in 'release' mode.

Value of the Data
• While there have been many routines developed for foreground-background portrait segmentation, evaluation of these routines is usually done with video clips under standard conditions. Little is known about their performance under some of the factors which affect segmentation efficacy. • This data will be useful to researchers who are developing and testing algorithms for portrait segmentation. • Color and 'ground truth' video clips are available for each of the factors described. A new algorithm can be tested by applying it to the color video clip and generating the segmented video. This can then be compared against the 'ground truth' to obtain any of the metrices of segmentation efficacy. • Alternatively, the custom software available at https://github.com/scloke/SegTest can be adapted to run the efficacy test automatically, capture new clips, and generate corresponding 'ground truth' clips. • The custom software also contains C ++ headers which can be used to directly interface OpenCV image structures with the algorithms in BGSLibrary'.

Data Description
Background segmentation is the process whereby an object of interest in the foreground of an image, often a person, is separated from the background. Typical applications are news and weather casting, game livestreaming, video conferencing and chat software, photo augmentation (Snapchat filters and beauty apps), and technical help desks [1] .
Accurate segmentation is quite difficult to achieve, and the method of choice depends on the application and the factors governing the process. In this article, the term 'factor' is defined as a characteristic of a particular use case that will influence the efficacy of foreground-background segmentation of an algorithm.
The data consists of a comparison of background segmentation algorithms contained in the C ++ wrapper 'BGSLibrary' v3.0.0 developed by Andrews Sobral [ 2 , 3 ]. The comparison is based on segmentation efficacy and speed of the algorithms when applied on a series of indoor colordepth video clips of a single person's head and upper body, each highlighting a common factor as determined from the literature review [4] . The algorithms are run on the color image data, Fig. 1. Manual GrabCut correction of raw 'ground truth' images. Note: The raw 'ground truth' image on the left shows a filling defect on the right hairline and an incorrectly segmented part of the desk adjacent to the left shoulder. The colored polygons manually mark out areas for the GrabCut algorithm to correct, being region of interest (orange), true foreground (white), and true background (black). while the 'ground truth' is semi-automatically extracted from the depth data. The purpose of this article is to provide a reference resource for those who are developing applications that require background segmentation of the head and upper body. Fig. 1 demonstrates the process whereby the automatic extraction of 'ground truth' images is corrected after inspection of the video frames at 4 fps with changes made using a manual GrabCut process. In previous segmentation datasets, the 'ground truth' was extracted manually with teams of data processors hand-drawing boundaries between domains [5] . This was not feasible for the present dataset given the thousands of individual frames for the video clips, so a semi-automated process was used to refine the raw 'ground truth' images obtained from the depth camera. Fig. 2 shows a representative frame at the mid-point of the analysis period from videos after segmentation with some of the BGSLibrary algorithms under standard conditions, with median F1 scores shown. This is so that visual inspection can pick up large areas of over and undersegmentation, and whether these areas are well demarcated from the main head and body segment. From these we can see that only the GMM Zivkovic and KDE algorithms have welldefined outlines with only small areas that have been misclassified, thus establishing the cutoff for good segmentation at an F1 score of 0.95 and higher. Similarly, the Eigenbackground, Adaptive-Selective Background Learning, and Adaptive SOM routines gave recognizable outlines for the head and upper body, with misclassified areas that are clearly separated from the main region. Hence, this establishes the cutoff for adequate segmentation at an F1 score of 0.80-0.95. Table 1 shows a classification scheme for segmentation methods based on the approach used, which was derived from a new systematic literature review [4] . Table 2 lists the factors which affect segmentation efficacy that were obtained from the same review. The folders and files in the data repository were named using the labels in the second table. While there is a rich literature source for segmentation methods, and some of these articles provide good in-class comparisons with related routines, there is a need for an updated and comprehensive cross-class review for this topic, the last of which was published more than five years ago [ 3 , 6-9 ]. Table 3 and Table 4 give the summary results for efficacy and processing time respectively for each of the algorithms under the conditions listed in Table 2 . The efficacy results are calculated frame-by-frame and the results listed consist of the median value with a 10-90th percentile range. Processing time is given as the average for all frames during the assessment period. Table 5 gives a qualitative assessment of each routine's performance according to a set of criteria. This table allows routines to be chosen based on the requirements for a particular use case. It is expected that most readers would utilize this table when evaluating the merits of each routine, with the previous two tables included for reference if further details are needed.  ( Table 3 ).

Video clips
The video clips in the Mendeley Data repository are organized into folders with names according to the labels in Table 2 . The individual clips are named according to the segmentation methods listed in Table 1 . Tables 3 and 4 give the segmentation efficacy and processing time for each method, with one of the factors applied. Table 5 gives an assessment of the segmentation algorithms according to a set of criteria.
The 'ground truth' video clips were saved to AVI files encoded with the lossless FFV1 codec so that the individual frames can be extracted for analysis. These files can be viewed using VLC media player (cross-platform), Media Player Classic (Windows 10), or any player which can use the FFV1 video codec.

Common factors affecting segmentation efficacy
There are several common factors which affect indoor segmentation efficacy: 1) image noise, 2) camera jitter and movement, 3) automatic camera settings, 4) illumination and shadows, 5) Note: The list of segmentation methods is taken from the C ++ wrapper 'BGSLibrary' v3.0.0 by Andrews Sobral [ 2 , 3 ]. Some methods fall into more than one class or utilize a hybrid approach. In these cases, the class listed best describes the novelty of the approach. This library contains only temporal segmentation methods since it is designed to process video sequences.

Algorithm comparison
This experiment was carried out to assess the suitability of the algorithms in the C ++ wrapper 'BGSLibrary' v3.0.0 developed by Andrews Sobral for head segmentation, given that it is a • Cream-colored screen used for simple camouflage • No screen used, and background shows a house interior in daylight Ghost Images 1 GHO • Patterned cloth is layered over part of the background at the start, and dropped at the 10 s mark • No subject is present, and the depth data is ignored • Comparison period begins one second after the patterned cloth completely leaves the image Sleeping Foreground 1 SLP • Patterned cloth is lowered to cover part of the background at the 10 s mark • No subject is present, and the depth data is ignored • Comparison period begins 10 s after the patterned cloth is lowered Dynamic Background 1 DYN • A stand fan is placed in front of the background and turned on • No subject is present, and the depth data is ignored • Comparison period begins 10 s after the clip starts * Where the depth data is ignored, the comparison period starts one second after the transition for the factor or 10 s after the clip starts where no transition occurs. Note: These factors were chosen based on the literature review.
mature library that has been continually updated over the past eight years and now includes over 40 routines [ 2 , 3 ].
The camera chosen for this features paired color-depth image sensors, with the color sensor having specifications typical of mobile devices and webcams, which cover most of the use cases for these algorithms. The factors chosen for testing are derived from the literature review as being able to influence the efficacy of background segmentation. The assessment criteria for the results were set based on the requirements of common use cases such as gamecasting and mobile communications to allow the readers to make their own judgements on the merits of each algorithm for their own purposes. Table 3 Efficacy of segmentation algorithms from the 'BGSLibrary' under specific factors as F1 scores or false positive rates.               Note: The segmentation algorithms are listed in Table 1 while the labels for the factors are listed in Table 2 . Two routines were not formally tested (GMM KaewTraKulPong and GMG) as they required an earlier version of OpenCV. The efficacy measures are calculated on a per-frame basis, and the three numbers listed in each cell are 10th centile, median, and 90th centile. ( continued on next page ) Note: The segmentation algorithms are listed in Table 1 while the labels for the factors are listed in Table 2 . Two routines were not formally tested (GMM KaewTraKulPong and GMG) as they required an earlier version of OpenCV. The processing time is calculated as an average of all frames during the comparison period for that clip.

Methodology
A series of video clips was captured with an Intel RealSense Depth Camera D435 featuring a global shutter, large color pixels of 3 um square, and a depth sensor using disparity mapping from stereo infra-red cameras. The following camera settings were used: structured light projector on, autofocus enabled, autoexposure disabled, automatic white balancing disabled, backlight compensation disabled, and powerline frequency compensation disabled.
Capture resolution was 640 × 480 pixels at 30 frames per second (fps) for color data and 90 fps for depth data, with the depth data processed using temporal and spatial smoothing with hole-filling to reduce artefacts. Synthetic paired color and depth frames were motion interpolated from the source frames to generate video clips without any inter-frames. The clips were then saved to Audio Video Interleave (AVI) files using FFMPEG, with dimensions of 640 × 480 pixels at 30 fps. Color clips were encoded using the lossy MPEG-4 Part 2 codec at a bit rate of 4 megabit per second (Mbps) except for the noise clips which were encoded at 12 Mbps to preserve the noise artefacts.
The clips were captured at night under controlled bidirectional diagonal and side lighting with Philips Hue White & Color Ambiance bulbs set to the 'Energize' preset with a color temperature of 6410 K, calculated from the Mired Color Temperature supplied by the Philips Hue Software Development Kit. The camera was placed 120 cm in front of either a plain green screen (standard), a cream-colored screen (camouflage), or with the screen removed (complex), having the subject standing 60 cm in front of the screen, with no intervening objects. This resulted in a foreground area that was consistently about half the total background size, which is sufficiently balanced to not distort the measures of segmentation efficacy, and yet not too big which would prevent the face detection routine from working properly. The factors listed in Table 2 were then applied.
All clips were 40 s long with the first 10 s showing just the background. The subject entered the scene at the 10 s mark and stood in the center of the frame while keeping a neutral expression, with the face and upper body fully visible. The comparison period was set to all frames between the 20 and 40 s mark inclusive. The brightness for all clips was normalized by applying the appropriate constant gamma correction to keep the average pixel brightness throughout the clip at 50% of the maximum brightness.
For lighting change factors, the lighting conditions were altered mid-way through the comparison period. For the 'Ghost Images' clip, the comparison period was set to one second after the transition. This is because most algorithms for detection and removal of moving objects, ghosts, and shadows operate in less than a second. The comparison period for the 'Sleeping Foreground' clip was set to 10 s after the transition to allow enough time for the object to be incorporated into the background model.
In clips where a subject was present, the face location was determined using libfacedetection by Shiqi Yu, and a seed point and depth obtained from the center of the bounding rectangle [13] . The 'ground truth' foreground was then extracted from a floating range flood fill starting at the seed point, with a maximum difference of 2 cm between adjacent pixels. Since the depth data from disparity mapping is coarse and lacks edge accuracy, an automated GrabCut algorithm was used to refine the edges of the foreground. The 'ground truth' clips were verified by visually inspecting at 4 fps, and adjustments were made with manual GrabCut assistance where the areas of inaccuracy exceeded 5% of the foreground ( Fig. 1 ). For clips without a subject, the depth data was ignored and the 'ground truth' foreground was set to zero.
The segmentation foreground was obtained by processing the color data using the appropriate 'BGSLibrary' algorithm with default settings, except for the 'GMG' and 'GMM Kaew-TraKulPong' routines which require an earlier version of OpenCV (see Table 1 ). To speed up processing, the segmentation routines were run in four isolated parallel threads in 'release' mode. The 'ground truth' and segmentation clips were saved to AVI files encoded with the lossless FFV1 codec.
Testing was performed on a system using Microsoft Windows 10 with a four-core Intel Xeon E3 processor running at 3.5 GHz using Visual Basic and Visual C ++ 2019 in a 64-bit address space with 32 GB RAM allocated and an NVIDIA GeForce GTX 1060 6GB graphics card installed. Image processing was done using the libraries in EmguCV 3.2.0 and Accord.Net 3.8.0. The system speed was rated at 475 million floating points per second (MFLOPS) using the Intel processor diagnostic tool 2.10, 64-bit version.
Processing time was calculated as the average time taken to process each frame over the comparison period.
Accuracy was defined as follows: In Eq. (1) , TP (true positives) represent correctly classified foreground pixels, TN (true negatives) represent correctly classified background pixels, FP (false positives) represent incorrectly classified foreground pixels, FN (false negatives) represent incorrectly classified background pixels, N represents the total number of frames, and t represents the frame time. The 'ground truth' is used as the reference for both TP and TN pixels, while the comparison was the segmentation result from the 'BGSLibrary' algorithm. A lower result for the accuracy indicates greater deviation from the 'ground truth'.
Precision and recall were defined as follows: T P t T P t + F P t (2) In Eq. (2) , precision refers to the proportion of the segmented foreground that has been correctly segmented.
In Eq. (3) , recall refers to the proportion of the segmented foreground that correlates with the 'ground truth' foreground.
The F1 score (also known as the balanced F-score or Dice coefficient) was then calculated as the harmonic mean of both precision and recall: 2 × P recisio n t × Recal l t P recisio n t + Recal l t The F1 score can range between 0 -1 and a lower result indicates greater deviation from the 'ground truth'.
The False Positive Rate (FPR) was defined as: In Eq. (5) , the FPR refers to the proportion of the negative 'ground truth' that is wrongly segmented as foreground.
In general, accuracy is used when the proportion of true results is the key issue, while the F1 score is employed when negative results are more important since it magnifies the effect of incorrect classification. While accuracy is an easy measure to understand, it has the drawback of giving a distorted representation of efficacy when there is a big imbalance in the number of negatives and positives. For face and head segmentation, missing portions of the face or addition of background features to the face are both undesirable, so the F1 score should be the preferred measure of segmentation efficacy.
For the three factors where no subject was present and the 'ground truth' consists of complete background, the preferred measure would be the FPR since all errors consist of false positives. For the 'Ghost Images' and 'Dynamic Background' factors, a low FPR indicates the algorithm has the ability to rapidly remove ghost images and cope with dynamic background objects respectively. However, for the 'Sleeping Foreground' factor, a low FPR indicates that the stationary foreground object has been absorbed into the background model and no longer shows up.
Segmentation efficacy should be calculated on a per-image basis to distinguish algorithms that vary in performance depending on image content, and ought to include a measure of spread in addition to centrality. This is especially important when evaluating algorithms that have a temporal component, as the segmentation result can change even though the images in the clip may be broadly similar.

Simulated factors
Gaussian and Uniform noise was added to the standard clip according to the method used by López-Rubio, with four levels of each type [11] . Jitter was simulated using a soft rubber mallet to strike the supporting tripod at regular intervals to induce vibrations. This was done to mimic both instability in the camera support as well as residual alignment differences that remain after motion compensation during pre-processing.
The two types of sudden illumination changes which are common in indoor settings come from switching overhead lights or drawing curtains (global change), and from opening and closing room doors which lead to an external light source (directional or local change). This was simulated by dimming and brightening all lights at once (global change), or by doing so with the lights from one direction only (local change). In contrast, the effect of shadows from uneven illumination was simulated by keeping the lights from one direction turned off throughout the clip duration. There was no temporal variance in the illumination, compared to the local change where the lighting was altered partway through the clip.
The requirement for background initialization was simulated by trimming the standard clip to show the subject in a stable stance (no initialization), or with two seconds of clear background (short initialization) at the start. In both cases, the comparison period was shifted to 10 s after the clip start to align with standard clip processing.
Simple color camouflage was simulated by using a cream-colored background screen, which was similar in color to the subject's skin. The subject also changed clothing to match the background color. This color was not exactly matched as it would be an unfair test since even humans would have difficulty differentiating the subject's outline under such conditions. For complex color camouflage, this was simulated by removing the background screen to show a typical indoor scene under daylight. The lighting for the subject was still controlled using the same bidirectional studio lights, and portions of the background were of similar color to the subject's clothing, skin, and hair.
For the final three factors, no subject was present and the 'ground truth' was set to complete background. Simulation of 'Ghost Images' was achieved by draping a complex patterned cloth over part of the background screen and dropping it at the 10 s mark. The comparison period was adjusted to start one second after the cloth completely left the image frame to test whether the algorithm could rapidly remove 'Ghosts'.
Simulation of 'Sleeping Foreground' was done by lowering the same patterned cloth over part of the background screen at the 10 s mark. The comparison period was set to start 10 s after the cloth was completely lowered and had achieved a stable position. This was to assess the algorithm's tendency to absorb stationary foreground objects into the background. A 'Dynamic Background' was simulated using a running stand fan placed in the same position where the subject would normally stand. This was to test the algorithm's ability to cope with background objects that display regular repetitive motion.

Comparison results
The results for comparison testing of efficacy and processing speed for the algorithms in the 'BGSLibrary' were summarized in Table 3 and Table 4 respectively. The measure of segmentation efficacy used was the F1 score, except for the three factors without a foreground subject where the FPR was used instead. Efficacy was calculated on a per-frame basis and the results displayed in each cell were for the 10th centile, median, and 90th centile frames, thus showing both centrality and spread. Processing speed was calculated as the average for all frames during the comparison period.

Assessment criteria
The results were assessed according to the following criteria: 1) high efficacy to properly segment the head and upper body, 2) consistency, 3) processing speed short enough for real-time segmentation, and 4) tolerance to the factors tested ( Table 5 ). Good segmentation is defined as a well-demarcated outline for the foreground object with at most small areas that are incorrectly classified but can be separated from the true foreground. Adequate segmentation is defined as a recognizable outline for the foreground object, with larger areas of incorrect classification that can still be separated from the foreground. From Fig. 2 , it can be estimated that good segmentation corresponds to a F1 score of 0.95 and above, while adequate segmentation corresponds to a F1 score from 0.80 to 0.95.
To satisfy the criteria for high efficacy, the median F1 score should be 0.95 and above under standard processing conditions, while the segmentation should be consistent enough that the 10th centile F1 score does not go below 0.80. For full real-time processing with a frame rate of 30 fps, each frame should require a maximum of 33 ms. With a more relaxed constraint of 10 fps if the intervening frames can be motion interpolated, the segmentation time for each frame should be a maximum of 100 ms. An algorithm is considered to tolerant to the factor tested if it can still segment adequately with a median F1 score above 0.80.
To establish a cutoff for the 'Ghost Images' and 'Sleeping Foreground' factors, we need to refer to Table 3 where the maximum FPR is approximately 0.45, corresponding to the full coverage of the patterned cloth over the background. We can then assume that an FPR of less than 5% of the maximum value at the 90th centile (FPR = 0.020) indicates that the tested algorithm can rapidly deal with 'Ghosts' which are no longer visible during the whole comparison period. Similarly, an FPR less than 5% maximum at the 10th centile indicates that the static foreground object has started to integrate into the background during the comparison period.
For the 'Dynamic Background' factor, the highest median FPR is for the 'Simple Gaussian' and 'Fuzzy Gaussian' algorithms. On viewing both, the 'Fuzzy Gaussian' clip has less noise and the maximum FPR representing the area of the moving fan blades can be derived from integrating the segmented foreground over the comparison period of the clip and removing pixel noise with a small kernel median filter. This value was found to be 0.0563, and we can similarly assume that a median FPR less than 5% of this value (FPR = 0.003) indicates that the algorithm can adequately deal with dynamic backgrounds by removing the spinning fan blades.

Limitations
There are several limitations to the methodology of the comparison testing. The first is that the 'ground truth' could not be consistently derived from the depth data since the disparity maps were coarse and unstable, even after extensive pre-processing with temporal and spatial smoothing with hole-filling. While the use of automated GrabCut improved this significantly, there was still a need to manually inspect each frame for quality. Areas which were prone to incorrect segmentation had to be marked out and corrected manually ( Fig. 1 ). This imposed a restriction on the amount of movement the subject could make, since excessive motion would require the markings to be updated every few frames instead of being allowed to propagate for much longer.
The next limitation is that the camera was static and true motion was not tested. This is important since many use cases require a mobile camera. However, this was unavoidable since the quality of the disparity maps would deteriorate even further with movement.
Another limitation is that while the 'memory effect' was demonstrated for several of the routines, it is possible that extending the test period may have detected it in more of them. Examination of the full clips for the routines which satisfied the stability criterion did indicate that towards the end of the clips there was deterioration in segmentation efficacy for some of them.
One more limitation is that there is only a single subject in all the video clips. While this does make the testing conditions consistent and improves the comparability of the results between factors, having more subjects with different combinations of hair and skin color will make the results more generalizable.
The final limitation is that only temporal segmentation routines were tested formally. It is possible that the non-temporal routines may eventually prove to be more suited to head and upper body segmentation. There is however no equivalent to the 'BGSLibrary' for non-temporal algorithms, and this is a gap in the current body of research.

Future directions
The first step would be to use a better depth camera to retake the test video clips, and the upcoming Intel RealSense L515 which uses solid state light detection and ranging (LiDAR) technology seems to be a big improvement on the D435 model used in this study. It has a depth error standard deviation of only 2.5 mm at 1 m distance from the target, gives cleaner contour outlines, has a higher resolution, and scans fast enough to cope with motion. This would remove the need for manual intervention when determining the 'ground truth' and would allow testing of both subject and camera motion.
Another step would be to gather and test non-temporal segmentation algorithms in a new library using the same methodology. Although it will require a lot of effort, this is necessary if we wish to identify suitable routines for head and upper body segmentation, since the routines from the 'BGSLibrary' are poorly suited for this.
The third step would be to expand the series of clips to cover background segmentation of the face only, since some use cases do not require the whole head and upper body. Examples of this would be face expression analysis and computation of facial action units.

Ethics Statement
The only human subject was the first author for which informed consent was sought and obtained. The study has ethics approval from the University of Auckland Human Participants Ethics Committee (Ref. 021497).

Funding
This study was paid for using PRESS account funding from the University of Auckland (ID 663710048 ).