Segmentation of body parts of cows in RGB-depth images based on template matching

.


Introduction
An important indicator for the welfare of cows is the cleanliness of the cow, as it is one of the critical factors influencing bacterial contamination of milk (Zucali et al., 2011), somatic cell count (SCC) and subclinical intra-mammary infection rate (Schreiner and Ruegg, 2003). The cleanliness also has economic consequences, as it influences milk quality, milk quantity and reproduction (Ellis et al., 2007). Cow cleanliness scoring is typically done using a 4 or 5 points scoring method ranging from clean to very dirty (Hultgren and Bergsten, 2001;Reneau et al., 2005;Lombard et al., 2010;Eckelkamp et al., 2016). For the assessment, the cleanliness of body parts (e.g. leg, thigh, belly, flank, rear and udder) is normally assessed separately.
In current practice, cow cleanliness is assessed manually by experts. As this is a time-consuming and expensive procedure, limited number of assessments are performed. Human scoring, moreover, is subjective and inconsistent. To improve both the quantity as well as the quality of cleanliness scoring, there is a strong need for automated scoring systems using camera systems. The development of automatic cleanliness evaluation needs 2 steps: (1) segmentation of the different body parts in the photographs and (2) detection of dirt on these segments and calculation of the body hygiene score. In this paper, we focus on the first step, the segmentation of body parts of the cows in color-depth (RGB-D) images, as this is a challenging and currently unsolved problem.
Computer-vision methods to segment different cow body-parts for cleanliness scoring are scarce in the literature. Here, we discuss related work concerning the use of computer vision to automate monitoring of cattle, as well as some relevant studies on body-part detection on humans and other animals. Pezzuolo et al. (2018) evaluated the use of multiple low-cost structured-light depth cameras for several 3D body measurements of dairy cows and reported uncertainty ranging between 3 and 15 mm. For the autonomous monitoring of body weight, Song et al. (2018) developed a method to measure several morphological traits from top-view depth images of the cow. Using these traits in combination with additional information on milk production and parity, the body weight was estimated. Other studies estimate the body condition score (BCS), a commonly used indicator of the nutritional status in beef and dairy cows. In a review, Alvarez et al. (2017) concluded that automatic systems based on image processing can reach human-level performance. Bercovich et al. (2013), for instance, developed an automatic BCS using body shape signature and Fourier descriptors.
Image processing also allows for behavior analysis. Leonard et al. (2019) adopt an overhead Kinect monitored and classified sow postures in farrowing stalls. Cangar et al. (2008), for instance, developed an automatic image-analysis system to identify the locomotion and posture behavior of pregnant cows prior to calving. Their method records position and body-size measurements over time in order to classify behavior as, e.g., standing or lying, and eating or drinking. Porto et al. (2013) developed a multi-camera system to detect cow lying behavior by training the Viola-Jones object-detection algorithm (Viola and Jones, 2001). Ahn et al. (2017) proposed a method for the detection of cow mounting-behavior to determine the optimal time of insemination by using the support vector machine (SVM) classifier with motion history image feature information. To analyze cow locomotion, Song et al. (2018) developed a vision-based system to detect lameness by tracking the locations of the hooves over time. They reported strong correlations with human lameness scored using the timing of the placements of the hooves. Pluk et al. (2012) studied on the automatic measurement of touch and release angles of the fetlock joint for lameness detection in dairy cattle using a combination of computer vision and a pressure mat, further improved in . Viazzi et al. (2013) developed the body-movement-pattern score, which uses the back posture to classify lameness.
Most of the existing work on body-part detection focuses on human posture recognition from 2D or 3D images. Felzenszwalb and Huttenlocher (2005) presented a computationally efficient framework for partbased modeling and recognition of objects using pictorial structure models, and demonstrated the technique on the detection of facial features and human body parts. Bourdev and Malik (2009) studied the detection, segmentation and pose estimation of people in images based on the detection of so-called poselets using histograms-of-orientedgradients (HOG) images features in combination with a Support Vector Machine (SVM) classifier. Simo-Serra et al. (2012) estimated 3D body pose unambiguisly by imposing kinematic constraints. Shotton et al. (2011Shotton et al. ( , 2013 used a Random Forests (RF) classifier with the global 3D centers of probability mass for each body part using simple depth image to locate the positions of 3D skeletal joints. Currently, convolutional neural networks become popular for the detection of body keypoints for the detection of human pose, e.g. in . Li et al. (2019) applied dominant orientation templates and brightness ratio templates to detect group-housed pigs.
Computer-vision methods for the detection of body parts of animals, on the other hand, are scarce in literature. Pistocchi et al. (2014) applied Structural Support Vector Machine (SSVM) to identify the body-parts of animals using both 2D and 3D features, tailored for dogs independently from the size and breed. Zhao et al. (2017) applied RF decision to segment eight body regions, i.e., head, neck, body, forelegs, hind legs and tail with local binary pattern (LBP)-based depth features to analyze cow behavior. This system concentrates on side-views and cannot detect body parts from other viewpoints, necessary for a proper evaluation of the cow cleanliness. To assess cow behavior, Jiang et al. (2019) applied FLYOLOv3 deep learning to detect key parts (e.g. head, frunk and legs) of dairy cow. The accuracy was 99.18% and the average detection rate was 94%.
A promising method to segment the different cow body parts is to describe the cow's body using a skeleton (or medial axis). Skeletonizing is an effective method to represent and analyze object shapes (Saeed et al., 2010). Shape similarity based on skeleton matching usually performs better than using the full contour or other shape descriptors in the presence of partial occlusion, noise, and articulation of parts (Borgefors et al., 2001). Based on the image skeleton, a skeleton graph can be constructed. Bai et al. (2008) used such skeleton graphs and proposed a robust object detection and segmentation method based on Skeleton Path Similarity Matching (SPSM). Liu et al. (2014) adopted this method to describe and segment images of cows into a number of body parts for the analysis of their gait.
In this paper, we propose a method for body-part segmentation based on template matching for endpoint classification using color-depth (RGB-D) images. We apply the method for the detection of nine bodyparts of dairy cows in an indoor environment. The novelty of our approach is twofold. Firstly, unlike the existing works, which only recognizes the whole body or seven body parts without udder from side views, our approach can detect nine different body parts, e.g. head, torso, belly, udder, left front leg, right front leg, left hind leg, right hind leg and tail from different viewpoints of the cow(side and back). We propose a color-depth based method for cow body segmentation identifying cows body parts. Secondly, we applied a template-based method for endpoints classification, Skeleton Path Similarity Matching (SPSM) (Bai et al.2008), which provided a robust description of the posture of cows, despite the variation in their appearance and relative pose towards the camera.

Data acquisition
All data were collected for 3 consecutive days at the Sifang Experimental Dairy Farm in Datong, China, July 1 to July 3, 2017. From the 2000 cows on the farm, 113 lactated Holstein cows were collected for body-part detection. Color-and-depth images have been taken using the Microsoft Kinect™ V2 RGB-D camera. The Kinect V2 provides a depth frame with the resolution of 512 × 424 pixels and a color frame with 1920 × 1080 pixels. The camera was connected to a personal computer and operated using the Kinect for Windows Software Development Kit (SDK) 2.0.
A total number of 5070 RGB-D images were taken, of which 2680 (52.86%) images containing a valid standing or walking cow were manually selected. The other images were rejected for different reasons: (1) images not containing a cow's side or back (33.33%, 1690 images), (2) images with incomplete depth data of the cow body due to direct sunlight from the roof gap (7.51%, 381 images), (3) images that contained severe occlusions from another cow (3.00%, 152 images), and (4) images with rapid movement resulting in blurred images (3.3%, 167 images). We performed the selection of images manually, as the focus of this paper is on the segmentation of cow body parts. In a future system, however, the image acquisition should be fully automated. We envision to put the camera at the exit of milking parlor, providing a standardized pose of the animal, preventing occlusions by other cows, as the animals pass there one at a time, and providing a completely controlled environment without sun light interference. With the increasing light sensitivity of future cameras, exposure time can be reduced, preventing image blur due to rapid movements.
The accepted data can be separated into four main cases: (a) complete data side view (645 image), (b) complete data back view (303 image), (c) some missing data due to some body movement (97 image), (d) some missing data due to sunlight inference (376 image). Examples of accepted and rejected images in the different categories can be observed in Fig. 1.

Image processing
The purpose of the developed image-processing method is to recognize and segment rear-and side-view images of cows into eight different body-part, i.e., (i) head, (ii) torso, (iii) left-front leg, (iv) right-front leg, (v) left-rear leg, (vi) right-rear leg, (vii) tail, and (viii) udder. Fig. 2 gives an overview of the proposed method. The workflow consists of a number of steps: (1) restoration, (2) image segmentation, (3) image selection, (4) skeletonization, (5) skeleton classification and (6) body-part segmentation. The different steps are described in the following subsections.

Depth-image restoration
The depth image obtained by RGB-D camera contains random noise. For instance, the cow's head and frunk which are shown in Fig. 3 (a) red box are accompanied by a lack of depth information. Therefore, the information lost in the depth image (the hole on the cow depth map) needs to be repaired and filled. The image restoration is for the entire depth frames. The depth images contain some missing data. To restore the depth data, two steps are performed. In the first step, the depth images are spatially filtered using a majority (or mode) filter: whereD t (p) is the original depth value of pixel p at timet, N 5×5 (p)is the neighborhood function that gives the set of all pixelsq in the 5 × 5 neighborhood of p, andmode(s)returns the mode of the set S, i.e., the majority count of the elements in S. In the second step, the depth images are filtered in the temporal dimension using a weighted averaging filter: where we used N = 5 in our experiments. Fig. 3 gives an illustration of the depth-image restoration. The image restoration method makes the boundary of the animal with respect to the background smoother, by applying the mode-filter and temporal averaging filter. As we use rather small kernel sizes, the shape of the animal is not adapted significantly.

Cow segmentation
To segment the cow from the background, we make use of a background depth-image, as depicted in Fig. 4a. This image contains the same scene, but only with stationary objects, that is without cows or human beings. Directly subtracting the background depth-image,D b , from the current restored depth image,D '' t , results in poor cow segmentation due to noise and small displacements of the camera. We therefore developed a more robust background-subtraction method based on the histograms of the aforementioned depth images,H b (k) andH r (k) (see Fig. 4c and 4d). The method consists of two steps; (1) determining the depth range in which the cow is positions, and (2) segmenting image to obtain the cow mask.
Step 1: We first subtract the two histograms, resulting in a cow histogram,H c (k): where k is the depth value (in mm) in the range 1000 ≤ k ≤ 4000. This range is restricted by the Kinect measurement range. Fig. 4e gives an example of the resulting cow histogram.
From H c , we determine the minimum depth, d min , and maximum depth, d max , at which the cow appears in the depth image. These threshold values are determined by finding the longest sequence of distances in H c that holds positive values (see Fig. 4e), which is We then determine the mid-depth-value: Step 2: Now that the depth range in which the cow is positioned is determined, we can obtain the cow mask. To obtain this, we first calculate mask M r containing all pixels in the restored depth image within a distance of − 500 to + 1000 mm from d: And we do the same for the background depth-image: We then obtain the cow mask by subtracting these two masks: As this mask might contain some small bits of noise, we filter the mask using a median filter with a kernel of 5 × 5:   Examples of the steps to obtain the cow mask are given in Fig. 5.

Image selection
The following requirements are applied to ensure that the image contains a single cow and is well segmented. (1) the width and height of the object mask should be smaller than respectively 510 and 422, due to the image resolution of the Kinect v2 depth data of 512 × 424, (2) the size of the object, calculated as the number of pixels in the object mask, should be larger than 20,000, (3) the length-width ratio should be between 1.80 and 1.17(cow's side), 0.97 and 0.42 (cow's back), and (4) object masks with holes that are larger than 500 pixels should be filtered to ignore incorrect cow-background segmentation close to the hooves connecting two separate legs.

Skeletonization
The detection and segmentation of the body parts is based on a skeleton description of the cow mask. To extract the cow skeleton, skeletonization was performed in 3 steps: (1) feature distance transform, (2) thinning, and (3) feature detection.
Step 1: feature distance transform The feature distance transform (FDT) is adopted from Strzodka and Telea (2004) and is based on the standard distance transform (DT). The DT of a binary image is a well-known shape representation. Given an object boundaryΩof a mask M, the DT of Ω at point p in the mask is defined as: The distance metric is usually Euclidean:dist(p, q) = ‖p − q‖ 2 . The above assigns to every point p the distance to the closest boundary pointq. For the full-scale cow mask, M ' c , the FDT labels every point with its DT value and the closest boundary point, q: Skeletons (Fig. 7a), or medial axes, are defined as the set of centers of maximal balls contained in Ω, or the locus of points at equal distance from at least two boundary points: The original skeleton is pruned by removing points with a distance value lower than threshold T s (Fig. 7b). Fig. 6 presents the relation between number of pixels P T and the optimal thresholdT s and a third-order polynomial fitting (R 2 = 0.75)resulting in the following equation: Step 2: thinning Thinning is an image-processing operation in which binary image regions are reduced to curved lines with single-pixel thickness that approximate the center skeletons of the regions without breaking the objects apart (Lam et al. 1992). Thinning the initial skeleton from step 1 results in a one-pixel wide skeleton (Fig. 7c). The skeleton obtained this way is essentially a homotopic skeleton generated by a series of sequential morphological thinning using the structuring elements from the Golay alphabet (Golay, 1969) until convergence.
Step 3: skeleton-graph construction In order to turn the skeletons of step 2 into skeleton graphs, we will consider the endpoints and bifurcation points of the skeleton. Endpoints are detected by applying a 3 × 3 sliding window and detect the number of skeleton pixels. If the center of the window is s skeleton pixel and the total number of skeleton pixels in the window is 2, an endpoint is detected. A similar approach can be used for bifurcations with the constraint that at least 4 pixels are active in the window. This will potentially give multiple detections. A decluttering step is used to interpolate between those points to approximate the true spatial position of the bifurcation. Then the skeletal branches are detected from endpoints to bifurcations based on an 8-connected-neighbors searching method. This results in a skeleton graph with the vertices consisting of the endpoints and bifurcations and the edges being the skeleton branches that connect these vertices (Fig. 7d). Shorter skeleton branches contain fewer skeleton features for template matching. As a final step, every branch length is calculated and the skeleton branches with a length shorter than 10 pixels are removed (Fig. 7e).

Skeleton classification
When we obtain the skeleton map, we need to classify each skeleton branch. The way to achieve this is to calculate the adjacent skeleton path and perform similarity matching with the templates in the database to obtain the best matching result. When we determine the category that each skeleton branch belongs to, we can divide the body into different parts.

a.
b. c. d.  2.2.5.1. Template matching. Template matching is based on a skeletongraph-matching method via path similarity proposed by Bai (2008). The set of endpoints of the skeleton graph is denoted as {e 1 , e 2 , ⋯, e N0 }, where N 0 is the number of endpoints. The set is ordered on the angle with respect to the center of gravity, G c , in an anti-clockwise direction, see Fig. 8. A skeleton path from endpoint e m toe n is denoted as p(e m ,e n ). G represents the skeleton of the cow for testing, and G ' represents the skeleton of templates in the dataset. The shape dissimilarity between two paths, p(e m , e n ) and p(e ' m ' , e ' n ' ), in two different skeleton graphs, G and G ' is calculated using the path distancepd(p(e m , e n ), p(e ' m ' , e ' n ' ). The path distance is calculated by sampling each path with M 0 = 40 equidistant points, , along the paths. The path distance is then calculated by: where r i is the radius of the maximum inscribed circle at sample point q i , which can be sampled from the distance transform calculated in Section 3.2.4, l is the Euclidean distance of endpoint e m and e n : The relative influence of the inscribed circles of the path sample points, the Euclidean distance between the end points and the angle between the end points can be weighted with parameters ∂ and β.
In our experiment, we sampled each skeleton path between two adjacent endpoints and compared with all other skeleton path. The reason is that we find there is a strong relationship within the r i , land α between next two endpoints. But that means we need more templates   that can cover all the posture of dairy cow that introduced in 2.2.5.2. Then, the matrix of the path distances is changed as, The similarity calculation is to find out the minimum distance Fwith all skeleton templates in database, where Mean and Dev are stand for mean and standard deviation of Dis(i, i), respectively,

Template database.
To create a training set with cows used for the template matching, we deploy an imaging system to capture cow images at the same experimental dairy farm at June 19, 2017, independent from the test data (see Fig. 9). The acquired cow images are 8bit grayscale images with a resolution of 512 × 424 pixels. The developed database includes 221 cow images and it can be divided into 3 endpoints (11 images, Fig. 9a), 4 endpoints (50 images, Fig. 9b), 5 endpoints (79 images, Fig. 9c), 6 endpoints (57 images, Fig. 9d), 7 endpoints (19 images, Fig. 9e) and 8 endpoints (5 images, Fig. 9f). The training data was manually annotated.

Recognition of right or left leg.
Discriminant factorsQ fn were defined for distinguishing far and near leg endpoint when two endpoints of forelegs or hind legs branches were detected (Bo et al., 2014).B i is the skeleton branch that has endpoint e i on the pruned skeleton graph G and contains skeleton points represented by b ik (k = 1, 2, ⋯, L) and the corresponding depth image value is I(x b ik , y b ik ). Withe i as the starting point, the skeleton points along the skeleton branch can build a subset Discriminant factor for near or far of endpoint e i is then defined as the average depth of this subset: when it is judged that two end points e i and e j of G satisfy Q fni < Q fnj , e i is classified as part of the skeleton branch of the proximal cow limb and e j is classified as part of the skeleton branch of distal cow limb. Otherwise, e i corresponds to distal and e j to proximal. Distal and proximal are then be used to distinguish the left and right limbs of the cow. Cow skeleton branches can be divided into eight groups: head, left foreleg (LFL), right foreleg (RFL), left hind leg (LHL), right hind leg (RHL), torso, udder and tail. Fig. 10 show results of template matching that were automatically generated results. And the endpoint "body" was added in endpoints classification from back to recognize pin bone of cow when cows are thin.
Step 1. Edge detection Firstly, each body region is covered using inscribed circles along its skeleton branch from endpoint to bifurcation in order of LHL, torso, LFL, RFL, head, RHL, tail (side-view) and torso, LHL, RHL, head, LFL, RFL, tail (rear-view). The obtained mask S 1 (x, y)provides a rough segmentation of the body-part in the depth image. To obtain a more detailed Fig. 10. Result of temple matching for cow side (a) and cow back (b).
segmentation, a region-growing algorithm is used (Gonzalez and Woods, 2018), which is separated by the full cow segment and the other body parts, preventing overlap between body parts. To distinguish adhesion body regions (e.g. tail and hindlegs), Scharr operator was used to compute gradient in × and y directions in entire depth image. In the images shown in Fig. 11, different body-part are marked automatically. The partial derivatives G x (p) and G y (p)of the image f(p) along horizontal and vertical directions.

16-bit depth image (without background) is defined as
where T G stands for depth threshold, in this experiment T G = 5.
Step 2. Automatic seeded region growing process Seeded region growing is a procedure that groups pixels or sub regions into larger regions based on predefined criteria and it requires seeds as additional input. The basic approach is to start with a set of seed points and grow the regions by appending to each seed's neighboring pixels that have similar properties to the seed. The seeded region growing algorithm applied in this study is summarized as follows: Initially, the region is composed of a single seed point R j 0 that determined by the endpoints of each skeleton branch. R j i (j = 0, ⋯6)represents the growing of the different regions,R j (j = 0, ⋯6). Then check 4-neiberhood pixels around the seed to determine whether they are similar to the seed one by one. If so, push the pixel into R j i , otherwise Fig. 11. Results of udder detection. a. body-parts detection for cow' side, b. calculation of udder center, radius and color filter for cow side. c. body-parts segmetation result for cow side. d. body-parts detection for cow' back, e. calculation of udder center, radius and color filter for cow back. f. body parts segmetation for cow back. disregard it. Each region is thus allowed to grow according to the following equation, 2.2.6.2. Udder segmentation. The program initially calculates the radius and center position information of the potential udder circle and then converts RGB potential udder region to HSV space. The H channel is used to characterize the colors range for udder detection. udder detection that can be divided into 2 steps: udder center and radius calculation, color filter.
Step 1:A 2D chessboard calibration method was applied to match color to depth image. GML C++ calibration toolbox v0.75 was adopted to calibrate the color camera and depth camera separately and then calculate intrinsic parameters, rotation matrixes and translation vectors both for color and depth cameras.
Step 2:Skeleton structure is adopted to locate potential udder center for cow' side.P u is calculated as the udder center located on the maximal inscribed circle (green circle in Fig. 11a) of the bifurcation P j of skeleton branch of hip.R T is the circle radius, indicated by the red circle Fig. 13. Partially Correct Solution.
( Fig. 12a), calculated by the formula analyzed by curve fitting (MATLAB 2016b) using database, where P T means the whole coverage of pixels for cow body. For backview udder, the radius and center position are calculated by equation (26) in the rear region described as grey area shown in Fig. 11b. center Q u , long axis R L and short axis R s are used to described the elliptic udder for cow' back (Fig. 11c).
whereXr max and Xr min present maximal and minimal × value of edge points of rear zone, respectively.Yr max and Yr min present maximal and minimal y value of edge points of rear zone, respectively.
Step 3: the potential udder region in RGB was converted to HSV color space because it is more related to human color perception. The red color in channel H is characterized by values within(0, 20)||(260, 360).The red circles in the black background (Fig. 13a&c) show the results of color filtering that udder region is enhanced. Then, a 3 × 3median filter was used to soften more the results achieved by the dilatation and erosion. Color images shown in Fig. 12c&d represent the result of full body detection from side and back, respectively.

Experimental setup
The performance of body-part detection was evaluated using accuracy of endpoints classification and body-part segmentation.

Skeleton classification
In this experiment, we evaluated the performance of the skeletonclassification method. We selected 1421 side-view images taken from 113 cows and 859 rear-view images from 75 cows at random. The performance is evaluated by presenting the results in a normalized confusion matrix, where the correct classifications and misclassifications are shown. Whether a body part was correctly classified by the skeletonclassification method was determined by 3 experts. In the case of discordance among experts we used majority voting.
The overall performance is measured using the classification rate (CR) where N c is the numbers of correctly classified images and N T is the number of images in total. The skeleton in an image is correctly classified when the skeletal branches were correctly classified.
To further analyze the performance, we classify images into three classes: • Correct Solution: where all the branches appear visually correct.
• Partially Correct Solution: where at least a half of the branches appears visually correct. • Mostly Wrong Solution: where visually the branches are perceived as wrongly detected.

Body-part segmentation
In this experiment, we want to examine the accuracy of the proposed body-part segmentation method. A dataset containing 100 side-view images and 100 rear-view images taken from 50 cows was used to evaluate the performance. All images were manually labelled to get the ground-truth segmentation (GT). Fig. 12 shows an example of manual segmentation both from side and rear view. For each image in this dataset, the body-part segments resulting from the method were compared to the corresponding GT labelling. The quality of segmentation was evaluated using a pixel-level accuracy, sensitivity and specificity:

Skeleton classification
Tables 1 & 2 show the results of the skeleton-classification experiment for side and back views respectively. The tables show the normalized confusion matrices. The rows represent the true body parts and the columns represent the output of the method. The skeletonclassification method outputs eight classes: head, torso, left front leg (LFL), right front leg (RFL), left hind leg (LHL), right hind leg (RHL), hip and tail. "Others" means wrong skeleton branch which disappear on the body-part. Unidentified skeleton branches were classified as "Missing". The classification rate for all parts is higher than 88.94%.
The results are presented in Tables 1 & 2 in the form of confusion matrix. The diagonal elements in the confusion matrix (Tables 1 & 2) shows the number of correctly classified instances. The bold values in table 1 and table 2 indicate correct rate. In the first row, the first element shows the proportion of skeleton branches belonging to 'head' class and classified as 'head'. The second element shows the proportion of skeleton branches belonging to 'head' class but misclassified as 'torso'. The third element shows the proportion of skeleton branches misclassified as LFL (left front leg) and so on.
The tables show what in general, the skeleton classification is successful. For most body parts, the classification rate is around 94-97%. Classification of the tail has a bit lower classification rate of 89% in case of the back views and 91% for the side view. Reason is that the tail is a thin structure that regularly is not visible in the cow silhouettes. Moreover, the tail often swings, which can result in the tail being filtered out by the temporal averaging filter in the image-restoration step.
For side-view cows, only one or two skeletal branches were wrongly classified. For instance, the "RHL" was classified as tail (Fig. 13a) and "tail" was classified as others (Fig. 13b). In all rear-view images, the wrong solution can be divided into partially correct solution which have only one wrong endpoint (43.37%) or with less than or equal to half of skeletal branches wrong (56.63%). For instance, the "Head" was classified as part of body (Fig. 13c) and "body" was classified as tail (Fig. 13d). The wrong classification with LHL and tail shown in Fig. 13a caused incorrect segmentation of LHL, udder and tail. The wrong classification with tail shown in Fig. 13b caused wrong segmentation of LHL.
Although Table 2 shows also high correct classifications for the back views, classifying endpoints is somewhat more complicated when viewed from the back. Fig. 13d shows wrong skeleton classification from the view of cows' back. Fig. 13d shows that the branches of head and tail were wrongly segmented, causing wrong segmentations of head and tail. It also shows the missing of RFL. Observing the results, we noted that "tail" and "head" are more frequently wrongly classified (e.g. body or forelegs) or missed by the method. Most of the errors involved the swing of the tail mainly when depth images are imprecise or noisy due to sensor inaccuracies caused by direct sunlight from roof gap.

Body-part segmentation
The representation of average accuracy, sensitivity and specificity are shown in Tables 3 & 4. In the detection of the body-parts of side-view and rear-view dairy cow, the average accuracies were 95.98% and 90.67%, the sensitivities were 70.58% and 44.93%, and the specificities were 97.03% and 95.07%.
In terms of accuracy, the torso accuracy was the lowest both from side and rear. we compare our system against the proposal of Zhao et al. (2017), that employs RF to recognized eight body parts of dairy cow from the side using depth images. We noted that the recognition rate of torso is lower than that of Zhao study. Most of the errors involves the wrong solution that hind legs, udder or tail were recognized as a part of torso. For cow's side, the tail sensitivity was the lowest. For cow's rear, head, fore legs and tail have lower sensitivity because the rear-view head and two front legs of cow cannot be detected if they exceed the detection range of the sensor. As the assessment site of cleanliness of cows includes belly, udder, near hind leg, rear and lower legs (the udder and the area around the udder), excluding the head, two front legs, and the tail. The lower sensitivity is mainly due to the sensitivity of the Kinect to strong illumination changes that resulted in cow masks with many holes, missing body-part and imprecise depth images. That problem can be partially mitigated by the adoption of a stereo camera with a higher resolution than the Kinect, increasing the costs of the system. Our study made use of the second version of the Kinect camera with its specific resolution and sensitivity. New 3D camera (e.g. Laser) might be required to reduce illumination sensitivity and extend its application inside the barn.

Discussion and conclusion
We have shown how for the first-time cow body-part detection for both of cows' side and rear. The 3D depth and color images are acquired by the Microsoft Kinect v2 sensor and used with the distance transform values to effectively classify the cow body-parts. We estimated cow body-parts by classifying the branches of skeleton. The segmentation of the body-parts of dairy cows were tested in this study, and the test results showed that:    1. In the classification of the skeleton of dairy cow, due to the distance limit and sensitivity of the sensor in the image acquisition, bodyparts produced in the pre-processing was lost partially, so the precisions from side were higher than those from rear. An average detection rate for side-view and rear-view cows were 95.39% and 94.24%, respectively. In the detection of the body-parts of side-view and rear-view dairy cow, the average accuracies were 95.98% and 90.67%, the sensitivities were 70.58% and 44.93%, and the specificities were 97.03% and 95.07%. The average computing time for per RGB-D images was 4.49 s (Alienware NA75E-7HL,2.8 GHz processor,16 GB memory). 2. The evaluation of the body cleanliness for dairy cow is foundation for the segmentation of the body parts of the cow. The cleanliness of the cows did not significantly affect the segmentation of the cow's body parts. However, the results of body segmentation will directly affect the accuracy of dirt detection. As the assessment areas of cleanliness includes belly, udder, near hind leg, rear and lower legs. In the absence of head, torso, front legs and tail, the key parts of the cleanliness evaluation could be accurately segmented, the average accuracies were 96.16% and 92.32%, the sensitivities were 77.82% and 66.82%, and the specificities were 98.06% and 96.72%, indicating that the method had a higher accuracy and was suitable for accurate segmentation of the key parts of cows. 3. The practical application of body segmentation system aims to achieve a new and useful improvement as welfare indicators (cleanliness evaluation). For commercial local farm conditions, such as camera location, background lighting, and milking routine will all need to be considered. Further research should investigate the methods of cleanliness evaluation on selected body-parts, e.g. belly, udder, rear and legs. The target of our article is to achieve the body segmentation for lactating dairy cows because the assessment of the cleanliness of the cows is mainly aimed at the cows involved in milking. Different types of cows or calves have different bone structure, muscles and hair. The shape and outlines of cow will be changed, so a new template dataset is needed.
In this paper, we have seen that we can successfully segment the images of cows in separate body parts in the majority of cases. Compared to the body-parts segmented by using previous method (Zhao et al., 2017;Jiang et al., 2019), the body-part segmented in this study are more fine and robust for both side-view and rear view cows, which are essential to scoring cleanliness for dairy cow. With new training data, the proposed method can also be used for segmentation of other animals (e.g. horse or pig).
The presented work forms the basis of an automatic cleanliness scoring system. A dirt-detection algorithm needs to be developed and combined with our proposed body-part-segmentation method to be able to assess the cleanliness of each body parts. Where the current work used manual selection of images of side/rear views of standing or walking cows, the future system needs to acquire such images automatically. This can be achieved by placing the camera at the exit of the milking parlor, where the cows pass one by one in a predefined pose, and where illumination can be fully controlled, to not be hampered by disturbances caused by sunlight.

Declaration of Competing Interest
The authors declared that there is no conflict of interest.