A dicentric chromosome identification method based on clustering and watershed algorithm

Aiming at the problem of low efficiency of dicentric chromosome identification counting under the microscope, this paper presents a joint processing algorithm combining clustering and watershed. The method first uses clustering and watershed algorithm to segment the original chromosome image, and then identifies the individual chromosomes. The results show that when the equivalent width Y parameter is selected m = 1, n = 1, the true positive rate of dicentric chromosome identification is 76.6%, and positive predictive value is 76.6% in high dose, which is higher than the threshold algorithm for the true positive rate (63.9%) and positive predictive value (63.5%). The number of identified dicentric chromosomes can be used for dose estimation. When 500 cells are used for identification and dose estimation, the dose estimation pass rate can reach 80% in high dose. But for low dose, more cells should be used to identify to increase the dose estimation pass rate.

can remove the impurities in the original image and perform the initial segmentation. The watershed algorithm can further segment the lightly contiguous chromosomes so that the single chromosomes in the chromosome cluster can be completely segmented. Using this method can not only maintain the original morphology of chromosomes, but also effectively avoid the large number of pixel loss caused by linear compression in the segmentation process. And segmentation of a single chromosome or clumps only accounts for tens of KB, greatly reducing the generated process memory and improving the processing speed. The segmented single chromosomes are identified by centromeres. The algorithm has a true positive rate (TPR) 76.6% and a positive predictive value (PPV) of 76.6% in high dose. The number of identified dicentric chromosomes can be used to estimate the dose of the population exposed to the radiation source, for low dose radiation, more identified cells should be used, and for high dose radiation, the number of identified cells can be appropriately reduced. When 500 cells are used for identification and dose estimation, the dose estimation pass rate can reach 80% in high dose.

Methods
To perform the segmentation and identification of chromosome, a chromosome image of cells in metaphases must first be obtained. The chromosome images used in this article are derived from regular Giemsa-stained slides. The chromosome image acquisition system consists of three parts (Fig. 2): (1) Color microscope, using an OLYMPUS optical microscope with an oil-immersed 100X objective. (2) CCD camera, using Lumenera's camera, the camera connected to the computer through the USB interface, real-time display and photographing. (3) Motion console and storage system. The computer sends instructions by Ethernet to make the motion console move according to the default path while the microscope moves up and down to photograph the captured chromosome clumps and store it in the computer.
The algorithms were developed in Visual Studio 2013, and the software was implemented by C++, which consists of four modules. They are image acquisition module, image processing module, image analysis module and data storage management analysis module. Software support library includes OpenCV and other third-party  Chromosome segmentation. Segmentation is an important step in chromosome analysis. The segmentation of chromosome will affect chromosome centromere identification and the accuracy of karyotype analysis. The chromosome segmentation includes the segmentation of chromosomes and backgrounds, and the segmentation of chromosome clumps. And the segmentation of chromosome clumps is divided into simple adhesion segmentation and overlapped cross segmentation. In this paper, the chromosomes and backgrounds are first segmented by clustering algorithm. The K-Means clustering algorithm is mainly used to implement the automatic clustering. It is an unsupervised machine learning algorithm and is widely used. There are many kinds of clustering algorithms. This paper uses the K-Means++ algorithm for segmentation. The main parameters of the K-Means++ algorithm are samples, clusterCount, termcrit, and attempts. The metaphase split phase images are taken as samples, and the clusterCount is chosen to be 2, which separates the background from the chromosome object, the termcrit is chosen to be 40, and the attempts is selected as 3. After the first clustering is completed, a large number of chromosome clusters are generated, and a small number of individual chromosomes are generated. By simple feature judgment, the screened chromosome clumps are re-segmented. For slightly clumped chromosome clumps, the watershed method is a commonly used segmentation method. In the paper, the clustering algorithm is combined with the watershed algorithm to segment the chromosome clumps. In the second segmentation, partial clustering algorithm parameters such as clusterCount and termcrit are appropriately adjusted for the chromosome clumps. After using the K-Means++ clustering algorithm, since the central parts of chromosomes are darkly stained at the opposite edges, this clustering algorithm makes the chromosomes thinner, that is, discarding the lighter part of the edge and retaining the deeper part of the center, thus making the large-scale adhesion chromosomes separate. However, the segmented chromosome is relatively thinner than the chromosome segmented in the first step. In order to ensure the consistency of the chromosome morphology of the whole cell, using the idea of watershed algorithm, the thinner segmented chromosome is used as a seed point. "Watering" one pixel at a time centered on each seed point until two different seed points meet the core "waters". At this time, different chromosomes are labeled with the seed point as a reference point, and each chromosome can be separated according to the label.
For overlapping chromosome clumps, although some papers propose some solution algorithms, such as threshold segmentation method 1 , deep learning after sampling from the artificially segmented chromosomes 2 , geometry-based segmentation method 3 , IAFCM (improved adaptive fuzzy C-Means algorithm) 5 , fuzzy c-means clustering algorithm and watershed algorithm 6 , CPOOS (classification-driven partially occluded object segmentation) 17 . But mostly for a specific type overlapping chromosomes, and segmentation does not have universal applicability, and even if the segmentation is performed according to the algorithms in the papers, the segmented single chromosomes are prone to misidentify the centromeres in the subsequent chromosome centromere identification algorithm. Therefore, for overlapping chromosome clumps (Fig. 3), this type is used to identify by manual interaction.
Centerline extraction. Many operations on chromosomes require the centerline of chromosomes, such as classification of chromosomes 4,14 . Many features related to shape and structure, such as width and density profiles, can be extracted based on the centerline of the chromosome. The small deviations in extracting these real features may lead to identification and classification errors 18 . When the extraction of the chromosome centerline is completed, the identification and counting of the centromeres of single chromosomes can be performed according to the centerline.
This article's processing method draws the minimum circumscribed rectangle of each chromosome firstly. Using the ratio of the area of the chromosome to the area of the smallest rectangle as a measure, when the ratio is less than a certain value, it indicates that the degree of chromosome bending is severe. When the ratio is close to 1, it indicates that the chromosome is straight. For the extended type chromosome, the axis of symmetry that parallels to the long side of the minimum circumscribed rectangle is directly used as the centerline of the chromosome. For the more severely curved chromosomes, use the method 19 to extract the skeleton. In order to overcome the problem of small bifurcations and small holes when extracting the skeleton by this algorithm, the chromosomes are eroded, dilated, smoothed borders and filled the internal pores. To overcome the fact that the skeleton extracted by this algorithm is significantly shorter than the chromosome, the approximate slope is calculated at both ends of the skeleton, and then an empirical value length is extended to solve the extraction of short skeleton.
Centromere identification. Centromeres are also called kinetochores. Chromosome centromeres refer to the pair of spherical structures that are located in the centromere area and the outer surface of the two chromatids and can be colored by special methods. The number of dicentric chromosomes in the human peripheral blood lymphocytes is used to detect the presence of chromosomal abnormalities in an individual. Or when a radiation accident occurs, the number of dicentric chromosomes is used to estimate the exposure of the human body to radiation. According to the radiation dose, it can improve the patient's efficient and timely treatment.
Observing and analyzing the chromosome image, we find that there are three distinct differences in the image characteristics of the centromere and non-centromere area. The chromosome usually has a smaller width at the centromere, a smaller gray value, and the distribution of gray values is more uniform, so these differences are usually used as the characteristics of the centromere identification. For example, the projection vectors in the horizontal and vertical directions are calculated by adding up the values of the pixels along the projection line to determine the centromere position 12 . The identification of centromeres can be identified using fuzzy sets and neighborhood masks 13 . The pixel and distance are computed to find centromeres 14 . The identification of centromeres can also be used to calculate the number of centromeres by concavity 8 , but both angle and curvature can cause noise pollution 20 . The multiple identification methods of centromeres mainly use the width or gray values. The accuracy of these identification methods is not very high. Therefore, the method of combining width and gray values is used to identify the location of centromeres. The background gray value of the single chromosome after division is set to 0, and the gray value of the region of the chromosome itself is not processed. Let the coordinate of each point of the center axis extracted in II-B be P i (x i , y i ), and the corresponding gray value is M i . Since the gray value is smaller, negate it, denote it as G i : i j ij ij , the negation of the corresponding gray value is G ij , and the Euclidean distance from the vertical point P i to the point P i Q i in vertical line is D ij : Define a new parameter: the equivalent width Y i , then the equivalent width at point P i is: For a single chromosome after extracting the central axis, the process of identifying the centromere according to the equivalent width is as follows: 1. A single chromosome after extracting the central axis is taken as input, and the equivalent width curve of the point on the central axis is obtained, and one-dimensional low-pass filtering is performed thereon. 2. For the filtered equivalent width curve, the trend of the equivalent width is fitted with a straight line, and the difference between the point value on the fitted line and the value corresponding to the filtered point is used to generate a difference curve. 3. Derivate the difference curve and perform one-dimensional low-pass filtering to find all extreme points 4. For all the minimum values, find the difference between the maximum value of the left and right sides of the point, and the difference between the left side is recorded as A and the right side is recorded as B. 5. A threshold T is set according to a large amount of data. When A > T or B > T, and A > T/2 and B > T/2, the minimum point is a centromere point.
According to the number of centromere points, the chromosome can be judged as dicentric chromosomes, or monocentric chromosomes, or multicentric centromere chromosomes. Dose estimation. Radiation sources that are usually exposed to the human body are X rays, γ rays, and occasionally neutrons. The radiation dose estimation for humans are often using a γ ray curve. As the uncertainty of counting can be caused by slides or observation of individual differences in chromosome centromeres, a confidence interval is introduced to express uncertainty, using a 95% confidence interval as a criterion 21 . As the Poisson distribution of detected aberrations in the overexposed sample and the uncertainty in the calibration curve that is close to the normal distribution, it is difficult to calculate the confidence limits. Savage 22 , Merkle 23 , and Szluinska 24 have been analyzed and discussed this problem. Merkle's method is the simplest, and considers both the Poisson error on the aberration yield and the errors on the dose curve to be taken into account.
For the dose-effect curve established on counting a large number of cells, the change of the curve is small compared with the change of the distortion rate of the subject, which can be neglected. As shown in the Fig. 4 and Table 1, the confidence interval can be calculated through the following four steps.
1. Assuming that M cells are analyzed and contain X dicentric chromosomes, the distortion yield is: The dose-effect curve is a linear square model(Y = C + αD + βD 2 ), the estimated dose D can be obtained by solving the equation: 2 2. Assuming the Poisson distribution, X U and X L are obtained from the standard statistical table of the expected Poisson's distribution limit 25 . The Y U and Y L are: Calculate the dose at the intersection of Y L and the curve, which is the lower confidence limit(D L ). 4. Calculate the dose at the intersection of Y U and the curve, which is the upper confidence limit(D U ).

Results and Discussion
Segmentation results. The chromosome segmentation includes the segmentation of chromosomes and backgrounds, the segmentation of chromosome clumps, and the segmentation of chromosome clumps is divided into simple adhesion segmentation and overlapped cross segmentation. After the chromosome and background segmentation of the original picture, the initial clustering segmentation can generate single chromosomes (Fig. 5). And when segmenting, the algorithm does not change the original morphology of the chromosome.
The simple characterization of the first-segmented chromosome clumps is performed, and the re-segmented chromosome clumps are selected. These selected chromosome clumps are taken as input to segment by clustering and watershed algorithm, it is obvious that the clustering segmented chromosomes are one circle smaller than the original chromosomes. In order to ensure the consistency of chromosome morphology, regarding the clustering segmented chromosomes as seed points, performing watershed segmentation can separate slightly sticky chromosome clumps (Fig. 6).
Most of the chromosomes are derived from human peripheral blood lymphocytes after gamma-irradiation. After being stained with Giemsa, they are placed on a microscope platform and scanned automatically. The data consist of metaphase split images taken from photographs of different doses of slides. Data contain 15,000 metaphase images, which are divided into data set 1, data set 2, and data set 3 for 5,000 images of 1 Gy,2 Gy,4 Gy radiation dose. And each data set contains 10 groups. For the data set 1, 2, 3, the software of using the clustering + watershed algorithm segmented the average group of 19542, 20128, 20732 objects, averaging 39,40,41 objects per metaphase. When the threshold is used, the software segmented the average group of 14654, 15244, 16178 objects, averaging 29, 30, 32 objects per metaphase.
Extraction centerline results. The centerline is extracted from the single segmented chromosomes. For the straight type, or curved type can directly obtain the centerline (Fig. 7a). For the hole type, or bifurcation type, the centerline can be obtained after the chromosome has been eroded, dilated, smoothed borders and filled the internal pores (Fig. 7b,c). The centerlines of most single chromosomes can be extracted, except for some specially shaped chromosomes. The centerlines of these chromosomes will produce a severe shift. However, since the occurrences of this type is infrequent, the effect on the dicentric chromosome recognition results almost can be neglect. Dicentric chromosome identification results. As shown in II-A, a new method for chromosome segmentation is designed based on the clustering algorithm and the watershed algorithm. The segmented single X X L X U , X D X X L X U , X D X X L X U , X D X X L X U , X D   Fig. 8. It can be clearly seen that Fig. 8a,b have two dicentric chromosomes which are identified correctly. But Fig. 8c contains three dicentric chromosomes, only two dicentric chromosomes have been identified because there is no outwardly extending chromosome arm at the unrecognized part A. Therefore, it is more difficult to identify this type dicentric chromosome. Foe three data sets, experts identified all dicentric chromosome and also labeled false positive dicentric chromosomes. The experts also judged the identified dicentric chromosomes after the software identification and corrected the number of identified dicentric chromosomes.
The dicentric chromosomes in three data sets were identified for the (m = 1, n = 1) values and compared with the threshold algorithm, the results are shown in Fig. 9. PPV and TPR are used to measure the identification of algorithms among different methods. PPV indicates the identification accuracy of the dicentric chromosomes,  and TPR indicates the identification rate of the dicentric chromosomes. As can be seen from Fig. 9, compared with the threshold algorithm, the clustering + watershed algorithm has better results on TPR and PPV. Especially in high-dose radiation, the clustering + watershed algorithm has the TPR of 76.6% and the PPV of 76.6%, both of which exceed three-quarters, showing good identification results. At low dose, due to the relatively low radiation dose, the formation of dicentric chromosomes is less, normal chromosomes are more. It is prone to mis-segmentation, which will lead to low identification accuracy (30-40%). Dose estimation results. Dose estimation was performed on 30 groups of identification results. The dose curve was based on the dose-effect curve of the dicentric chromosomes fitted in our laboratory (8). The relative deviation of the estimated dose ≤20% is regarded as qualified. The pass rate for each data set is shown in Fig. 10.
As can be seen from Fig. 10, the higher dose, the higher pass rate of the dose estimate. Therefore, when dose estimation is performed, the number of cells should be selected more for low-dose radiation, and the number of cells can be appropriately reduced for high doses.

Conclusion
This paper proposes a segmentation method based on clustering algorithm and watershed algorithm to segment the chromosome cluster, and then extract the central axis from the segmented single chromosome. According to the position of the central axis, the dicentric chromosomes are identified by combining the two factors of gray scale and distance. After manually identifying the identified dicentric chromosomes, the number of dicentric chromosomes is obtained, which can be used to estimate radiation doses. The results are as follows:  (1) The proposed automatic segmentation and identification method for dicentric chromosomes has the true positive rate (TPR) 75.6% and the positive predictive value (PPV) of 60%, which is higher than the method using threshold algorithm. (2) By comparing the different dose, it is found that the higher dose, the higher true positive rate and positive predictive value can be obtained, especially the positive predictive value. (3) The yield and pass rate of dose estimation depend on the amount of radiation dose received. For low dose radiation, the more cells should be identified, and for high dose radiation, the number of identified cells can be appropriately reduced. When 500 cells are used for identification and dose estimation, the dose estimation pass rate can reach 80% in high dose.
Statistics. Data were tested for normal distribution. Differences between groups were analyzed using the paired Student t test (IBM SPSS Statistics v. 17.0, IBM, Armonk, NY). All values are expressed as mean ± standard deviation (SD). Statistical significance was accepted for values of P < 0.05.

Data Availability
The datasets generated and/or analyzed during the current study are available from the corresponding author on reasonable request.