An Interactive Java Statistical Image Segmentation System: gemident

Supervised learning can be used to segment / identify regions of interest in images using both color and morphological information. A novel object identiﬁcation algorithm was developed in Java to locate immune and cancer cells in images of immunohistochemically-stained lymph node tissue from a recent study published by Kohrt, Nouri, Nowels, Johnson, Holmes, and Lee (2005). The algorithms are also showing promise in other domains. The success of the method depends heavily on the use of color, the relative homogeneity of object appearance and on interactivity. As is often the case in segmentation, an algorithm speciﬁcally tailored to the application works better than using broader methods that work passably well on any problem. Our main innovation is the interactive feature extraction from color images. We also enable the user to improve the classiﬁcation with an interactive visualization system. This is then coupled with the statistical learning algorithms and intensive feedback from the user over many classiﬁcation-correction iterations, resulting in a highly accurate and user-friendly solution. The system ultimately provides the locations of every cell recognized in the entire tissue in a text ﬁle tailored to be easily imported into R (Ihaka and Gentleman 1996) for further statistical analyses. This data is invaluable in the study of spatial and multidimensional relationships between cell populations and tumor structure. This system is available at www.GemIdent.com together with three demonstration videos and a manual.


Introduction
We start with an overview of current practices in image recognition and a short presentation of the clinical context that motivated this research, we then describe the software and the complete workflow involved, finally the last two sections present technical details and potential improvements.The interactive algorithm, although developed to solve a specific problem in histology, works on a wide variety of images.For instance, locating of oranges in a photograph of an orange grove (see Fig. 1).
Figure 1: The original image (left), a mask superimposed on the original image showing the results of pixel classification (center), the original image marked with the centers of the oranges (right) Any image that has few relevant colors, such as green and orange in the above example, where the objects of interest vary little in shape, size, and color, can be analyzed using our algorithm.First, we will describe the application to cell recognition in microscopic images.

Previous research
As emphasized in recent reviews(Ortiz de Solirzano, Garcea Rodriguez, Jones, Pinkel, Gray, Sudar, and Lockett (1999), Mahalanobis, Kumar, and Sims (1996), Wu, Gauthier, and Levine (1995), Wu, Barba, and Gil (1998), Yang and Parvin (2003), Gil, Wu, and Wang (2002), Fang, Hsu, and Lee (2003), Kovalev, Harder, Neumann, Held, Liebel, Erfle, Ellenberg, Neumann, Eils, and Rohr (2006a)), automated computer vision techniques applied to microscopy images are transforming the field of pathology.Not only can computerized vision techniques automate cell type recognition but they enable a more objective approach to cell classification providing at the same time a hierarchy of quantitative features measured on the images.Recent work on character recognition (Chakraborty 2003) shows how efficient interactivity can be in image recognition problems, with the user pointing out mistakes in real time, thus providing online improvement.In modern jargon, we call this interactive boosting (Freund and Schapire 1997).Current cell image analysis systems such as EBimage (Skylar and Huber 2006) and Midas (Levenson 2006) Collins (2007) do not provide these interactive visualization and correction features.and count cells objectively.Kohrt et al. (2005) also showed that T-cell and dendritic cell populations within axillary lymph nodes of patients with breast cancer are significantly decreased in patients who relapsed.No study thus far has examined the spatial variability in lymphocyte populations and phenotypes as related to lymph node-infiltrating tumor cells and clinical outcome.The location of tumor-dependent immune modulation has significant sequelae, given the critical role of lymph nodes in activation of the immune response.This suggests that tissue architecture could yield clues to the workings of the immune system in breast cancer.This can be investigated through spatial analysis of different cell populations.The limiting step to date has been locating every cell.

gemident
This algorithm has been engineered into the software package named gemident (Kapelner, Holmes, and Lee 2007).The distribution, implemented in Java, includes an easy-to-use GUI with four panels -color (or stain) training, phenotype training / retraining (see Fig. 6), classification, and data analysis (see Fig. 7) with a final data output into a text file which is easy to input and analyse in R. The Java implementation ensures that full platform-independence is supported.The distribution also includes support for image sets derived from the BacusLabs Bliss Imager (Bacus 2007a), the BacusLabs NanoZoomer (Bacus 2007b), and the CRI Nuance Multispectral Imager (Nuance 2007).
In this paper, we focus on the software itself, its use in conjunction with R and the developments which were engineered to analyze multispectral images where the chromatic markers (called chromagens) are separated a priori .Figure 2 shows gemident employed in the localization of cancer nuclei in a typical Kohrt (Kohrt et al. 2005) image.The algorithm internals are detailed in section 2 but we give a brief summary here for potential users who do not need to know the technical details.
Our procedure requires interactive training: the user defines the target objects sought from the images.In the breast cancer study, this would be the cancer nuclei themselves.The user must also provide examples of the "opposite" of the cancer nuclei, i.e. the "non-object."Fig. 3 shows an excerpt from the training step.New images can be classified into cancer nuclei and non-cancer nuclei regions using a statistical learning classifierHastie, Tibshirani, and Friedman (2001).Using simple geometric techniques, the centers of the cancer nuclei can then be located and an accurate count tabulated.
The user can then return and examine the mistakes, correct them, and retrain the classifier.In this way a more accurate classifier is iteratively constructed.The process is repeated until the user is satisfied with the results.

Algorithm Image Acquisition
Any multiple wavelength set of images can be used, as long as they are completely aligned.In this example, the images were acquired via a modified CRI Nuance Multispectral ImagerNuance (2007).The acquisition was completely automated via an electronically controlled microscopy with an image decomposition into subimages.Multiple slides were sequentially scanned via an electronically controlled cartridge that feeds and retracts slides to and from the stage.The setup enables scanning of multiple slides with large histological samples (some being 10mm+ in diameter and spanning as many as 5, 000+ subimages or stages) at 200X magnification.
Each raw exposure is a collection of eight monochromatic images corresponding to the eight unique wavelength filters scanned (for information on dimension reduction in spectral imaging see (Lee, Woodyatt, and Berman 1990)).The choice of eight is to ensure a complete span of the range of the visual spectrum but small enough to ensure a reasonable speed of image acquisition and avoid unwieldly amounts of data.These multispectral images are composed of 15-bit pixels whose values correspond to the optical density at that location.The files for each wavelength for each subimage or stage are typically TIF files of about 3MB each.
Prior to whole-tissue scanning, a small representative image is taken that contains lucid examples of each of the S chromagens.The user selects multiple examples of each of the S chromagens.This information is used to compute a "spectral library."The same spectral library can then be used for every histological sample that is stained with the same chromagens.
After whole-tissue scanning, the intermediate images are combined using proprietary spectral unmixing algorithms (Nuance 2007) to yield S orthogonal chromagen channel images.These "spectrally unmixed" images are also monochromatic and composed of 15-bit pixels whose values correspond to a pseudo-optical density.
We use the spectrally unmixed images directly as score matrices, which we call F s where s is the chromagen of interest.The program has been used with various scoring generation mechanisms and the quality of the statistical learning output has proved quite robust to these changes.Multispectral microscopy has the advantage of providing a clean separation of the chromagen signals.

Training Phase 1 Interactive acquisition of the Training Sets for Objects of Interest
For each of the P phenotypes or categories of objects of interest, a training set is built: the user interactively chooses example points, t = (i, j), in the images where the phenotype is located forming the lists: T + 1 , T + 2 , . . ., T + P .In addition, the user interactively chooses example points where none of the P phenotypes are located, i.e. the "non-object," forming the list T − .

Learning Phase
2 Feature Definition: a) Define a "ring", c r , to be the 0/1 mask of the locus of pixels, t, r + distance away from the center point t o where refers to the error inherent in discretizing a circle (see Fig. 4).Generate a set of rings, for r ∈ {0, 1, 2, . . ., R} where R is the maximum radius of the largest phenotype to be identified and it can be modified by a multiplicative factor which controls the amount of additonal information beyond R to be included: b) For a point t o = (i o , j o ) in the training set and for a ring c r , create a "Ring Score," q,r , by summing the scores for all points in the mask: c) Repeat for all rings in C and for all S score matrices to generate the observation record, L i of length S × R by vectorizing s,r : d) Now compute an observation record for each point t in the training sets for all phenotypes and for the 'non' category.We append a categorical variable recording the phenotype to all observation records L i .
3 Creation of a Classifier All observation vectors are concatenated row-wise into a training data matrix, and a supervised learning classifier that will be used to classify all phenotypes is created.
In this implementation we use the statistical learning technique known as "Random Forests" developed by Breiman (2001) to automatically rank features among a large collection of candidates.This technique has been compared to a suite of other learning techniques in a cell recognition problem in Kovalev, Harder, Neumann, Held, Liebel, Erfle, Ellenberg, Neumann, Eils, and Rohr (2006b) who found it to be the best technique providing both the most accurate and the least variable of all the techniques compared.As all supervised learning techniques (Hastie et al. 2001), the method depends on the availability of a good training set of images where the pixels have already been classified into several groups.
At this point, all previous data generated can be discarded.Most machine learning classifiers, including our version of Random Forests, provide information on which scores q,r are important in the classification (see an example in Figure 12).

Classification Phase 4 Pixel Classification
For each image I to be classified, an observation record is created for each pixel (steps 3 & 4), L(t o ), and then evaluated by the classifier.The result is then stored in a binary matrix, with 1 coding for the phenotype and 0 for the opposite.There are P binary matrices, B 1 , B 2 , . . ., B P , (one for each phenotype):

. , B P
To enhance speed, k pixels can be skipped at a time and the B p 's can be closed k times (dilated then eroded using a 4N mask (Gonzalez, Woods, and Eddins 2004)) to fill in holes.

Post-processing
Each pixel is now classified.Contiguously classified regions, called "blobs" hereon are postprocessed in order to locate centers of the relevant objects, such as the middle of an orange or the cell nucleus.
Define the matrices C to hold centroid information:

. , C P
There are many such algorithms for blob analysis.We used a simple one which we summarize below.For a more detailed description, see Appendix A. For an example of the results it produces, see Fig. 5 A sample distribution of the training blob sizes is created by reconciling the user's training points then counting the number of pixels in each blob using the floodfill algorithm.We calculated the 2nd percentile, 95th percentile, and the mean.For each blob obtained from the classification, we used those sample statistics to formulate a heuristic rule that discards those that are too small and quarantines those that are too large.Those that are too large are split up based on the mean blob size.To locate each blob's centroid, the blob's coordinates were averaged.After classification (and post-processing if desired), the results from the B's and the C's can be superimposed onto the original training image.The user can add false negatives (hence adding points to the T + p 's) and add false positives (hence adding points to T − ).The observation records can then be regenerated, the supervised learning classifier can be relearned, and classification can be rerun with greater accuracy.This retraining-reclassification process can be repeated until the user is satisfied with the results.

Visualization Features
In the early stages of the workflow, there are several graphical helpers showing where the training points are.An overview window shows where the individual subimages are situated and how they are labeled (see section 4 for an example), there is also a rare event finder that only displays certain colors when searching for rare cell types or colocalization of two colors.
The user can also interactively identify points by turning them on and off with a mouse click.A magnification window improves the accuracy with which phenotype points are chosen.
In the retraining panel, varying the opacity of the phenotype binary marker matrix B has proven quite useful.(see Fig. 6) The user has an interactive view of the points where there are type I errors and can add points to improve the error rates of that type.

First Order Statistics and Output Overview
The data analysis panel (Fig 7) generates histograms and summary reports.Most importantly, the program also allows the user to save the cell center data as a text file that can be used by any specialized statistical package, in the Example section below we show the use of the spatstat package in R.

A complete example
Immunohistochemical staining (IHC) refers to the process of identifying proteins in cells of a slice of tissue through the specificity with which certain bind to special antigens present on the cell.
Combining a stain called a chromagen with antibodies allows visualization and reveals their localization within the field.IHC is thus widely used to discover in situ distributions of different types of cells.Until recently most of the data collection was done by manual cell counting.In this example we will show how different types of immune cells as well as cancer cells were detected and localized in large numbers using statistical learning for image segmentation.

Training Set Helper
When the image set is first opened, a training helper appears.Each number represents the snapshots (called stages) captured individually by the system.
The target phenotypes in this case were Dendritic cell nuclei, T-cell nuclei, Tumor cell nuclei, and unspecific cell nuclei (called other_cell).

Classified Zones
After specifying a set of training pixel phenotypes by clicking on them, the training set is completely classified.Every pixel in the images is assigned a phenotype or put in the 'Non' group.After this, the output is zones or blobs of different phenotypes.

Interpreting the Classifiers
In order to do the classification, GemIdent creates a random forest (see Section 2 Step 4).When the Figure 9: An assembly of all the individual stages into the complete scan, the parts of the images that we want to train and classify are indicated in green, the ones that will be discarded have red numbers, the red disks show the training set.random forest classifier has been created a graphic is displayed illustrating the importances of each feature in the classifier (see Figure 12).Each bar graph represents a chromagen, the x-axes indicate the ring score's radius, and the y-axes indicate relative importance (with the largest bar being the most important).This illustration is important to the user because it serves as an additional check to spot mistakes.
Figure 12: Bar charts for interpreting the distance at which the chromagens influence the classifier.The hematoxylin was found to be very important at r ∈ [0, 5].The forest learned that cancer cell nuclei, T-cell nuclei, and unspecific cell nuclei all have hematoxylin-rich centers.Ferangi Blue was found to be most important at r ∈ [3,6] indicating that the classifier learned that the T-cells are positive at their membranes.Vulcan Red was found to be most important at r ∈ [6,15] indicating that cancer cells are positive on their membranes and their diameter varies dramatically.DAB was found to be important at r ∈ [1, 6] indicating that dendritic cells appear devoid of a nucleus and vary in size.After the centroids have been determined the program can evaluate its errors, the type I errors are show in a small window and written to the summary file, in the output directory, which in our case showed: filename,Num_Dendritic_cells,Num_T_cells,Num_other_cells,Num_Tumor stage_0096, 23,482,479,44 stage_0093,29,230,259,615 stage_0126,3,46,166,816 stage_0147,18,1203,907,

Data Analysis with R
The centroid data is loaded into R as a text file named after the particular phenotype they belong to with 5 columns, the stage number (which is the subimage to which the cell belonged), the local and the global X and Y coordinates, for instance the file 99_4525D-Tumor.txtcontains: ,locX,locY,globalX,globalY stage_0096,201,51,4040,13037 stage_0096,214,91,4053,13077 stage_0096,220,76,4059,13062 stage_0096,230,107,4069, The analysis performed on these data show the spatial landscape of the lymph node immune cells quite clearly.

Conclusions and Future Uses
The success of the method resides on the collection of all the data features in a neighborhood of a pixel, and the selection by random forests of the pertinent features for each particular phenotype.The iterative boosting enables the user to increase accuracy to a satisfactory level.
Although we have concentrated here on static multispectral images, fluorescent images can be classified in a similar way.Instead of using distributions in colorspace to obtain scores, density estimation can be used to compute scores in the unidimsensional space of the fluorescent layer intensity images.
Furthermore, the algorithm is not only restricted to static images: Film is a composition of images called "frames" displayed over time.Identification can be done in moving images using the changing frames as the "z-axis" and instead of scores computed via sums of rings, it can be sums of spheresurfaces.The algorithm can also be generalized to phenotype identification in n-dimensions.The algorithm also may be applied to identification of objects in satellite imagery, face recognition, automatic fruit harvesting and countless other fields.
There is no doubt that images will continue to provide data to solve many biological mysteries.The gemident project is a step towards combining human expertise and statistical learning to supply a feasible and objective data collection process in the presence of high case to case variability.

Figure 2 :
Figure 2: The original image (left), a mask superimposed on the original image showing the results of pixel classification (center), the original image marked with the centers of the nuclei (right)

Figure 3 :
Figure 3: An example of training in a microscopic image of immunohistochemically stained lymph node.The cancer membranes are stained for Fast Red (displays maroon).There is a background nuclei counterstain that appears blue.The phenotype of interest are cancer cell nuclei.Training points for the nuclei appear as red diamonds and training points for the "non-nuclei" appear as green crosses.

Figure 4 :
Figure 4: Example of a ring mask: c 4 superimposed onto a typical cancer nuclei from the Kohrt images, white pixels designate the points which participate in the ring's score.

Figure 5 :
Figure 5: Left -excerpt of B cancer matrix, right -the results of the centroid-finding algorithm superimposed on the marked image

Figure 6 :
Figure6: Screenshot of the retraining tool which allows the user to vary the opacity of the B p 's (the sliders in the center bottom) and highlights classification mistakes (the red circles) for easy correction.The example image displayed was scanned at 40x by the optical BacusLabs Bliss Imager(Bacus 2007a)

Figure 7 :
Figure 7: An example of the command-line data analysis panel.The histogram displayed within the panel is the result of a function that finds the distances from all phenotype A centroids to their closest phenotype B neighbors.The binary image is the B CancerM embrane .The open PDF document is the autogenerated report which includes a thumbnail view of the entire image set, counts and Type I error rates for all phenotypes, as well as a transcript of the analyses performed.

Figure 8 :
Figure 8: A tumor invaded lymph node as it appears through the microscope, at this resolution we can see the red zones which are Tumor invaded.

Figure 10 :
Figure 10: Slides of axillary lymph nodes were imaged and loaded as a new project into Gemident.Each slide had the same chromagen profile where CD1a targeting dendritic cells were stained with 3, 3 −Diaminobenzidine (DAB) (these appear as brown in the screenshots), CD4 targeting T-cells were stained with Ferangi Blue (which appear bright green in the screenshot above), AE1/AE3 targeting cytokeratin within breast cancer cells were stained Vulcan Red.In addition, slides were stained with blue Hematoxylin to reveal all nuclei.The intensity at which each of the chromagens is displayed can be adjusted using the "Color Adjustment" dialog box (shown on the left).The pixels chosen as training points are shown as crosses surrounded by grey disks.The color of the crosses indicates the training phenotype chosen (Green for Tumor, Black for Non).

Figure 11 :
Figure 11: After the training sets have been classified, the pixels in the different phenotypes appear as blobs.

Figure 13 :
Figure 13: The next step is performed by requesting the program to Find Centers, this starts the centroid finding algorithm which designates one pixel as the cell center, thus enabling cell counting and cell localization.Here the cell centroids are marked by black diamonds filled with the relevant phenotype's color code.

Figure 14 :
Figure 14: Type I errors are displayed surrounded by red circles, consulting similarities among these errors allows the user to choose which points to add to the training set to improve the accuracy.

Figure 15 :
Figure15: Output from the estimation of spatial densities using kernel density estimates from spatstat.