Optimal breast cancer diagnostic strategy using combined ultrasound and diffuse optical tomography

: Ultrasound (US)-guided near-infrared diﬀuse optical tomography (DOT) has demonstrated great potential as an adjunct breast cancer diagnosis tool to US imaging alone, especially in reducing unnecessary benign biopsies. However, DOT data processing and image reconstruction speeds remain slow compared to the real-time speed of US. Real-time or near real-time diagnosis with DOT is an important step toward the clinical translation of US-guided DOT. Here, to address this important need, we present a two-stage diagnostic strategy that is both computationally eﬃcient and accurate. In the ﬁrst stage, benign lesions are identiﬁed in near real-time by use of a random forest classiﬁer acting on the DOT measurements and the radiologists’ US diagnostic scores. Any lesions that cannot be reliably classiﬁed by the random forest classiﬁer will be passed on to the second stage which begins with image reconstruction. Functional information from the reconstructed hemoglobin concentrations is employed by a Support Vector Machine (SVM) classiﬁer for diagnosis at the end of the second stage. This two-step classiﬁcation approach which combines both perturbation data and functional features, results in improved classiﬁcation, as denoted by the receiver operating characteristic (ROC) curve. Using this two-step approach, the area under the ROC curve (AUC) is 0.937 ± 0.009, with a sensitivity of 91.4% and speciﬁcity of 85.7%.


Introduction
Breast cancer is the most common cancer in women worldwide, with approximately 1.67 million new cases each year [1]. Despite improvements in detection and diagnosis, more than half a million patients die from this disease annually. Breast cancer is a spectrum of diseases with different histologic subtypes, grades, and biologic and metabolic activities, resulting in a wide range of functional differences [2]. Benign breast disease also encompasses a heterogeneous group of diseases that vary in vascular content, proliferative index, metabolic activity, and risk of breast cancer [3].
Multiple imaging modalities are currently used for breast cancer screening and diagnosis. X-ray mammography is the predominant modality for both screening and diagnostic imaging. Breast ultrasound (US) is the second most common diagnostic imaging modality and is also used for screening average to moderate risk women with dense breast composition [4][5]. Due to its high cost and limited access, MRI is reserved for screening high risk women and has application to a very narrow group of diagnostic indications. While the characteristics of malignant and benign breast lesions are well established by conventional imaging techniques [6][7], their overlapping appearances result in approximately one million image-guided breast biopsies each year in the United States, most yielding benign results [8]. An optical tomography system that reveals functional differences in breast abnormalities could greatly improve diagnostic accuracy and reduce the number of benign biopsies.
One major challenge for dual-modality breast cancer diagnosis is DOT's relatively slow data processing and image reconstruction speed as compared to the real-time imaging capabilities of US. Near real-time diagnosis is critical for the clinical translation of a US-guided DOT dual-modality technique. The approach described here employs a random forest classifier, an ensemble learning method that has been used widely in medical imaging applications [22]. It makes a decision based on the majority vote of many individual decision trees that are trained on predictive features [23]. Random forest classifiers have demonstrated promising results for computer aided breast cancer diagnosis utilizing US [24], mammogram [25] and biopsy data [26].
In this study, we investigate a two-stage diagnostic strategy for clinically managing breast cancer. The first stage seeks to identify benign lesions in near real-time, based on radiologists' US scores and DOT measurements in the form of perturbation data that have not undergone image reconstruction. This is accomplished by use of a random forest classifier. Intermediate lesions that cannot be identified as benign with high confidence are flagged, and functional images are subsequently reconstructed off-line from the DOT measurements. In the second stage of the diagnostic strategy, features are extracted from the reconstructed DOT images, and a Support Vector Machine (SVM) classifier is employed for diagnosis. This proposed diagnostic strategy has showed significant improvement over DOT functional feature and US based diagnosis alone by increasing the AUC (area under the ROC curve) from 0.892 to 0.937. To the best of our knowledge, this is the first time a two-step automated diagnostic strategy has been proposed with near real-time assessment capability for the majority of benign lesions.

Patients and ultrasound BI-RADS grading
A total of 188 patients were studied to evaluate the proposed diagnostic scheme: based on biopsy results, 47 patients had malignant lesions (mean age 59 years; range 34-94 years) and 141 had benign lesions (mean age, 48 years; range 17-82 years). For benign lesions, 32% were fibroadenomas, 24% fibrocystic changes, 9% fat necrosis/inflammatory changes, 14% proliferative lesions, 17% complex cysts, 4% lymph nodes and breast tissue. For malignant lesions, 26% were stage 2 to 4 cancers, and 6% DCIS and 68% stage 1 cancers. The clinical study was approved by the local Institutional Review Board and was compliant with the Health Insurance Portability and Accountability Act. Informed consent was obtained from each patient. Data used in this study were obtained from an earlier study, and data from patients were de-identified [21].
For each lesion, a sequence of US images was obtained and retrospectively reviewed by two radiologists who were blind to the optical results and final diagnosis. The lesions were graded using the Breast Imaging Reporting and Data System (BI-RADS) for US. For each lesion, one of four grades were given, 4A, 4B, 4C and 5 based on the suspicion level of malignancy. BI-RADS 4A refers to ≤10% likelihood of malignancy while 4B, 4C and 5 denotes 10% to 50%, 50% to 95% and ≥95% likelihood of malignancy [6]. In the classification process, all BI-RADS grades (4A to 5) were encoded into numbers from 0 to 3 based on increasing suspicion level, i.e., 0 for 4A, 1 for 4B, 2 for 4C, 3 for 5. These numerical scores from two radiologists were used in the random forest classifier as two additional features to the 12 perturbation features introduced later.

DOT system
The US guided frequency domain DOT system consisted of a hand-held probe with 9 optical sources and 10 parallel detectors with a US transducer located at the middle [27]. Four laser diodes with wavelengths 740, 780, 808 and 830 nm were sequentially switched by 4×1 and 1×9 optical switches to deliver light modulated at 140 MHz to 9 different source positions on the probe. Backscattered light was collected via 10 light guides on the probe. The output of each detector channel was further amplified and sampled by an analog-to-digital converter (ADC) and stored in a PC. Frequency and phase measurements were extracted from the detected signal using Hilbert transformation. Each data acquisition took 2 -3 seconds. Multiple data sets were acquired at the lesion area and the contralateral normal breast at the mirror position, referred to as the reference breast. The entire data acquisition took ∼5 minutes to complete, and the perturbation data were calculated immediately after.

DOT measurements and perturbation features
The DOT perturbation, U sc , is defined as the normalized difference between the lesion and reference measurements, which is related to differential absorption of the lesion and reference normal tissue. For the i th source-detector pair, where m is the number of measurements, and U l (i) = A l (i)e jϕ t (i) and U r (i) = A r (i)e jϕ r (i) are the lesion and reference measurements, respectively. A two-dimensional representation of DOT perturbation measurements is shown in Fig. 1(a) for a benign lesion and Fig. 1(b) for a malignant lesion. The unit circle represents the expected boundary for perturbation data. A convex hull or envelope of the data distribution is marked by a black polygon. For benign lesions, perturbation is skewed towards positive real axis or evenly distributed around both the positive and negative real axes while perturbation for a malignant lesion is skewed toward negative real axis due to high absorption of cancer, which leads to lower ratio of A l (i) A r (i) (Eq. (1) [28]). This difference in data distribution is quantified by data features extracted from the perturbation that are useful in differentiation of benign and malignant lesions.
Two sets of DOT data features were extracted from the perturbation measurements: morphological features from the convex hull of the data distribution, and histogram features. Four features extracted from the convex hull are: the area, perimeter, moment of inertia and, centroidal polar moments. The moment of inertia, I m , is a quantitative measurement of the resistance of an object against angular acceleration. The centroidal polar moment, I p , denotes resistance of the object against torsion or twisting. The definitions of the moment of inertia and centroidal polar moment are as follows: where dm and dA are the differential mass and area elements, respectively, and r is the distance from the axis of rotation to these elements. Additional details are provided in Appendix A. For each lesion, all measurements were compiled to generate two separate univariate histograms of the real and imaginary perturbations. A representative example of a univariate histogram for a benign lesion is shown in Fig. 2(a) for real perturbation and Fig. 2(b) for imaginary perturbation. A histogram of a malignant lesion is shown in Fig. 2(d) for real perturbation and 2(e) for imaginary perturbation. From each histogram, six features -the mean, standard deviation, skewness, kurtosis, energy, and entropy -were extracted. In total we obtained 12 features from these two univariate histograms. The real and imaginary perturbations were used together to obtain a bivariate histogram, as shown in Fig. 2(c) for a benign case and 2(f) for a malignant case. Four features -the mean distance from the centroid, standard deviation of distance from the centroid, multivariate skewness, and multivariate kurtosis -were calculated from each bivariate histogram. Two tailed t-test was performed for each feature to calculate p-value, which is an estimate of the predictive capability of the respective feature. All features were ranked in the ascending order of p-values and features with p-value less than 0.05 are used in the classification. A total of 12 features were found significant and used in the random forest classifier described below.

Random forest classifier
A random forest is an ensemble of decision tree classifiers where each decision tree independently casts a vote for a certain class based on a randomly chosen subset of all features. The final outcome of the forest is based on the majority vote of all the trees. In this study, a total of 14 features, including 12 perturbation features and 2 sets of US BI-RADS scores from two radiologists, were used for classification. The random forest classifier employed in this study consisted of 50 classification and regression trees (CART). Each tree works on 6 randomly selected features out of 14 features. Information gain is used to calculate the best split at each decision tree node. Decision trees can safely handle correlated features too, since once a feature is used to split the samples, information gain on the split samples in the child node will be lower for correlated features [29][30]. Another feature of random forest classifier is that they can possess attractive bias-variance trade-offs if suitably defined. To realize this, in this study, we limited each individual decision tree depth to five and set the number of minimum required samples to split a node to four. To simplify the random forest classifier, Bonferroni correction can be applied to the group of features extracted from the same data set, for example, features from histograms, to obtain corrected p values and reevaluate the significance of each feature. However, this approach may increase the false negative rates on feature selection and miss important features [31]. For decision trees in a random forest classifier, an optimal feature is selected at each node based on information gain between that node and its child nodes. If a feature is not significant, it will not be selected. A Python open source machine learning library, scikit-learn, was used to build and train random forest classifier.

Image reconstruction and functional feature extraction
In DOT image reconstruction, the 3-dimensional breast volume to be reconstructed, underneath the 10 cm probe, is represented by voxels, with finer voxels within a lesion area identified by the co-registered US image and coarse voxels in the background region [32]. Fitted optical properties from the contralateral reference breast measurements are used to calculate a weight matrix W (Eq.n. (3)) for the voxels. The total absorption of each voxel is reconstructed and then divided by the voxel volume to obtain the differential optical absorption coefficient, δµ a . The inverse problem is linearized by use of the Born approximation to obtain a linear equation relating the changes of the optical absorption coefficients to the perturbation measurements, U sc : where W L and W B are the voxel weights in the lesion and background, respectively; δµ aL and δµ aB are unknown optical properties of voxels in the lesion and background, respectively; and n is the total number of voxels to be reconstructed. The optical absorption coefficients were reconstructed by solving a L2 regularized unconstrained optimization problem using the conjugate gradient method [33].X where, λ is the regularization parameter that is proportional to tumor size and largest singular value of the weight matrix, W. Oxy-and deoxy-hemoglobin concentrations, (C HbO 2 , C Hb ), were calculated from four wavelength absorption maps using the value of the extinction coefficient, ε, for different wavelengths, Functional features were extracted from the reconstructed hemoglobin maps. Three features were calculated from all lesion images, C HbO 2 , C Hb , and total hemoglobin C tHb . Since C tHb is the summation of C HbO 2 and C Hb , there are only two independent features. In general, C Hb is much lower than C HbO 2 for breast lesions, thus it is less robust in computation. Therefore, we have chosen C tHb and C HbO 2 for classification. The light shadowing effect was also used as a functional imaging feature [34]. In reflection geometry, large tumors are more likely to show light shadow effect in reconstructed DOT images because photons are absorbed more by top portion and less photons penetrate deeper. This results in the reduced signal to noise ratio received at longer source and detector pairs and therefore low reconstructed absorption values at deeper layers. To quantify the light shadow effect, the shadow parameter was calculated as the ratio of average C tHb calculated from the topmost layer in depth and the average of the underlying layers. Examples of a malignant lesion (a)-(b) and a benign lesion (c)-(d) are given in Fig. 3 to demonstrate the light shadow effect. The shadow parameter or ratio of the malignant lesion is 4.52, and the ratio of the benign lesion is 2.12. The three functional features, C tHb , C HbO 2 and light shadow parameter, are the functional features used in the second step of diagnosis by the SVM classifier.

Support vector machine
The Support Vector Machine (SVM) is a widely used binary classifier that finds the optimum hyperplane to separate two classes by maximizing the margin of error. In this study, we use a linear SVM. For a collection of feature vectors, {x i } and associated class labels, y i ∈ {−1, +1}, we find the optimal hyperplane w T x + b = 0, where, w is the weight of support vectors and b is the bias term. The following optimization problem is solved to find the weight, w, where, C is the regularization term. A Python open source machine learning library, scikit-learn, was used to build and train SVM classifier.

Two-step classification
Breast lesions were diagnosed in two steps. Immediately after data acquisition, perturbation features were extracted and US BI-RADS scores were obtained from the radiologists. These perturbation features and BI-RADS scores were used in a random forest classifier to identify lesions having a high probability of being benign. The total number of decision tree votes required to declare a benign lesion was set very high so that the false negative rate would be very small or nonexistent in near real-time assessment. Two-thirds of the malignant samples and the same number of benign samples were used for training, and the rest were used for testing. Recall that our sample set contained 47 malignant cases and 141 benign cases. The training set comprised 32 malignant and 32 benign cases. The test set was comprised of the remaining 15 Here, all 32 malignant cases were again used in training; however, for benign cases, lesions with intermediate malignancy probability, not filtered in first step, were used in training. Again, hyperparameters were selected by 5-fold cross-validation performance. The test set of 15 malignant and 109 benign cases was not employed for training or validation. Thus, the test data were unseen by both the random forest and SVM classifiers. This entire two-step process was repeated 20 times for different random train-test split as illustrated in Fig. 4. Note that, the chart presented in Fig. 4 is the workflow that we reanalyze previously collected data in this study. In the first step in training, perturbation features and US BI-RADS scores of all training samples are used to train a random forest classifier. Each decision tree outputs a binary decision of either benign or malignant for each training sample. In general, if more than half of the decision trees in the forest provide a benign decision, that sample is assumed to be benign. However, in this classification scheme, the threshold for the total number of decision tree votes to determine benignity was set as high as possible to avoid false negatives in the first step. A greedy search was applied to find the threshold. Initially the threshold, i.e., the number of votes required to determine benignity, was set to the maximum number of decision trees, which is 50. Then the threshold was decreased in steps of 1 as long as there were no false negatives. Using this approach, the minimum threshold that provided 100% training sensitivity was achieved in the first step. In contrast, in testing, a sample was classified as 'confirmed benign' in first step only when the total number of trees voting 'benign' was greater than or equal to the threshold. In the second step of diagnosis, image reconstruction was done to obtain hemoglobin maps for the remaining samples. The maximum total hemoglobin, maximum oxy-hemoglobin and light shadow parameters were extracted from the maps. These functional features were then used to classify rest of the samples using the SVM classifier.
The objective function for random forest is misclassification cost. If true label is y i and predicted label is y i , for i − th sample, the objective function is, Here, n is the number of samples, x i, split is the split value to be used for splitting the parent node for feature x i , and w y i y i is the misclassification cost of a sample originally with label y i predicted as y i . Detailed explanation of the misclassification cost and minimization of the cost function can be found in [35]. For second step using SVM, the objective function is given in Eq. (6). In the first step, random forest classifier filters out benign cases with low malignancy probabilities.
In the second step, the SVM classifies the remaining cases. The trade-off between these two steps are controlled by the threshold of decision tree votes to determine benign in first step. If the threshold is high, only few benign lesions can be classified and the false negative rate in first step is lower. If the threshold is low, more benign lesions can be identified but the false negative rate can be higher. Please note that the threshold for total number of decision trees is not fixed. For each random train-test split, the threshold was based on minimizing false negatives in the training data.
Finally, to evaluate and visualize the importance of each feature at both steps of diagnosis, we have used random forest to calculate feature importance for all features including perturbation features, functional features and US BIRADS ranks. In the random forest, the importance of each feature is the average information gain due to that feature only across all the trees [36]. First 15 most important features found through this method are visualized in Fig. 5. The importance of these features is normalized to the summation, thus they add up to 1. In terms of computation speed, for 20 runs of random train-test split, random forest classification takes approximately 5 to 6 seconds, while SVM takes only 1 to 1.5 seconds. The computation is performed in an Intel core i5 3.0 GHz CPU with 8 GB RAM.

Performance evaluation
To evaluate the performance of the classification algorithms, for each sample in the test set, we computed the probability of malignancy from the respective classifier. The receiver operating characteristic (ROC) curve and the area under the ROC curve (AUC) was used as the performance measure to evaluate the classifiers. Specifically, the threshold for determining malignant or benign class based on malignancy probability is varied to produce ROC curves. For classifications using SVM, the probability is obtained based on the distance of a test sample from the optimal hyperplane. The distance is passed as an argument to a sigmoid function to obtain a probability estimate. Then we can vary the probability threshold from 0 to 1 to generate true positive and false positive rates and obtain a ROC curve. For random forest, probability of prediction is related to the proportion of decision trees that voted for that predicted label. In the two-step approach, the first step yields true negatives and false negatives with zero true positives and false positives. In the second step, true positives and false negatives generated from second step as well as summation of true negatives and false negatives from both steps are used to generate a combined ROC.
Twenty runs with different random train-test splits were performed for each classifier. The mean AUC denoted how well the classifier could separate benign and malignant classes and the standard deviation indicated the robustness of the classifier for varying training and testing data sets. Sensitivity and specificity were calculated at a threshold of 0.5 from the mean ROC curve.
To evaluate the radiologists' performance, the sensitivity and specificity were calculated based on BI-RADS scores: 4A and 4B were grouped as benign, and 4C and 5 as malignant.

Perturbation feature selection and random forest classifier
A total of 20 features were extracted from perturbation data and box plots and p-values of all features are shown in Fig. 6. The first 12 features are statistically significant and have used for the first step random forest classifier. 64.8% (±4.7%) of the benign lesions were identified by the random forest classifier with 1.9% false negative rate in testing, which was 70 cases on average out of 109 benign cases. Note that low malignancy lesions were filtered out and diagnosed as benign, thus there was no true positive or false positive rate.

Clinical study results
The BI-RADS for both radiologists are shown in Table 1. Using BI-RADS scores only and grouping 4A and 4B as benign, and 4C and 5 as malignant, the sensitivities for radiologist I and II were 70.9% (±0.3%) and 85.6% (±0.2%), and the specificities were 90.8% (±2.2%) and 63.5% (±2.4%), respectively. The ROC curves for radiologist I and II are shown in Fig. 7(a) and 7(b), with AUC value 0.848 ± 0.003 and 0.783 ± 0.031 respectively. The blue curve and the light blue shade denote the mean and standard deviation of 20 ROC curves obtained from 20 runs. Using functional features only in the SVM classifier, the AUC was 0.781 ± 0.048, as shown in Fig. 7(c) with a sensitivity of 82.5% (±4.2%) and specificity of 72.9% (±1.0%). Using BI-RADS scores along with functional features in the SVM classifier improved the AUC to 0.892 ± 0.027 ( Fig. 7(d)), with a sensitivity of 90.2% (±1.9%) and specificity of 74.5% (±1.3%).
The proposed two-step diagnosis significantly improved the AUC to 0.937 ± 0.009 (Fig. 7(e)), with a sensitivity of 91.4% (±0.6%) and specificity of 85.7% (±0.8%). In the first step of the two-step method, 64.8% benign samples were classified as benign by the random forest classifier. Even though a zero false negative rate was enforced in training, 1.9% (±0.6%) of malignant samples were misclassified as benign in testing in the first step.
It is interesting to compare the performance of the aforementioned ROC curves with the ROC curve of random forest with all features included in one step. As shown in Fig. 7(f), the AUC was 0.923 ± 0.020, which is statistically the same as the two-step classification (see 3.3). The sensitivity and specificity were 90.3 ± 0.7% and 82.1 ± 1.2%, respectively. However, the proposed two-step approach provides near real-time diagnosis capability. AUCs for all different diagnostic schemes are summarized in Table 2.

Comparison of ROC curves
To evaluate the performance of different diagnostic methods, we used DeLong's method to compare different ROC curves obtained from different diagnostic methods. An open source MATLAB software package written by Sun et al. is used [36]. The p-values obtained from the method are given in Table 3. As seen from the table, the performance of functional feature with  US BI-RADS scores is statistically significantly better than that with functional feature only. The proposed two-step reconstruction also performs significantly better than functional feature with US BI-RADS. The proposed two-step diagnostic method is statistically the same as using all the features combined, however, the advantage of the two-step method is the near real-time diagnosis and immediate recommendation for patients who have benign lesions.

Discussion and summary
In summary, a novel breast cancer diagnostic strategy based on a two-step classification strategy was proposed and validated with a large pool of patient data. This strategy involves near real-time automated assessment using a random forest classifier to filter out highly probable benign lesions based on perturbation data and US BI-RADS scores. Lesions that cannot be identified as benign with high confidence are flagged, and their functional images are subsequently reconstructed off-line from the corresponding DOT measurements. In the second stage of the diagnostic strategy, features are extracted from the reconstructed DOT images, and a Support Vector Machine (SVM) classifier is employed for diagnosis. Functional feature extraction can take up to two hours including manual US image segmentation and optical image reconstruction with artifact evaluation. However, these steps are critical to provide high diagnostic accuracy. The random forest classifier reliably predicted more than half of benign lesions in near real time, shortly after perturbation features were extracted and radiologist's BI-RADS scores were available. In practice, BI-RADS are typically available within a few minutes after the patient exam. Such rapid diagnosis helps advance clinical management by identifying highly probable benign lesions and allowing the physicians to comfortably recommend follow-up instead of biopsy or surgical removal of the lesions. Additionally, US BI-RADS evaluation is highly dependent on the radiologist's experience; while the random forest classifier combines sensitive perturbation data with the BI-RADS to provide an improved diagnosis over that of a radiologist alone.
The two-step diagnosis scheme improves the specificity of a breast cancer diagnosis over a diagnosis based on the BI-RADS score and DOT-derived functional parameters only. This improvement is due to the diagnosis of highly probable benign lesions by the random forest classifier. A lower standard deviation across multiple cross-validations indicated this approach is very robust to different training and testing datasets and hence more reliable. Introducing perturbation features in the first step improved the overall diagnostic performance and facilitated better clinical management of the benign lesions to reduce unnecessary biopsies. Although a hemoglobin map is reconstructed from perturbation data, the tumor size and location provided by co-registered US and the breast tissue optical background properties are also used in the reconstruction process. The tumor size and location define the fine mesh area and location, and the background optical properties are used to calculate the weight matrix. Thus, for similar perturbation data, the reconstructed functional features can be different for different background properties, and lesion dimensions and locations. Our results suggest that this additional information employed when reconstructing functional features is valuable to further differentiate benign and malignant lesions.
For large benign lesions, even if the absorption coefficient is high, the hemoglobin concentration map shows less light shadowing and a more uniform distribution in depth, which is critical in differentiating large benign lesions and malignant tumors. For low grade carcinomas (14.63% in this study), the detection sensitivity of DOT can be lower due to the low level of tumor angiogenesis, however, the distorted tumor morphology evaluated by US BI-RADS is very helpful in improving diagnosis. Additionally, certain types of fibroadenomas are vascularized and present as false positives to DOT, however, the fibroadenomas' well circumscribed morphology in US image can help rule out malignance.
This study has the limitation that radiologists' evaluations were done on stationary ultrasound images. Real time assessment of ultrasound images while examining the patient may improve the overall diagnostic performance. Additionally, with other diagnostic information, such as mammograms and patient family history, the overall diagnostic performance can be further improved. In current practice, suspicious lesions found from x-ray mammograms are referred to ultrasound for targeted examination of the lesion regions. Based on real-time ultrasound, the attending radiologist will provide BI-RADS score and make a recommendation of follow-up or biopsy immediately after the US exam. Our proposed study flow will immediately provide recommendation based on the optimal strategy reported in this manuscript which can minimize the biopsy recommendations for a large portion of benign lesions. This is a direction that we are pursuing in on-going clinical studies.
The majority of the biopsy patient population has benign findings and the benign to malignant ratio is 3:1 in this study. We have chosen balanced data sets in training because a bias in training set can increase the false negative rate. With a bias in training set, we could have a classifier biased toward the majority class, which is benign findings, and then we would have more false negatives in the testing result. We did majority under sampling [37] to reduce the training bias to ensure an unbiased classifier. The only problem of bias in testing set is that accuracy of the classifier will be skewed toward the majority class. Thus, we didn't use accuracy as performance evaluation but AUCs for the proposed algorithms. The 1.9% false negative rate of the first step random forest classifier warrants discussion. In the twenty runs of random train-test splits, two malignant cases were categorized as benign in some tests. One case is a small ducatal carcinoma (5 mm measured by US) with low vascular content and another is a small mixed dual and lobular carcinoma (3 mm measured by US) with low vascular content. Both lesions are intermediate grade and two radiologists scored them as 4B. This type of false negative is difficult to avoid, however, including x-ray mammogram reading could add another parameter to identify this type of small tumors. Additionally, 3 to 6 month follow up recommendation would also allow a small window to monitor the development of this type of small tumors.
The proposed novel two-step diagnostic strategy employing a random forest classifier as a first step to filter out low suspicious benign lesions during patients' US exam has great potential to streamline breast diagnostic work flows by suggesting short-term follow-ups rather than biopsy. Based on a large patient pool, 64.8% of the benign lesions were identified by the first step random forest classifier with 1.9% false negative rate. The next step using an SVM classifier combining DOT total hemoglobin functional maps with other diagnostic image features, provides high overall performance, AUC of 0.937, in breast cancer diagnosis. The reported two-step diagnostic strategy can be generalized to other modality guided diffused optical tomography for the optimal management of breast cancer diagnosis.