Direct identification of breast cancer pathologies using blind separation of label-free localized reflectance measurements

Breast tumors are blindly identified using Principal (PCA) and Independent Component Analysis (ICA) of localized reflectance measurements. No assumption of a particular theoretical model for the reflectance needs to be made, while the resulting features are proven to have discriminative power of breast pathologies. Normal, benign and malignant breast tissue types in lumpectomy specimens were imaged ex vivo and a surgeon-guided calibration of the system is proposed to overcome the limitations of the blind analysis. A simple, fast and linear classifier has been proposed where no training information is required for the diagnosis. A set of 29 breast tissue specimens have been diagnosed with a sensitivity of 96% and specificity of 95% when discriminating benign from malignant pathologies. The proposed hybrid combination PCA-ICA enhanced diagnostic discrimination, providing tumor probability maps, and intermediate PCA parameters reflected tissue optical properties.


Introduction
Breast cancer continues to be the most diagnosed cancer among women, comprising 23% of all female cancers. Non-invasive small lesions, detected at an early-stage, however, can be treated successfully with breast conserving therapy (BCT), which includes local tumor excision followed by moderate-dose radiation therapy [1]. Early invasive breast cancers (stage I and stage II) have, however, a high risk of reoccurrence when residual disease is left at or near the cut edge. In fact, BCT has been demonstrated to be equally effective as mastectomy only when no residual disease is left on margins [2], thereby minimizing the need for more a more radical therapy like mastectomy. Despite its therapeutic predictive value, most studies report high variability in the number of patients treated with BCT with residual disease, demonstrating a lack of standardization for margin delineation [3].
Light scattering spectroscopy has been applied broadly to identify residual disease in resected breast tissues by detecting changes in the scattering spectrum induced by morphological variations in the size and number of density of cells and the tissue extracellular matrix [4]. Natural heterogeneity in light scattering from tissue morphology has been observed and its spatial distribution can be used to improve discrimination between tissue subtypes [5]. Consequently, the scanning spectroscopy system demonstrated in [6] has been designed to be maximally sensitive to elastic scattering, although some partial coupling with hemoglobin absorption has been observed. Signal localization is employed in the illumination and detection paths to preserve the weakly scattered spectrum. Optical properties, namely the reduced scattering and absorption coefficients, are traditionally parameterized according to theoretical models of light scattering in turbid media. These models are valid when specific, physical conditions are met in the data acquisition geometry [7]. The full accomplishment of these conditions is sometimes impossible to fulfill, yielding uncertainty in the separation of absorption and scattering signatures. Typically, analytical solution for the problem of diffuse reflectance from turbid media such as biological tissues only exists for idealized systems, like a point source in a semi-infinite medium [8]. Models assume light incidence on an optically homogenous medium, which is also only approximate for biological tissues. Furthermore, single-fiber reflectance measurements do not accurately recover the photon pathlength, limiting absolute quantification of optical parameters [9]. Consequently, the existence of a model-free approach would be a great asset.
Blind Signal Separation (BSS) is a set of signal processing techniques able to decouple information arising from multiple sources. Consequently, they can be employed to decouple the information generated by absorption and scattering in tissue for unique acquisition geometries. These methods have been extensively used for removing interference and noise or for feature extraction from optical signals [10][11][12][13]. Principal Component Analysis (PCA) performs a change of basis and finds new uncorrelated projections, while Independent Component Analysis (ICA) creates a new independent feature space, with more statistical separation than uncorrelation [14]. The validity of ICA has been previously demonstrated in diverse scenarios. It has been proven to enhance classification features from mammograms for breast cancer detection [15] and also for other cancer types: maps of tumor probability have been extracted from the ICA of RGB fluorescence images taken from the skin [16]. PCA is typically employed to reduce data dimensionality and to enhance performance of the independent ICA algorithms [17]. PCA and ICA have been effectively applied to un-mix distinct exogenous fluorophores in multispectral opto-acoustic tomography data [10] and also for removing or studying blood absorbance from NIRS signals [11,12]. Nevertheless, the overall outperformance of ICA over PCA is still to be proven [13].
Here, a feature-extraction method is presented to discriminate benign from malignant pathologies in resected breast tissue and its diagnostic performance is validated according to histology, the diagnostic gold standard. No analytical models are performed to extracted diagnostic components from the scattering spectrum. Instead, PCA is used to transform spectral data into an uncorrelated feature space that reduces data dimensionality and eliminates cross talk between hemoglobin absorption and scattering signatures. Then, ICA optimized by PCA is used to predict the breast tissue subtype. Finally, sign ambiguities associated with the BSS algorithm are solved by a user-guided, soft calibration process.

Optical imaging data from breast tissues and the modeling of reflectance
Localized measures of broadband reflectance from resected breast tissues were obtained from previous work [18], using a custom-built, quasi-confocal acquisition geometry [6]. This system separates weakly scattered from multiply scattered light by spatial confinement of the illumination and detection spot sizes (~100µm). The system employs a broadband fibercoupled tungsten-halogen light source, operating in the (510 -785 nm) spectral waveband. An optical-fiber, coupled to a CCD-based spectrometer, was used for confocal spectroscopic detection. The spectral resolution of the system provides 512 spectral images for each sample.
Samples of freshly resected breast tissues acquired during breast conserving surgery, were obtained directly from the Department of Pathology at Dartmouth-Hitchcock Medical Center, when there was tissue in excess of that required to make a clinical diagnosis. Tissues were 1-2cm 2 with a thickness of 3-5mm. Immediately after each imaging procedure, each sample with formalin-fixed and paraffin embedded, then stained with Hematoxylin and Eosin (H&E) for pathology correlation. 29 resected tissues were imaged and, on each specimen, several regions of interest (ROIs) were further evaluated by a pathologist for precise co-registration with optical maps. In total, 48 different ROIs were identified that were not uniform in size, having diameters from 500 um to 0.2 cm. Tissues were characterized as benign, malignant or adipose, as summarized in Table 1. An analytical solution that accurate describes the diffuse reflectance arising from turbid media such as biological tissues has not yet been demonstrated. In spite of this, under the spatial constraints it is possible to model the measured backscattered reflectance with the aid of an empirical approximation. To compare reflectance modeling parameters and PCA-ICA analysis, an empirical approximation validated on previous study [18] was considered to contrast blindly obtained results. This model is shown in Eq. (1): Here, A is the scattering amplitude, b is the scattering power, ρ the pathlength, HbT C the concentration of hemoglobin, and

Multivariate linear analysis and BSS
Principal Component Analysis (PCA) and Independent Component Analysis (ICA) are linear processing techniques characterized by their simplicity and low computational load. Equation (2) shows the linear transformation that describes both processes.
where x comprises the reflectance information per tissue sample measured at N locations (approximately 4000 pixels or observations per sample in this case) and M different spectral bands (being M = 512); W is the mixing matrix that represents the linear operation to be applied on the original data x to provide s, that contains the multivariate data result with the decoupled mixed signals or scores i.e., the representation of the raw data x in the new component space. This also can be described in the opposite way, i.e. the measured data as a linear mix of the components as indicated in Eq.
where A is the matrix of coefficients or loadings. PCA and ICA algorithms do not require any training, modeling, supervision or previous signal information and they are considered accordingly Blind Signal Separation techniques (BSS).

Linear mixture of components
PCA and ICA assume linear mixtures. If neperian logarithm is applied on empirical expression in Eq. (1) a linear sum of the reflectance spectra parameters can be defined and then compared with PCA-ICA results, as shown on Eq. (4): (4) the expression for the spectrum becomes as shown in Eq. (5): where the columns of matrix A would become directly the spectral features of the spectral components of tissue ( ) n σ λ and they would be related with the properties of its components; the sources s are the blindly extracted parameters, which might be related to the contribution of tissue to the scattering and absorption phenomena.

PCA to uncorrelate components and compress spectral data
Principal Component Analysis (PCA) is usually employed as a technique to reduce the number of variables in a data set with a minimal loss of information and to search for a more significant data representation. However, the physical meaning of these new variables is not always straightforward.
PCA assumes a linear approximation of the problem, as the one described in 2.4.1. The covariance matrix C from input data x must be calculated, assuming that x is a mean-centered version of the initial reflectance data. Since the covariance matrix is symmetric, calculation can be described as in Eq. (6): where D is a diagonal matrix containing the eigenvalues of C, E are the eigenvectors of the covariance matrix C. The mixing matrix in Eq. (5), W is defined as its Hermitian W = E H . This matrix W is the one that transforms the input data x into the uncorrelated components in vector s, being the components ordered according to the contribution of their eigenvalues to the total variance of the data set. Focusing on Eq. (4), components in vector s could represent the contributions to variation of spectra S 1 , and ( ) n σ λ would be the normalized spectral variation of tissue components. The first few columns of matrix W could extract those tissue properties, being the rest components with small associated eigenvalues related to noise. A criterion must then be established to decide these few number of maintained uncorrelated components from the initial M = 512 to L. The chosen criterion is to maintain L<M eigenvalues, with a joint variance above a specific threshold, as shown in Eq. (7).
where D(q,q) is the q th eigenvalue of the covariance matrix C.

ICA to identify independent latent factors
Independent Component Analysis (ICA) is also a multivariate linear blind separator that uses higher order statistics, instead of covariance, to extract the new set of linearly unmixed components. Since statistical independence is a stronger condition than uncorrelation, more accurate maps of diagnosis can be obtained. This work assumes that exist malignancy tissue properties that are statistically independent from other tissue types such as normal or adipose. This hypothesis is based on the differences in absorbance and scattering generated by each tissue condition. All measures of spectral reflectance could be used to discriminate between tissue types, but this is frequently not optimal and always computationally demanding [20]. Here, PCA is proposed to reduce the data dimensionality [21]. Consequently, the data arising from this preprocessing step can be analyzed with ICA. This is the reason why ICA results cannot be spectrally interpreted as in Eq. (4): the dimension of data now is not the spectral 512 components but the very few PCA pre-processed components. In fact, a similar situation as in Eq. (3) is faced, but now x, i.e. the detectors, are the uncorrelated components maintained after the PCA analysis, and s, i.e. the sources, will be the IC components, which are supposed to be more diagnostically discriminating. Figure 1 summarizes the whole analysis procedure to obtain the maps of tumor probability: PCA is first applied to the logarithm of the initial reflectance data set, containing 512 images, one per wavelength Fig. 1(a). Then, a few uncorrelated components are maintained Fig. 1(b) and are input into the ICA algorithm. A tumor map probability Fig. 1(c) is computed from the resulting independent components, which, because of the more stronger condition mentioned above, are expected to be more diagnostically relevant and unmixed than the principal components attained in the immediate prior analysis stage. To obtain the independent components, a FastICA algorithm was employed that is based on maximization of the fourth statistical moment, i.e. kurtosis. It is computationally simple, fast and requires little memory space [21]. PCA analysis is based on the extraction of the singular value decomposition matrix (SVD) and there is mathematically no way to avoid this sign ambiguity arising from a multiplicative term such as the pair of singular vectors [21]. In ICA, the variance of the independent components cannot be determined [20], so the magnitudes of the independent components may be fixed, but this still leaves the ambiguity of the sign. Several strategies have been tested to deal with the ambiguity problems of BSS analysis. The sign ambiguity of PCA can be case does not permit a spectral interpretation. Concerning order ambiguity, while PCA components are ordered by variance, the intrinsic order ambiguity of ICA impedes a discriminating rank of independent components [20].
FastICA is a recursive algorithm that starts with an initial guess. If this initial guess is not fixed, the algorithm begins with a random matrix resulting in different signs and orders of the output signals, even if it initiates from the same set of measurements. The W matrix resulting from PCA, i.e. the uncorrelated coefficients, is proposed as the initial seed to limit this ambiguity effect. Additionally, surgeon-guidance is proposed to compensate for this ambiguity. Visual inspection of results reveals that one significant independent component is sufficient to distinguish benign from malignant pathologies. The surgeon could guide selection of the significant component or alternatively, the significant component could be identified by cross-correlating a digital photograph of the sample with its spectrally-derived ICA parameters, mimicking the surgeon's viewpoint. Even though sign ambiguities in the magnitude of the selected independent component still would induce an error in the tissue category assignment. To this end, a calibration method is employed in which the user, ultimately the surgeon, specifies a set of known pixels, i.e. obviously malignant tissue at the center of the lesion. Informed with this initial information, ICA then provides a map of tumor extent.

Well-known point's strategy
The requirement for the sign ambiguity in ICA to be corrected and the approach to implement this correction vary among applications [20], although a majority of them are based on the employment of supervised classifiers after the ICA process. In the present tissue diagnosis application, different pairs of tissue regions (adipose-benign, adipose-malignant, benignmalignant) become well separated by the PCA-ICA combination but the sign ambiguity introduces a constraint in the tissue category assignment to perform an absolutely blind selection. In the validation against the pathologist-based diagnosis, it is precisely the sign of the magnitude of the selected independent component the one that differentiates between tissue diagnoses. A pair of two possibilities of diagnosis can be considered corresponding to positive sign and negative sign regions. However, the same sign is not always associated to the same pathology for different patients.
The proposed procedure however needs some a priori knowledge about the sign that is associated per tissue type, since this information is required to specify the associated tissue category (malignant or non-malignant) in the final guidance map. Taking advance of both their experience and the information provided by pre-interventional techniques, surgeons are able to clearly identify the tumor center and healthy tissue. The main difficulty they face is the accurate delineation of the malignant area far from the center. This is the point where the proposed guidance map would be of great interest. Once surgeons are asked to locate malignant and non-malignant centers, these points will work as calibration points for the algorithm identifying the actual sign for malignancy regions. In order to emulate this surgeon selection, 25 pixels on each ROI have been selected as "well-known", contrasted points to be certainly diagnosed. Then a detection mask can be easily created. This process is summarized in Fig. 2.

Results and discussion
The goal of this paper is to design a blind data analysis, i.e. without model fitting, to segment tumor from normal tissues in lumpectomy specimens using localized, measured broadband reflectance. This blind analysis is designed to discriminate between areas in a single tissue sample and not between samples. This is acceptable for margin detection because the success of BCT is measured by accurate tumor delineation within each patient. Results will compare the performance obtained by PCA, PCA-ICA and the extraction of optical parameters according to an empirical approximation to Mie theory [18]. The metrics considered to address the performance are the probability of detection and false alarm (sensitivity and specificity) of PCA-ICA and PCA itself.

PCA results: uncorrelation has less strong diagnostic ability than independency
For blind signal separation, PCA is applied to reduce the dimensionality of broadband reflectance data, to estimate the number of components used to inform a diagnosis, and to analyze if their diagnostic relevance.
Kept variance presents different slopes depending on each tissue sample. A dynamic threshold based on the derivative of the kept variance curve has been empirically selected. To this end, the L maintained components will be those whose kept variance varies more than 0.2% from the previous set of L-1. The resulting number of maintained uncorrelated components varies from 2 to 7 in the data set, being usually 3. Figure 3 shows two different cumulative variance plots corresponding to two different samples: normal-adipose and malignant-adipose. The first few components correspond to the large eigenvalues, while the components on the right part of the graph have small eigenvalues and are presumed to be related to noise. The reflectance spectral map of sample 1 (normal-adipose) is more uniform than the one of sample 2 (malignant-adipose) due to their different tissue composition. This spectral fact makes that the proposed dynamic criteria would select 5 components for sample When the A matrix of PCA coefficients of Eq. (3) is qualitatively observed, the first principal coefficient displays a constant spectral tendency while the second shows exponential or negative logarithm behavior. Figure 4 represents the mean of the first three principal coefficients along the 29 samples. The optical system is optimized not to detect absorption, but just scattering [18]. Nonetheless the third principal coefficient exhibits high correlation with absorption by hemoglobin.
Considering the Mie linear approximation of reflectance as noted in Eq. (4), the similarity with PCA results is found as stated by Eq. If this supposition of likeness was right, similarity between PCA scores and model-based parameters should be found. Table 2 shows the correlation between the mean PCA scores and optical parameters extracted from the empirical approximation fitting given by Eq. (1), to assess the contribution of scattering power and hemoglobin absorption to each score. Although a high correlation with Mie power scattering is found on PC2 (Fig. 4), and hemoglobin absorption is usually collected on PC3 (Fig. 4), this relationship does not necessarily define PC2 and PC3 as scattering power and hemoglobin, like in a conventional model fitting extraction. However, some similarities are found on the behavior of the statistical features (PCA scores) and the optical features (scattering and absorption from model) which may suggest that BSS analysis accounts for physical variation of parameters of the tissue. Figure 5 shows the maps of the PC2 scores ( Fig. 5(a)) for a specific tissue sample when compared with the scattering power map (Fig. 5(b)) obtained from Eq. (1). High correlation between both maps can be observed that is shown also in the associated scatter plot (Fig. 5(c)). Figure 6 represents the influence of the hemoglobin in the scores of PC3. In the digital photograph of the sample (Fig. 6(a)) some blood pools can be observed. The spectral variation of PC3 ( Fig. 6(b)) shows similarities with the hemoglobin spectrum and the map of scores of the PC3 exhibits high values in the areas where the blood pools are located (Fig. 6(c)).  ICA was then applied to the three most significant PCA components, determined by kept variance, to extract the independent maps of each sample, for improved discrimination. These independent features were used to classify the tissues, and results were compared with the uncorrelated features to check if the independency stronger statistical condition translates into better classification accuracy.

ICA for the extraction of independent maps
While PCA ambiguity is easy to solve through a correlation with the spectral signatures of tissue chromophores, FastICA is an iterative algorithm that causes two types of ambiguities and such a study is not so straightforward. Because of this, and as mentioned above, PCA matrix (coefficients) is used as the initial seed for the ICA algorithm. Even under this premise, it was not possible to deal with these ambiguities analytically.
By visual inspection it seems easy to determine which component is most discriminating, but automation of this analysis is desirable. The digital photograph of each tissue sample also provides useful information as it mimics surgeons vision. Correlation with the digital photograph is proposed as a fast solution to determine which ICA score is more interesting for diagnostic purposes. This procedure could emulate the surgeon behavior. Figure 7 shows one of the tissue samples and the probability of detection (P d ) and degree of correlation (R) when compared with the digital photograph of the tissue (Fig. 7(a)) in case of selection of the last IC ( Fig. 7(b)) or the penultimate IC (Fig. 7(c)). The H&E section is also shown (Fig. 7(d)) for visualization purposes. In this sample, the last IC exhibits the highest correlation with the digital photograph and also achieves the highest probability of detection when results are validated against the ROI's information provided by the pathologist. As explained in the previous section, after choosing the most appropriate independent score, the "well-known points" calibration is performed to solve the sign ambiguity. The selected ICA score is expected to be the best for classification purposes and it is also supposed to be more related with single scattering feature, as interferences from absorption and other attenuation contributions are supposed to be minimized by the optical set-up and then is expected to be associated with the discarded independent component maps. Table 3 shows the mean discrimination and standard deviation of each cluster of PCA scores when compared with the attained with the ICA. Figure 8 shows the maps of probability of tumor for different samples compared with the digital photograph, the H&E section and the pathologist diagnosis. The PCA-ICA process provides the highest separation between the different pathologies with very similar standard deviation, achieving a high degree of accuracy with the pathologist decision. This makes easier to implement a linear classifier, which is fast and computationally simple. Then, after the best ICA score for classification purposes is selected, and according with to the values in the ICA space, a probability of tumor can be calculated. Figure 8 shows the results for 4 different samples with diverse pathologies malignant (purple), non-malignant (white) and adipose (cyan). The last column shows the images of the map for tumor probability and good agreement can be found by visual inspection when compared with the H&E section and the pathologist ROIs. For a more quantitative assessment, Table 4 shows sensitivity and specificity outcomes detecting malignant points and a comparison of strategies to fix and select the diagnostic map. The best results depend on the chosen ICA score, being better when it becomes fixed by visual inspection. The correlation with the digital photograph of the specimen helps in the determination of the best score but its performance is a bit lower than the provided by PCA first score.

Conclusions
The feasibility of PCA and ICA to blindly detect and localize breast tissue pathologies is proposed and successfully checked in this paper. To preserve the tissue properties, elastic scattering that requires low optical power is used. The analysis here proposed is designed to discriminate between tissue areas within a single sample, and not between samples, with the ultimate goal of surgeon guidance for Breast Conserving Therapy purposes.
Computationally efficient BSS analysis has been directly applied to 512 optical localized reflectance breast measurements, instead of reflectance model fitting, to readily identify their corresponding cancer pathologies. Reflectance is directly obtained from endogenous tissue properties, mainly scattering from tissue morphology, without injection of contrast agents that require expensive biocompatibility studies and regulatory approval for clinical use.
PCA reduces the dimension of the data set, from the initial 512 spectral bands to just 3-5 uncorrelated components. The latter exhibit significant similarities with the parameters extracted based on an empirical model based on the Mie theory, specifically scattering power and the hemoglobin absorption spectrum. They can additionally be used as classification features by applying a linear threshold. However, the statistical feature of uncorrelation is softer and less significant than statistical independence so ICA has been employed to compare results of classification. Combined PCA-ICA analysis has provided the best significant diagnosis maps with probability of tumor information. Discriminating spectral information, sometimes lost in empirical approximations of light scattering, contributes here to a better tissue type separation.
Sign ambiguities limiting discrimination by ICA have been resolved by selecting some "well-known" points that the surgeon can provide in a real scenario to determine a calibration environment. However, ambiguity arising from the order in which the scores are generated has been a challenge. The selected criterion to confront this ambiguity is to correlate the ICA results with the corresponding digital photograph of the tissue. The best sensitivity-specificity possible attained with ICA is 96%-95% while "photograph correlation for selection" solving proposal yielded 93%-81%. Therefore a loss in the classification is induced if the selection of best score is not optimized. However, both ICA solutions are still better choice than the selection of the second PCA score, which presents 86%-74%.
Furthermore, important correlation between tumor probability and H&E maps is also obtained, which suggest that a future application of the system could be margin delimitation. The goal of this approach is not to diagnose malignancy but to map its extent. During surgery the tumor is already localized, so a seeding of the algorithm by the surgeon is feasible.
To conclude, this contribution validates and optimizes the ability of PCA and ICA to blindly detect breast tissue pathologies. Tissue features related to elastic scattering and blood absorption have been extracted from label-free localized reflectance measurements, using no training information nor empirical models, although further contrast of this aim needs to be proven based on tissue simulating phantoms of known optical properties. Even though, PCA and ICA extract significant features to provide a map of tumor probability to be used in an intraoperative context.