Hyperspectral image segmentation: a preliminary study on the Oral and Dental Spectral Image Database (ODSI-DB)

ABSTRACT Visual discrimination of clinical tissue types remains challenging, with traditional RGB imaging providing limited contrast for such tasks. Hyperspectral imaging (HSI) is a promising technology providing rich spectral information that can extend far beyond three-channel RGB imaging. Moreover, recently developed snapshot HSI cameras enable real-time imaging with significant potential for clinical applications. Despite this, the investigation into the relative performance of HSI over RGB imaging for semantic segmentation purposes has been limited, particularly in the context of medical imaging. Here we compare the performance of state-of-the-art deep learning image segmentation methods when trained on hyperspectral images, RGB images, hyperspectral pixels (minus spatial context), and RGB pixels (disregarding spatial context). To achieve this, we employ the recently released Oral and Dental Spectral Image Database (ODSI-DB), which consists of 215 manually segmented dental reflectance spectral images with 35 different classes across 30 human subjects. The recent development of snapshot HSI cameras has made real-time clinical HSI a distinct possibility, though successful application requires a comprehensive understanding of the additional information HSI offers. Our work highlights the relative importance of spectral resolution, spectral range, and spatial information to both guide the development of HSI cameras and inform future clinical HSI applications.


Introduction
During an intervention, the physician or surgeon has to continuously decode the visual information into tissue types and pathological conditions.As a result of this process, decisions on how to continue with the intervention are taken.Hyperspectral cameras can capture visual information far beyond the three red, green, and blue (RGB) wavelengths that the naked human eye (and common endoscopes) can perceive.This additional data provides extra cues that facilitate the identification and characterization of relevant tissue structures that are otherwise imperceptible (Shapey et al. 2019;Ebner et al. 2021).To name a few examples, hyperspectral information has been used as an input to visualize tissue structures occluded by blood (Monteiro et al. 2004) or ligament tissue (Zuzak et al. 2008), display tissue oxygenation and perfusion (Best et al. 2011;Chin et al. 2017), classify images as cancerous/normal (Lu et al. 2017;Fei et al. 2017;Bravo et al. 2017;Beaulieu et al. 2018), and improve the contrast between different anatomical structures such as liver/gallbladder (Zuzak et al. 2007), ureter (Nouri et al. 2016) and facial nerve (Nouri et al. 2016).The output of these systems is typically a classification score, an overlay with a segmentation, or a contrast-enhanced image.In this work, we focus on segmentation.
Image segmentation is a building box of many computer-assisted medical applications.In dentistry, the two most common diagnostic visualization strategies are visual inspection (RGB imaging) and X-ray imaging (Hyttinen et al. 2020).RGB imaging serves a multitude of purposes.To name a few, patient instruction and motivation, medico-legal reasons, treatment planning, liaison with dental laboratory, assessment of the baseline situation (when seeing a new patient), and progress monitoring (Ahmad 2009a,b).RGB imaging also provides valuable information for soft-tissue diagnostics and some surface features of hard tissue.However, the information it captures is restricted to the capabilities of the human eye, with spectral characteristics dictated by the central wavelengths of the short, middle, and long wavelength-detecting cones in the retina (450 nm, 520 nm, 660 nm).Unlike RGB imaging, X-ray provides anatomical and pathological information on hard-tissue structures such as the teeth and the alveolar bone.This additional information comes at the expense of exposing the patient to ionizing radiation and potential risks derived from the use of contrast agents.
In contrast to RGB imaging, hyperspectral cameras allow us to capture additional information beyond the usual three RGB bands.This new set of images corresponding to narrow and contiguous wavelength bands forms the reflection spectrum of the sample (in our case, the sample is the inside of the mouth).Although the applications and possible benefits of hyperspectral imaging (HSI) are an active field of research (Shapey et al. 2019;Manifold et al. 2021;Ebner et al. 2021;Seidlitz et al. 2022), we foresee that patients could potentially benefit from this technology in two different ways.First, the reflectance spectrum could be used to extract tissue properties and produce a range of pseudo-color images that enhance the visualization capabilities of clinicians (Fält et al. 2018).For example, displaying or highlighting imaging biomarkers or clinical conditions that are barely visible or not perceivable in RGB (Best et al. 2013;Zherebtsov et al. 2019;Hyttinen et al. 2018Hyttinen et al. , 2019)).Second, hyperspectral images could provide computer-assisted diagnosis methods with additional information that can help improve the accuracy of detecting and diagnosing lesions (Boiko et al. 2019).The work presented in this manuscript is aimed at assessing the latter possible benefit.
The research hypothesis we are working with is that there may be perceivable differences in the reflectance spectrum of diseased tissue compared to that of healthy anatomy.For example, the average reflectance spectrum of all the pixels of a healthy tooth might be different to that of one affected by a certain condition (e.g.earlystage cavities).A preliminary step to the development of such quantitative dental and oral biomarkers is to segment the different anatomical structures accurately.In this scenario, a question that quickly arises is whether we can obtain an improved segmentation accuracy with state-of-the-art deep neural network architectures designed for 2D RGB image analysis by simply replacing the RGB input with a hyperspectral cube.Similarly, from a deep neural network design perspective, it is interesting to see how accuracy changes when we just use spectral information (i.e. when we classify pixels individually) or when we also use spatial information (i.e. when we segment images).That is, we aim to discover how the segmentation accuracy changes when reducing the information available from N hyperspectral bands to the usual three RGB bands.This is indeed one way to assess a lower bound1 of the added value of hyperspectral information above and beyond RGB.

Contributions
In this work, we provide baseline results for the segmentation of the Oral and Dental Spectral Image Database (ODSI-DB) (Hyttinen et al. 2020).We evaluate how the segmentation performance changes when using different spatial and spectral resolutions as data inputs, guiding future developments in the field of dental reflectance and hyperspectral image segmentation.We provide the training code along with the models validated in this work2 .Additionally, we propose an improved approach to reconstructing RGB images from hyperspectral raw data than that employed in ODSI-DB.This is useful to compensate for missing spectra, as it occurs in the images captured with Nuance EX camera used in ODSI-DB.Our code to perform such conversion is also made available 2 .

Related work
Recently, a literature review on deep learning techniques applied to hyperspectral medical images has been published (Khan et al. 2021).In the following paragraph we summarise some of the most recent work involving hyperspectral images in the context of surgery, and how the performance varies across different computer-assisted applications when using hyperspectral bands as opposed to traditional RGB imaging.
In Garifullin et al. (2018), the authors segmented the retinal vasculature, optic disc and macula using 30 spectral bands (380-780 nm).Their results showed an improvement of 2 percentage points (pp) for vessels and optic disc and 6 pp for the macula when comparing deep learning (DL) models trained on hyperspectral versus RGB images.In Ma et al. (2019Ma et al. ( , 2021)), authors employ hyperspectral images for tumor classification and margin assessment.In this work, the model proposed by the authors for tissue classification on surgical specimens achieved a pixel-wise average AUC of 0.88 and 0.84 for hyperspectral and RGB, respectively.In Wang et al. (2021), the authors reported a difference of 2 pp in Dice coefficient for the segmentation of melanoma in histopathology samples of the skin when comparing the performance of a 2D U-net on RGB and hyperspectral images.Similarly, Trajanovski et al. (2021) showed that the Dice coefficient for the segmentation of tongue tumors increases from 0.80 to 0.89 when using hyperspectral information.Despite this recent body of work, to the best of our knowledge, there is no current benchmark on ODSI-DB, which as opposed to the previous literature, mostly targeting binary classification, has a considerably higher number of classes ( 35) and a substantial number of patient samples (> 200).

Dataset details
The ODSI-DB dataset (Hyttinen et al. 2020) contains 316 images (215 are annotated, 101 are not) of 30 human subjects.The 215 annotated images are partially labelled, and the number of annotated pixels per image varies from image to image.The annotated pixels can belong to 35 possible classes.The number of annotated pixels per class is shown in Tab.A1 in the appendix.ODSI-DB contains images captured with two different cameras, 59 annotated images were taken with a Nuance EX (CRI, PerkinElmer, Inc., Waltham, MA, USA), and 156 were obtained with a Specim IQ (Specim, Spectral Imaging Ltd., Oulu, Finland).The pictures taken by the Nuance EX contain 51 spectral bands (450-950 nm with 10nm bands), and those captured by the Specim IQ have 204 bands (400-1000nm with approximately 3nm steps).The reflectance values for the images are in the normalized range [0, 1].

Reconstruction of RGB images from hyperspectral data
Although ODSI-DB contains RGB reconstructions of the hyperspectral images, we have observed that the RGB reconstruction method used to generate the RGB images provided does not compensate for the lack of hyperspectral information in the 400-450 nm range for the Nuance EX images.The lack of this information results in a yellow artifact in the reconstructed RGB images from the Nuance EX (see Fig. 1).We thus provide an alternative RGB reconstruction.

Our RGB reconstruction
Figure 1.Exemplary reconstructed RGB images provided in the ODSI-DB dataset compared to our RGB reconstructions.As can be observed, the RGB images reconstructed from the hyperspectral images captured with the Nuance EX are affected by a yellow artifact.This artifact is not present in those reconstructed from the Specim IQ.This occurs because the Nuance EX does not capture the 400-450 nm range, which carries information relevant to reconstruct the blue channel (and in a lesser degree the red) channel.
Our RGB reconstruction follows the method proposed by Magnusson et al. (2020), where the hyperspectral images are first converted to CIE XYZ and then to sRGB.
The conversion from CIE XYZ to linear sRGB is a linear transformation where the X, Y, and Z channels contribute largely to the red, green, and blue channels of the linear sRGB image, respectively.When converting hyperspectral images to CIE XYZ, the contribution (i.e. the weight) of each hyperspectral band to the CIE XYZ image is defined by a color matching function (CMF).We used the standard CIE 1931, shown in Fig. 2.However, as shown in Fig. 2, the hyperspectral bands in the range 400 − 450 nm have a considerable weight to reconstruct the Z channel (blue), and a minor weight to reconstruct the X (red) channel.
As the Nuance EX hyperspectral images do not have any information in this range, we miss a substantial amount of the information needed to reconstruct the Z (blue) channel correctly, which is why the images look yellow (see the blue medical-grade glove in the Fig. 1).On the other hand, the images captured with the Specim IQ camera have information in the 400 − 450 nm range (the camera range is 400 − 1000 nm).Therefore, the RGB reconstructions look realistic and do not display the yellow tint seen in those from the Nuance EX.
The purpose of the modified RGB reconstruction proposed in this section is to compensate for the missing information (380 − 450 nm for the Nuance EX and 380 − 400 nm for the Specim IQ).With this modification, we aim to make the RGB images produced from both cameras look alike prior to them being processed by the convolutional network.To do so and account for the missing wavelengths, we modify the CIE original CMF shown in Fig. 2. The modification consists of taking the CMF function in the missing range (e.g.380 − 450 nm for the Nuance EX), flipping it over the vertical axis at the start of the captured wavelengths (450 nm for the Nuance EX), and summing it with the original CMF.More formally, the modified color matching functions are defined as where x, ȳ, z are the original CIE 1931 2-deg color matching functions (CMFs) (Smith and Guild 1931) shown in Fig. 2 (left), xc , ȳc , and zc are the additive corrections to compensate for the missing information, and xn , ȳn , and zn are the corrected CMFs (shown in Fig. 2 center and right).As different cameras are missing different wavelength ranges, the additive corrections must be different.We define the CMF correction for the Nuance EX as and the CMF correction for the Specim IQ as The original CMF, along with those modified for the Specim IQ and Nuance EX images are shown in Fig. 2. Due to the nature of the proposed CMF modifications, the modified RGB reconstruction can be seen as a color normalization that transforms the input data to a common RGB space, easing the learning of the segmentation from a dataset containing a mix of Nuance EX and Specim IQ images.Once we have the modified CMFs, the conversion from hyperspectral to RGB is as follows where f is the spectral density of the sample (i.e. the continuous version of the hyperspectral image), and g is the spectral density of the illuminant, D65 in our case.As we have these functions (x n , ȳn , zn , f , g) typically sampled at different wavelengths, we interpolate all of them with a PCHIP 1-D monotonic cubic interpolator.We use the composite trapezoidal rule to evaluate the integral (with the image wavelengths as sample points).
After the RGB conversion, following the proposal by Magnusson et al. (2020) to avoid images looking overly dark, we apply the following gamma correction to all the RGB pixels γ(x) = 12.92x x ≤ 0.0031308 1.055x 0.416 − 0.055 otherwise (5) where x is either the red, green, or blue intensity of each pixel (correction is applied to all the RGB channels).

Types of input and segmentation model
In this study, we evaluate model performance for pixel classification on the ODSI-DB dataset when different forms of data input are employed (see Fig. 3).We refer to rgbimage when we reconstruct the RGB image from the whole spectral range using a colour matching function as explained in Sec.3.2.As the hyperspectral images in ODSI-DB have a different number of bands (450-950 nm with 10 nm steps for the Nuance EX, and 400-1000 nm with 3 nm steps for the Specim IQ), we linearly interpolate the images from both cameras to a fixed set of 170 evenly-spaced bands in the 450-950 range.As in most of the recent body of work in biomedical segmentation, backed up by the state-of-the-art results in most biomedical segmentation challenges, we chose the endoder-decoder 2D U-Net Ronneberger et al. (2015) as our go-to model to build the segmentation baseline.For the pixel-wise experiments (rgbpixel and spixel) a network with an equivalent number of 1 × 1 filters and skip layers was used.The network hyperparameters for each input type were tuned on a random 10% of the images contained in the training set.

Hyperspectral vs RGB as feature vectors
Hyperspectral images have a higher spectral resolution than RGB images .As we are interested in seeing whether this increased resolution translates into a higher degree of discrimination among feature vectors, we run t-SNE (van der Maaten and Hinton 2008) on 10 million randomly selected pixels, with an equal number of pixels picked from each image.For visualisation purposes, we reduce the dimensionality from hyperspectral and RGB to 2D.The hyperparameters used for the experiment were 10, 200, and 1000 for perplexity, learning rate, and number of iterations, respectively.The initialization was performed with PCA.The visualization of the 2D features is shown in Fig. 4. As can be observed in this figure, in the Nuance EX visualization, the hyperspectral plot shows a boundary between oral mucosa and gingiva (attached and marginal) which is blurred in RGB, and also a clearer separation between attached and marginal gingiva themselves.In the Specim IQ visualization, the hyperspectral plot displays a sharper edge between skin and gingiva (attached and marginal), and also between hair and specular reflections.

Nuance EX RGB Hyperspectral
Specim IQ RGB Hyperspectral

Evaluation protocol
For performance testing purposes, the 215 annotated images provided in ODSI-DB are partitioned into training (90%) and testing (10%)3 .The training/testing split was generated randomly.We consider a training/testing split as valid when all the classes are represented in the training set (so we can perform inference on images containing any of the classes).In order to generate a valid training/testing split, we follow the next steps.For each class, we make a list of all the images that contain pixels of such class.We randomly pick one of those images and put it in the training set.After looping over all the classes, the training set contains at least one image with pixels of each class.We split the remaining images into training and testing with a probability p=0.5.As there are classes whose pixels can be found only in one or two images of the dataset, not all the classes are present in the testing split.This is for example the case of the classes fibroma, makeup, malignant lesion, fluorosis, and pigmentation.Therefore, for reporting purposes, we ignore heavily underrepresented classes and concentrate on those tissue classes with at least 1 million pixel samples: skin, oral mucosa, enamel, tongue, lip, hard palate, attached gingiva, soft palate, and hair.In addition, we report class-based results and image-based results.The class-based results are computed by taking all the annotated pixels contained in the testing images as a single set.A confusion matrix is then built for each class, where the positives are the pixels of such class, and the negatives are the pixels of all the other classes.Sensitivity, specificity, accuracy and balanced accuracy (arithmetic mean of sensitivity and specificity) are reported for each class.An average across classes is also reported.To obtain imagebased results, we compute the average accuracy across all the images of the testing set, where the accuracy of any given image is calculated as the coefficient between the number of pixels accurately classified (regardless of the class) divided by the number of pixels annotated in the image.The class-based results are shown in Tables 1, 2, 3, 4 for the input modes rgbpixel, spixel, rgbimage, simage, respectively.The class-based comparison between RGB to hyperspectral pixel inputs led to close results, except for enamel and lip classes, where hyperspectral pixels helped improve the performance by 4 pp and 6 pp, respectively.The balanced accuracy over classes showed a slight improvement of 1.1 pp when using multiple bands.However, when comparing the accuracy averaged over images, where better-represented classes (i.e.skin, oral mucosa, enamel, tongue, lip) have a higher weight, the hyperspectral accuracy showed an improved accuracy of 10 pp over RGB pixel inputs.
When comparing the class-based rgbimage and simage results, a mild improvement is observed when using the extended spectral range.The average balanced accuracy achieved was 74.39 % for RGB reconstructions, and 76.18 % for hyperspectral images.These results, along with the 10 pp gap when moving from single-pixel inputs to images suggest that without considering DL architecture changes, the spatial information is the main driver of segmentation performance in dental imaging.Nonetheless, common dental conditions such as calculus, gingiva erosion, and caries are related to two classes in particular, attached gingiva and enamel.While the enamel is distinct from the rest of the tissue, and relatively trivial to spot in an RGB image, the attached gingiva is better segmented when hyperspectral information is available, as shown by its improved The image-based results (see Table 5) display a clear performance gap (> 10 pp) between RGB and hyperspectral pixel inputs.This gap comes down to 2 pp when comparing RGB to hyperspectral images.

Conclusions
In this work we performed an ablation study to discern how the availability of spatial and spectral information impacts the segmentation performance.We reported baseline results for four types of input data, single RGB pixels, hyperspectral pixels, RGB images and hyperspectral images.In addition, we provided an improved method to reconstruct the RGB images from the hyperspectral data provided in ODSI-DB.
We reported a mild improvement in the segmentation results on ODSI-DB when using hyperspectral information.However, the main driver of segmentation performance for the dental anatomy present in the dataset seems to be the availability of spatial information.It is when moving from pixel classification to full image segmentation that we reported the largest rise in segmentation performance.
Future work stems in several directions.An interesting research question is whether, by means of hyperspectral imaging we can mitigate the annotation effort, which is one of the current issues in the CAI field.That is, the mild improvement in segmentation performance achieved with hyperspectral inputs could be potentially exploited to annotate fewer images without sacrificing segmentation performance.This would be particularly interesting for the field of hyperspectral endoscopy, as it represents an additional benefit in favour of the use of hyperspectral endoscopes.Another future direction is the exploration of convolutional architectures that take advantage of the hyperspectral nature of the data.Current state-of-the-art models such as U-Net have been optimised for RGB images, hence by simply replacing the input we fall on the risk of not taking full advantage of the hyperspectral information available.
TV is supported by a Medtronic / RAEng Research Chair [RCSRF1819\7\34].CH is supported by an InnovateUK Secondment Scholars Grant (Project Number 75124).For the purpose of open access, the authors have applied a CC BY public copyright licence to any Author Accepted Manuscript version arising from this submission.

Disclosure statement
TV, ME, and SO are co-founders and shareholders of Hypervision Surgical.TV also holds shares from Mauna Kea Technologies.

Figure 2 .
Figure 2. Original and modified color matching functions.

Figure 3 .
Figure 3.The four types of input compared are single RGB pixels (rgbpixel), single hyperspectral pixels (spixel), RGB images (rgbimage), and hyperspectral images (simage).The format used to define the dimensions of the input is H, W, C, where H, W , and C represent height, width, and channels, respectively.

Figure 4 .
Figure 4. t-SNE of 10 million randomly selected pixels (evenly distributed across images) from the ODSI-DB dataset.The dataset contains images captured with two different cameras, Nuance EX (51 bands) and Specim IQ (204 bands), hence the separated plots.

Table 1 .
Class-based pixel classification results for the rgbpixel mode.To generate this table all the pixels contained in the images of the testing set are considered as a single set.All the results are provided in percentage.

Table 2 .
Class-based pixel classification results for the spixel mode.To generate this table all the pixels contained in the images of the testing set are considered as a single set.All the results are provided in percentage.

Table 3 .
Class-based pixel classification results for the rgbimage mode.To generate this table all the pixels contained in the images of the testing set are considered as a single set.All the results are provided in percentage.

Table 4 .
Class-based pixel classification results for the simage mode.To generate this table all the pixels contained in the images of the testing set are considered as a single set.All the results are provided in percentage.

Table 5 .
Image-based accuracy for the different input types.The presented accuracy is the average of the images in the testing set.The accuracy for a single image is computed as the coefficient of the pixels correctly predicted divided by the total number of annotated pixels in the image.As when presenting class-based results, only the tissue classes with more than 1M pixels in the dataset have been considered.Accuracy results are provided in percentage.