Computerized Medical Imaging and Graphics

Image texture is a very important component in many types of images, including medical images. Medical images are often corrupted by noise and affected by artifacts. Some of the texture-based features that should describe the structure of the tissue under examination may also reﬂect, for example, the uneven sensitivity of the scanner within the tissue region. This in turn may lead to an inappropriate description of the tissue or incorrect classiﬁcation. To limit these phenomena, the analyzed regions of interest are normalized. In texture analysis methods, image intensity normalization is usually followed by a reduction in the number of levels coding the intensity. The aim of this work was to analyze the impact of different image normalization methods and the number of intensity levels on texture classiﬁcation, taking into account noise and artifacts related to uneven background brightness distribution. Analyses were performed on four sets of images: modiﬁed Brodatz textures, kidney images obtained by means of dynamic contrast-enhanced magnetic resonance imaging, shoulder images acquired as T2-weighted magnetic resonance images and CT heart and thorax images. The results will be of use for choosing a particular method of image normalization, based on the types of noise and distortion present in the images. © 2020 The Authors. Published by Elsevier Ltd. This is an open access article under the CC BY license (http://creativecommons.org/licenses/by/4.0/


Introduction
Image texture is one of the most important components of many types of images, including medical images. Visualization by means of various medical imaging modalities represents the properties of internal organs and tissues in terms of texture. The texture of the cross-sections of such structures, observed for example by tomographic imaging, provides information that can be further supported by a diagnostic process. Texture parameters reflect the physiological properties of tissues. This enables, inter alia, the segmentation of organs, the detection of lesions and an assessment of the degree of pathological changes. The significance of texture analysis for aided imaging diagnostics has been demonstrated for all kinds of imaging modalities, including computer tomography (CT) (Thevenot et al., 2013), Magnetic Resonance Imaging (Larroza et al., 2017), ultrasound (Chrzanowski et al., 2008) and optical imag-documented in review papers (Ouahabi, 2013;Priya et al., 2017;Somkuwar and Bhargava, 2013). There are two general issues with noise removal techniques. Such techniques can be tedious (Pieciak et al., 2017), and their application often causes changes to the image content, especially at the texture level (Aja-Fernández et al., 2009). Thus, the susceptibility of texture analysis methods to noise should be considered.
The results of texture analysis depend on the image acquisition parameters. These include image resolution, the size of the field of view (FoV), the matrix size and the background brightness heterogeneity. In addition, analyses performed on images of different patients often use different values for the acquisition parameters. This may cause the brightness and contrast of individual regions of interest (ROIs) to differ. The values can vary for RIOs within an image (Styner et al., 2005) and for ROIs in subsequent images. As a result, the values of some textural features depend not only on the texture, but also on the average brightness and/or contrast of the ROI (Materka et al., 2000). For this reason, some of the features that should describe the structure of the tissue under examination may also reflect, for example, the scanner's uneven sensitivity within the analyzed tissue region (Materka and Strzelecki, 2013). This may in turn lead to an inappropriate description of the tissue or incorrect classification. To limit these phenomena, the ROIs are normalized (C. P. Loizou et al., 2009).
The aim of this work was to analyze the impact of implementing different image normalization methods on the results of texture classification, with attention to noise and artifacts related to uneven background brightness distribution. The results may be useful for choosing a particular method of image normalization based on the types of noise and interference present in the images. In texture analysis methods, image intensity normalization is usually followed by the reduction of levels coding the intensity. For this reason, the influence of the number of intensity levels on the accuracy of texture classification was also examined. This study constitutes an extension of research described in  and (Kociolek et al., 2018). The analyses presented here were performed for textures from the Brodatz album (Brodatz, 1966), for kidney area images obtained by means of dynamic contrast-enhanced magnetic resonance imaging (DCE-MRI) (Eikefjord et al., 2017), shoulder T2 weighted MRI images (Obuchowicz et al., 2019) and thorax CT images. Brodatz textures were used for the simulation of various types of noise and background nonuniformity artefacts. The paper is structured as follows. Section 2 presents the concept of normalization and intensity reduction for texture analysis. Section 3 describes the data utilized in our study as well as the experimental procedure. Experimental results are discussed in Section 4. The results and a summary of the experiments are presented in Section 5.

Normalization and intensity level reduction for texture analysis
A typical procedure for obtaining texture-based features is shown in Fig. 1 Normalization usually constitutes the first step of the procedure. It leads to an extension of the image histogram in the ROI into the entire available intensity range. This has two effects: there is more contrast between the bright and dark structural elements creating the investigated texture, and the influence of the local mean intensity is reduced. Both of these effects usually improve the quality of the features obtained. The next step in the feature estimation procedure is the reduction of the number of intensity levels used for texture representation. Usually, the intensity ranges in ROIs are smaller than the intensity range of the whole image. For example, for 16 bit coded images there are 65536 intensity levels, so normalization to the full range will produce a sparse his-  togram. Moreover, in the case of features based on the gray level co-occurrence matrix (GLCM) (Haralick, 1979;Haralick et al., 1973), which is a two-dimensional matrix, the matrix for 16 bit images will have 65 536 × 65 536 = 4 294 967 296 elements. A matrix of this size will require an unnecessary increase in the computation time needed to estimate GLCM-based features.
Normalization and reduction of intensity levels are very often performed simultaneously. It is worth noticing that for some software packages the number of intensity levels is reduced to the range 0, . . ., 2 k − 1 The procedure can be described by Eq. (1) where: N (x, y) = round to int I (x, y) − min norm max norm − min norm 2 k − 1 min norm -minimum normalized value, max norm -maximum normalized value, k -no. of image bits per pixel after normalization. The image normalization techniques investigated in this study were as follows: min-max -in this type of normalization, the min norm and the max norm from Eq. (1) are the minimum and maximum intensities taken directly from the histogram. Such normalization is useful if the histogram of the investigated texture is range uniform (Fig. 2a).
1 %-99 % -for such normalization min norm = arg cumulative histogram = 1% and max norm = arg cumulative histogram = 99% . This type of normalization should be useful, for example, for range-uniform intensity distributions with limited numbers of outliers (Fig. 2b). Such outliers are very often caused by dead pixels in imaging sensors. ±3 -the range of intensities for this normalization can be defined as min norm = − 3 and max norm = + 3 , where is the mean intensity and is the standard deviation of the image intensities in the ROI. Such normalization is useful when the intensity histogram of the texture is in close to a Gaussian distribution.

Material and methods
The analyses described in this study were performed on four sets of images.

Test set 1 -Brodatz images
The first set consisted of modified images from the Brodatz album (Brodatz, 1966).
Twelve textures from the Brodatz album were used to build the test set (Fig. 3). Originally, 256 Gy level images with dimensions 512 × 512 pixels were downloaded from (SIPI, 2020). Fig. 4 shows the intensity histogram for an example Brodatz image without any added transformations or distortions.
The original images were converted into 32 bit signed format and the following image distortion combinations were applied to each image: • no distortion; • Gaussian noise with standard deviations of 6.4, 12.8 and 25.6, constituting respectively 2.5%, 5% and 10% of the full intensity range of the original image; • Rician noise with an s parameter of 6.4, 12.8 and 25.6, constituting 2.5%, 5% and 10% of the full intensity range of the original image; • low frequency background intensity non-uniformity in the form of an added intensity gradient from the top left to the bottom right of the image; • a combination of Gaussian noise and background intensity nonuniformity, • a combination of Rician noise and background intensity nonuniformity.  Additive Gaussian noise was obtained by the application of Eq. (2), where: I GN (x, y) -intensity of the image with Gaussian noise, I (x, y) -intensity of the original image, rand ( = 0, = 1) -a randomly generated value with a Gaussian distribution of mean = 0 and a standard deviation = 1, k -the noise level coefficient. Rician noise was generated in the way described in (Ridgway, 2020) by means of Eq. (3) Where: I RN (x, y) -intensity of image with Rician noise, I (x, y) -intensity of the original image, rand ( = 0, = 1) -a randomly generated value with a Gaussian distribution of mean = 0 and standard deviation = 1; s -the noise level coefficient. Background intensity non-uniformity was created by the addition of an intensity gradient to the image, according to Eq. (4).
After the application of distortion, an intensity offset of 1000 was added to the image in order to remove negative intensity values. The resulting images were converted and stored in 16 bit Tagged Image File format. Fig. 5 shows a small piece of an example test image (taken from the upper left corner of texture Br02 from Fig. 3 accompanied by the image with an offset of 1000 added and the same image combined with offset and intensity non-uniformity (the latter two represented in pseudocolors for better visualization of the introduced artifacts). In total, 13 combinations of distortion were applied to each of the 12 base images. For each image, a set of 64 uniformly distributed non-overlapping ROIs was defined. Each ROI is a square of 61 × 61 pixels. Of the 64 test ROIs, 13 were used as a training set, and remaining 51 were used as the test set (Fig. 7). Fig. 8 shows histograms for images without distortion for an ROI located close to the center of the image (ROI #28 marked on Fig. 7).
As can easily be seen, each image has a different histogram and none of the histograms follows closely the histograms from Fig. 2. Fig. 9 shows histograms of an example image (mainly Br02) and selected ROIs after the application of 5% Rician noise and background non-uniformity.
Clearly, noise modifies the shape of the histogram and low frequency background non-uniformity offsets intensities inside the ROI.

Test set 2 -kidney images (DCE MRI)
The second set analyzed in this work consisted of 130 dynamic contrast-enhanced magnetic resonance images (DCE-MRI). This was a subset of the images described in (Eikefjord et al., 2017) and used in (Kociołek et al., 2019). The original set consists of 20 sequences taken for 10 healthy nonsmoking volunteers acquired by means of a 1.5 T Siemens Magnetom Avanto scanner (Erlangen, Germany) using a standard phased-array coil. Two acquisitions for each volunteer were taken at intervals of 7 days. Each time sequence includes 74 volumetric images. Each volumetric image was acquired by means of a 3D spoiled gradient-recalled (SPGR) pulse-sequence with the following parameters: flip angle: FA = 20 • ; echo time TE = 0.8 ms; repetition time TR = 2.36; parallel imaging factor PIF = 3; Field of view 425mm × 425mm × 90mm; Voxel size 2.135mm × 2.135mm × 3mm. This results in a 3D vol-  ume of 192pixels × 192pixels × 30pixels, which can be treated as 30 slice images with dimensions of 192pixels × 192pixels. The data were obtained from the Haukeland University Hospital Bergen, Norway. All processed image data were anonymized before being handed to us. All volunteers gave their written informed consent for participation in the examinations. The study protocol was approved by the Institutional Review Board at the Haukeland University Hospital Bergen, Norway.
Each time sequence captures the filtration action of the kidneys. A bolus injected into the bloodstream of the patient is filtered out to the urinal tract. As it passes through the kidney structures, it causes changes in the kidney image. Although the kidney region did not constitute homogeneous region in (Kociołek et al., 2019) we found that the values for multiple textural features are related to the state of the kidney during the filtration process. In this work, we are therefore using textural features which apply to the whole kidney region. Each image was accompanied by a corresponding ROI mask. These masks were prepared semi-manually. We noticed that for a given slice the shape of the kidney remained unchanged during respiratory movements. Thus, the shape of the kidney was manually outlined for the first timeframe. For consecutive frames, only the position of the mask was adjusted manually in order to match the position of the kidney. For each time sequence, slices containing a clear view of the medulla and pelvis were selected. Usually there were two such slices in each time sequence. This gives 2960 images (2 slices × 74 time frames × 2 sequences × 10 patients). From the available images, we handpicked 130 belonging to one of three classes: • no contrast agent in the kidney (43 images); • contrast agent in the right kidney cortex (42 images); • contrast agent in the pelvis (45 images); A half of the images from each class were selected randomly as the training set. The other half was used as a test set. Fig. 10 shows an example of DCE-MRI with the superimposed contour of the right kidney ROI for two volunteers.

Test set 3 -shoulder images (MRI T2 weighted)
The third set analyzed in this work was composed of T2 weighted MRI images acquired with different acquisition times. The study protocol was designed according to the guidelines of the Declaration of Helsinki and the Good Clinical Practice Declaration Statement. Special care was taken regarding personal data safety, with all images anonymized before processing. Written acceptance to conduct the study was obtained from the Ethics Committee of the Jagiellonian University (no.1072.6120.15.2017 dated: 20.06.2017). This was a subset of images described in (Obuchowicz et al., 2019). The original set consists of 12 T2-weighted coronal volumetric images of the shoulder taken from patients between the ages of 23 and 56 years old. Volumetric images were acquired using a 1.5 T Siemens Essenza (Erlangen, Germany) equipped with 12 dedicated table coils and 8 channel shoulder coils. Regular GRAPPA1 and three shortened sequences, GRAPPA2, GRAPPA3 and GRAPPA4, were performed using parallel imaging. Each volumetric image for the GRAPPA1 was acquired by means of a T2 weighted sequence with the following parameters: flip angle FA = 150 • ; echo time TE = 101 ms; repetition time TR = 3.54 ms; phase oversampling of 100; slice spacing 3.9 mm; field of view 200 mm × 200mm × 60mm; voxel size 0 .625mm × 0.625mm × 3mm. GRAPPA2 has acquisition time reduced by 2, GRAPPA3 has the acquisition time reduced by 4, and GRAPPA4 has the acquisition time reduced by 8. Each GRAPPA sequence results in a 3D volume of 320pixels × 320pixels × 20pixels, which can be treated as 20 slice images with dimensions of 320pixels × 320pixel. As can be seen from Fig. 12, reducing the acquisition time deteriorates image quality, introducing distortions and noise.
We handpicked 71 images of the shoulder. On each image, three circular ROIs with diameters of 21 pixels were marked (Fig. 12). These contained the humerus epiphysis (bone), muscle and fat regions, respectively. The training set consisted of 36 images. The remaining 35 images were used as test set.

Test set 4 -thorax images (CT)
The last set analyzed in this work consisted of heart CT cross-sections obtained from the same raw data for different reconstruction filters. Images were acquired using a SIEMENS STOMATOM CT scanner (Erlangen, Germany). The parameters used were as follows: X-ray tube current: TC = 1082 A; X-ray tube voltage: TV = 120kV ; exposure time TE = 285 ms; generator power P = 65 W ; slice thickness 0.6 mm; Field of view 205mm × 205m; pixel size 0.4mm × 0.4mm; image size 512 × 512. Three different reconstruction kernels were implemented, with different degrees of image sharpness and different purposes (depending on the type of organ for which they provided the best visualization). These were: HeartView smooth (B26f), Medium smooth (B30f), and HeartView medium (B36f). Sharper filters generate higher noise in the reconstructed images. To partly remove this noise, the SAFIRE iterative reconstruction-based filter was used at level 3 (Grant and Raupach, 2012). It was demonstrated in (Völgyes et al., 2017) that the application of different reconstruction ker-nels (resulting in different noise levels) significantly affects the resulting image quality, including the possibility of low contrast tissue detection. The influence of CT heart image quality (expressed in terms of applied reconstruction kernels) on the possibility of building accurate 3D/4D model of the heart has been investigated in (Bielecka and Piórkowski, 2015). Our aim was to determine whether image normalization can influence tissue classification in such data, assuming different noise levels are generated by the kernels.
Particular care was taken to ensure the safety of personal data. All processed image data were anonymized before being handed to us. Written informed consent for the processing and publication of anonymized clinical images was obtained from the management of the Department of Diagnostic Imaging at Jagiellonian University Medical College. The usual requirement for informed consent from patients was waived in view of the retrospective nature of the research.
Five slices were obtained for two patients. On each image, three sets of circular ROIs with diameters of 35 pixels were marked   . 13. CT images of a thorax obtained using three reconstruction filter kernels B26f, B30f and B36f and histograms of example ROIs. Three ROI groups are marked on the B26f image: red circles -muscle, green circles -heart, blue circles -aorta (For interpretation of the references to colour in the Figure, the reader is referred to the web version of this article). Fig. 13a. one containing the muscle region, the second containing the heart region and the third containing the aorta region. The ROIs for the muscle and heart did not overlap, while the ROIs for the aorta partially overlapped. In total, there were 45 ROIs in the mus-cle region, 56 ROIs in the heart region and 20 ROIs in the aorta region. Five images, three slices from the first patient and two from the second, were used as a training set. The remaining five images were used as a test set.

Test methodology
Three normalization schemes were tested, min-max, 1 %-99 % and ±3, followed by a reduction in the intensity level number (intensity binning/quantization) to 8, 16, 32, 64, 128 and 256 levels (corresponding to 3, 4, 5, 6, 7 and 8 bits coding intensity). Fig. 14 shows an example image from the MR kidney test set after the application of normalization methods with 8, 32, and 128 intensity levels. Normalization was performed based on the intensity distribution for the right kidney region. Fig. 15 shows histograms of the right kidney region for images from Fig. 14.
For each image, ROI and combination of normalization scheme and number of intensity bins, the following set of textural features was calculated: Feature calculations (including normalization and quantization) were obtained using QMaZda software (Szczypinski et al., 2017). A detailed description of the features can be found in (Szczypiński, 2018).
For each type of distortion, a series of classification experiments was performed involving features calculated for each normalization scheme -quantization level combination. The classification experiments were based on linear discriminant analysis built into the MzReport tool from the QMaZda package. The classification experiment involved two phases: • supervised training performed on the training set, • a testing phase performed on the testing set. During the training phase, a series of binary classifiers was trained separately for each pair of classes. The following algorithm was used in the training phase for each classifier: 1) All feature vectors associated with the training set were loaded into the MZReport tool. 2) All feature values were standardized according to their mean and standard deviation.
3) The best combination of three features was found for each pair of classes according to their discriminative power using LDA.
As a result of the training phase, a set of (K2-K)/2 classifiers for K classes was found, each utilizing the best ranked combination of three features.

Results
Classifiers trained during the training phase were used in the testing phase of the experiment. In the testing phase, feature vectors calculated for the test set were fed to the set of classifiers. This resulted in a confusion matrix, shown in Fig. 16.
The sample under consideration was assigned to a given class if all (K-1) classifiers for the class agreed. Otherwise, no decision was reported. This situation is also treated as a classification error ("no decision" column in Fig. 16). Fig. 17 presents the results of classification experiments for Brodatz images distorted by means of Rician noise and background non-uniformity (detailed results for all considered distortions can be found in the Supplementary Information). The classification error is reported for each combination of normalization technique and intensity level number separately for three noise levels (2.5%, 5% and 10%) and six feature groups. In the case of Gabor-based features ±3 normalization gives the best classification results. Changing the number of intensity levels has a limited influence on the classification results, although there is a slight increase in error for a small number of coding intensity levels. This set of features was relatively immune to moderate levels (2.5 % and 5%) of noise and to background non-uniformity.
For GLCM-based features, there was no clear trend in terms of classification error versus normalization method and number of intensity levels. However, the lowest classification error was observed for GLCM-based features. The maximum observed error was below 10 % regardless of the number of intensity levels and normalization scheme. For low and moderate noise levels (2.5 % and 5%, Rican and Gaussian) the maximum observed error was below 6%. Similarly, the differences in classification error for different combinations of normalization method and number of intensity levels was low, with a maximum equal to 6%.
For gradient features, better classification results were obtained mostly for large numbers of intensity levels (128 and 256). But there was no major improvement in comparison to low numbers of intensity levels. With moderate noise levels (2.5 % and 5% Rican and Gaussian) without background non-uniformity, 1 %-99 %normalization performs better than other two methods. For moderate noise levels with background non-uniformity, ±3 normalization gives the lowest classification error. With higher noise levels, 1 %-99 % normalization performs similarly to ±3 normalization and both methods perform better than min-max normalization.
The best classification results for GLRM-based features in all combinations of the tested distortion were obtained for the lowest numbers of intensity levels. All normalization schemes performed similarly, although ±3 normalization gave the lowest classification error in 8 out 14 cases.
For HOG-based features with low noise level (2.5 %) and a combination of moderate noise level (5%) and background nonuniformity, the best classification results were obtained in the case of 1 %-99 % normalization. Slightly worse results were obtained for ±3 normalization. In the case of moderate noise levels without background non-uniformity and with a high noise level (10 %), ±3 normalization performed slightly better than 1 %-99 % normalization. In most cases, min-max normalization gave the worst classification. The number of intensity levels had a limited influence on the classification results, but in most cases higher numbers of intensity levels gave slightly better classification.
For wavelet-based features, with a low noise level (2.5 %) minmax normalization usually gave the best classification results. For moderate and large noise levels, the best results were usually obtained using ±3 normalization. Changing the number of intensity levels had a limited influence on the classification outcome, but in some cases the use of 8 intensity levels introduced a slightly higher classification error. Fig. 18 shows the results of classification experiments performed on DCI -MRI of kidneys.
The kidney images were obtained by means of the DCI-MRI technique, so Rican noise may be present. For Gabor transform-based features, the lowest classification error was observed for min-max normalization and 32 or 64 intensity bins. For GLCM features, the best results (classification error equal to 0) were obtained for all normalization methods with 8-64 intensity levels and 1 %-99 % normalization with 128 intensity levels as well as min-max normalization with 256 intensity levels. For gradient-based features, the lowest classification error was with 1 %-99 % normalization and 32-256 intensity levels. For GRLM-derived features, the best classification results were obtained for ±3 normalization with 8 intensity levels. A little higher but comparable error level was obtained for ±3 normalization with 64 and 256 intensity levels. For HOG-based features, the lowest classification error was obtained for min-max normalization with 8 intensity levels. For wavelet-based features, the best classification was obtained for min-max normalization with 16 intensity levels. Fig. 19 shows classification error in the analysis of shoulder MRI images acquired for GRAPPA 1, GRAPPA 3 and GRAPPA 4 sequences. Only GLCM, GLRM and HOG-based features are presented here, since for other feature families the minimal classification error was well over 40 %.
For GLCM-based features, min-max normalization with 256 intensity levels gave the lowest classification error for GRAPPA 1 sequence. Similar but slightly higher classification error can be obtained for ±3 normalization with the same number of intensity levels. In the case of GRAPPA3 sequence the best classification results were achieved for 1 %-99 % normalization with 8 intensity levels, However, the normalization ±3 with 256 intensity levels gave similar result. In the cases of GRAPPA4 sequence the best classification results were achieved for min-max normalization with 32 intensity levels.
In the cases of GRAPPA1 and GRAPPA3 sequences, for GLRMbased features the best classification results were achieved for   1 %-99 % normalization with 256 intensity levels. In the case of GRAPPA4 the best classification was obtained for min-max normalization with 256 intensity levels.
For histogram of oriented gradients, min-max normalization with 16 intensity levels gave the lowest classification error for GRAPPA 1 sequence. Similar but slightly higher classification error can be obtained for ±3 normalization with 256 intensity levels.
In the cases of GRAPPA 3 and GRAPPA 4 sequences the best classification results were achieved for ±3 normalization with 256 intensity levels. Fig. 20 shows classification error in the analysis of thorax CT images obtained by means of three reconstruction filters, B26f, B30f and B36f. As in the case of the MRI shoulder results, only GLCM, GLRM and HOG-based features are presented here because in other feature families the minimal classification error was well over 30 %.
For GLCM-based features, 1 %-99 % normalization with 256 intensity levels gave the lowest classification error for B26f reconstruction filter. In the case of B30f filter the lowest error was obtained for min-max normalization with 256 intensity levels However, 1 %-99 % normalizations with the same number of intensity levels gave similar result. For B36f reconstruction filter min-max normalization with 128 intensity levels allow for the lowest classification error. Slightly higher classification error can be obtained for 1 %-99 % normalization with 256 intensity levels.
In the case of B26f reconstruction filters, the best classification result for GLRM-based features was obtained with 1 %-99 % normalization and 8 intensity levels. However, the 1 %-99 % normalizations with the 256 intensity levels gave similar result. With the B30f filter, the best classification was obtained using ±3 normalization and 8 intensity levels. Similar but slightly higher classification error can be obtained for 1 %-99 % normalization with 256 intensity levels. In the case of B36f reconstruction filters, the best classification results were achieved with min-max normalization and 16 intensity levels as well as min-max and 1 %-99 % normalizations with 256 intensity levels.
For HOG-based features and images obtained using the B26f reconstruction filter, the best classification results were obtained with 1 %-99 % normalization and 128 intensity levels. In the case of B30f filter the lowest error was achieved for 1 %-99 % normalization with 64 intensity levels. However, the 1 %-99 % normalizations with 128 intensity levels gave similar result. For images obtained using the B36f reconstruction filter, the best classification results were achieved with ±3 normalization and 16 intensity levels.

Discussion and summary
It should be emphasized that changing the normalization technique or the number of intensity levels may lead to a completely a new texture feature set. This means that a given feature formula calculated for a different normalization technique and/or number of intensities will result in different numerical values. Moreover, such feature values may be uncorrelated. Fig. 21 shows three combinations of normalization/number of intensity levels for the sum of squares feature, calculated based on GLCM, created for direction = 0 • and offset = 1. All three combinations of normalization/number of intensity level had different feature values for given image/ROI combinations. The Pearson correlation coefficient between the min-max normalization with 256 intensity levels and the min-max normalization with 128 intensity levels is equal to 1.0. Between the min-max normalization with 256 intensity lev-els and ±3 normalization with 256 intensity levels the Pearson correlation is equal to 0.55. Table 1 shows the percentage of features with a Pearson correlation coefficient greater than 0.9 for different combinations of normalization and numbers of intensity levels, with min-max normalization and 256 intensity levels as the reference.
Some features (Gabor, WDT and HOG) were resistant to gray level reduction, while others (GRLM and GLCM) were not. It is also worth noting that ±3 normalization had the greatest effect on the texture properties, so few features correlate with those estimated for the reference texture. In any case, it should be remembered that for a given feature all the analyzed ROIs need to be normalized in the same way, with a constant number of intensity levels.
Lack of normalization can render features useless for classification problems. A very good example is provided by GLCM-based features. An intermediate step for these features is the calculation of a gray level co-occurrence matrix, which is a two-dimensional (or second order) histogram. For the 16bit images utilized in this study, without reducing the intensity levels, such a histogram will be a matrix of 65535 × 65535 elements. This means over 4 billion elements. Assuming 32-bit integers as matrix elements, such a structure will occupy 16GB of memory. On the other hand, it will be a sparse matrix since there are 61 × 61 = 3721 pixels in each of the considered ROI for Brodatz textures. A simple reduction of the number of intensity levels without normalization will lead to a situation in which the pixel values will be represented by an extremely limited number of intensities. In the case of modified Brodatz texture-based images, the intensity values for each image fall in a range of between 950 and 1550. Simple binning into 256 intensity levels will result in a maximum of 3 consecutive (2, 3 and 4) intensities. Similar behavior may be observed for MRI images, where the intensities in the kidney region fall into a range of between 0 and 700 and the bone, muscle and fat  fall into a range of between 0 and 550. For tested CT images the intensity range in considered ROIs fall into a range of between 850 and 1350. The application of a normalization technique will result in better representation of the texture in the new intensity range.
There are several papers concerning the utilization of different normalization methods in texture analysis. In (Dzierżak et al., 2019) different normalization methods were applied for analysis of Spinal CT images. It was demonstrated in this paper that image normalization significantly worsened the accuracy of the automatic diagnosis of osteoporosis based on CT images of the spine. In (Loizou et al., 2009) five different normalization techniques were tested during the analysis of brain MRI images. It was shown in this work that application of min-max normalization provides more stable feature values for brain lesion analysis. In (Collewet et al., 2004) normalization techniques were tested for analysis of cheese MRI images. It was demonstrated in this work, that application of image normalization improves soft cheese texture classification accuracy.
However, to the best knowledge of the authors this is the only publication in which different normalization techniques were tested in comprehensive way for different numbers of intensity levels on images with various distortions obtained by means of various imaging techniques.
The research, of course, has limitations. Only three classes of biomedical images, obtained using two imaging methods (CT, MRI), were used. It cannot be assumed that these images were disturbed by noise from perfect Rice and Gauss distributions, as was the case of the tests performed on the textures from the Brodatz album. Regardless of the noise, however, these images had artifacts related to reduced acquisition time (MRI) or different reconstructive nuclei (CT). Therefore, the results of the analyses obtained for these images did not always coincide with those obtained for natural textures. Nevertheless, the results allow several conclusions to be drawn that should be true for a broader class of images.
A few general conclusions can be drawn from the Brodatz experiment: 1) For GRLM-based features, there is a clear rule that low numbers of intensity levels should be used. Improvement of the classification results for small numbers of intensity levels is due to the construction principles of the run length matrix. For small numbers of intensity levels, GRLM is less sensitive to noise, because strips of pixels with the same intensity are not broken. 2) For high noise distortions, both Gaussian and Rican ±3 normalization gives lower classification error than min-max and 1 %-99 % normalization. (sentence removed) This is caused by the fact that the addition of noise modifies (stretches) the histogram in such a way that it becomes similar to a Gaussian distribution, for which ±3 normalization is preferred. 3) For Gabor transform-based features, the normalization technique of choice is ±3. This normalization gives the minimum classification error for all tested numbers of intensity levels, regardless of the distortion type. 4) For wavelet features, ±3 normalization gives either the lowest or very close to lowest discrimination error for most distortion combinations and all tested numbers of intensity levels. The only exceptions occur for low levels of Gaussian noise and nondistorted images, in which case min-max normalization gives the best results. 5) For HOG-based features, 1 %-99 % normalization is preferred for low noise levels as well as for the combination of moderate noise and background nonuniformity. For other distortion combinations, ±3 normalization is better. 6) For gradient-based features, 1 %-99 % normalization gives either the lowest or very close to the lowest discrimination error for most distortion combinations. The only two exceptions are low and moderate Rician noise combined with background non-uniformity and low Gaussian noise with background nonuniformity.
Unfortunately, these findings do not hold for the kidney experiment. This is due to the fact that for images where the contrast agent is flushed to the pelvis, the histograms differ significantly from a Gaussian distribution (Fig. 11c) Such histograms have a relatively long tail of large values for intensities from the bright but relatively small pelvis area. Both ±3 and 1 %-99 % normalizations can cause clipping of this tail, resulting in the pelvis blending into the kidney cortex area. This can lead to the degradation of classification results for certain features. Such a situation is especially visible for Gabor and Wavelet transform-based features.
The GLCM-based features require additional comment. It is difficult to identify any clear trend in terms of discrimination performance for different normalizations or numbers of intensity levels. On the other hand, for all tested image sets the lowest discrimination error was obtained for GLCM-based features. The maximal discrimination error was 9.3 % and the minimal discrimination error 0,8% for all combinations of distortion tested in the Brodatz experiment. In the kidney experiments, the minimal and maximal errors were 0% and 1.5 %, respectively. It seems that such features clearly describe the properties of Brodatz and kidney textures and that their values are sufficiently distinctive to provide good classification of the analyzed classes. In such cases, the normalization scheme does not significantly influence classification accuracy, even if not all of the features maintain their values after normalization (see Table 1). This means that the proper selection of parameters which uniquely represent texture properties is crucial for further analysis.
For the MRI shoulder and CT heart datasets, it can be noticed that ±3 normalization works less well than in the case of kidney images. This can be explained by the fact that none of the images in these datasets follow the Gaussian distribution (see Fig. 12 and Fig. 13). Therefore, ±3 normalization results in missing gray level distribution, usually due to overextension of the intensity range to fit it to the normal distribution. Thus, other normalization schemes are a better choice for such medical images, such as 1 %-99 % or min-max, at least as far as texture classification is considered.
Another observation related to the medical datasets is that, contrary to the Brodatz images, in the case of GLCM and GLRMbased features a reduction in classification error was observed with increasing numbers of intensity levels. In this case, the noise that affected both the MR and CT images was not additive Gaussian (as it was in the experiments with Brodatz data), so no averaging effect occurred. When an image is affected by additive noise, the reduction in the gray level number corresponds to averaging the whole image data, including noise. When the noise mean is equal to zero, as it was in the experiment performed on Brodatz images, the averaging of additive noise leads to its reduction (in terms both of the standard deviation and the maximum amplitude value). In the medical images analyzed here, the noise was usually not additive (Rician in the case of MR data, or kernel dependent in the reconstructed CT images). In such cases, reducing the number of intensity levels does not reduce image noise. Moreover, the information reduction caused by this process deteriorates classification accuracy in some cases (see e.g. GLCM features for MR GRAPPA 1 and CT data, all kernels).
Finally, the best texture classification ability, in the cases of both the Brodatz texture and biomedical images analyzed in this study, was for GLCM-based features. The number of such features is much larger in comparison to other parameter groups, as stated in subsection Test Methodology. Of course, some may be correlated (especially the same features calculated for different inter-pixel distances). However, it is still easier to find a 3-element feature subset from larger input space that will be optimal for classification than to consider a parameter space with much lower dimensionality. On the other hand, our analysis of the literature shows that the statistical parameters estimated for the co-occurrence matrix are the most commonly used texture parameters. Perhaps this is due to the fact that those features were among the first to be identified (they were developed in the 1970s (Haralick, 1979;Haralick et al., 1973)). Nevertheless, they have demonstrated their usefulness repeatedly to describe a wide class of textures found in biomedical and natural images. Of course, other methods of texture analysis should not be rejected a priori (unless there is evidence of its limited usefulness for a given class of images). Biomedical textures are characterized by very high complexity and variability, so it is worth investigating a whole series of features in their analysis.
To summarize, for Gabor transform, gradient, HOG and waveletbased features the number of intensity levels has a minimal influence on classification error. On other hand, this error depends on the normalization method. For GLCM and GRLM-based features, both the normalization method and the number of intensity levels influence the classification error.
In the case of MRI and CT images, the lowest classification error can usually be obtained with large numbers of intensity levels (mainly 256 in our case). The normalization method of choice for such modalities is 1 %-99 %, since it gives the most stable results (usually the best or close enough to the best). The other two normalization methods should be considered if 1 %-99 % normalization does not provide sufficiently low classification error.
In the case of images affected only by additive noise, ±3 normalization with low numbers of intensity levels may be considered for GLCM and GRLM-based features.

Marcin
Kociołek is the main author of the presented research and vast amount of the text. He has chosen the methodology and performs all the data analysis.
Michał Strzelecki oversees whole project. He was an initiator of the study. He has great contribution in the literature revive and discussion part of the text.
Rafał Obuchowicz provided the thorax CT and Shoulder t2 weighted MRI images. Hi consult all medical aspects of the paper. He was also contributed in the text descripting CT and MRI data.

Declaration of Competing Interest
The authors whose names are listed immediately below certify that they have NO affiliations with or involvement in any organization or entity with any financial interest (such as honoraria; educational grants; participation in speakers' bureaus; membership, employment, consultancies, stock ownership, or other equity interest; and expert testimony or patent-licensing arrangements), or non-financial interest (such as personal or professional relationships, affiliations, knowledge or beliefs) in the subject matter or materials discussed in this manuscript.

Funding source
This research did not receive any specific grant from funding agencies in the public, commercial, or not-for-profit sectors.

Human subjects study statement
All human-related imaging data used in this study were received from external institutions. All processed image data was anonymized before it was handed to us. The institutions which acquired the data followed the guidelines of the Declaration of Helsinki and the Good Clinical Practice Declaration Statement. Special care was taken regarding personal data safety, with all images anonymized before processing. Written acceptance to conduct the study was obtained from the appropriate Ethics Committees.