MS-Net: Multi-Segmentation Network for the Iris Region Using Deep Learning in an Unconstrained Environment

Iris segmentation is a significant phase in the iris recognition process because segmentation errors cascade into all subsequent phases. Therefore, it is important that errors in iris segmentation are minimised. The U-Net architecture that uses a deep learning approach was previously adopted for this task, but its performance was affected by the deformation of iris images caused by various noise factors in unconstrained (non-ideal) environments. Scratches, blurriness, dirt, specular reflections and other noise factors are some of the challenges faced in unconstrained environments when eyeglasses are present in the original images. Additionally, the performance of the iris segmentation was degraded due to problems of exploding gradient or vanishing gradient and the loss of information. This paper proposes a multi-segmentation network called MS-Net, based on a deep learning approach, that aims to capture high-level semantic features while maintaining spatial information to improve the accuracy of iris segmentation. MS-Net consists of three principal segments: a feature encoder network, a multi-scale context feature extractor network (MSCFE-Net) and a feature decoder network. MSCFE-Net a multi-scale context feature extractor network is constructed from a dilated residual multi-convolutional network module and a pyramid pooling residual model based on an attention convolutional module. In addition, the proposed MS-Net contains dense connections within the feature decoder network to decrease training difficulty, by using only a few training samples. The accuracy of MS-Net was evaluated on the CASIA-Iris.V4-1000 and UBIRIS.V2 databases. The performance of our proposed MS-Net method on the CASIA-Iris.V4-1000 and UBIRIS.V2 databases achieved an overall accuracy of 97.11% and 96.128%, respectively. Experiment results show that MS-Net is able to achieve better results compared to earlier methods used for the same purpose.


I. INTRODUCTION
Traditionally, authentication made use of passwords, smart cards, pin codes and patterns. These techniques are being replaced because they were misused and people were easily deceived [1]. Biometric recognition technologies are being adopted in many important tasks, such as to unlock The associate editor coordinating the review of this manuscript and approving it for publication was Siddhartha Bhattacharyya . a smartphone, to withdraw cash from ATMs, to identify travellers at airports, for border control, to record attendance [2], [3] and to make purchases at retail outlets [4]. The need for reliable, quick and secure methods has led to the emergence of physiological and behavioural models in biometric recognition systems [5]. Physiological biometrics involves recognising the geometry of the iris [6], fingerprint [7], face [8], retina or hand. The behavioural model authenticates by recognising signature, voice or gait [9], [10]. Among all these, recognising the iris has proven to be capable of achieving high distinctiveness, reliability and good performance [11] in terms of the false rejection rate (FRR) and false acceptance rate (FAR) [12], [13]. The most recent methods either have deep-learning features accompanied by a traditional pipeline [14], [15] or may simulate an existing model that has been trained in iris recognition [16].
Security systems need to be efficient, especially at public entry points such as airports, where people are put through numerous time-consuming screenings and stressful processes. Hence, it is crucial that a biometric authentication system be implemented to provide the shortest process time, with minimal user involvement. Delays can be shortened and the economic burden lightened if swift and secure systems were available [17].
Recent developments in iris recognition were focused on overcoming challenges in unconstrained environments to improve recognition rates. However, real-life situations present more complex scenarios such as low resolutions, off-angles, scaling, rotations, occlusions by eyelash, distortions, cropped images, blurred or off-focus images and noise from eyeglasses [18], [19], [20], [21], [22], [23], [24], [25]. Presently, hand-held mobile devices employ iris recognition, replacing traditional techniques as a more reliable method of authenticating in selected areas [26]. The iris segmentation (IS) phase is the most critical phase of an iris recognition system (IRS) because its accuracy is fundamental for the accurate performance of all subsequent phases, hence the entire system. The IS phase has been shown to be the most computationally challenging step in an IRS [27], [28], [29], [30], [31]. This phase has traditionally been tackled by utilising handcrafted and custom-designed segmentation methods [15], [32] that have achieved significant results in several databases of various qualities.
Considering the success of DL approaches for overcoming many other challenges in producing accurate iris images, researchers are performing in-depth studies of a technique called the convolutional neural network (CNN) to further improve the accuracy of current IS techniques [16], [33], [34], [35]. Past experiments on iris images have shown that errors in the segmentation phase impact the overall accuracy of an IRS [36]. Segmentation errors cause further errors in all subsequent phases, leading to even more recognition errors. The challenges of IS are due to noise from eyeglasses [1], [3], [25], [37] IS techniques can also fail because this noise is sometimes misconstrued as limbus boundaries or pupillary boundaries [38]. Thus, the IS phase of an iris recognition system for images with eyeglasses still faces challenges in unconstrained environments [25], [31], [39]. These challenges in fact enhance the development of an IRS because improving IS techniques significantly increases the accuracy of biometric recognition. DL techniques are better than traditional techniques at identifying poor quality samples of iris images [23]. Therefore, the significance of this study is in improving the accuracy of IS in the presence of noise such as scratches, blurriness, dirt, specular reflections and other noise factors that are caused by eyeglasses.
To tackle the challenges encountered in IS due to eyeglasses, the multi-segmentation network (MS-Net) based on Deep CNNs (DCNNs) is proposed to segment the iris region. The authors propose the MS-Net approach to extract the iris region, which would allow the detection of distinguishing features, which were produced as a result of the simultaneous processing of information from multiple sources. MS-Net improves the accuracy of interconnected tasks because it is equipped with multi-task learning abilities and multiloss functions that capture the differences and similarities between multiple samples. The architecture of the proposed MS-Net method comprises three main networks: the feature encoder network, the multi-scale context extractor network (MSCFE-Net) and the feature decoder network. Within the MSCFE-Net, the authors propose the Dilated Residual Multi-Convolutional Network (DRMC-Net) model to encode highlevel feature maps and obtain deeper features using four convolution branches with multi-scale dilated convolutions. Additionally, the authors also introduce a Pyramid Pooling Residual Model based on the Attention Convolutional (PPR-AC) module, which was modelled on a pyramid pooling network module [40]. PPR-AC improves feature extraction from the multi-scale receptive field in high-level feature maps. The aim of this strategy is to avoid a degradation in performance when faced with problematic situations of noticeable scale differences. In this network, residual connectivity was used to prevent the gradient from vanishing. For the feature decoder network, a dense block was inserted directly after the convolution layer of each block to quicken training and minimise the number of parameters required to enhance performance.
The fundamental contributions of this paper are: a) Introduces an MS-Net architecture based on DL to address different types of noise caused by eyeglasses. MS-Net learns the representative features of an iris image from a small amount of training data. b) Proposes the DRMC-Net and PPR-AC modules to enhance the receptive field and capture multi-scale features from high-level semantic feature maps by combining them with the encoder-decoder network for improving the segmentation of iris images. c) The MS-Net architecture uses the dense block in the decoder network structure to improve segmentation accuracy. MS-Net has fewer parameters, requires less training and segments faster, thus it is suitable for practical applications. d) Evaluates the accuracy of the proposed MS-Net architecture on two available public databases, using five evaluation metrics. The remaining sections of this paper are organised as follows: Section II presents an overview of related works on IS, comprising traditional and DL techniques; Section III gives details of the MS-Net method proposed in this paper; Section IV describes the experiment setup, covering databases, implementation details and evaluation metrics; experiment results are explained in Section V; and section VI ends this paper with our conclusions.

II. RELATED WORKS
This section presents a brief review of the important works on IS using traditional and DL techniques.

A. TRADITIONAL TECHNIQUES
Most techniques of executing IS use traditional image processing such as Hough Transform (HT), morphological operator, Histograms of Oriented Gradients (HOG), thresholding, Daugman's method, Canny edge detector (CED) and active contours. The basic IS techniques were introduced by Daugman [41] using the Circular Hough Transform (CHT) and Integro-Differential Operator (IDO) methods [42]. Sahmoud & Abuhaiba [43] proposed a 2-step method for IS 59370 VOLUME 11, 2023 Authorized licensed use limited to the terms of the applicable license agreement with IEEE. Restrictions apply.
by employing the k-means clustering technique to first determine the expected iris region on a vertical CED output of the iris image, followed by the CHT method to then determine the iris radius and centre. An algorithm was developed to detect and separate the upper eyelid and to remove the noniris regions. To shorten the CHT processing time for iris edge detection, the k-means clustering algorithm was used. However, this method had difficulty localising the upper and lower eyelids, a step that was necessary in order to determine the natural edge of the iris. Chung et al. [44] had used CHT to roughly locate the pupils by determining the outer and inner edges of the iris, using the Orientation Matching Transform (OMT) method. Then, to obtain a better iris region, the Delogne-Kåsa circle fitting (DKCFS) technique was employed to remove outlier details of coarse inner and outer iris edges. Accuracy is affected when there is a combination of very noisy eye images and the possible failure of binarization. Improving the accuracy of the IS phase faces such challenges as off-angles, reflection from eyeglasses, occlusion by eyelids and eyelashes, rotation and different scanners. Shang et al. [24] developed CHT based on Canny's edge detection technique [45] for segmenting the iris region. CHT identifies boundaries by taking the duplicity between points on a boundary and the coordinate parameters of the centre and the radius of that boundary [46]. The weight of the matrix established by the radius' threshold was used to determine these parameters.
Gangwar et al. [47] proposed an adaptive filtering Iris-Seg method to utilise a coarse-to-fine approach of segmenting the pupil and iris edges. The method exploits dynamic thresholding assisted by local features to segment the pupil. The iris edge is located using adaptive filters in the polar space and then the cartesian space was improved. The performance of this technique may be significantly poorer in images within the visible wavelength spectrum (VIS) because they are more complicated than near-infrared (NIR) images. Abdullah et al. [48] developed a method for IS using iris and pupil segmentation. The pupillary boundary was identified using the thresholding method, where detection of the iris boundary was performed by a fusion-based shrinking and expanding of an iterative active contour (AC). This method was more robust for closed eye detection, using the Hue-Saturation-Intensity (HSI) method and separating the eyelids in the iris images. The challenge in this method is the longer processing time of the iteration process. Taking into account the unconstrained (non-ideal) environment, Sardar et al. [49] came up with a soft-computing method for IS, based on the rough entropy for localising the pupillary edge by reducing spatial ambiguity and the grey level, which was achieved by employing circular sector analysis (CSA) to determine the limbus boundary. The weakness of this method is the noise that affects the segmented pupil due to the threshold value. Khan et al. [29] devised an IS method to attain the best real-time results, using a Field Programmable Gate Array (FPGA). This technique uses the resulting image obtained after removing the background, followed by morphological operations to locate the pupil region. The proposed noniterative technique was implemented via an FPGA to achieve high speed and high accuracy, using less memory. However, it failed to determine the pupillary boundary in the presence of thick eyebrows and eyelashes, which worsened the accuracy of the visual spectrum images due to the poor contrast between the pupil and the iris in VIS environments. Although all these techniques achieved good speed and accuracy, their overall performance was still influenced by various environmental factors like varying illumination, off-angles, eyelids, eyelashes, reflections, blurring, various types of noise from eyeglasses and the ghost effect. Therefore, adopting a DL approach in this research could result in IS improvement.

B. DEEP LEARNING TECHNIQUES
This section explains the main contributions of DL techniques in iris segmentation. To overcome the challenges of traditional techniques and improve accuracy, IS has been done using DL techniques. Arsalan et al. [16] introduced a 2-phase IS system based on the CNN approach to detect the natural boundary of the iris in unconstrained environments. The first phase takes a rough iris boundary and uses modified CHT to define the region of interest (RoI) by slightly increasing the radius of the iris region. In the second phase, CNN is used by a fine-tuned VGG-face model on the RoI. The output of the CNN layers is two types of features by which the iris and non-iris pixels are classified to identify the correct iris edges. The difficulties are the unavailability of large public databases to investigate and train the DL approach (a large number of 21 × 21 iris images are needed to train CNN techniques). Despite the impressive results achieved previously, IS still faces challenges in adapting existing methods to low-quality iris images in unconstrained environments. These environments include blurriness, occlusions, off-angles, eyeglasses, rotations, low resolutions, reflections and other issues. Bazrafkan et al. [10] designed the Fully Convolutional Deep Neural Network (FCDNN) based on semi-parallel deep neural networks (SPDNN), which integrates several deep structure networks to allow the usage of parameters similar to SegNet-Basic that was introduced by Badrinarayanan et al. [50].
Bazrafkan & Corcoran [51] proposed a U-shaped system with 13 layers based on CNN to segment the iris region. The method does not have pooling layers to avoid unwanted artifact generated by the various networks. The terminal prediction of the system can utilise the traditional binary cross-entropy loss function. These techniques utilise DL approaches to achieve high IS performances. However, under non-ideal (unconstrained) conditions, such as with eyeglasses, off-angles, blurring, reflections and varying degrees of illumination, there is still potential for better accuracy in detecting the boundaries of the actual iris. Arsalan et al. [3] introduced a deep learning, densely VOLUME 11, 2023 59371 Authorized licensed use limited to the terms of the applicable license agreement with IEEE. Restrictions apply.
connected, fully convolutional network called IrisDenseNet to improve IS performance in unconstrained environments. IrisDenseNet comprises two main methods: the densely convolutional network (DenseNet) method [52] and the Seg-Net method as introduced in [50]. There are challenges to this method. The first is to keep the mini-batch sizes small because training takes longer due to its dense connectivity. The second is to consider high false-positive and falsenegative errors. The latter is mainly due to pixels of eyelashes or pupils whose values are roughly similar to the pixel value of the iris region. In addition, there is reflection noise caused by eyeglasses or a dark iris region, which is still prevalent. To improve the accuracy of the IS, Arsalan et al. [1] suggested a deep learning approach based on the Fully Residual Encoder-Decoder Network (FRED-Net), also known as a semantic segmentation network. FRED-Net is based on the architecture of SegNet as proposed in [50], setting the convolutional layers at 8 rather than 13. FRED-Net uses the residual skip connections between convolutional layers for both the encoder and the decoder networks. Some limitations of this technique are that the accuracy of the semantic segmentation technique relies on the number of training iris images and that there are false-positive and false-negative errors. Hence, to improve the performance of the technique proposed in [3], Li et al. [13] introduced an IS method that utilises a redesigned structure of the CNN network to become a faster R-CNN network for segmentation of the iris region. A faster R-CNN was designed by Zeiler and Fergus [53]. Their method involves the VGG-16 model suggested by Simonyan and Zisserman [54] as presented in Othman et al. [55]. The procedure was constructed using the Gaussian mixture model (GMM), taking into account the expectation-maximisation (EM) method proposed by Bilmes [56], Figueiredo and Jain [57]. The constructed algorithm was applied to locate the pupillary region and detect the limbus edge points to enhance the limbus boundary localisation technique [58]. Lastly, the limbus and pupillary boundaries were used to locate the iris region. To further enhance the suggested method [13], merged heterogeneous techniques that consider the strength of the CNN approach and the quickness of the traditional approach can be applied. Accuracy can also be improved by using a semantic segmentation technique and the proposed algorithm.
Lozej et al. [15] introduced an end-to-end deep-learning method for IS, based on the famous U-Net architecture. There were still challenges to this technique caused by several factors, including occlusion due to eyelashes and/or eyelids, noisy pixels around the edge of the iris region, rotation, offangles, over/poor illumination, blurring and noise caused by eyeglasses. To improve the IS performance of the U-Net architecture, Lian et al. [2] came up with the Attention U-Net (ATT-UNet) technique. This is a technique originally suggested by Ronneberger et al. [59] in connection with an attention mask guided by more distinctive features for classifying pixels of the iris and non-iris regions. ATT-UNet first selects a bounding box of the possible iris region and produces an attention mask. The attention mask is then combined with the distinguishing feature maps as a weighted function. ATT-UNet has a similar architecture to VGG16 but lacks fully connected layers and uses the pre-trained ImageNet template to initialise the weights as mentioned by Krizhevsky et al. [60]. Meanwhile, challenges such as noise due to eyeglasses, occlusion caused by hair and eyelashes, blurring, reflection, bad illumination and off-angles still exist. Zhang et al. [61] designed four network architectures combining the dilated convolution (DC) proposed by Yu and Koltun [62] along with the U-Net architecture in Ronneberger et al. [59]. The DC technique extracts more information from the eye images and accordingly, enhances the efficiency of IS. In the first three architectures of the proposed network, the partial DC was merged with the U-Net architecture (PD-UNet), while in the fourth, the full DC was merged with the U-Net architecture (FD-UNet). The FD-UNet method extracts features using DC rather than the original convolution, allowing for better processing of image information. The accuracy of IS in network performance was enhanced by adjusting the parameters of the network.
The current techniques for semantic segmentation based on a deep learning approach show favourable performance accuracies for iris segmentation, but still face challenges in unconstrained environments, such as in the presence of various types of noises due to eyeglasses, illumination variations, occlusion due to eyelids and eyelashes, blurring, off-angles, ghost effects and reflections. Thus, we proposed a multisegmentation network called MS-Net, based on a deep learning approach, to improve the accuracy of iris segmentation in challenging situations due to the presence of eyeglasses. Table 1 illustrates the comparison between a number of previous methods and the proposed MS-Net method for iris segmentation.

III. ARCHITECTURE OF THE PROPOSED MS-NET MODEL
This section provides details of the architecture of our proposed MS-Net model. It was developed based on the original U-Net model, aimed at enhancing U-Net in addressing the IS task. The purpose of this enhancement is to enable the modified model to better identify distinguishing features for separating the iris and non-iris pixels in images with eyeglass challenges.
The proposed MS-Net architecture ( Figure 1) consists of three main elements: the feature encoder network (marked in navy blue), the MSCFE-Net (marked in straw yellow) and the feature decoder network (marked in saffron yellow). Another essential part of this architecture that is newly-introduced is the skip connection (for concatenation). It concatenates the encoded feature map with the corresponding decoded map.

A. THE FEATURE ENCODER NETWORK
The two architectural branches of MS-Net function similarly to the microscopic optic nerves of the human eye. It extracts features using two similar frameworks and produces a single feature map from a single input image for each branch [63]. In the encoder, each block in the down-sampling path contains two 3 × 3 convolutions and dropout layers followed by a 2 × 2 max-pooling layer. The layers in both branches use batch normalisation, ReLU, zero-padding and the encoding process, represented by the navy-blue bands in Figure 1. This dual-input structure is capable of extracting prominent features, thanks to its simultaneous processing of information from multiple images of the eye. The encoder halves the image resolution at each down-sampling step and doubles the number of feature channels.
The encoder is designed as a Siamese [64], [65] structure to allow the model to pick out more distinguishing features and details by simultaneously processing information from two sources (similar to seeing with both eyes at the same time). Sharing the same weightage means using the same method to extract features from both eye images. In addition, neighbouring inputs from the two eye images provide temporal contiguity [66], [67]. The weight-sharing technique also reduces the number of network parameters to avoid overfitting due to the small number of training samples.

B. MULTI-SCALE CONTEXT FEATURE EXTRACTOR NETWORK (MSCFE-NET)
MSCFE-Net consists of the DRMC-Net and the PPR-AC models, as indicated by the straw-coloured bands in Figure 1. The proposed module extracts multi-scale semantic information of context features and produces high-level maps by first encoding that information, then suppressing noise pixels and finally, enhancing the receptive field. The PPR-AC model consists of a residual pyramid pooling model and an attention convolutional network.
In this study, DC was applied to efficiently calculate the undecimated wavelet transforms in the technique mentioned in [85] and as applied previously in the context of a DCNN by [68], [77], and [86]. This technology permits us to calculate the responses of any layer with any required accuracy. It can be seamlessly combined with the training phase of the network but can also be implemented after a network is trained. Our proposed MS-Net model applies four DC rates, as shown in Figure 2. DC is a robust tool that controls the resolution of features calculated by DCNNs, modifies the fields in the filter view to capture multi-scale information and simplifies the traditional convolution process. Mathematically, DC is applied to the two-dimensional signal output F(x) at any location (x), with input signal M(x) and convolution filter N(K), whose length is K and dilation rate is (r), as illustrated in Equation (1): Note that conventional convolution is a state in which the dilation rate r = 1. We refer interested researchers to [74] for more details about the DC model.

2) RESIDUAL STRUCTURE (RS) MODE
DCNNs typically integrate the features of input objects at various levels. The deeper the network, the richer the extracted features and the stronger the ability to learn. Therefore, a network design with a more profound architecture should be able to extract richer features. However, increasing the number of layers will not enhance performance but instead might cause information loss, vanish the gradient and degrade network performance when a deep network is trained [71], [87].
To prevent degradation, He et al. [71] and Zhang et al. [87] applied the idea of residual learning, which was introduced in the Residual Network (ResNet) technique. This means that the network is no longer learning total outputs. Residual learning allows the number of layers in the deep network to be raised in order to generate a map of the region of interest while avoiding degradation problems. This method effectively simplifies the learning objective and reduces the computational complexity, F(x), by implementing a feature map function, Y (x), where x is the network input as described in Equation (2): The residual block provides several advantages. It supports information flow and facilitates parameter optimisation by permitting the gradient to readily back-propagate over the associations. Furthermore, the residual learning structure has no extra parameters, which implies that enhancements are achieved without increasing the model's computational complexity. Figure 3(a) illustrates the architecture of the original ResNet network [71], which consists of multiple blocks of residual learning. Our proposed method avoids gradient vanishing and information loss without increasing the deep network's parameters and computational complexity, as illustrated in Figure 3(b). In addition, the skip connection step that we introduced efficiently protects the original features during sending and prevents information loss from the network.

3) DILATED RESIDUAL MULTI-CONVOLUTIONAL NETWORK MODULE (DRMC-NET)
Residual connections [71] and inception [88] are two classical architectures based on DL. Inception [88] introduces several designs with different receptive fields to extend the convolutional network architecture, which enhances the performance of the networks at lower computational costs. ResNet was trained for profound networks [71]. It uses shortcut connections to avoid degradation problems such as vanishing and exploding gradients. Szegedy et al. [41] suggested an inception-ResNet architecture that combines inception [88] and ResNet [71], thereby inheriting the benefits of both methods.
Using DC [85] and Inception-ResNet [86], [89], the DRMC-Net module is proposed to encode high-level feature maps. DRMC-Net employs four parallel dilated convolutions of 3 × 3 kernels with different dilation rates (r) as illustrated in Figure 4. In the proposed model, DRMC-Net provides four rs, namely 1, 3, 6 and 12. The branch with the highest rate of 12 contains a 1 × 3 convolution and a 3 × 1 convolution. The technique is akin to inception architectures that use several different receptive fields. In addition, the current proposal employs a 1 × 1 convolution at the end of each branch to improve linear activation and computational efficiency. The final design is provided with a shortcut mechanism (such as ResNet) to add the resulting features to the original ones. By merging the DCs at various dilation rates, the DRMC-Net model can extract semantic features from objects of various sizes and enhance the receptive field while maintaining computational cost.

4) PYRAMID POOLING RESIDUAL MODEL BASED ON ATTENTION CONVOLUTIONAL (PPR-AC) MODULE
The IS is a complicated process when eyeglasses are present because it causes large differences in the intensity of the pixels. The eyeglasses result in blurriness, specular reflections, occlusions, scratches and shadows. Also, it is difficult to extract features from multi-scale receptive fields in high-level feature maps, which degrades performance in problematic situations of significant scale differences. The Pyramid Pooling Module (PPM) [40] is an effective technique for extracting multi-scale features from a single eye image. A representation of the distinguishing features in the region of interest can be obtained using the pyramid pooling residual model based on the attention convolutional (PPR-AC) module to improve multi-scale feature extraction. The general architecture of the suggested PPR-AC is illustrated in Figure 5. PPR-AC consists of a Pyramid Pooling Residual Network model (PPR-Net) with skip connections and an Attention Convolutional Network (AC-Net). The idea of extracting multi-scale features using PPM is to enhance the capturing of features from the various receptive fields of the object. To avoid vanishing of the gradient and information loss, a skip connection step for both phases of PPR-AC was added. The AC-Net Module was used to improve accuracy at a lower level of computational complexity to selectively emphasise beneficial features. The feature map that was obtained from the DRMC-Net module is fed into the PPR-AC module as input. Figure 5(a) illustrates PPR-Net, which uses four different sizes of pooling, namely 16 × 16, 8 × 8, 4 × 4 and 2 × 2. To get different representations of a subregion, a 1 × 1 convolution is used to decrease the dimensions of the original features, followed by the upsampling and concatenation of the layers to generate a representation of the feature map. Finally, this representation is fed into AC-Net to extract the final distinguishing features, which contain both local and global information of the context, as shown in Figure 5  There are multiple parallel convolution layers, namely 3 × 3, 3 × 1 and 1 × 3 kernels, within a residual learning structure. Therefore, DRMC-Net can extract semantic feature maps of different scales and enhance the receptive field.
• ATTENTION CONVOLUTIONAL NETWORK MODEL (AC-NET) Excessive noise in eye images due to eyeglasses causes the iris information to be unclear, which results in difficult detection and segmentation of the iris region. The Attention Convolutional Network (AC-Net) model is designed to improve accuracy of our proposed network whilst requiring less computational complexity to extract the distinguishing features before it is decoded. The main aim of AC-Net is to pick out important information related to the objective of the task among the various types of information from the region of interest. Our experiments showed that network performance was enhanced by including AC-Net in the architecture of the suggested network. The AC-Net used in this research includes two convolutional layers, one average pooling layer, one up-sampling layer and a sigmoid layer. All layers use zero-padding and ReLUs. The architecture and main parameters of the AC-Net model are illustrated in Figure 5(b).

C. THE FEATURE DECODER NETWORK
The network's decoding process is performed by MS-Net, comprising two branches with the same configurations, as shown in the saffron-coloured part of Figure 1. The decoding process consists of the convolution operation, dense connection and up-sampling. It decodes the high-level features extracted by the encoder and MSCFE-Net. The dense connections block was employed to increase training speed and reduce the required number of parameters in order to enhance performance. The skip connection was included to connect the encoded and decoded elements for the purpose of addressing the low resolution of high-level semantic features and information loss across the consecutive convolutional layers and during repeated up-sampling operations. Finally, we applied a 1 × 1 convolution layer to minimise the feature channels and sigmoid function to produce a pair of output maps. The effectiveness of MS-Net is more apparent in images with bad illumination and complex noise from eyeglasses such as blurriness, reflections, shadows, scratches, dirt and other factors. It is for this reason that we included the MS-Net structure in the decoder module, as it can further enhance segmentation.

D. LOSS FUNCTIONS
During the training of the proposed MS-Net framework for IS, binary cross entropy was proposed to depict loss function segmentation, which is described as follows: where N i ∈ [0, 1] indicates the output of the last layer of the proposed framework, M i ∈ [0, 1] indicates the ground truth labels of the iris image and J indicates the number of pixels in each iris image.

IV. EXPERIMENT SETUP
The following three sections describe the databases, implementation details and evaluation metrics.

A. DATABASES
In this study, the CASIA-Iris.V4-1000( 1 ) and UBIRIS.V2( 2 ) databases were used to evaluate the accuracy of the proposed techniques. Both of these databases are utilised to evaluate the performance of the IS task and the IRS. Details of these two databases are as follows: 1) CASIA-IRIS.V4-1000 is a subset of the CASIA-Iris.V4 [38] database provided by the Center for Biometrics and Security Research (CBSR). The dataset contains 20,000 NIR images taken from 1,000 subjects, composed of the left and right eyes with 1 to 10 samples each. The resolution for all images is 640 × 480 pixels and captured with an IKEMB-100 camera. All images are in (.jpeg) format. The database includes samples with eyeglasses and specular reflections. Additionally, for subjects who wear eyeglasses, it includes images with eyeglasses and without eyeglasses as a standard for measuring performance. A subset of 700 images for the training phase and 400 for the testing phase was used. 2) UBIRIS.V2 [90] is an iris database captured in an unconstrained (non-ideal) environment using visible light illumination. It contains 2 sets of images for 261 subjects (left and right eyes) totalling 522 or (2 * 261). The total number of samples in this database is 11,102 VIS images captured using a Canon EOS DSLR camera with a focal range of 400 mm, an exposure time of 1/200 s and an ISO of 1600. All images are in (.tiff) format. The resolution of each image is 800 × 600 pixels. A subset of 700 images for the training phase and 500 for the testing phase was used. The database includes samples with eyeglasses and specular reflections.

B. IMPLEMENTATION DETAILS
The simulation platform for conducting our experiment is PyCharm, using Keras with TensorFlow backend.
The experiment was implemented using a laptop with Intel (R) Core (TM) i7-9750H processor, Nvidia GeForce GTX 1660Ti GPU(6GB-GDDR6) Graphics Processing Unit, 32GB-DDR4 RAM and 512GBNVMe SSD, running on a 64-bit Windows10 Home OS. The proposed networks were trained on the CASIA-Iris.V4-1000 and UBIRIS.V2 databases. The number of epochs was set at 500, with 10 steps per epoch for both U-Net and our proposed network. During the training phase, augmentation was not used. The Adam optimisation process was employed with a learning rate of 10−4 and zero decay. Each network was then tested on two iris image settings, with and without eyeglasses. The images with eyeglasses posed several challenges, because of blurriness, reflections, dirt, scratches, shadows and other issues.

C. EVALUATION METRICS
The performance evaluation of the proposed IS method and the comparisons were carried out using the following five evaluation metrics: Accuracy (ACC) whose value lies between 0 and 1 as defined by Equation (4), the Equal Error Rate (EER) whose value ranges from 0 to 1 as defined by Equation (5), Precision (Pr) as defined by Equation (6), Recall (Rec) as defined by Equation (7)

V. EXPERIMENT RESULTS
This part evaluates the segmentation result, measured using five metrics of the proposed MS-Net and the original U-Net on two databases. The three objectives of this evaluation are to measure the accuracy of MS-Net, to compare the accuracy of MS-Net with that of U-Net [15] and to compare MS-Net with a number of state-of-the-art methods.

A. EVALUATION OF THE PROPOSED MS-NET METHOD
The segmentation performance was evaluated against three main criteria: with eyeglasses, without eyeglasses and both with and without eyeglasses (overall). The proposed MS-Net method was compared with the original U-Net method [15]. The original U-Net is an open-source algorithm and was tested using the same CASIA-Iris.V4-1000 and UBIRIS.V2 databases. Figure 6 shows the changes in training loss (a and c) and training accuracy (b and d) in the original U-Net (a and b) and the proposed MS-Net (c and d). The x-axis represents the number of epochs while the y-axis represents the loss and the accuracy of the training phase per batch of epochs between 0 to 500. A comparison between the original U-Net algorithm and the suggested MS-Net method shows that the accuracy of MS-Net is higher and the loss lower. This suggests that the MS-Net has improved the convergence rate and accuracy of the training phase, besides achieving significant optimisation. Figure 7 presents a comparison of the IS without eyeglasses obtained from the CASIA-Iris.V4-1000 and UBIRIS.V2 databases. It shows the original eye images (a and g) followed by the three resulting images obtained, which are the ground truth (b and h), segmentation by U-Net (c and i), the false positive and negative pixels of the U-Net segmentation results are presented in green and blue colours, respectively (d and j), segmentation by the proposed MS-Net method (e and k) and the false positive and negative pixels of the MS-Net segmentation results are presented in green and blue colours, respectively (f and l). The images produced by MS-Net were shown to be more accurate than those obtained by other techniques. Figure 8 shows the same layout as Figure 7 but with eyeglasses. These segmentation results were also obtained using samples taken from the CASIA-Iris.V4-1000 and UBIRIS.V2 databases. The results show that the proposed MS-Net method is more reliable in performing IS for samples with eyeglasses as compared with U-Net method.
Generally, eyeglasses are a significant source of noise in non-cooperative and unconstrained environments of iris identification. The following effects were considered to be noise in iris images with eyeglasses [38], [39]: specular reflections, occlusions, shadows, dirt, scratches and blurriness. In addition, the edges/frames of eyeglasses may be misidentified as limbus/pupillary boundaries by public IS methods such as the open-source OSIRIS [55]. As a result, these problems frequently lead to erroneous segmentation of the iris region, which further degenerates the performance accuracy of the IRS. For this reason, the authors propose the MS-Net architecture to improve the accuracy of IS for iris images with eyeglasses. Figure 9 presents the results of IS obtained by the U-Net algorithm compared to the proposed MS-Net method using challenging samples due to eyeglasses, taken from the CASIA-Iris.V4-1000 and UBIRIS.V2 databases. As can be observed, certain iris images were very difficult to process due to noise from eyeglasses. When plotting these results, such challenging examples of the IS from these two databases had to be further processed. Results showed that the images underwent inaccurate segmentation of the iris region using the original U-Net [15] algorithm. In contrast, the proposed MS-Net method successfully segmented the iris in those challenging samples and produced higher accuracy IS, as shown in Figure 9. MS-Net detects the iris region with high accuracy against challenges with eyeglasses caused by blurriness, specular reflections, dirt, scratches and other factors.
There were samples with eyeglasses that were poorly segmented by both the original U-Net algorithm and the proposed MS-Net method, as shown in Figure 10. The noise generated by the eyeglasses, such as specular reflection or blurring, poses a significant challenge to segmenting the iris region. Nevertheless, the IS results obtained by the suggested MS-Net are better than those of the original U-Net. Iris images in the UBIRIS.V2 database are visible light (VI) images, which means that challenges could be caused by low  contrast. Even if we relabelled the iris images in the database with the human eye, some are still suspect and require further investigation.
As shown in Fig. 7, Fig. 8, Fig. 9 and Fig10, there are two types of errors called the false positive (FP) and the false negative (FN). When a non-iris pixel is predicted by both  the original U-Net method and the proposed MS-Net method to be an iris pixel, the false positive (FP) error (shown by the green colour) happens, whereas the false negative (FN) error (shown by the blue colour) happens when an iris pixel is predicted by both the original U-Net method and the proposed MS-Net method to be a non-iris pixel. The FP error is caused by scratches and blurring noises due to eyeglasses or similar non-iris region pixels that have a pixel value closer to the pixel value of the iris region, while the FN error is caused by reflection noises due to eyeglasses or the dark region.

B. COMPARISON OF THE PROPOSED MS-NET METHOD WITH THE ORIGINAL U-NET NETWORK
The proposed MS-Net method was evaluated and compared with the original U-Net method [15] in terms of F-Measure (FM), Accuracy (ACC), Precision (Pr), Recall (Rec) and Equal Error Rate (EER) as defined by Equations (4) through (8). Table 2 lists the comparison between the original U-Net and the suggested MS-Net on the CASIA-Iris.V4-1000 and UBIRIS.V2 databases. The MS-Net method achieved better results compared with the original U-Net method in both databases with an error rate of 0.011 and 0.033 on CASIA-Iris.V4-1000 and UBIRIS.v2, respectively. It is observed that the suggested MS-Net method always achieved excellent IS results when dealing with iris images taken in both constrained (ideal) and unconstrained (nonideal) environments, demonstrating its exceptional generalisation capacity to various databases.  Figure 11 is a graphical representation of the performance of several IS techniques. Our proposed MS-Net is compared with previous state-of-the-art methods that were based on the traditional approach [91] and CNN methods for IS [10], [15], [34], [39], [61], [92], [93], [94], [95], [96], [97], [98], [99], [100], [101], [102]. FM testing was conducted to compare the performance of MS-Net with earlier methods, utilising the same open database, the UBIRIS.v2. It was observed that the MS-Net method achieves better FM results compared to the other iris segmentation methods tested.

VI. CONCLUSION
The accuracy of an IS technique is essential in determining the accuracy of an IRS. Our proposed multi-segmentation network (MS-Net) framework based on a DL approach enhances the IS of images with eyeglasses. This technique has been tested, evaluated and compared with other techniques. MS-Net was effective in improving the accuracy of interconnected tasks, leading to multi-task learning by capturing the differences and similarities between multiple samples. In terms of architecture, MS-Net consists of a feature encoder network, a feature extractor by a multi-scale context network (MSCFE-Net) and a feature decoder network. In MSCFE-Net, the DRMC-Net model was adopted to encode highlevel feature maps and obtain deeper features using four convolution branches with multi-scale dilated convolution. PPR-AC was also introduced to improve extraction features of the multi-scale receptive field in high-level feature maps and to avoid drops in performance when faced with problems of significant scale differences. The proposed method contains dense connections within the feature decoder network in each branch. This dense connection reduces training difficulty as only a few training samples of eye images are used. The results of experiments conducted on the proposed MS-Net method showed that it is able to increase the accuracy of IS, particularly with cases that involve eyeglasses.
In future work, the researchers would be designing more efficient techniques by exploiting spatial associations between the iris mask, the outer iris edge, and the inner iris edge to enhance the iris segmentation accuracy. Also, the researchers would further improve the speed of our proposed MS-Net method by designing heterogeneous networks that combine the speed of traditional methods and the power of the deep learning approach with different forms of ResNet to make it more efficient for iris segmentation tasks. Additionally, our method would be used for different segmentation tasks, such as medical image segmentation (brain anatomy, melanoma, vessel segmentation, cancer segmentation [skin, liver, breast, and lung cancer segmentations], fractures, etc), dental image segmentation, market segmentation, and facial segmentation.