Semantic and structural image segmentation for prosthetic vision

Prosthetic vision is being applied to partially recover the retinal stimulation of visually impaired people. However, the phosphenic images produced by the implants have very limited information bandwidth due to the poor resolution and lack of color or contrast. The ability of object recognition and scene understanding in real environments is severely restricted for prosthetic users. Computer vision can play a key role to overcome the limitations and to optimize the visual information in the prosthetic vision, improving the amount of information that is presented. We present a new approach to build a schematic representation of indoor environments for simulated phosphene images. The proposed method combines a variety of convolutional neural networks for extracting and conveying relevant information about the scene such as structural informative edges of the environment and silhouettes of segmented objects. Experiments were conducted with normal sighted subjects with a Simulated Prosthetic Vision system. The results show good accuracy for object recognition and room identification tasks for indoor scenes using the proposed approach, compared to other image processing methods.


Introduction
Retinal degenerative diseases such as retinitis pigmentosa and age-related macular degeneration cause loss of vision due to the gradual degeneration of the sensory cells in the retina [1,2]. Retinal prostheses are currently the most promising technology to improve vision in patients with such advanced degenerative diseases [3][4][5][6]. These devices elicit visual perception by electrically stimulating retina cells. As a result, implanted patients are able to see patterns of spots of light called phosphenes that the brain interprets as a visual information [7][8][9]. Current retinal prosthetic devices are limited to hundreds of electrical receptors, which produce a very limited visual elicitation [10][11][12]. From the actual technologies for retinal implants [13], one of the most active line of research is based on implants with a micro camera that captures external stimuli and a processor that converts the visual information in microstimulations in the implant, as can be seen in Fig 1. Following the computer image paradigm, we can say that the visual information evoked by the implants has very low spatial resolution and very limited dynamic range (only few levels of stimulus intensity are perceived as different) [14][15][16]. Intuitively, from an information theory perspective, the process from the external sensor input to the implant stimuli is analogous to taking a high definition image and convert it to a low resolution, grayscale image with just a few grey levels. Thus, a large amount of visual information is lost. Prosthetic vision allows users to recognize objects with simple shapes, to see people's silhouettes in bright light or detect motion [17], but high level tasks require more precise visual cues and a deeper interpretation of the information.
Recent developments in implants might result in an improved resolution and performance of the visual elicitation [18], but quality would still be several orders of magnitude lower than a current micro camera. Alternatively, the visual information gathered by the external camera could be processed prior to being transferred to the retinal electrodes. Image processing can be used to extract and highlight relevant information from the external camera. This information can be presented with visual cues that help to understand the perceived scene by the implanted subject. Several studies have already been conducted testing specific cues for object recognition [19][20][21][22][23][24][25], reading [26][27][28][29], facial recognition [30,31] or navigation [32][33][34][35][36] in the context of prosthetic vision.
One of the most basic image processing tasks from the cognitive, but also from the computational level, is the segmentation of the image in different regions [37][38][39]. From a statistical point of view, this corresponds to the problem of clustering. Rooted on the Aristotelian laws of association, early research in perception from Gestalt psychologists found the importance of the principles of grouping [40]. These principles state that our brain tends to group image elements based on proximity, color, shape or other similarities. Although some of the Gestalt ideas are controverted, the principles of grouping have been supported by posterior empirical research [41,42]. From a computational perspective, image segmentation dates back to the seminal work of Minsky and Pappert [43] followed by several works in the 60s and 70s [44,45]. At that time, segmentation was based on grouping elements as belonging to the same object. Adelson [46] proposed to group elements based on abstract textures and materials, advocating the idea of seeing stuff rather than things. This was the stepping stone for modern semantic segmentation, where the objective is to group the image regions based on labels with semantic meaning, without relying on individual objects [47]. Furthermore, the use of semantic labels transforms the clustering problem into a classification problem. Recent research using deep learning has gone one step further to produce instance-aware semantic segmentation [48]. In this case, we are back to the concept of seeing things by grouping pixels of single objects, but including a semantic label for the object. The external and internal components include a micro camera, a transmitter, a external processing unit and a implanted electrode array. First, the external camera acquires an image. Then, the external processor converts the image to a suitable pattern of electrical stimulation of the retina through an electrode array. For visually impaired people, basic scene understanding is essential for many everyday tasks and it also facilitates subsequent tasks of finer perception. In this work, we use segmentation to provide a basic visual representation of indoor scenes for prosthetic users. We combine both semantic segmentation and instance segmentation. We use instance segmentation to highlight relevant objects in the scene. This has a double purpose: on one hand, we are able to reduce visual clutter, which becomes indistinguishable noise in a low resolution implant array; on the other hand, the grouping highlights the silhouette of the object, making it more distinguishable. One of the main problems of using object silhouettes for recognition is the lack of sense of scale or perspective. Thus, we rely on a second semantic segmentation component to extract structural informative edges of the scenes, such as wall and ceiling intersections. Those edges provide an intuitive representation of the 3D structure of the room as concluded in [49], where it is shown that the results with the structural edges are significant and better than the results obtained without edges for scene recognition. The idea of combining instance and semantic segmentation has been previously studied in the computer vision literature with different approaches [50,51] and it has shown to be of great benefit for holistic scene understanding [52]. The limiting case where every pixel of the image has a semantic label and instance id is called panoptic segmentation [53].
Current state of the art methods for image segmentation are mostly based on deep neural networks [47]. Most recent developments of semantic and instance-aware segmentation are based on Fully Convolution Networks (FCN) [47,48,[54][55][56]. A FCN is an architecture based on convolutional layers with added upsampling layers with skip connections to allow for detailed pixel prediction on arbitrary-sized inputs [57]. Similar approaches, like the U-net architecture, are able to provide accurate pixel prediction [58]. In this work, we use two different types of FCN-based segmentation to highlight the information available in the image and to present the most useful information to the user: PanoRoom [54] for semantic segmentation of structural elements and Mask-RCNN [55] for instance segmentation of relevant objects.

Subjects
Eighteen subjects with normal vision volunteered for the formal experiment. The subjects (four females and fourteen males) were between 20 and 57 years old.
Ethics statement. The research process was conducted according to the ethical recommendations of the Declaration of Helsinki and was approved by the Aragon Autonomous Community Research Ethics Committee (CEICA) that evaluates human research projects, human biological samples or personal data. The research protocol used for this study is noninvasive, purely observational, with absolutely no-risk for any participant. There is no personal data collection or treatment and all subjects were volunteers. Subjects gave their informed written consent after explanation of the purpose of the study and possible consequences. The consent allowed the abandonment of the study at any time. All data were analyzed anonymously.

Stimuli
We use a two step process to generate the stimuli used in the experiments. First, we process the original color image with the different methods stated in the following subsections. This generates a grayscale or binary image which corresponds to the signal to activate the electrodes of the retinal implant. Then, we use simulated prosthetic vision by generating the phosphene pattern as an image in a computer screen. The phosphene simulation has been designed to represent the descriptions of phosphene perception reported by retinal prothesis patients.
The following sections describe the three segmentation methods used in the experiments to process the input images and generate the activation of the phosphenes. First, our proposal based on semantic segmentation with artificial neural networks (SIE-OMS). To our knowledge, this is the first work that have used deep learning models in this setup. Therefore, we have used two standard processing methods as baselines: a) detecting the silhouettes and structure within the scene with a standard edge detector (Edge), and b) generating the stimulus directly from the input image luminance (Direct). Examples of the resulting effect are shown in Fig 2. For reproducibility purposes, all the stimuli images used in the experiments can be found in the dataset available online Image dataset: https://doi.org/10.6084/m9.figshare. 11493249.v4. SIE-OMS. We propose to combine two FCNs to select and highlight informative elements in indoor scenes as an intelligent way of activating the phosphenes. Specifically, we extract structural informative edges (SIE) and object masks and silhouettes (OMS) to later combined both, SIE and OMS, to build our proposed schematic representation of the scene (SIE-OMS), as can be seen in Fig 3. This idea comes from our previous study, where the results concluded that the representation of SIE in the schematic representation of the scene is significant and produce better results in object and scene recognition for SPV than the schematic representation without edges [49].
Structural informative edges (SIE). One of the main problems in the recognition of scene elements based on silhouettes is the lack of sense of scale or perspective. The scale and the structure of the scene can be achieved by detecting the structural informative edges (SIE), that is, those main edges formed by the intersection of the walls, floor and ceiling of the room. These edges can be seen in Fig 4. Our approach is based on the model by Fernandez-Labrador et al. [54] for indoor scenes. Similarly to the object masks network described below, this method is also based on a FCN for pixel classification. In this case, the network was trained to estimate probability maps representing the room structural edges, even in the presence of clutter and occlusions. The architecture of the network is an the encoder-decoder structure [54]. The encoder is built from a ResNet-50 model [59], pre-trained on the ImageNet dataset, with the final layer replaced with a decoder that jointly predicts layout edges and corners locations already refined. The output of the model is an unique branch whose output has two channels, corners and edges maps. In the decoder, the model employs skip-connections from the encoder to the decoder concatenating 'up-convolved' features with their corresponding features from the contracting part. In order to improve the training phase, Fernandez-Labrador et al. suggest to performe preliminary predictions in different resolutions which are concatenated and feed back to the network. The loss function for training is a pixel-wise sigmoid cross-entropy, regularized by the L1-norm of the network parameters. The loss function was minimized by using Adam with an initial learning rate of 2.5e −4 and exponentially decayed  by a rate of 0.995 every epoch. They also applied 0.3 dropout In this context, dropout refers to the technique used in deep learning to prevent overfitting. Do not confuse with the dropout of phosphenes. rate and 5e −6 weight decay and a batch size of 16 which allowed the learning of more complex rooms. In this work, we have used a model pre-trained with the LSUN dataset [60].
Object masks and silhouettes (OMS). We perform instance segmentation of objects using the original architecture of Mask R-CNN [55] which is partially represented in Fig 5. This method is an extension of Faster R-CNN [61] with several improvements and an extra branch to segment the object masks. The first part of the network, called a Region Proposal Network (RPN), proposes object bounding box candidates on the input image. These candidates are called regions of interest (ROIs). It also generates feature maps from the whole image. In our case, we used the Feature Pyramid Network (FPN) [62]. The second module is a RoIAlign layer that pools a small feature map for the object region from those extracted by the FPN. Then, it aligns each ROI to the feature map. This is the backbone architecture. Then, the model splits in two branches, as can be seen in Fig 5. The box branch is based on the classification component of Faster R-CNN. It generates two outputs for each ROI: a) the class of the object present in the ROI and, b) a refined object bounding box using a regression model. The mask branch is a convolutional neural network that takes the high probability regions selected by the ROI classifier -box branch-and generates a binary masks of the object. Then, it uses upsampling and deconvolution layers to scale the predicted masks to the size of the ROI bounding box which gives the final masks, one per object. Regarding the loss function for the model, it is composed by the total loss in doing classification, generating bounding box and generating the mask. The mask loss was defined only on positive RoIs and the mask target was the intersection between an RoI and its associated ground-truth mask. The training was performed with a batch size of 16 and for 160k iterations. The learning rate was 0.02 which was decreased by 10 at the 120k iteration. They also used a weight decay of 0.0001. For this work, we have used a pre-trained model on the COCO dataset [55,63]. Thus, we have only considered the object classes that were already defined in the pre-trained model. In order to speed up computation and remove spurious detections we removed the object classes of clearly small objects (e.g.: scissors, banana, etc.) as the scale of the scene in the image would not have allow it to be identified, and non-indoor objects (e.g.: car, tree, etc.). Once the object masks have been generated by the network, we highlighted the contour of each mask to avoid confusion on overlapping masks, as can be seen in Fig 6. We also performed morphological operations to reduce the aliasing effect when translated to phosphene images.
Dealing with occlusions. Although this algorithm has achieved good results for object segmentation, there are more complicated cases, such as images with overlapping objects or scenes with occlusions, where the view of one object may be blocked by other objects. In that case, we could use a depth sensor, such as an RGB-D camera, or a stereo camera to estimate the depth. Alternatively, there are some works to estimate the depth purely, based on monocular information [64]. As a proof of concept, we found that the probability score for the detection network was correlated to the level of occlusion of each object. In non-occluded objects its form is complete and therefore its recognition is more likely detected. In contrast, the form of the occluded objects is not complete and therefore its recognition is less likely to be detected. That is, a high probability is most likely to appear in objects that are in the front. Thus, we stacked the instances from the least to the highest probability, leaving the objects with the highest score overlapping the objects with the least score, as can be seen in Fig 6. This was confirmed experimentally for our setup, which is a simple problem with a limited number of classes and scenarios. In the case of a more general environment with more classes it would be necessary to use a more complex model, but that is beyond the scope of the paper.
The final representation of the SIE-OMS method is a superposition of both parts, SIE and OMS, always assuming the edges as background and object masks as foreground.
Baseline methods. We have considered two baseline methods that are the most used in the literature and that follow a completely different structure to our SIE-OMS model [65][66][67][68]. We compared SIE-OMS with two baseline methods used in retinal prothesis: a) a direct method that converts the input image directly to the phosphene map by averaging the brightness on the region covered by each phosphene, and b) a standard edge detector to extract brightness contours (see Fig 2). The direct method has proved to be very effective in scenes where high contrast predominates [65]. Edge detectors have also been previously used for prothesis vision and phosphene images [66][67][68]. Since the contours of an image holds much information, edge extraction is a useful method of encoding and selecting the information contained in an image. The drawback here is that the understanding of a complete scene in low vision represented by edges may be more challenging because the amount of clutter. For example, Sanocki et al. investigated how complicated is an edge extractor method comparing object recognition with and without removal of background clutter with edge images [69]. The results showed that the increase in the number of edges greatly increase the complexity. For the edge detector, we used the Canny implementation from the scikit image Python package with the default parameters [70]. In this case, we also added morphological operations (dilation) to reduce aliasing without adding clutter.
Phosphene simulation. As commented before, first, the input images are processed by one of the three methods (SIE-OMS, Direct, Edge) resulting in the grayscale images from Fig  2. Then, these grayscale images are used to activate the phosphene map. In this work, we have used a simulated phosphene map on a computer screen, but the same activation images could be directly applied to the retinal implant.
Based on previous studies with simulated prosthetic vision [9,71,72], we approximate the phosphenes as grayscale circular dots with a Gaussian luminance profile -each phosphene has maximum intensity at the center and gradually decays to the periphery, following a Gaussian function-. The intensity of a phosphene is directly extracted from the intensity of the same region in the processed image. The size and brightness are directly proportional to the quantified sampled pixel intensities. For our phosphenic images, the array of phosphenes was limited to 32 x 32 (1024 electrodes) and 8 different luminance levels according to the number of luminance levels attainable in human trials using retinal prostheses [9,22]. We also included a 10% dropout of electrodes, which is a standard value used in the literature [72]. The dropout percentage has shown that significantly affects the performance of recognition tasks decreasing recognition accuracy as the dropout percentage increases [73]. The complete process of phosphene generation can be found in the Supplementary material (see S1 Appendix).

Experimental setup
Most of the SPV configurations are usually based on a computer screen for the presentation of static or dynamic phosphene images [29, 74,75]. This methodology allow controlled evaluation of normally sighted subject response and task performance which is fundamental to know the way humans perceive and interpret phosphenized renderings. SPV also offers the advantage of adapting implant designs to improve the perceptual quality using image processing techniques without involving implanted subjects. In our case, the participants were normal sighted subjects seated on a chair facing a computer screen at 1m distance resulting in a 20 degrees simulated field of view, as can be seen in Fig 7. For the formal experiment, subjects were recruited to complete two tasks: object recognition and room identification. The recognition accuracy was analyzed after the trials. Each trial consisted of a sequence of images presented randomly to the subject with the proposed SIE-OMS stimuli method and the two baseline methods (Edge, Direct), as can be seen in Fig 8. At the beginning of the experiment, a white dot was displayed in the center of the screen indicating where the subjects had to maintain the fixation sight until the beginning of the task.
Next, each phosphenic image appeared for 10 seconds and switched for the next image automatically. This procedure was repeated for the other test images. To avoid distractions in the participants, they verbally indicate the type of objects seen in each image and their selection of room type keeping the fixation sight on the screen. The responses of each image were annotated by the experimenter. If the subjects did not respond within the 10 seconds that the image is displayed, the result of the test image was considered not answered (NA). If the subjects were only able to respond to one of the two phases of the experiment, only the unanswered phase was considered as not answered. Every 15 test images we made a pause of 30 seconds. The complete experiment took approximately 15 minutes.
The experiments were conducted using a public database of indoor scenes [76]. All the images from the database are still life scenes, from arbitrary scenarios, locations, clutter, cameras and lightning conditions. Some images are from old phone cameras with very poor quality and resolution to be more challenging as a computer vision benchmark. Thus, we replaced some images with the first results of querying Google Images with the room label, that also matched the database features (e.g., still life, mid-wide view. . .). For each of the six categories, we randomly selected 50 images. Hence, we conducted the experiment using 300 images from different indoor environments. The original images were processed using our proposed method and the two baseline approaches, resulting in 900 phosphenic images. Prior to beginning the experiment, subjects were informed about the number of images in the experiment (54 images per subject). Subjects were unaware that multiple image processing strategies were phosphenes (rows: bathroom, bedroom, dining room, kitchen, living room and office, respectively). Each column shows: a) input images, b) images processed using the Edge method, c) images processed using the Direct method and d) images processed by our SIE-OMS method, respectively. used in the experiment, although a screen with four images were shown to the subjects at the beginning of the test. These demo images were not included in the experiment, to avoid learning effects. Subjects were informed that all scenes were indoor scenarios, but they were not informed about the types of room, neither the object classes, nor the number of objects in each image. The types of room studied were: bathroom, dining room, living room, kitchen, office and bedroom. No subject identified a type of room or scene not belonging to that list. In most of the tests, the objects identified by the subjects were: chair, table, couch, toilet, bath, sink, bed, oven/microwave, refrigerator, laptop. This coincides with the list of classes used for our SIE-OMS which was selected without looking at the database and before conducting any trial or test. As commented before, the object classes were those already included in the pre-trained model. However, in two images with the direct method, a couple of subjects were able to find a window that our system did not detect because the class was not included. Furthermore, in a couple of cases a subject wrongly identified wardrobe and door in images containing a fridge.

Results
The following section shows the results of the experiment. We analyze separately the results of object recognition phase and room identification phase. We show the percentage of correct responses in both tasks and we include 95% confidence intervals. We also differentiate between incorrect response and no answer. Table 1 and Fig 9 show the global results for object recognition and room identification tasks considering the proposed stimuli generator (SIE-OMS) and the two baseline methods (Edge and Direct). The analysis of the average correct responses for both tasks reveals a significant difference between methods (p< 0.001). In both tasks, the results show a considerably better performance of SIE-OMS compared to the other methods. The SIE-OMS method has the highest percentage of correctly identified objects (62.78%) compared to Edge (19.17%) and Direct (36.83%) methods. Likewise, there is a clear increase in the percentage of success in the room identification of SIE-OMS versus Edge and Direct method. The number of unanswered responses for our method was also smaller, indicating that there was no difficulty in the comprehension of most of the images. In contrast, it is worth noting the high percentage of unanswered responses for the Edge method, reaching more than 70% of the scenes. Figs 10 and 11 show the results for the object recognition and room identification tasks for each room-type, respectively. As before, when comparing the baseline methods versus our approach, the highest number of correct responses is obtained for SIE-OMS method for all room types. Besides, the largest difference in results was obtained comparing the Edge method versus the SIE-OMS method (p<0.001) for all room-types. However, there was no significant difference for kitchen type in Direct vs SIE-OMS (p = 0.464). Similarly, there is a significant difference for living room (p<0.05), office (p<0.01) and bathroom (p<0.01). On the other Table 1

. Global object recognition (OR) and room identification (RI) values for each phosphenic stimuli method.
Comparison of mean responses and standard deviation grouped by type of phosphenic image method (Edge, Direct and SIE-OMS). 95% of confidence interval for the mean difference. hand, the results of room identification task for each room-type (Fig 11) provide additional support for the SIE-OMS method since this method also has the best percentage of correct responses in each room-type, exceeding 85% for the cases of bedroom and dining room. In the same way as in the identification of objects, the case of the kitchen obtained the worst results, followed by the office case. Taken together, these findings indicate that SIE-OMS method was significantly effective improving object recognition and room identification, yet also significantly more effective than the baseline methods, Edge and Direct. Fig 12 shows four examples of failed and successful tests from the three methods. The two top rows show a bathroom and a bedroom scene where the identification of the objects and room was a success for all the methods. This is due to the location of a characteristic object with a clear silhouette in the center of the image that also helps in the identification of the room. Contrary, the bottom rows show a kitchen and an office where the recognition of the objects and the identification of the type of room failed in all cases as a result of the lack of distinguishable shapes (rectangle silhouettes) and visual clutter.

Performance analysis of SIE-OMS
We also analyzed the performance of the proposed SIE-OMS method. The SIE-OMS system detected all the clearly visible objects of the scenes and even most of the occluded objects that matched the selected classes. Structural edges also improve the performance of our method. Recovering the main structure of the room provide sense of scale or perspective of the objects Higher scores in correct responses indicate that subjects were able to recognize the type of room in each test image. Higher ratios in non responses indicate that subjects were not able to recognize the type of room in each image. The SIE-OMS method obtained the highest score of the three methods in all room-type compared with Edge and Direct methods. In the same way as in the identification of objects, results also showed how the most difficult room was the kitchen. ��� = p<.001; �� = p<.01; � = p<.05; ns = p>.05. All t-tests paired samples, two-tailed. https://doi.org/10.1371/journal.pone.0227677.g011 Image segmentation for prosthetic vision and hence a better understanding of the 3D scene. Table 2 shows the confusion matrix of room-type based on answered images (correct and incorrect responses). Table 3 shows the confusion matrix of the room-type based on the total images of the test (correct, incorrect and no answer).  Concerning the performance of our method, the recall and the precision are high, reaching in some cases up to 97% ( Table 2). The diagonal elements show the number of correct classifications for each class. Hence, most of confusions are found in bathroom, living room and office. Office was confused with bathroom because of the similarity of shape between some chairs and toilets. In addition, office was confused by bedrooms since many of them usually have study desks in the bedrooms. There were other less relevant cases where the dining room was confused with an office since both are composed of chairs and tables. This confusion can be explained because the database is from Northamerican locations, while the subjects live in Spain where apartments commonly join the dining room and the kitchen.
Note that when the unanswered responses are taken into account (Table 3), the recall for the kitchen case decreased significantly (from 88.89% to 48.00%). This means that the kitchen room is more difficult to be identified. This low performance in the kitchen identification is mainly because the information provided turned out to be very limited in this case. For instance, ovens, microwaves and fridges with a rectangular shape masks were sometimes confused with windows, doors or wardrobes (which are object classes not considered by our system).

Discussion
The visual information in interpretation of the phosphene simulation is an important issue due to the limited capabilities of retinal implants. Low resolution, limited dynamic range and narrow visual field are some of the limitations present in current retinal prostheses [8,9]. Furthermore, Nanduri et al. [7] showed that phosphenes are not perfectly located in the visual field corresponding to a specific grayscale pixel. Electrically elicited phosphenes change in form and size with increasing amplitude. However, depending on the type of device the perceptual distortions will be affected differently. For example, in retinal prostheses this distortion produce visual effects called comets that might result in a substantial loos of information [77]. Other devices, such as optogenetic technologies, may suffer a loss of temporal resolution, while cortex implants suffer from crosstalk [77]. This fact results in loss of visual information which affects patient perception. However, there are research groups using computer vision approaches to try to expand the perceptible visual field in implanted patients to provide useful information in the peripheral vision [78,79].
Another important limitation of retinal implants is phosphene dropout, which has been reported in retinal prostheses trials [17,80] as a result of very high threshold values needed to elicit phosphenes in areas with a high number of degenerating nerve cells. Clinical trials by Thompson et al. [81] indicated that the dropout rate has significant effects on the speed and Table 3. Confusion matrix results for room identification based on the total images (correct, incorrect and no answer (NA)) using SIE-OMS method. Image segmentation for prosthetic vision accuracy of recognition tasks. Similarly, Cao et al. [82] showed that the accuracy and efficiency in writing tasks decrease as the variability of distortion and dropout percentage increase.

Actual/Predicted
To overcome the limitations of implants, SPV researchers have tried to optimize the image presentation to deliver the effective visual information in daily activities. For example, Vergnieux et al. [32] limited the cues in a virtual scene using different renderings methods, highlighting structural cues such as the edges of different surfaces for navigation. For the same purpose, Perez et al. [35] proposed a phosphene map coding using a ground representation of the obstacle-free space and a ceiling representation based on vanishing lines. Wang et al. [22] proposed two image representation strategies using background subtraction to segment moving elements for object recognition. Similarly, Guo et al. [23] and Li et al. [24] proposed two image processing strategies based on a saliency segmentation technique. For scene recognition, McCarthy et al. [83] presented a visual representation based on intensity augments in order to emphasise regions of structural change.
In terms of complex scene understanding, just few SPV studies have been proposed [24,84]. It is well established that in realistic environment, which is made of complex scenes, the observer is forced to select relevant elements [85]. That is necessary to quickly understand the meaning of a scene as well as for object search. For instance, the set of objects in the environment give rise to a corresponding set of representations in the observer. Each representation describe the identity, location, and meaning of the item it refers to finally forming a literal representation of the environment. Some research on the visual perception of subjects has shown that because the fixation of the gaze changes in a short period of time when an environment is observed, the content of the scene can not be integrated into a complete and detailed representation [86,87]. This suggests that such complete and detailed representations are not needed to obtain a meaning of the scene. Just a few set of object and scene elements are enough to provide access to semantic information [88].
A well-known result in psychophysics highlight that grouping elements in a scene are fundamental for scene understanding [42]. First, the grouping of pixels in a region defines a contour. In many cases, shape alone permits recognition of objects. Biederman et al. [86,89] demonstrated that the silhouettes of the objects are generally very easy to identify and to recognize. The silhouette conveys only part of the visual information needed for the interpretation of an object. Concretely, the concepts such as convexities, concavities, or inflections of contours allow the observer to infer the surface geometry [90]. However, this bottom-up perception can be computed first and to help any top-down search to converge to the right answer. This can help to understand the visual scene through the interpretation of its content. However, in order to fully understand a scene, it is not only important the identification of individual objects comprising the scene but also their relative locations and relations [88]. Based on this idea, the segmentation of the scene into elements with semantic meaning becomes a key point in low vision.
The state of the art in semantic segmentation include deep learning algorithms. Specifically, FCNs have proven to be successful in various recognition tasks such as semantic segmentation of images. In this work, we use two FCNs to select and highlight useful information in indoor scenes such as relevant object masks and silhouettes and structural edges which recover the main structure of the scene providing sense of scale or perspective of the objects. Even though deep learning methods are known for being resource-hungry during training, they can achieve real time performance for prediction even in mobile or embedded devices [91]. Thus, our method could be easily integrated in an implant device.
The performance of the proposed visual stimuli, the SIE-OMS method, was investigated for object recognition and room identification tasks. We introduced the effect of dropout with a 10% of phosphenes omitted at random. This effect has been shown in other studies that decreases the performance of subjects in daily tasks, although it is known that with practice it will improve performance [7,77,79,81]. Our results show that generating phosphene images by extracting specific segments of the scene such as structural informative edges and objects shapes are effective at improving object recognition and room identification. Moreover, the SIE-OMS method produces a large improvement on object recognition and room type identification compared to standard methods in SPV. Here, we have taken the pre-trained neural network model of He et al. [55] with the same classes as it had pre-defined without modifying any of them. The pre-defined classes coincided with those classes detected by users with the Direct method, which does not depend on the model of [55], since both methods, the SIE-OMS and the Direct, are independent. Note the case of the "window" class that was detected by users with the Direct method but was not a pre-defined class in the model [55]. We consider large objects since the scale of the images allowed a complete view of the room and that could also be identified by the users with the Direct method. However, object appearance alone is not enough for accurate object recognition in certain scenes. Since the only piece of visual data that our system uses for each object is its shape, the introduction of complementary information such as the object label could make recognition easier and avoid confusion between objects with similar shape. These factor will be considered in future studies for more realistic practices. We also note that structural edge detection is fundamental for performing tasks such as self-orientation and building a mental map of the environment. Finding such structure is crucial for personal mobility with retinal prothesis, where the bandwidth of image information that can be represented per frame is quite restricted. Overall, we can affirm that the perception and comprehension of the scene can be obtained with just a few set of elements represented in the environment.

Conclusion
We present a new visual representation of indoor environments for prosthetic vision, which emphasizes the scene structure and object shapes. By combining the output of two FCN for structural informative edges and object masks and silhouettes, we have demonstrated how different scenes and objects can be quickly recognized even under the restricted conditions of prosthetic vision. Our results demonstrate that our method is well suited for indoor scene understanding over traditional image processing methods used in visual prostheses. The key idea of our current results is that, with only a few significant elements of the scene, it is possible to obtain a good perception of the environment, even in complex and occluded scenes. This work can be used to help visually impaired people to significantly improve their ability to adapt to the surrounding environment.