Image and Video Processing for Visually Handicapped People

This paper reviews the state of the art in the ﬁeld of assistive devices for sight-handicapped people. It concentrates in particular on systems that use image and video processing for converting visual data into an alternate rendering modality that will be appropriate for a blind user. Such alternate modalities can be auditory, haptic, or a combination of both. There is thus the need for modality conversion, from the visual modality to another one; this is where image and video processing plays a crucial role. The possible alternate sensory channels are examined with the purpose of using them to present visual information to totally blind persons. Aids that are either already existing or still under development are then presented, where a distinction is made according to the ﬁnal output channel. Haptic encoding is the most often used by means of either tactile or combined tactile/kinesthetic encoding of the visual data. Auditory encoding may lead to low-cost devices, but there is need to handle high information loss incurred when transforming visual data to auditory one. Despite a higher technical complexity, audio/haptic encoding has the advantage of making use of all available user’s sensory channels.

information therefore needs to be simplified and transformed in order to allow its rendition through alternate sensory channels, usually auditory, haptic, or auditory-haptic.
As the statistics above show, the large majority of visually impaired people is not totally blind but suffers from impairments such as short sight which decreases visual acuity, glaucoma which usually affects peripheral vision, agerelated macular degeneration which often leads to a loss of central vision. Due to the prevalence of these impairments and henceforth the need for mass-produced aids, these low-vision (as opposed to blindness) aids are often of low technicality; examples are magnifiers, audio books, and spelling watches. In other words, the need for massmarket computerized devices with image processing capabilities is not strongly felt. There are some exceptions, such as screen readers coupled with zoom (possibly directly operating on JPEG and MPEG data) and OCR capabilities, possibly with added Braille and/or vocal output. A noted work was the development of the Low Vision Enhancement System (LVES), [2]) where significant efforts were put on portability, ergonomics, and real-time video processing. A head-mounted display with eye tracking was used; processing included spatial filtering, contrast enhancement, spatial remapping, and motion compensation. Other systems include ad hoc image/video processing that compensates for a particular type of low-vision impairment. Typical techniques used are zooming, contrast enhancement, or image mapping (e.g., [3][4][5][6][7]). These low-vision aids will not be described further in this article, which concentrates on aids for the totally blind. Note however that some of the devices for the totally blind also target partially sighted users.
One of the long-term goals of research in the domain of assistive aids for blind persons is to allow a totally sightless user to perceive the entire surrounding environment [8][9][10]. This not only requires to perform some form of scene interpretation, but also that the user is able to build a mental image of his/her environment. An important factor to take into consideration is then the time of appearance of blindness, from birth or later. According to [11], mental images are a specific form of internal representation, and their associated cognitive processes are similar to those involved in other forms of perception. The mental image is obtained according to an amodal perceptual process. The term "amodal" has been established following several studies made on congenitally blind people, which proved that a mental image is not uniquely based on visual perception [12]. In the case of a blind person, the mental image is usually obtained through the use of haptic and auditory perception. Kennedy [13] claimed that congenitally blind subjects could recognize and produce tactile graphic pictures including abstract properties such as movement. He also recognized that blind people are also able to understand and utilize perspective transformations, which is contested by [14]. According to Arditi's point of view, congenitally blind users cannot access purely visual properties, which include the perspective transformations. Studies reported by [15] also revealed that congenitally blind people were able to generate and use mental images from elementary tactile pictures. However, they suffer from imagery limitations when tactile images increase in complexity. These limitations are caused by their spatial perceptual deficit due to their blindness and the high attentional load associated with the processing of spatial data. Hatwell also assumed that haptic spatial perceptions from congenitally blind people are systematically less efficient than for the late blind persons. This comes from the visual-haptic cross-modal transfer that took place during the infancy of the late blind and which increased the spatial perceptual quality of such sensory system. This set of observations showed that early and late blind people are able to generate mental images, although the process is harder for early blind persons. Furthermore, associations of colors to objects will only be known at an abstract level for people having never experienced sight. In any case, the content of nonvisual pictures must be previously simplified, in order to minimize the cognitive process necessary for the recognition.
Another very important issue is the development of orientation, mobility, and navigation aiding tools for the visually impaired. The ability to navigate spaces independently, safely, and efficiently is a combined product of motor, sensory, and cognitive skills. Sighted people use the visual channel to gather most of the information required for this mental mapping. Lacking this information, people who are blind face great difficulties in exploring new spaces. Research on orientation, mobility, and navigation skills of people who are blind in known and unknown spaces, indicates that support for the acquisition of efficient spatial mapping and orientation skills should be supplied at two main levels: perceptual and conceptual [16,17].
At the perceptual level, the deficiency in the visual channel should be compensated by information perceived via other senses. The haptic, audio, and smell channels become powerful information suppliers about unknown environments. Haptics is defined in the Webster dictionary as "of, or relating to, the sense of touch." Fritz et al. [18] define haptics: "tactile refers to the sense of touch, while the broader haptics encompasses touch as well as kinaesthetic information, or a sense of position, motion, and force." For blind individuals using the currently available orientation, mobility, and navigation aids, haptic information is commonly supplied by the white cane for low-resolution scanning of the immediate surroundings, by palms and fingers for fine recognition of object form, texture, and location, and by the feet regarding navigational surface information. The auditory channel supplies complementary information about events, the presence of others (or machines or animals), or estimates of distances within a space [19].
At the conceptual level, the focus is on supporting the development of appropriate strategies for an efficient mapping of the space and the generation of navigation paths. Research indicates that people use two main scanning strategies: route and map strategies. Route strategies are based in linear (and therefore sequential) recognition of spatial features, while map strategies, considered to be more efficient than the former, are holistic in nature. Research shows that people who are blind use mainly route strategies when recognizing and navigating new spaces, and as a result, they face great difficulties in integrating the linearly gathered information into a holistic map of the space. Thierry Pun et al.

3
The reminder of this article is organized as follows. Section 2 presents the main alternate sensory channels that are used to replace sight that is haptic, auditory, and their combination. Direct stimulation of the nervous system is also discussed. The following sections then review various assistive devices for totally blind users, classified according to the alternate modality used. This classification was preferred over one based on which image/video processing techniques are employed, as many systems use a variety of techniques. It was also preferred over a description based on the situation in which a device would be used, since various systems aim at a multipurpose functionality.Section 3 thus discusses systems relying on the haptic channel, historically the first to appear. Section 4 concerns the auditory channel, while Section 5 discusses the use of the combination of auditory and haptic modalities. Section 6 discusses these devices from a general viewpoint and concludes the article.

ALTERNATE SENSORY CHANNELS AND MODALITY REPLACEMENT FOR THE TOTALLY BLIND
Sight loss creates four types of limitations, regarding communication and interaction with others, mobility, manipulation of physical entities, and orientation in space (e.g., [20][21][22]). To compensate for total or nearly total visual loss, modality replacement is brought into play, which is the basic development concept in multimodal interfaces for the disabled. Modality replacement can be defined as the use of information originating from various modalities to compensate for the missing input modality of the system or the users. The most common modality to replace sight is touch, more precisely the haptic modality composed of two complementary channels, tactile and kinesthetic [23]. The tactile channel concerns awareness of stimulation of the outer surface of the body, and the kinesthetic channel concerns awareness of limb position and movement. Haptic perception is sequential and provides the blind with two types of information that are of complementary nature: semantic "(what is it?)" and spatial "(where is it?)" [24]. Both types of information are at the end of the process combined together to form the mental image. Two strategies are applied when performing the exploration of a physical object using the hand, based on macro-and micromovements. Macromovements perform a global analysis, while micromovements consider details; assistive devices should therefore allow for these two types of explorations. The use of the haptic modality to replace the visual can be accomplished in two ways, that is, physically and virtually. In the physical interaction, the user interacts with real models that can be 3D map models or Braille code maps using the hands. In the virtual interaction, the user interacts with a 3D virtual environment using a haptic device that provides force/tactile feedback and makes the user feel like touching a real object. The physical haptic interaction is in general more efficient than the virtual due to the intuitive way of touching objects with the hands, instead of using an external device for interaction. However, virtual haptic interaction is more flexible and many 3D virtual environment and virtual objects can be rapidly designed, while with proper training it is reported that the user can easily manipulate a haptic device to navigate in 3D virtual environments [25].
The other main replacement modality is hearing. Whereas touch plays the key role in the perception of close objects, hearing is essential for distant environments. A sound is characterized by its nature and its spatial location [26]. Monaural hearing can be sufficient in a number of situations, although binaural hearing plays an important role in the perception of distance and orientation of sound sources. Assistive devices that use the hearing channel to convey information should thus not prevent normal hearing; they should only become active at the user's request (unless an alert needs to be conveyed). The audio and haptic modalities can also be used jointly, as is the case with some of the assistive aids that are presented below.
Research aiming at directly stimulating the visual cortex, thus bypassing alternate sensory channels, has been active for decades. Intracortical microstimulation is performed by means of microelectrodes implanted in the visual cortex (e.g., [27][28][29]). When stimulated, these electrodes generate small visual percepts known as phosphenes which appear as light spots; simple patterns can then be generated. An alternative approach consists in the design of artificial retinas (e.g., [30,31]). In addition to technical, medical, and ethical issues, these devices require that at least parts of the visual pathways are still operating: the optical nerve in case of artificial retina as well as the visual cortex. Direct cortical or retinal stimulation will not be discussed further, but it should be noted that such apparatus call for sophisticated, real-time image processing to simplify scenes in such a way that only the most meaningful elements remain.
Regarding the available aids for the visually impaired, they can be divided into passive, active, and virtual reality aids. Passive aids are providing the user with information before his/her arrival to the environment. For example, verbal description, tactile maps, strip maps, Braille maps, and physical models [17,32,33]. Active aids are providing the user with information while navigating, for example, Sonicguide [34], Kaspa [35], Talking Signs or embedded sensors in the environment [36], and Personal Guidance System, based on satellite communication [37]. The research results indicate a number of limitations in the use of passive and exclusive devices, for example, erroneous distance estimation, underestimation of spatial components and objects dimensions, low information density, or misunderstanding of symbolic codes used in the representations.
Virtual reality has been a popular paradigm in simulation-based training, game and entertainment industries [38]. It has also been used for rehabilitation and learning environments for people with disabilities (e.g., physical, mental, and learning disabilities) [39,40]. Recent technological advances, particularly in haptic interface technology, enable blind individuals to expand their knowledge as a result of using artificially made reality through haptic and audio feedback. Research on the implementation of haptic technologies within virtual navigation environments has yielded reports on its potential for supporting rehabilitation training with sighted people [41,42], as well as with people who are blind [43,44]. Previous research on the use of haptic devices by people who are blind relates to areas such as identification of objects' shape and textures [45], mathematics learning and graphs exploration [46,47], use of audio and tactile feedback for exploring geographical maps [48], virtual traffic navigation [49], and spatial cognitive mapping [50][51][52].

Tactile encoding of scenes
As seen above, two fairly different haptic modalities can be used: tactile and kinesthetic. Tactile devices are likely the most widely used to convey graphic information. Historically, the first proposed system dates back to 1881 when Grin [53] proposed the Anoculoscope. This system should have projected an image on an 8 by 8 array of selenium cells which, depending on the amount of impinging light should have controlled electromechanical pin-like actuators. This system was however never actually realized, "for lack of funding" as the inventor stated.
Coming to more recent work, some guidelines should be followed in order for an image to be transformed in a form suitable for tactile rendition. The tactile image should be as simple as possible; details make tactile exploration very difficult. Attention should be paid to the final size of objects; some resizing might be necessary. Crossings of contours should be avoided, by separating overlapping objects; contours should be closed. Text, if present, should be removed or translated into Braille. As image processing practitioners know, performing such image simplification is no mean feat and various solutions have been proposed (e.g., [54][55][56]). They have in common a chain of processing that includes denoising, segmentation, and contours extraction. Contours are closed to eliminate gaps, and short contours are removed, resulting in a binary simplified image. In some cases, regions enclosed by closed contours have been filled in by textures.
A critical issue is then how to render these images in tactile form. Two families of supports coexist, allowing for either static or dynamic rendering. Static images are in general produced by means of specific printers that heat up paper on which a special toner has been deposited; under the influence of the heat, this toner swells and therefore gives a raised image. Such static raised images are routinely used in many places; often however these images are prepared by hand and little image processing is involved.
Supports that permit dynamic display of images can be mechanic-tactile with raising pins, vibrotactile, electrotactile where small currents are felt in particular locations, and so on, (see [57] for a comprehensive review). The earliest system using a head-mounted camera and dynamic display was the Electrophtalm from Starkiewicz and Kuliszewski, 1963, later improved to allow for 300 vibrating pins [58]. Around the same time was developed the TVSS-Tactile Vision substitution System, with a 1024 array of vibrating pins located on the abdomen of the user (e.g., [59,60]). A noted portable device using a small dynamic display of 24 by 6 vibrating pins is the Optacon, first marketed in 1970 by Telesensory Systems Inc. (Mountain View, Calif, USA) [61], and used until recently (e.g., [62,63]). The user could pass a small camera over text or images, and corresponding pins would vibrate under a finger. In terms of image processing, in such systems using dynamic display the image transformation was based on simple thresholding of grey-level images, where the threshold could be varied by the user.
Purely tactile rendition of scenes using dynamic displays suffers from several drawbacks. First, the information transfer capacity of the tactile channel is inherently limited; not more than a few hundreds of actuators can be used. Secondly, such displays are technologically difficult to realize and costly; they are also difficult to use for extended periods of time. Finally, apart from reading devices such as the Optacon, real-time image/video scene simplification is needed which is difficult to achieve with real scenes. The tactile channel is therefore often complemented with the auditory channel, as described in Section 5.

Tactile/kinesthetic encoding of scenes
The basic principle there is to provide the user with force feedback and possibly additional tactile stimuli. Such approaches have been made popular due to the development of virtual reality force-feedback devices, such as the CyberGrasp [64], the PHANTOM family of devices [65], the Phantograph [66], gaming devices, or simply the Logitech WingMan force-feedback mouse [67]. Force feedback allows rendering a feeling of an object, of a surface. For instance, [68] investigated different methods for representing various forms of picture characteristics (boundary or shape, color, and texture) using haptic rendering techniques. A virtual fixation mechanism allows following contours as if one was guided by virtual rails. When the force-feedback pointer is close enough to the line, this mechanism pulls the end effector towards the line. Surface textures were also rendered by virtual bump mapping.
Colwell et al. [44] carried out a series of studies on virtual textures and 3D objects. They tested the accuracy of a haptic interface for displaying size and orientation of geometrical objects (cube, sphere). They also studied whether blind people could recognize simulated complex objects (i.e., sofa, armchair, and kitchen chair). Results from their experiments showed that participants might perceive the size of larger virtual objects more accurately than of smaller ones. Users also may not understand complex objects from purely haptic information. Therefore, additional information such as from the auditory channel has to be supplied before the blind user can explore the object. Other studies reported by [49] tested the recognition of geometrical objects (e.g., cylinders, cubes, and boxes) and mathematical surfaces, as well as navigation in a traffic environment. Results showed that blind users are able to recognize more easily realistic complex objects and environments rather than abstract ones.
In [69], a method has been proposed for the haptic perception of greyscale images using pseudo-3D representations of the image. In particular, the image is properly filtered so as to retain only its most important texture information. Next, the pseudo-3D representations are generated using the intensity of each area of the image. The user can then navigate Thierry Pun et al. into the 3D terrain and access the encoded color and texture properties of the image.
Recently, Tzovaras et al. [25] developed a prototype for the design of haptic virtual environments for the training of the visually impaired. The developed highly interactive and extensible haptic VR training system allows visually impaired to study and interact with various virtual objects in specially designed virtual environments, while allowing designers to produce and customize these configurations. Based on the system prototype and the use of the Cyber-Grasp haptic device, a number of custom applications have been developed. The training scenarios included object recognition/manipulation and cane simulation (see Figure 1), used for performing realistic navigation tasks. The experimental studies concluded that the use of haptic virtual reality environments provides alternative means to the blind for harmlessly learning to navigate in specific virtual replicas of existing indoor or outdoor areas.

AUDITORY ENCODING FOR VISION SUBSTITUTION
Fish [70] describes one of the first known works that used the auditory channel to convey visual information to a blind user. 2D pictures were coded by tone bursts representing dots corresponding to image data. Image processing was minimal. The vertical location of each dot was represented by the tone frequency, while the horizontal position was conveyed by the ratio of sound amplitude presented to each ear using a binaural headphone. At about the same time appeared a device, the "K Sonar-Cane," that allowed navigation in unknown environments [71]. By combining a cane and a torch with ultrasounds, it was possible to perceive the environment by listening to a sound coding the distance to objects, and to some extent object textures via the returning echo. The sound image was always centered on the axis pointed at by the sonar. Scanning with that cane only produced a one dimensional response (as if using a regular cane with enhanced and variable range) that did not take color into account. Some related developments used miniaturized sonars mounted on spectacles.
Later, Scadden [72] was reportedly the first to discuss the use of interface sonification to access data. Regarding diagrams, their nonvisual representation has been investigated by linking touch (using a graphical tablet) with auditory feedback. Kennel [73] presented diagrams (e.g., flowcharts) to blind people using multilevel audio feedback and a touch panel. Touching objects (e.g., diagram frames) and applying different pressures triggered feedback concerning information regarding the frame, and the interrelation between frames. Speech feedback was also employed to express the textual content of the frame. More recent works regarding diagrams presentation include for instance [74,75]. Using speech output, Mikovec and Slavik [76] defined an objectoriented language for picture description. In this approach, an image was defined by a list of objects in the picture. Every object was specified by its definition (position, shape, color, texture, etc.), its behavior "(is driving)", and by its interrelations with other objects. These interrelations were either hierarchical "(is in)" or not (for groups of objects without hierarchical relation). The description was then stored into an XML document. To obtain the picture description, the blind user worked with a specific browser which went through the objects composing the image and read their information.
The direct use of the physical properties of the sound is another method to represent spatial information. Meijer [77] designed a system "(The Voice)" that uses a time-multiplexed sound to represent a 64 × 64 gray level picture. Every image is processed from left to right and each column is listened to for about 10 milliseconds. Each pixel is associated with a sinusoidal tone, where the frequency corresponds to its vertical position (high frequencies are at the top of the column and low frequencies at the bottom) and the amplitude corresponds to its brightness. Each column of the picture is defined by superimposing the vertical tones. This head-centric coding does not keep a constant pitch for a given object when one nods the head because of elevation change. In addition, interpreting the resulting signal is not obvious and requires extensive training. Capelle et al. [78] proposed the implementation of a crude model of the primary visual system. The implemented device provides two resolution levels corresponding to an artificial central retina and an artificial peripheral retina, as in the real visual system. The auditory representation of an image is similar to that used in "TheVoice" with distinct sinusoidal waves for each pixel in a column and each column being presented sequentially to the listener.
Hollander [79] represented shapes using a "virtual speaker array." This environment was defined with a virtual auditory spatialization system based on specific headrelated transfer function (HRTF) [26]. The auditory environment directly mapped the visual counterpart; a pattern was rendered by a moving sound source that traced in the virtual auditory space the segments belonging to the pattern. Gonzalez-Mora et al. [80] have been working on a prototype for the blind in the Virtual Acoustic Space Project. They have developed a device which captures the form and the volume of the space in front of the blind person's head and sends this information, in the form of a sound map through  Figure 2: Schematic representation of the SeeColor targeted mobility aid. A user points stereo cameras towards the portion of a visual scene that will be sonified. Typical colors, here green for the traffic light and yellow for the crosswalk, are transformed into particular musical instrument sounds: flute for the green pixels, and piano for the yellow ones. These sounds are rendered in a virtual 3D sound space which corresponds to the observed portion of the visual scene. In this sound space, the music from each instrument appears to originate from the corresponding colored pixels location: upper-right for the flute, bottom-center for the piano.
headphones in real time. Their original contribution was to apply the spatialization of sound in the three-dimensional space with the use of HRTFs.
Rather than trying to somehow directly map scene information into audio output, it is also possible to perform some form of image or scene analysis in order to obtain a compact description that can then be spoken to the user. This is typically the case with devices for reading books, such as with the Icare system [81]. Programs that look for textual captions in images also enter in this category; they can be very useful for instance for accessing web pages in which text is often inlaid in images. Similarily, diagram translators allow to describe the content of schematics. Some applications that are more sophisticated in terms of image or video processing often address mobility and life in real, unfamiliar environments. When mobility is concerned, there is need for systems embedded in portable computers such as PDA's. One example that targets unfamiliar environment concerns the design of a face recognition system, where images acquired by a miniature camera located on spectacles are analyzed and then transmitted by a synthetic voice [82].
Concerning developments revolving around navigation, Eddowes and Krahe [83] present an approach for detecting pedestrian traffic lights using color video segmentation and structural pattern recognition. The NAVI (Navigation Assistance for Visually Impaired) system uses a fuzzy-rule-based object identification methodology and outputs results in a stereo headphone (e.g., [84]). In [85,86], methodologies for the detection of pedestrian crossings and orientation, and for the estimation of their lengths are discussed. A vision-based monitoring application is presented in [87]; it concerns the detection of significant changes from ceiling-mounted cameras in a home environment, in order to generate spoken warnings when appropriate.
A project currently conducted in one of our laboratories (Geneva) and called SeeColor aims at achieving a noninvasive mobility aid for blind users, that uses the auditory pathway to represent in real-time frontal image scenes [88,89]. Ideally, the targeted system will allow visually impaired or blind subjects having already seen to build coherent mental images of their environment. Typical colored objects (signposts, mailboxes, bus stops, cars, buildings, sky, trees, etc.) will be represented by sound sources in a three-dimensional sound space that reflects the spatial position of the objects (see Figure 2). Targeted applications are the search for objects that are of particular use for blind users, the manipulation of objects, and the navigation in an unknown environment. SeeColor presents a novel aspect. Pixel colors are encoded by musical instrument sounds, in order to emphasize colored objects and textures that will contribute to build consistent mental images of the environment. Secondly, object depth is (currently) encoded by signal time length with four possible values corresponding to four depth ranges. In terms of image and video processing, images coming from stereo cameras are processed in order to decrease the number of colors and retain only the most significant ones. Work is underway concerning the extraction of intrinsic color properties, in order to discard as much as possible the effect of the illuminants. Another aspect under investigation concerns the determination of salient regions, both spatial and in depth, to be able to suggest a user where to focus attention [90]. Experiments have been conducted first to demonstrate the ability to learn associations between colors and musical instrument sounds. The ability to locate and associate objects of similar colors has been validated with 15 participants that were asked to make pairs with socks of different colors. The current prototype is now being tested as a mobility aid, where a user has to follow a line painted on the ground in an outdoor setting (see Figure 3); real-time sonification combined with distance information obtained from the stereo cameras allows quite accurate user displacement.

AUDITORY/HAPTIC ENCODING FOR VISION SUBSTITUTION
In view of the limitations of the auditory or of the haptic channels taken independently, it makes sense to combine them in order to design auditory/haptic vision substitution Thierry Pun et al. systems. The first multimodal system for presenting graphical information to blind users was the Nomad [91], where a touch sensitive tablet was connected to a synthetic voice generator. Parkes [92] presents and discusses a suite of programs andintegrated hardware called TAGW, where TAG stands for Tactile Audio Graphics. Systems with similar functionalities allowing the rendering of diagrams were also realized for instance by [73,93] with emphasis on hierarchical auditory navigation. In these systems, the graphical information to render has to be manually prepared beforehand in order to associate particular vocal information with image regions. Commercial tactile tablets with auditory output exist, such as the T3 tactile tablet from the Royal National College for the Blind, UK [94]. The T3 is routinely used in schools for visually handicapped pupils, for instance to allow access to a world encyclopaedia.
The possibility to render more complex information has also been investigated. Kawai and Tomita [95] describe a system that uses stereo vision to acquire 3D objects and render them using a 16 × 16 raising pins display. Synthetic voice is added to provide more information regarding the objects that are presented. Grabowski and Barner [96] extended the system developed by Fritz by adding sonification to the haptic representation. In this approach, the haptic component was used to represent topological properties (size, position) while sonification mapped purely visual characteristics such as colors or textures. More recently, [97], a framework has been developed for generating haptic representations, called force fields, of scenes captured through a simple camera. The advantage of this approach lies in the fact that the force fields, after being generated, can be stored and processed independently from their source as an individual means of scene representation. The framework in [97] has been used with videos of 3D map models and can be also used with aerial videos for the potential generation of urban force fields. The resulting force fields are processed using either the Phantom Desktop or the CyberGrasp haptic device.
An auditory-haptic system that uses force-feedback devices complemented by auditory information has been designed by [22,98,99]. In a first phase, a sighted person has to prepare an image to be rendered by sketching it and associating auditory information to key elements of the drawing. This phase should ultimately be made automatic through the use of image segmentation methods, but this had not been fully implemented as the project concentrated on the rendering aspects and on evaluation. Associated auditory cues differed depending on whether the part to sonify was a contour or a surface. In case of surfaces, the blind user obtained auditory feedback when crossing the object and/or during the whole time he/she pointed to the object surface. Auditory cues were either tones whose pitch depended on the touched object, or spoken words. In addition, haptic feedback describing the object surface was simulated using either a friction or a textural effect. Contours were rendered using kinesthetic feedback, by a virtual force of fixture based on a virtual spring that attracted the mouse cursor towards the contour (see Figure 4). Experiments were first conducted by a Logitech WingMan Force-Feedback mouse. Its working space was found too limited in space (i.e., 2.5 cm×2 cm) confirming the assumption of [100]; a specific force-feedback pointing device was thus built, providing a 11.5 cm × 8 cm workspace.
In [97], a very promising approach has been presented for the auditory-haptic representation of conventional 2D maps. A series of signal processing algorithms is applied on the map image for extracting the structure information of the map, that is, streets, buildings, and so on, and the symbolic information, that is, street names, special symbols, crossroads, and so on. The extracted structure information is displayed using a grooved line map that is perceived using the Phantom haptic device. The generated haptic map is then augmented with all the symbolic information that is either displayed using speech synthesis for the case of street names, or using haptic interaction features, like friction and haptic texturing. For example, higher friction values are set for the crossroads, while haptic texturing is used to distinguish between special symbols of the map, like hospitals, and so on. During run time, the user interacts with the grooved line map and whenever a special interest point is reached, the corresponding haptic or auditory information is displayed.
In [101], an agent-based system that supports multimodal interaction for providing educational tools for visually handicapped children is described. Interaction modalities are auditory (vocal and nonvocal) and haptic; the haptic interaction is accomplished using the PHANTOM manipulator. A simulation application allows children to explore natural astronomical phenomena, for instance to navigate through virtual planets. Regarding mobility, Coughlan and  Shen [102] and Coughlan et al. [103] address the needs of blind wheelchair users. Their system uses stereo cameras in order to build an environment map. They have also developed specific algorithms to estimate the position and orientation of pedestrian crossings. It is planned to transmit information to the user using synthetic speech, audible tones, as well as tactile feedback.

CONCLUSIONS
As can be seen from the references, research on vision substitution devices has been active for over a century. Systems aiming at totally replacing the sense of sight for blind persons can be categorized according to the alternate modality that is used to convey the visual information: haptic (tactile and/or kinesthetic), auditory, and auditory/haptic. The use of several modalities is relatively recent, and this trend will necessarily increase since there is a clear benefit in exploiting all possible interaction channels.
The fact that these modalities are of a rather sequential nature implies a fundamental limitation to all visual aids since vision is essentially parallel. A given modality requires some specific preparation of the information. The auditory channel processes an audio signal that is sequential in time, but also allows for some form of parallel processing of the various sound sources composing the stimulus. This "sequential-parallel" capability is for instance used in the SeeColor Project described above: a user sequentially fo-cuses on various portions of a scene, and each portion is mapped into several simultaneous sound sources. The haptic modality should provide for both global and local analyses; although rather sequential in nature, some form of global parallel exploration is possible when using more than a finger.
For a long time, image/video processing (if any) has remained fairly simple. In many cases, images are prepared manually before being presented to the system. Otherwise, image/video processing can consist of simple thresholding operations, of image simplification techniques based on denoising and contours segmentation. Region segmentation is used for instance to allow region filling with predefined textures. Specific image processing techniques such as contrast enhancement, magnification, and image remapping are used for low-vision aids where the disability to compensate is well characterized spatially or in the frequency domain. There is now a clear trend to use the most recent scene analysis techniques for static images and videos. Object recognition and video data interpretation are performed in order to be able to describe the semantic content of a scene. One reason for this increasing use of fairly involved methods, besides their maturation, is the possibility to embed complex algorithms in portable computers with high processing capabilities.
It is a fact that research in vision replacement does benefit more and more from progress made in computer vision, video and image analysis. Many other issues must however be solved. In terms of human-computer interaction, there is need to better adapt to user needs in terms of ergonomy and ease of interaction. Attention has to be paid to the appearance of systems to make their use acceptable in public environments (although nowadays wearing "funny looking" devices is not as critical as it was in the 1970s). Regarding evaluation, it is not that easy to find potential users interested in participating in experiments, especially knowing that the devices they are testing most likely will not make it to the market. Not to be neglected is the economic aspect. It is true that the number of totally blind persons is large in absolute numbers, and will increase in relative numbers due to the ageing of the population, but the vast majority of sightless persons cannot easily afford to buy expensive apparatus. Governments therefore should come into play, by providing direct subsidies to those in need as well as funding for research in this area (which is the case now as for instance the 6th and 7th European research programs include such topics).
In conclusion, it is felt that with the current possibilities of miniaturization of wearable devices, the advent of more sophisticated computer vision and video processing techniques, the increase in public funding, more and more visual substitution devices will appear in the future, and very importantly will gain acceptance amongst the potential users.

ACKNOWLEDGMENTS
This work is supported by the Similar IST Network of Excellence (FP6-507609). T. Pun, P. Roth, and G. Bologna gratefully acknowledge the support of the Swiss Hasler Foundation and of the Swiss "Association pour le bien des aveugles et amblyopes," as well as the help at various stages of their projects from André Assimacopoulos, Simone Berchtold, Denis Page, and of Professors F. de Coulon (retired) and A. Bullinger (retired) for having helped a long time ago one of the authors (T. Pun) on this fascinating and hopefully useful research topic. Thanks also to many blind persons who have helped us along the years, in particular, Marie-Pierre Assimacopoulos, Alain Barrillier, Julien Conti, and Céline Moret.