Artificial Intelligence for Cochlear Implants: Review of Strategies, Challenges, and Perspectives

Automatic speech recognition (ASR) plays a pivotal role in our daily lives, offering utility not only for interacting with machines but also for facilitating communication for individuals with partial or profound hearing impairments. The process involves receiving the speech signal in analog form, followed by various signal processing algorithms to make it compatible with devices of limited capacities, such as cochlear implants (CIs). Unfortunately, these implants, equipped with a finite number of electrodes, often result in speech distortion during synthesis. Despite efforts by researchers to enhance received speech quality using various state-of-the-art signal processing techniques, challenges persist, especially in scenarios involving multiple sources of speech, environmental noise, and other adverse conditions. The advent of new artificial intelligence (AI) methods has ushered in cutting-edge strategies to address the limitations and difficulties associated with traditional signal processing techniques dedicated to CIs. This review aims to comprehensively cover advancements in CI-based ASR and speech enhancement, among other related aspects. The primary objective is to provide a thorough overview of metrics and datasets, exploring the capabilities of AI algorithms in this biomedical field, and summarizing and commenting on the best results obtained. Additionally, the review will delve into potential applications and suggest future directions to bridge existing research gaps in this domain.


I. INTRODUCTION
In the symphony of modern technology, automatic speech recognition (ASR) emerges as a master, orchestrating a seamless interaction between humans and machines.This transformative technology has quietly become an integral part of our daily lives, influencing how we communicate, access information, and even navigate the intricacies of healthcare.The significance of ASR extends beyond its role in facilitating human-computer interaction; it permeates diverse applications such as voice assistants and virtual agents, speech-to-text conversion, identity verification, and holds particular promise in the realm of biomedical research [1], [2].ASR bridges the gap between spoken language and digital communication, enabling the conversion of spoken words into written text with remarkable accuracy.The per-vasiveness of ASR technology is evident in the devices we use daily-smartphones, smart speakers, and voice-activated virtual assistants-all seamlessly responding to our spoken commands and queries.The convenience it brings to our lives is undeniable, offering a hands-free and efficient mode of interaction that has become second nature.
ASR is pivotal in authentication systems, safeguarding the security and privacy of sensitive information.The integrity of audio speech can be verified through techniques-based ASR such as adversarial attack detection [3], steganalysis [4]- [6], speech biometrics [7], and more.Beyond the realm of communication, ASR finds itself at the heart of various applications, each playing a unique role in different domains.Speaker recognition, a facet of ASR, is not merely confined to enhancing security measures.It has evolved into a ver-satile tool employed in healthcare, where the identification of individuals through their unique vocal signatures holds promise for personalized patient care.This is particularly relevant in scenarios where quick and secure authentication is crucial, such as accessing medical records or authorizing medical procedures.Event recognition, another dimension of ASR, is a game-changer in sectors ranging from security to healthcare.In the former, ASR algorithms analyze audio data to automatically detect and categorize specific events, reinforcing surveillance capabilities.In healthcare, event recognition becomes a powerful tool for monitoring and early detection of health-related events, recognizing speech from noisy environment [8], understanding person with dysarthria severity [9], and more.In the context of cardiac health, ASR can aid in identifying anomalies in heart sounds, potentially enabling early intervention and preventive measures [10].Source separation, the ability to discern and isolate individual sound sources from complex audio signals, is a boon in fields like entertainment and music production.However, its significance extends into the realm of biomedical research, where ASR plays a pivotal role in decoding the intricate language of physiological signals.In the context of cochlear implants (CIs), source separation becomes a critical component in enhancing the auditory experience for individuals with hearing impairments.
The CIs, designed to restore hearing in individuals with severe hearing loss or deafness, rely on ASR for optimizing their functionality.ASR contributes significantly to the improvement of speech perception in CI users by enhancing the processing and interpretation of auditory signals.CIs work by converting sound waves into electrical signals that stimulate the auditory nerve, bypassing damaged parts of the inner ear.ASR complements this process by aiding in the recognition and translation of spoken language.The technology plays a crucial role in optimizing speech understanding for CI users by refining the interpretation of varied speech patterns, tones, and nuances.Moreover, ASR in the context of CIs extends beyond basic speech recognition.It contributes to the recognition of environmental sounds, facilitating a more immersive auditory experience for individuals with hearing impairments.This is particularly significant in enhancing the quality of life for CI recipients, allowing them to navigate and engage with their surroundings more effectively.

A. RELATED WORK
Many reviews have been written in the context of CI.For example, [11] discussed the advantages offered by machine learning (ML) to cochlear implantation, such as analyzing data to personalize treatment strategies.It enhances accuracy in speech processing optimization, surgical anatomy location prediction, and electrode placement discrimination.Besides, it delves into its applications, including optimizing cochlear implant fitting, predicting patient threshold levels, and automating image-guided CI surgery.The review dis-cusses some novel opportunities for research, emphasizing the need for high-quality data inputs and addressing concerns about algorithm transparency in clinical decision-making for improved patient care.Similarly, the review by Manero et al. [12] details some benefits of employing artificial intelligent (AI) in enhancing CI technology, involving adaptive sound processing, acoustic scene classification, and auditory scene analysis.The authors discuss AI-driven advancements aiming to optimize sound signals, adapt to diverse environments, and improve speech perception for individuals with hearing loss, ultimately enhancing their overall quality of life.
Additionally, the review [13] explores three main topics: direct-speech neuroprosthesis, which involves decoding speech from the sensorimotor cortex using AI, including the synthesis of produced speech from brain activity; a topdown exploration of pediatric cochlear implantation using ML, delving into its applications in pediatric cochlear implantation; and the potential of AI to solve the hearing-innoise problem, examining its capabilities in addressing challenges related to hearing in noisy environments.Moreover, the review [14] critically examines the current landscape of tele-audiology practices, highlighting both their constraints and potential opportunities.Specifically, it explores intervention and rehabilitation efforts for CIs, focusing on remote programming and the concept of self-fitting CIs.Recently, a review by Henry et al. in 2023 [15] conducts a comprehensive review of noise reduction algorithms employed in CIs.Maintaining a general classification based on the number of microphones used-single or multiple channels-the analysis extends to incorporate recent studies showcasing a growing interest in ML techniques.The review culminates with an exploration of potential research avenues that hold promise for future advancements in the field.Table 1 offers a comparative analysis of the proposed review in contrast to other discussed AI-based CI reviews and surveys.

B. STATISTICS ON INVESTIGATED PAPERS
Recently, there has been a surge in publications related to AIbased CI.The review methodology entails defining the search strategy and study selection criteria.Criteria for inclusion, such as keyword relevance and impact, shape the quality assessment protocol.A comprehensive search was conducted on databases such as Scopus and Web of Science.Keywords were extracted for theme clustering, resulting in a formulated query to gather advanced AI-based CI studies.The research query retrieves references from papers containing the keywords "Cochlear implant" or "Hearing loss" and "Artificial intelligence" in their abstracts, titles, or authors' keywords.It subsequently refines these papers, focusing on those that also include "Machine learning," "Deep learning," or "Reinforcement learning."Figure 1 illustrates the most frequently used keywords by the authors in the titles, abstracts, and keywords of the selected papers.
Figure 2 illustrates the distribution of these papers by

C. MOTIVATION AND CONTRIBUTION
The motivation behind conducting a comprehensive review on CIs stems from the imperative to critically assess and consolidate the current state of AI applications in this crucial field.CIs have revolutionized auditory rehabilitation for individuals with hearing impairment, and integrating ML and deep learning (DL) techniques holds immense potential for further advancements.This review seeks to fill a significant gap in the existing literature by providing a detailed analysis of recent AI-based CI frameworks.The primary objective is to present a nuanced understanding of the landscape, categorizing frameworks based on ML and DL methodologies, available datasets, and key metrics.By addressing this gap, the review aims to offer valuable insights for researchers, clinicians, and technologists involved in the development and improvement of CI technologies.Furthermore, the exploration of advanced DL algorithms, such as transformers and reinforcement learning (RL), in the context of CIs, underscores the potential for transformative breakthroughs.Ultimately, this research review aspires to contribute to the enhancement of CI technologies, fostering innovation and improving the quality of life for individuals with hearing impairment.The principal contributions of this paper can be succinctly outlined as follows: • Detailing the assessment metrics associated with AI and CIs, and elucidating the extensively utilized datasets, whether publicly accessible or generated, employed to validate AI-based ASR for CI methodologies.fering insights, and proposing novel ideas to address these gaps.Additionally, exploring potential avenues for future research to deepen comprehension and provide valuable guidance for subsequent investigations.The subsequent sections are organized as follows: Section II delves into the background in speech processing, outlining datasets and metrics.Section III discusses the methodology employed for CI based on AI.Section IV presents the medical applications and impact of applying AI on CI.Section V offers a comprehensive discussion on research gaps, future directions, and perspectives.Finally, Section VI concludes the paper with implications and future research directions.[16] partial or severe deafness.CIs comprise an external part with a microphone and speech processor and an internal part with a receiver-stimulator and electrode array, as shown in Figure 3.They convert sounds into electrical signals, stimulating the auditory nerve to enable sound perception in individuals with profound hearing loss [16].
The incoming sound is divided into multiple frequency channels using bandpass filters and then processed by envelope detectors.Non-linear compressors adjust the dynamic range of the envelope for each patient.The compressed envelope amplitudes are then utilized to modulate a fixedrate biphasic carrier signal.A current source converts voltage into pulse trains of current, which are delivered to electrodes placed along the cochlea in a non-overlapping manner.This stimulation method is called continuous interleaved sampling (CIS).Another coding strategy, known as advanced combination encoder (ACE), uses a greater number of channels and dynamically selects the "n-of-m" bands with the largest envelope amplitudes (prior to compression).Only the corresponding "n" electrodes are stimulated.A popular device widely used for CIs, such as the cochlear nucleus, typically has 22 channels.
The sound processor, which usually contains a microphone, battery, and other components, can be worn either behind-the-ear (BTE) or off-the-ear (OTE).A headpiece holds a transmitter coil, positioned externally above the ear, while internally, a receiver coil, stimulator, and electrode array are implanted.The SP includes a digital signal processor (DSP) with memory units (maps) that store patient-specific information.An audiologist configures these maps during the fitting process, adjusting thresholds for each electrode, including T-Levels (the softest current levels audible to the CI user) and C/M-Levels (current levels perceived as comfortably loud), as well as the stimulation rate or programming strategy.Data (pulse amplitude, pulse duration, pulse gap, etc.) and power are sent through the skull via a radio frequency signal from the transmitter coil to the receiver coil.The stimulator decodes the received bitstream and converts it into electric currents to be delivered to the cochlear electrodes.High-frequency signals stimulate electrodes near the base of the cochlea, while low-frequency signals stimulate electrodes near the apex.
The CI stimulates the auditory nerve afferents, which connect to the central auditory pathways.However, compared to individuals with normal hearing (NH), CI users face more difficulties in speech perception, particularly in noisy environments.Hearing loss can be caused by various factors, including natural aging, genetic predisposition, exposure to loud sounds, and medical treatments.Damage to the hair cells in the inner ear often leads to a reduced dynamic range of hearing, as well as decreased frequency selectivity and discriminative ability in speech processing.To evaluate the effectiveness of CIs in speech perception amid noise, listening tests involving both normal hearing individuals and CI users are commonly conducted.These tests typically employ a combination of speech utterances from a recognized speech corpus and background noises like speech-weighted noise and babble.Alternatively, vocoder simulations can be utilized alongside speech intelligibility metrics.While sentence-based tests are frequently employed, other stimuli such as vowels, consonants, and phonemes are also used.As a result, noise reduction techniques are increasingly employed to enhance the performance of CIs in challenging environments [15].

B. DATASETS
Researchers have utilized numerous datasets to validate their proposed schemes, comprising both widely recognized publicly available sets and locally generated ones.These datasets fall into two categories: speech or images.Table 2 provides a summary of these datasets, detailing their characteristics, citing studies that have utilized them, and indicating their availability through links or references.

C. METRICS
Multiple evaluation metrics are utilized during the training and validation of any DL models, including AI-based CI.These metrics, integral to the confusion matrix, are widely known and applicable across various data types such as speech or images.They include Accuracy (Acc), Sensitivity (Sen), Recall (Rec), Specificity (Spe), Precision (Pre), F1 score (F1), and receiver operating characteristic curve (ROC).Moreover, different metrics play roles in prediction tasks.For instance, intersection over union (IoU) assesses overlap, and mean absolute error (MAE) quantifies absolute differences.For a comprehensive understanding of the metrics discussed, including their equations, refer to the details provided in [5], [39].Other metrics that are widely used for CI are summarized in Table 3.

III. TAXONOMY OF CI-BASED AI TECHNIQUES
Several artificial intelligence techniques have been employed to enhance the efficacy of CIs.While some rely on 1D data, others process information in a 2D image format.Figure 4 summarizes all AI algorithms utilized, alongside the features employed and hybrid AI methodologies.Additionally, Table ?? provides a summary of DL-based techniques utilized in CI hearing devices.

A. CI-BASED AI IMPLEMENTATION
CI programming involves adjusting device settings to optimize sound perception for individual users.This includes setting stimulation levels, electrode configurations, and signal processing parameters to enhance speech understanding and auditory experiences based on patient feedback and objective measures.In 2010, Govaerts et al. [41] described the development of an intelligent agent, called fitting to outcomes expert (FOX), for optimizing CI programming, as illustrated in Figure 5.The agent analyzes map settings and psychoacoustic test results to recommend and execute modifications to improve outcomes.The tool focuses on an outcome-driven approach, reducing fitting time and improving the quality of fitting.It introduces principles of AI into the CI fitting process.The study proposed objective measures and group electrode settings as strategies to reduce fitting time.
Similarly, in [42]- [46], all have employed FOX for programming CI.Vaerenberg et al. [42] discusses the use of FOX for programming CI sound processors in new users.FOX modifies maps based on specific outcome measures using heuristic logic and deterministic rules.The study showed positive results and optimized performance after three months of programming, with good speech audiometry and loudness scaling outcomes.The paper highlights the importance of individualized programming parameters and the need for outcome-based adjustments rather than relying solely on comfort.In [43], computer-assisted CI fitting using FOX assessed its impact on speech understanding.Results from 25 recipients showed that 84% benefited from suggested map changes, significantly improving speech understanding thanks to the learning capacity of FOX.This approach offers standardized, systematic CI fitting, enhancing auditory performance.
The COCH gene, referred to as the Cochlin gene, is responsible for encoding the cochlin protein situated on chromosome 14 in humans, primarily expressed in the inner ear.Cochlin predominantly functions within the cochlea, a spiral-shaped structure involved in the process of hearing, contributing to its structural integrity and proper operation.Wathour et al. in [44] discuss the use of AI in CI fitting through two case studies.The first case involves a 75-yearold lady who received a left ear implant due to gradual and severe hearing loss in both ears without a clear cause.In the second case, a 72-year-old man with a COCH gene mutation causing profound hearing loss in both ears underwent a right ear implant to assess whether CI programming using the AI software FOX application could improve CI performance.The results showed that AI-assisted fitting led to improvements in auditory outcomes for adult CI recipients who had previously undergone manual fitting.The AI suggestions helped improve word recognition scores and loudness scaling curves.Similarly, Waltzman et al. [45] incorporate AI in programming CIs, aiming to assess the performance and standardization of AI-based programming on fifty-five adult CI recipients.The results showed that the AI-based FOX system performed better for some patients, while others had similar results; however, the majority preferred the FOX system.

B. ML-BASED METHODS
ML is a subfield of AI that focuses on developing algorithms and statistical models that enable computer systems to improve their performance in a specific task by learning features from input data.The research in [47]   TIMIT 6300 630 The dataset consists of phonemically and lexically transcribed speech from American English speakers belonging to diverse demographics and dialects.It provides comprehensive information with time-aligned orthographic, phonetic, and word transcriptions.Additionally, each utterance is accompanied by its corresponding 16-bit, 16kHz speech waveform file, ensuring a complete and detailed dataset for analysis and experimentation in ASR and acoustic-phonetic studies.
[23], [24], [25] Link 2 GSC 18 Hours 30 The dataset was gathered using crowd-sourcing.It consists of 65,000 recordings, each lasting one second, and contains 30 brief words.Among these words, 20 commonly used ones were spoken five times by the majority of participants, while 10 other words (considered unfamiliar) were spoken only once.
[26] Link 3 The dataset features simulated living rooms with static sources, including a single target speaker, interferer (competing talker or noise), and a large target speech database of English sentences produced by 40 British English speakers.
[28] Link6 DEMAND 560 6 The dataset consists of 15 recordings capturing acoustic noise in various environments.These recordings were made using a 16-channel array, with microphone distances ranging from 5 cm to 21.8 cm.

THCHS-30 35 Hours 50
The dataset is a free Chinese speech corpus accompanied by resources such as lexicon and language models.
[30] Link 8 BCP 55938 20 The bern cocktail party (BCP) dataset contains Cocktail Party scenarios with individuals wearing CI audio processors and a head and torso simulator.Recorded in an acoustic chamber, it includes multi-channel audio, image recordings, and digitized microphone positions for each participant.
[31] [32] ikala 252-30s 206 The dataset comprises audio recordings consisting of vocal and backing track music with a sampling rate of 44100 Hz.Each music track is a stereo recording, where one channel contains the singing voice and the other channel contains the background music.All tracks were performed by professional musicians and featured a group of six singers, evenly split between three females and three males.
[33] Link 9   MUSDB 150 4 The dataset is a collection of music tracks specifically designed for music source separation research.It consists of professionally mixed songs across various genres, with individual tracks isolated for vocals, drums, bass, and other accompaniment.
[33] Link 10 CQ500 491 - The dataset includes anonymized dicoms files, along with the interpretations provided by radiologists.The interpretations were conducted by three radiologists who have 8, 12, and 20 years of experience in interpreting cranial CT scans, respectively.
[34] Link In addition, Torresen et al. [48] discusses the use of ML techniques to streamline the adjustment process for CIs.The goal is to predict optimal adjustment values for new patients based on data from previous patients.By analyzing data from 158 former patients, the study shows that while fully automatic adjustments are not possible, ML can provide a good starting point for manual adjustment.The research also identifies the most important electrodes to measure for predicting levels of other electrodes.This approach has the potential to reduce programming time, benefit patients, and improve speech recognition scores, particularly for young children and patients with post-lingual deafness.Henry et al.
in their [49] investigates the importance of acoustic features in optimizing intelligibility for CIs in noisy environments.
The study employs ML algorithms and extracts acoustic features from speech and noise mixtures to train a deep neural networks (DNN).The results, using various metrics, reveal that frequency domain features, particularly Gammatone features, perform best for normal hearing, while Mel spectrogram features exhibit the best overall performance for hearing impairment.The study suggests a stronger correlation between STOI and NCM in predicting intelligibility for hearing-impaired listeners.The findings can aid in designing adaptive intelligibility enhancement systems for CIs based on noise characteristics.
Moreover, the research in [50] focuses on imputing miss-TABLE 3: An overview of the metrics employed for evaluating CI methods.

MMSE Estimator xMMSE = E[X|Y ]
Minimum mean square error (MMSE) is a statistical estimation technique used in speech enhancement to minimize the mean square error between the estimated Y and true X clean speech signals Short-time objective intelligibility (STOI) is a metric used to assess the intelligibility of time-frequency weighted noisy speech.It is based on the idea that human speech perception relies on the availability of important acoustic features in short time frames [40] Source-to-distortion ratio (SDR), source-to-artifact ratio (SAR), and source-tointerference ratio (SIR) are metrics objectively assess and compare speech sourceseparation algorithms based on accuracy and minimization of distortions and interference.SDR gauges source separation quality by comparing true source power to introduced distortion.SAR evaluates source separation from artifacts or noise, while SIR measures the ratio of true source power to interference after separation.FOX processes this information and generates fitting suggestions as its output.When integrated with proprietary outcome and CI fitting software, the shaded boxes represent its functionality, while the unfilled boxes represent its standalone capability [41].Audiqueen is a dataset with A and E (A&E) phoneme discrimination.

AI methods used for
data was found to be non-uniform, with inter-octave frequencies being less commonly tested.The multiple imputation by chained equations (MICE) method, safely imputed up to six missing data points in an 11-frequency audiogram, consistently outperformed other models.This study highlights the importance of imputation techniques in maximizing datasets in hearing healthcare research.Xu et al. in [51] explores the objective discrimination of bimodal speech using frequency following responsess (FFRs).The study investigates the neural encoding of fundamental frequency (f 0 ), called also pitch [52], and temporal fine structure cues (TFSC) in simulated bimodal speech conditions.The results show that increasing acoustic bandwidth enhances the neural representation of f 0 and TFSC components in the non-implanted ear.Moreover, ML algorithms successfully classify and discriminate FFRs based on spectral differences between vowels.The findings suggest that the enhancement of f 0 and TFSC neural encoding with increasing bandwidth is predictive of perceptual bimodal benefit in speech-in-noise tasks.FFRs may serve as a useful tool for objectively assessing individual variability in bimodal hearing.The research conducted by Crowson et al. [53] aimed to predict postoperative CI performance using supervised ML.The authors used neural networks and decision tree (DT)-based ensemble algorithms on a dataset of 1,604 adults who received CIs.They included 282 text and numerical variables related to demographics, audiometric data, and patient-reported outcomes.The results showed that the neural network model achieved a 1-year postoperative performance prediction root mean square error (RMSE) of 0.57 and classification accuracy of 95.4%.When both text and numerical variables were used, the RMSE was 25.0% and classification accuracy was 73.3%.The study identified influential variables such as preoperative sentence-test performance, age at surgery, and specific questionnaire responses.The findings suggest that supervised ML can predict CI performance and provide insights into factors affecting outcomes.In the same context of prediction, Mikulskis et al. [54] focuse on predicting the attachment of broad-spectrum pathogens to coating materials for biomedical devices as illustrated in Figure 6.The authors employ ML methods to generate quantitative predictions for pathogen attachment to a large library of polymers.This approach aims to accelerate the discovery of materials that resist bacterial biofilm formation, reducing the rate of infections associated with medical devices.The study highlights the need for new materials that prevent bacterial colonization and biofilm development, particularly in the context of antibiotic resistance.The results demonstrate the potential of ML in designing polymers with low pathogen attachment, offering promising candidate materials for implantable and indwelling medical devices.Similarly, Alohali et al. [55] focuses on using ML algorithms to predict the post-operative electrode impedances in CI patients.The study used a dataset of 80 pediatric patients and considered factors such as patient age and intraoperative electrode impedance.The results showed that the best algorithm varied by channel, with Bayesian linear regression and neural networks providing the best results for 75% of the channels.The accuracy level ranged between 83% and 100% in half of the channels one year after surgery.Additionally, the patient's age alone showed good prediction results for 50% of the channels at six months or one year after surgery, suggesting it could be a predictor of electrode impedance.Recently, Zeitler et al. [56] developed supervised ML classifiers to predict acoustic hearing preservation in patients undergoing CI surgery.The classifiers were trained using preoperative clinical data from 175 patients.The analysis revealed associations between various factors and hearing preservation outcomes.The random forest classifier demonstrated the highest mean performance in predicting outcomes.
ML showed potential for predicting residual acoustic hearing preservation and improving clinical decision-making in cochlear implantation.

C. CNN-BASED METHODS
Convolutional neural networks (CNNs) are a class of DL algorithms widely used in computer vision tasks.Their architecture includes convolutional layers that automatically learn hierarchical features from input data.The core convolution for 2D data operation is defined by the equation: Here, I represents the input 2D data, K is the convolutional kernel, and S is the output feature map.CNNs excel at recognizing spatial patterns, making them essential in image recognition, object detection, and other visual tasks.Additionally, there exist 1D CNNs, which are effective for sequential data analysis, such as in natural language processing or time series applications.CNN is widely used for the interdisciplinary nature of CI, which involves aspects of neurobiology, signal processing, and medical technology.For example, the proposed work [57] introduces a novel pathological voice identification system using signal processing and DL.It employs CI models with bandpass and optimized gammatone filters to mimic human cochlear vibration patterns.The system processes speech samples and utilizes a CNN for final pathological voice identification.Results show discrimination of pathological voices with F1 scores of 77.6% (bandpass) and 78.7% (gammatone).The paper addresses voice pathology causes, compares filter models, and proposes a non-invasive, objective assessment system.It contributes to the field with a comprehensive performance analysis, achieving high accuracy and demonstrating effectiveness compared to related works.Addtionally, in the scheme proposed by Wang [17], the fully convolutional neural networks (FCN) model is evaluated for enhancing speech intelligibility in mismatched training and testing conditions.Using 2,560 Mandarin utterances and 100 noise types, the study compares FCN with traditional MMSE and deep denoising auto-encoder (DDAE) models.Two sets of experiments are conducted for normal and vocoded speech.The FCN model demonstrates superior performance, maintaining clearer speech structures, especially in mid-low frequency regions crucial for intelligibility.Objective evaluations using STOI scores and a listening test confirm FCN's effectiveness under challenging SNR conditions, outperforming MMSE and DDAE.The study suggests FCN as a promising choice for electric and acoustic stimulation (EAS) speech processors.
Moving on, the research paper in [58] presents a novel approach to optimize stimulus energy for CIs.A CNN was developed as a surrogate model for a biophysical auditory nerve fiber model, significantly reducing simulation time while maintaining high accuracy.The CNN was then used in conjunction with an evolutionary algorithm [59] to optimize the shape of the stimulus waveform, resulting in energy-efficient waveforms and potential improvements in CI technology.Traditional computational models of the cochlea, which represent it as a transmission line, are computationally expensive due to their cascaded architecture and the inclusion of nonlinearities.As a result, they are not suitable for real-time applications such as hearing aids, robotics, and ASR.For the aforementioned conditions, the study in [60] presents a hybrid approach, called CoNNear15 , which combines CNNs, capable of performing end-to-end waveform predictions in real-time, with computational neuroscience to create a realtime model of human cochlear mechanics and filter tuning.The CNN filter weights were trained using simulated basilarmembrane (BM) displacements from cochlear channels, and the model's performance was evaluated using basic acoustic stimuli.The CoNNear model is designed to capture the tuning, level-dependence, and longitudinal coupling characteristics of human cochlear processing.It converts acoustic speech stimuli into BM displacement waveforms across 201 cochlear filters.Its computational efficiency and ability to capture human cochlear characteristics make it suitable for developing human-like machine-hearing applications.
The research paper in [26] explores the utilization of a CNN in simulating speech processing with CIs.The study investigates the effect of channel interaction, a phenomenon that degrades spectral resolution in CI delivered speech, on learning in neural networks.By modifying speech spectrograms to approximate CI delivered signals, the CNN is trained to classify them.The findings suggest that early in training, the presence of channel interaction negatively impacts performance.This indicates that the spectral degradation caused by channel interaction conflicts with perceptual expectations acquired from high-resolution speech.The study highlights the potential for reducing channel interaction to enhance learning and improve speech processing in CI users, particularly those who have adapted to high-resolution speech.
Schuerch et al. [35] focus on the objectification of intracochlear electrocochleography (ECochG) using AlexNet, CNN architecture, to automate and standardize the assessment and analysis of cochlear microphonic (CM) signals in ECochG recordings for clinical practice and research.The authors compared three different methods: correlation analysis, Hotelling's T2 test, and DL, to detect CM signals.
The DL algorithm performed the best, followed closely by Hotelling's T2 test, while the correlation method slightly underperformed.The automated methods achieved excellent discrimination performance in detecting CM signals with an accuracy up to 92%, providing fast, accurate, and examinerindependent evaluation of ECochG measurements.
Moreover, Arias et al. [21] presents a methodology for speech processing using CNNs.The study aims to im-prove the representation learning capabilities of CNNs by combining multiple time-frequency representations of speech signals.The proposed approach involves generating multi-channel spectrograms by combining continuous wavelet transform, Mel-spectrograms, and Gammatone spectrograms.These spectrograms are utilized as input data for the CNN models.The effectiveness of the approach is evaluated in two applications: automatic detection of speech deficits in CI users and phoneme class recognition.The results demonstrate the advantages of using multichannel spectrograms with CNNs, showcasing improved performance in speech analysis tasks.The convolutional recurrent neural network with gated recurrent units (CGRU) architecture is utilized, as illustrated in Figure 7.The input sequences consist of 3D-channel inputs created by combining Mel-spectrograms, Cochleagrams, and continuous wavelet transform (CWT) with Morlet wavelets.Convolution is applied solely on the frequency axis in order to preserve the time information.The resulting feature maps are subsequently fed into a 2-stacked bidirectional gated recurrent units (GRU).A softmax function is employed to predict the phoneme label for each speech segment in the input signal.
This paper [22] introduces a novel method for automatically detecting speech disorders in CI users using a multichannel CNN.The model processes 2-channel input comprising Mel-scaled and Gammatone filter bank spectrograms derived from speech signals.Testing on 107 CI users and 94 healthy controls demonstrates improved performance with 2channel spectrograms.The study addresses a gap in acoustic analysis of CI user speech, proposing a DL approach with potential applications beyond CI users.Experimental results indicate the effectiveness of the proposed CNN-based method, offering promise for speech disorder detection and potential extensions to other pathologies or paralinguistic aspects that employ mel-frequency cepstral coefficients (MFCCs) and gammatone frequency cepstral coefficient s (GFCCs) features.
For 2D CNN, the following work [61] introduces image guided cochlear implant programming (IGCIP), enhancing CI outcomes using image processing.IGCIP segments intracochlear anatomy in computed tomography (CT) images, aiding electrode localization for programming.The scheme addresses challenges in automating this process due to varied image acquisition protocols.The proposed solution employs a DL-based approach, utilizing CNNs to detect the presence and location of inner ears in head CT volumes.The CNNs is trained on a dataset with 95.97% classification accuracy.Results indicate potential for automatic labeling of CT images, with a focus on further 3D algorithm development.However, in [58] presents a machine-learning approach to optimize stimulus energy for CIs.A CNN was developed as a surrogate model for a biophysical auditory nerve fiber model, significantly reducing simulation time while maintaining high accuracy.The CNN was then used in conjunction with an evolutionary algorithm to optimize the shape of the stimulus waveform, resulting in energy-efficient waveforms.
The proposed surrogate model offers an efficient replacement for the original model, allowing for larger-scale experiments and potential improvements in CI technology.
The work proposed by [62] introduces sliding window based CNN (SlideCNN), a novel DL approach for auditory spatial scene recognition with limited annotated data.The proposed method converts auditory spatial scenes into spectrogram images and utilizes a SlideCNN for image classification.Compared to existing models, SlideCNN achieves a significant improvement in prediction accuracy, with a 12% increase.By leveraging limited annotated samples, SlideCNN demonstrates an 85% accuracy in detecting real-life indoor and outdoor scenes.The results have practical implications for analyzing auditory scenes with limited annotated data, benefiting individuals with hearing aids and CIs.
This paper [63] focuses on advancing laser bone ablation in microsurgery using 4D optical coherence tomography (OCT).The challenge lies in automatic control without external tracking systems.The paper introduces a 2.5D scene flow estimation method using CNN for OCT images, enhancing laser ablation control.A two-stage approach involves lateral scene flow computation followed by depth flow estimation.Training is semi-supervised, combining ground truth error and reconstruction error.The method achieves a MEE of (4.7 ± 3.5) voxel, enabling markerless tracking for image guidance and automated laser ablation control in minimally invasive cochlear implantation.Recently, Almansi et al. [64] presents a radiological software prototype for detecting and classifying normal and malformed inner ear anatomy using cropping algorithms and CNN to analyze CT images.The software achieved an average accuracy of 92.25% for cropping inner ear volumes and an AUC of 0.86 for classifying normal and abnormal anatomy.Additionally, Jehn et al. [65] aimed to improve auditory attention decoding (AAD) for CI users using a CNN.EEG data from 25 CI users showed that the CNN decoder achieved a maximum decoding accuracy of 74% for a decision window of 60 seconds.Besides, the work in [66] introduces a method for detecting dysphonic voice using cochleagram images and a pre-trained CNN, achieving 95% accuracy with sentence samples.However, [67] proposes a method combining high-resolution spiral CT scanning with DL technique for diagnosing auriculotemporal and ossicle-related diseases.The study utilizes CNN-UNet model to extract sub-pixel information from medical photos of the cochlea.The results demonstrate that this approach improves diagnostic efficiency and enhances understanding of these complex diseases.

D. GAN-BASED METHODS
A GAN is a type of AI model consisting of two neural networks, a generator, and a discriminator, engaged in a competitive learning process as presented in Figure 8.
The generator aims to create realistic data, such as images, while the discriminator tries to differentiate between real and generated samples.This adversarial training dynamic leads to the refinement of the generator's output, generating increasingly authentic data.The objective is for the generator to produce data that is indistinguishable from real samples.The training process is represented by the minimax game framework, with the GAN objective function given by: In the GAN objective function, E x∼pdata(x) and E z∼pz(z) indicate the expected values over real data samples x and noise samples z, respectively.G generates samples, D discriminates between real and generated samples, p data and p z are the distributions of real data and noise, respectively.Using GAN, the research in [78] proposes a DL-based method for reducing metal artifacts in post-operative CT imaging.The method utilizes a 3D-GAN trained on a large number of pre-operative images with simulated metal artifacts.The GAN generates artifact-free images by reducing the metal artifacts.The effectiveness of the method is evaluated quantitatively and qualitatively, showing promising results compared to classical artifact reduction algorithms.The approach overcomes the challenges of post-operative assessment of cochlear implantation caused by metal artifacts, and it does not require registration of pre and post-operative images.The 3D-GAN improves spatial consistency and is applicable to various types of artifacts.In addition, Wang et al. in theirs paper [70], proposes a 3D metal artifact reduction algorithm for post-operative high-resolution CT imaging.The algorithm is based on a GAN that uses simulated physically realistic CT metal artifacts created by CI electrodes.The generated images are used to train the network for artifact reduction.The metal artifact reduction-GAN based method as described in [70], utilizes a three-step process for reducing metal artifacts.Firstly, a simulation is performed to replicate CI positioning.Secondly, a physical simulation of CI metal artifacts is conducted.Lastly, a 3D GAN is trained using both simulated and preoperative datasets.The generator component of the GAN generates an image that has reduced metal artifacts, while the discriminator network is responsible for determining whether the input image contains metal artifacts or not.The method was evaluated on clinical CT images of CI postoperative cases and outperformed other general metal artifact reduction approaches.The paper introduces a novel approach that combines the physical simulation of metal artifacts with 3D-GAN, providing a promising solution for improving the visual assessment of post-operative imaging in CT.
Similarly, for CI metal artifacts reduction also, a conditional generative adversarial networks (cGAN) were proposed by Wang et al. [71].The approach involves training a cGAN to learn mapping from artifact-affected CTs to artifact-free CTs.During inference, the cGAN generated CT images with removed artifacts.Additionally, a bandwise normalization method was proposed as a preprocessing step to improve the performance of cGAN.The method was evaluated on post-implantation CTs recipients, and the quality of the artifact-corrected images was quantitatively assessed using P2PE.The results demonstrate promising artifact reduction, outperforming the previously proposed techniques.The authors evaluates the quality of artifactcorrected images using a quantitative metric based on segmentations of intracochlear anatomical structures.Specifically, the segmentation results obtained from a previously published method were compared between real preimplan-tation CTs and artifact-corrected CTs generated by the proposed method.The ASE was used as a metric to assess the accuracy of the segmentation.The paper reports that the proposed method achieves an ASE of 0.
where, σ is an activation function, typically the hyperbolic tangent or rectified linear unit (ReLU).Several speech processing techniques that utilize CI are based on RNNs.CI users struggle with music perception, and many studies have shown that enhancing music vocals improves their enjoyment.The study described by Gajęcki.et al. [80] explores source separation algorithms to remix pop songs by emphasizing the lead-singing voice.Deep convolutional autoencoders (DCAE), deep recurrent neural networks (DRNN), MLP, and non-negative matrix factorization (NMF) were evaluated through perceptual experiments involving CI recipients and normal hearing subjects.The results show that a MLP and DRNN perform well, providing minimal distortions and artifacts that are not perceived by CI users.The paper also highlights the benefits of the implementation of a MLP for real-time audio source separation to enhance music for CI users due to their reduced computation time.In addition, The study described in [27] proposes a speech separation framework for CI users using TasNet and RNN-EVD.TasNet, a non-causal multiple-input multiple-output (MIMO)-based method, is employed as the speech separation module.RNN-EVD, which combines RNNs with EVD, is utilized to preserve spatial cues.The framework aims to effectively separate speech and reduce ILD errors.The RNN-EVD network is trained using ∆ILD as the objective, and an additional SNR term is added to the loss function for convergence.The experimental results demonstrate the effectiveness of the proposed framework in preserving ILD cues for CI users in various hearing scenarios.Borjigin et al. [38] explores the use of DNN algorithms, specifically RNN and SepFormer, a Transformer-based algorithm, in speech separation applications to improve speech intelligibility for CI users in multi-talker interference.The algorithms were trained with a customized dataset and tested with thirteen CI listeners.Both RNN and SepFormer significantly improved speech intelligibility in noise without compromising speech quality, indicating the potential of DNN algorithms as a solution for multi-talker noise interference.
The long short term memory (LSTM), an enhanced version of the RNN, addresses limitations observed in RNNs under specific conditions [5], [81].Unlike RNN, LSTM excels in preserving past information, making it suitable for tasks with long-term dependencies.Comprising LSTM units forming layers, each unit regulates information flow through input, output, and forget gates, allowing for prolonged retention of crucial information.The forward pass equations ( 4) illustrate this process [5].The symbols L i and L j denote input and output, while A f , A i , and A j represent activation vectors for forget, input, and output gates.V c is the cell state vector, and σ for the sigmoid activation function and ⊙ for element-wise multiplication.This LSTM structure with weight matrices W and U and bias vector b is outlined by [82].
Recently, several schemes for CI and utilizing LSTM have been proposed in the literature.The study described by Lu. et al. in 2020 [72] introduces a speech training system designed for individuals with hearing impairments, such as those with CIs, as well as individuals with dysphonia, utilizing automated lip-reading recognition.The system combines CNN and RNN to compare mouth shapes and train speech skills.It includes a speech training database, automatic lip-reading using a hybrid neural network, matching lip shapes with sign language vocabulary, and drawing comparison data.The system enables hearing-impaired individuals to analyze and improve their vocal lip shapes independently.It also supports the use of medical devices for correct pronunciation.Experimental results demonstrate the system's effectiveness in correcting lip shape and enhancing speech ability.The proposed model utilizes ResNet50, MobileNet, and LSTM networks for accurate lip-reading recognition.Later on, the scientific paper published by Chu et al. in 2021 [83] proposes a causal DL framework for classifying phonemes in CIs to enhance speech intelligibility.The authors trained LSTM networks using features extracted at the time-frequency resolution of a CI processor.They compared CI-inspired features (log STFT power spectrum, log ACE power spectrum, and log-melfilterbank) with traditional ASR features.The results showed that CI-inspired features outperformed traditional features, providing slightly higher levels of performance.The author claimed that, this study is the first to introduce a classification framework with the potential to categorize phonetic units in real-time in a CI, offering possibilities for improving speech recognition in reverberant environments for CI users.Similarly, the research presented by Jeyalakshmi et al. [73] focuses on predicting CI scores for children aged 5 to 10 using a reconfigured LSTM network as illustrated in Figure 9.The proposed architecture aims to enhance language development skills in children with auditory deprivation, this could be achieved by guiding CI programming through the analysis of cross-modal data obtained from previously programmed patients.The research utilizes visual cross-modal plasticity and visual evoked potential to discover patterns in the data that can predict outcomes for future patients.The proposed methodology involves the use of LSTM network and ESCSO to identify optimal weights.The results demonstrate the superiority of the ESCSO-based LSTM technique over other methods.In Figure 9, "oz," "cz," "t5," and "t6" refer to specific electrode placements or positions on the scalp in the international 10-20 system for electroencephalography (EEG) or eventrelated potential (ERP) recordings.These positions represent specific areas on the scalp where electrodes are attached to measure electrical activity in the brain.The amplitude represents the intensity or strength of the electrical signal de-tected at a particular point on the scalp, reflecting the neural activity in the corresponding brain region.The parameters "N75," "P100," and "N145" refer to specific components or peaks of ERPs obtained from EEG recordings.ERPs are electrical responses recorded from the brain in response to specific stimuli or events, and they reflect the neural processing associated with those stimuli.Besides, I/P represent inputs, O/P for output, and W represent Weights.Recently, [77] proposes a neural network model based on Bi-LSTM architecture for classifying hearing loss types using tonal audiometry data.The model achieves 99.33% classification accuracy on external datasets.The system can assist general practitioners in independently classifying audiometry results, reducing the burden on audiologists and improving diagnostic accuracy.The study, that may help assisting CI patients, aims to surpass the current SOTA accuracy rate of 95.5% achieved through DT.

F. AE-BASED METHODS
An autoencoder (AE) is a type of neural network designed for unsupervised learning, tasked with encoding input data into a compressed representation and decoding it back to the original form.Examples include variational autoencoderss (VAEs), which balance data compression with generative modeling, convolutional autoencoders (CAEs), which employ convolutional layers for efficient feature learning and reconstruction, and sparse AEs, which induce sparsity, promoting selectivity in feature representation, among others.The encoding equation typically involves a mapping function, such as h = f (x), where h is the encoded representation and x is the input.The decoding equation is the reconstruction of the input, often expressed as r = g(h), where r is the reconstructed output and g is the decoding function.AEs find applications in data compression, denoising, feature learning, and more.Recently, many research papers for CI have been proposed that are based on AE.
As a point of the case, the scientific paper [18] delves into the pivotal objective of enhancing speech perception for CI users in noisy conditions, recognizing the critical role of noise reduction (NR) in this pursuit.The proposed method, named DDAE-NR, has been proven effective in restoring clean speech.The study focuses on evaluating the DDAEbased NR using envelope-based vocoded speech, mimicking CI devices.The procedure of DDAE-based NR can be split into two main stages: training and testing.During the training phase, a collection of pairs of noisy and clean speech signals is prepared.These signals are initially transformed into the frequency domain using an fast Fourier transform (FFT).The logarithmic amplitudes of the noisy and clean speech spectra are then used as inputs and outputs, respectively, for the DDAE model.
Key findings underscore the superior intelligibility of DDAE-based NR in vocoded speech compared to SOTA conventional methods, indicating its potential implementation in CI speech processors.However, the study acknowledges the use of noise-vocoded speech simulation for evaluation and emphasizes the need for further validation with real CI recipients in clinical settings, addressing potential inconsistencies in the transition to actual CI devices.
A zero-delay deep autoencoder (DAE) is proposed in [23] for compressing and transmitting electrical stimulation patterns generated by CIs.The goal is to conserve battery power in wireless transmission while maintaining low latency, which is crucial for speech perception in CI users.The DAE architecture is optimized using Bayesian optimization and the STOI.The results show that the proposed DAE achieves equal or superior speech understanding compared to audio codecs, with reference vocoder STOI scores at 13.5 kbit/s.This approach offers a promising solution for efficient and real-time compression of CI stimulation patterns, addressing the constraints of low latency and battery power consumption.Moreover, The research in [68] focuses on achieving accurate segmentation of the vestibule in CT images, a crucial step for clinical diagnosis of congenital ear malformations and CIs.The challenges addressed include the small size and irregular shape of the vestibule, making segmentation difficult, and the limited availability of labelled samples due to high labour costs.To overcome these challenges, the proposed method introduces a vestibule segmentation network within a basic encoder-decoder framework.Key innovations include the incorporation of a residual channel attention (Res-CA) block for channel attention, a global context-aware pyramid feature extraction (GCPFE) module for global context information, an , active contour with elastic loss (ACE-Loss) function for detailed boundary learning, and a deep supervision (DS) mechanism to enhance network robustness.The network architecture utilizes ResNet34 as the backbone with skip connections for multilevel feature fusion.Results showcases a high performance, and are supported by comprehensive comparisons, ablation studies, and visualized segmentation outcomes.The study also acknowledges limitations, such as reliance on professional annotations.
In addition, the study presented in [85] aims to enhance the accuracy and robustness of intra cochlear anatomy (ICA) segmentation, a vital component in preoperative decisions, insertion planning, and postoperative adjustments for CI procedures.The ICA includes structures such as scala tympani (ST), scala vestibul (SV), and the active region (AR).The researchers employed two segmentation methods, active shape model (ASM) and DL based on 3D U-Net AE, and combined them to achieve improved accuracy and robustness.A twolevel training strategy involved pretraining on clinical CTs using ASM and fine-tuning on specimens' CTs with ground truth.Results demonstrated that DL methods outperformed ASM in accuracy.While a trade-off between accuracy and robustness was observed, the combined DL and ASM approach showed improvements in both aspects.The study concludes that the proposed DL and ASM method effectively balances accuracy and robustness for ICA segmentation, highlighting the potential of DL-based methods, especially when integrated with ASM, to enhance CI procedures.
The proposed min-max similarity (MMS) methodology in [69] represents a groundbreaking approach to semisupervised segmentation networks, particularly in the context of medical applications such as endoscopy surgical tool segmentation and CI surgery.MMS is introduced through dual-view training with contrastive learning, utilizing classifiers and projectors to create negative, positive, and negative pairs.The inclusion of pixel-wise contrastive loss ensures the consistency of unlabeled predictions.In the evaluation phase, MMS was tested on four public endoscopy surgical tool segmentation datasets and a manually annotated CI surgery dataset.The results demonstrate its superiority over SOTA semi-supervised and fully supervised segmentation algorithms, both quantitatively and qualitatively.Notably, MMS exhibited successful recognition of unknown surgical tools, providing reliable predictions, and achieved real-time video segmentation with an impressive inference speed of about 40 frames per second.This signifies the potential of MMS as a highly effective and efficient tool in medical image segmentation, showcasing its applicability in realworld surgical scenarios.
Similarly, The primary aim of the study in [86] is to devise an automated method for the segmentation and measurement of the human cochlea in ultra-high-resolution (UHR) CTimages.The objective is to explore variations in cochlear size to enhance outcomes in cochlear surgery through personalized implant planning.Initially, the input scans undergo a two-step process using a detection module and a pixelwise classification module for cochlea localization and segmentation, respectively using an AE as illustrated in Figure 10.The detection module reduces the search area for the classification module, improving algorithm speed and reducing false positives.Both modules are trained on image patches, allowing for a larger training set size by generating multiple examples from each scan.The segmented cochlear structure then proceeds to a final module that combines DL and thinning algorithms to extract patient-specific anatomical measurements.DL is employed in each step to leverage its ability to learn directly from input data, providing automatic results without the need for user-adjustable parameters during testing.

G. RL-BASED METHODS
Deep reinforcement learning (DRL) is a subfield of ML that enable agents to learn and make decisions in complex environments.It involves training an agent to interact with an environment, learn from the outcomes of its actions, and optimize its behavior over time [87].In traditional RL, agents learn by trial and error, receiving feedback in the form of rewards or penalties for their actions.However, DRL incorporates DNN, make it capable of learning complex patterns and representations from raw data.This allows DRL agents to handle high-dimensional input spaces, such as images or sensor data, and make more sophisticated decisions.Figure 11 illustrate the priciple of RL.
The paper presented by radutoiu et al. [34] presents a   novel method for accurately localizing regions of interests (ROIs) in the inner ear using DRL.The proposed method addresses the challenges of robust ROI extraction in full head CT scans, which is crucial for CI surgery.The approach utilizes communicative multi-agent RL and landmarks specifically designed to extract orientation parameters.The method achieves an average estimated error of 1.07 mm for landmark localization.The extracted ROIs demonstrate an IoU of 0.84 and a dice similarity coefficient of 0.91, conducted over 140 full head CT scans, showing promising results for automatic ROI extraction in medical imaging.In addition, Lopez et al. presents in [88] a pipeline for characterizing facial and cochlear nerves in CT scans using DRL.Key landmarks around these nerves are located using a communicative multiagent DRL model.The pipeline includes automated measurement of the cochlear nerve canal diameter, extraction and segmentation of the cochlear nerve cross-section, and path selection for the facial nerve characterization.The pipeline was developed and evaluated using 119 clinical CT images.
The results show accurate characterizations of the nerves in the cochlear region, providing reliable measurements for computer-aided diagnosis and surgery planning.The proposed approach demonstrates the potential of DRL for landmark detection in challenging medical imaging tasks.

IV. APPLICATIONS OF DL-BASED MEDICAL CI
This section explores the application of deep learning in the field of cochlear implants, encompassing tasks such as speech denoising and enhancement, segmentation for precise identification and analysis of cochlear structures, thresholding, imaging, localization of CI, and more.Figure 12 provides a comprehensive overview of AI-based applications for CIs and theirs associated benefits.Furthermore, Table 5 summarizes various applications based on AI techniques, highlighting their performance, and pros and cons.

A. SPEECH DENOISING AND ENHANCEMENT
The integration of ML, and DL has proven invaluable in the field of CIs.Researchers have harnessed these technologies to tackle numerous challenges and enhance speech perception for individuals with hearing impairments.The work [19], [89] employed DDAE approach to reduce unwanted background noise in speech signals.However, Lai et al. in [89] devised a NR system that employed a noise classifier and DDAE, specifically tailored for Mandarin-speaking CI recipients.The proposed schemes [24], [28], [90] aim to perform end-to-end speech denoising, with the goal of enhancing speech intelligibility in noisy environments.Gajecki et al. [24], [28] employed DNN to develop the Deep ACE method, while Healy et al. in [90]  the broad spectrum of applications of AI, particularly DL techniques, in addressing challenges related to noise reduction and enhancing speech intelligibility in CIs applications.
Moreover, Kang et al. [30] used DL-based speech enhancement algorithms to optimize speech perception for CI recipients.Their approach achieved a balance between noise suppression and speech distortion by experimenting with different loss functions.Hu et al. [91] developed environmentspecific noise suppression algorithms for CIs using ML techniques.They improved the processed sound by classifying and selecting envelope amplitudes based on the SNR in each channel.Banerjee et al. [92] employed online unsupervised algorithms to learn features from the speech of individuals with severe-to-profound hearing loss, aiming to enhance the audibility of speech through modified signal processing.Li et al. [93] developed an improved NR system for CIs using DL, specifically DDAE, and knowledge transfer technology.Their goal was to enhance speech intelligibility in noisy conditions.Fischer et al. [31] utilized DL-based virtual sensing of head-mounted microphones to improve speech signals in cocktail party scenarios for individuals with hearing loss, resulting in enhanced speech quality and intelligibility, particularly in noisy environments.These studies exemplify the versatility of AI and DL in addressing various challenges associated with CIs, including NR, speech enhancement, and improved speech perception.Furthermore, the paper by Chu et al. [25] explores the application of ML algorithms to mitigate the effects of reverberation and noise in CIs, to improve speech intelligibility for individuals with severe hearing loss.

B. IMAGING
DL methods have revolutionized CI applications by leveraging imaging data for enhanced analysis and optimization.Hussain et al. [98] employed image analysis tools, such as the oticon medical nautilus software, to automatically detect landmarks and extract clinically relevant parameters from cochlear CT images.This approach provides valuable insights into cochlear morphology, facilitating the development of less traumatic electrode arrays for cochlear implantation.Zhang et al. [61] focused on automatically detecting the presence and location of inner ears in head CT images, aiming to assist in image-guided CI programming for patients with profound hearing loss.Regodic et al. [99] introduced an algorithm that utilizes a CNN for automatic fiducial marker detection and localization in CT images, enhancing registration accuracy, reducing human errors, and shortening intervention time in computer-assisted surgeries.Margeta et al. [94] presented Nautilus, a web-based research platform that employs AI and image processing techniques for automated cochlear image analysis.This platform enables accurate delineation of cochlear structures, detection of The work include a small dataset, potential variability in image quality, lack of external validation, and limited assessment of clinical utility and computational requirements.
No [96] 2023 UNETR Explores the feasibility of using a DL method based on the UNETR model for automatic segmentation of the cochlea in temporal bone CT images.

DSC=0.92
The small dataset used, the variability in image quality, and the absence of specifications regarding computational requirements.
No [97] 2019 3D U-NET Is a two-level training approach using a DL method to accurately segment the intra-cochlear anatomy in head CT scans.The method combines an active shape model-based method and a 3D U-Net model DSC=0.87 Limited dataset, variable image quality, lack of external validation, limited assessment of clinical utility, and no specifications on computational requirements.
No [36] 2023 AlexNet The study assesses repeatability, thresholds, and tonotopic patterns using a DL-based algorithm, providing insights into inner ear function and potential clinical applications.

Acc=83.8%
Potential dependence on the quality of the input data, limited generalizability to different patient populations or implant systems, and the need for further external validation and comparison with expert visual inspection.
No [65] 2024 CNN The model explores the use of CNNs to improve the decoding of selective attention to speech in CI users, aiming to enhance their listening experience in challenging environments.

Acc= 74%
Small sample size of 25 CI users, limiting the generalizability of the findings, and the presence of electrical artifacts in EEG recordings caused by the implant potentially affecting the accuracy of decoding.
electrode locations, and personalized pre-and post-operative metrics, facilitating clinical exploration in cochlear implantation studies.Li et al. [100] proposed the integration of DL techniques into a clinical µCT system to optimize imaging performance, improve reconstruction accuracy, and enhance diagnostic capabilities in temporal bone imaging and other clinical applications.Wang et al. [78] addressed the reduction of metal artifacts in post-operative CI CT imaging using a 3D GAN, enabling better analysis of electrode positions and assessment of CI insertion.These advancements highlight the significant role of DL, ML, and AI in leveraging imaging data for improved CI analysis, design, and surgical procedures.
In addition to the previous advancements, DL and AI have been applied to various aspects of CI applications using imaging data.Chen et al. [68] utilize AI for accurate vestibule segmentation in CT images, which plays a crucial role in the clinical diagnosis of congenital ear malformations and CI procedures.Kugler et al. [101] employ AI techniques to accurately estimate instrument pose from X-ray images in temporal bone surgery, enabling high-precision navigation and facilitating minimally invasive procedures.Waldeck et al. [102] develop an ultra-fast algorithm that utilizes automated cochlear image registration to detect misalignment in CIs, significantly reducing the time required for diagnosis compared to traditional multiplanar reconstruction analysis.Finally, Chen et al. [103] focus on creating a threedimensional finite element model of the brain based on magnetic resonance imaging (MRI) data to analyze and optimize the current flow path induced by CIs.This application of AI contributes to the improvement of implant design in the future.These innovative approaches demonstrate the diverse applications of DL, ML, and AI in CI research, ranging from scene understanding to precise segmentation, instrument pose estimation, misalignment detection, and implant design optimization.

C. SEGMENTATION
DL, ML, and AI have revolutionized CI segmentation, enabling precise identification and analysis of cochlear structures in various imaging modalities.Li et al. [96] applied a UNETR model to automatically segment cochlear structures in temporal bone CT images, enhancing surgical planning and cochlear implantation outcomes.Reda et al. [104] developed an automatic segmentation method for intra-cochlear anatomy in post-implantation CT scans, facilitating the customization of sound processing strategies for individual CI recipients.Moudgalya et al. [105] employed a modified V-Net CNN to segment cochlear compartments in µCT images, enabling precise quantification of local drug delivery for potential treatment of sensorineural hearing loss.Wang et al. [106] focused on metal artifact reduction and intra-cochlear anatomy segmentation in CT images using a multi-resolution multi-task deep network, benefiting CI recipients.Heutink et al. [86] developed a DL framework for the automatic segmentation and analysis of cochlear structures in ultra-highresolution CT images, providing accurate measurements for personalized implant planning in cochlear surgery.Zhang et al. [97] utilized a 3D U-Net DL method to achieve accurate segmentation of intra-cochlear anatomy in head CT images, facilitating optimal programming of CIs and improving hearing outcomes.These studies highlight the significant impact of DL, ML, and AI in advancing CI segmentation, ultimately leading to improved patient care and treatment outcomes.Recently, Zhu et al. [107] proposes an uncertaintyaware dual-stream network, called UADSN, for facial nerve segmentation in CT scans for cochlear implantation surgery.UADSN combines 2D and 3D segmentation streams and uses consistency loss to improve accuracy in uncertain regions.The network achieves superior performance compared to other methods on a facial nerve dataset, with an emphasis on topology preservation.

D. THRESHOLDING
DL, ML, and AI have been instrumental in the field of CIs, particularly in thresholding applications.Kuczapski et al. [108] developed a software tool that utilizes AI to estimate and monitor the effective stimulation threshold (EST) levels in CI recipients.By leveraging patient data, audiograms, and fitting settings, this tool aids in the fitting process and predicts changes in hearing levels, enhancing personalized care.Botros et al. [109] introduced AutoNRT, an automated system that combines ML and pattern recognition to measure electrically evoked compound action potential (ECAP) thresholds with the Nucleus Freedom CI.This objective fitting system streamlines clinical procedures and ensures precise and efficient threshold measurements.Furthermore, Schuerch et al. [36] utilized a DL-based algorithm to objectively evaluate and analyze ECochG signals.This algorithm enables the assessment of ECochG measurement repeatability, comparison with audiometric thresholds, and identification of signal patterns and tonotopic behavior in CI recipients.Through the integration of DL, machine ML, and AI, these studies have significantly advanced thresholding techniques in CI applications, leading to improved fitting accuracy, streamlined procedures, and objective evaluation of signal responses.

E. LOCALIZATION OF CI
DL methods have been instrumental in CI localization applications, providing accurate and automated solutions.Chi et al. [95] proposed a DL-based method for precise localization of electrode contacts in CT images.Their approach utilized cGANs to generate likelihood maps, which were then processed to estimate the exact location of each contact.Radutoiu et al. [34] focused on the automatic extraction of ROIs in full head CT scans of the inner ear.By leveraging AI, they achieved high precision in ROI localization, facilitating accurate surgical planning for insertion.Noble et al. [110] and Zhao et al. [111], [112], developed AI-based systems to automatically identify and position electrode arrays in CT images.These technologies enable large-scale analyses of the relationship between electrode placement and hearing outcomes, leading to potential advancements in implant design and surgical techniques.Heutink et al. [86] employed DL for the automatic segmentation and localization of the cochlea in ultra-high-resolution CT images.This approach allows for precise measurements that can be used in personalized planning, reducing the risk of intra-cochlear trauma and optimizing surgical outcomes.These studies showcase the significant contributions of DL and AI in localization applications, enabling accurate and efficient identification, positioning, and analysis of electrode arrays and facilitating improved surgical planning and outcomes.Burkart et al. [113] investigates the influence of sound source position and electrode placement on the stimulation patterns of CI under noise conditions.The study utilizes a measurement setup to simulate realistic listening scenarios.The results reveal that the effectiveness of CI noise reduction systems is influenced by these factors, and artificial intelligence fitting algorithms should be considered to optimize CI performance.

F. OTHER
DL techniques have been employed in various CI applications, showcasing their potential to enhance hearing outcomes and improve device performance.Bermejo et al. [114] introduced a decision support system using a novel probabilistic graphical model to optimize CI parameters based on audiological tests and the current device status, aiming to optimize the user's hearing ability.Castaneda et al. [115] focused on the use of blind source separation (BSS) and independent component analysis (ICA) to identify auditory evoked potentials (AEPs) and isolate artifacts in children with CIs, enabling improved assessment of auditory function.Incerti et al. [116] investigated the impact of varying cross-over frequency settings for EAS on binaural speech perception, localization, and functional performance in adults with CIs and residual hearing, providing valuable insights for personalized device programming.Katthi et al. [117] developed a DL framework based on canonical correlation analysis (CCA) to decode the auditory brain, establishing a strong correlation between audio input and brain activity measured through EEG recordings.This research has implications for decoding human auditory attention and improving CIs by leveraging the power of DL.

V. OPEN ISSUES AND FUTURE DIRECTIONS
While significant strides have been achieved in integrating AI into CIs, numerous research lacunae persist, offering avenues for further advancements in the field.Here are several potential realms warranting exploration in future studies: Real-time signal processing and personalized design: Investigating real-time adaptive signal processing methods employing AI algorithms has the potential to enhance sound processing for CI recipients, yielding enhanced speech intelligibility outcomes.Enhancements in adaptability to dynamic acoustic environments and real-time optimization of stimulation parameters have the capacity to substantially enhance CI performance.The authors have observed a gap in the study and implementation of AI models tailored for CI on realtime platforms like field programmable gate arrays (FPGAs).Further research in this burgeoning area holds promise for adapting a variety of existing AI models to enhance real-time capabilities.
Besides, tailoring CIs to meet the unique needs of individual users poses a significant challenge.Investigating AIdriven methodologies leveraging personal data to personalize device configurations based on factors like physiological, auditory, and neural feedback during mobility can enhance both individual outcomes and overall satisfaction.
Predicting the long-term effects: Gaining insight into the enduring CIs is vital for enhancing patient selection, counseling, and device advancement.Utilizing AI methods to shift through extensive datasets can pinpoint the predictive elements influencing sustained success.These factors may encompass pre-implantation attributes, surgical approaches, and auditory rehabilitation.Constructing predictive models using AI algorithms can furnish valuable perspectives on long-term consequences, thereby informing clinical judgments.
Incorporating multiple sensory and modalities: CIs traditionally prioritize the reinstatement of auditory experiences.Yet, enriching the perception and comprehension of sound can be achieved by integrating additional sensory dimensions like vision and touch, resulting in a multi-modal approach.
Investigating AI-driven techniques that amalgamate inputs from various senses to enhance speech recognition, spatial sound perception, and overall auditory understanding presents a promising direction for future exploration.Besides, CIs in both ears, when paired with AI algorithms, can enhance speech comprehension.By analyzing sound patterns from both implants, AI adjusts settings to optimize signal processing, improving overall accuracy and clarity of speech perception for users with bilateral implants, and enhancing their auditory experience and communication abilities, as investigated in [118].However, extensive research possibilities are required to tailor solutions with CI hardware capabilities, by taking into account computation cost, and AI model complexity.
Empowering AI-based CI using DTL: Deep transfer learning (DTL) is a highly efficient DL technique enabling the transfer of knowledge from pre-trained models, trained on millions of speech corpora and/or images, to train smaller models with limited data availability [119], [120].This approach offers significant advantages in producing lightweight AI models suitable for devices with limited computational resources, such as CIs.Only a limited number of studies have explored the impact of DTL on CI, as demonstrated in [93], which has received relatively little attention from researchers.We anticipate further exploration of this promising technique, particularly through the utilization of various DTL subtechniques, such as domain adaptation, transductive methods like cross-lingual transfer, cross-corpus transfer, zero-shot learning, fine-tuning, among others [2].
Ensuring data privacy through FL: Federated learning (FL) facilitates collaborative model training across decentralized devices by aggregating local updates rather than centralizing data.This preserves user privacy and enhances model performance, particularly beneficial in healthcare applications [121].Gathering comprehensive datasets is challenging due to rare anomaly cases and privacy concerns.FL addresses this by training models on distributed, encrypted data from multiple sources, ensuring privacy while maintaining efficacy.Researchers have yet to fully explore FLbased model building for CI, neglecting the potential to construct efficient AI models capable of accommodating diverse classes.Further investigation into this promising technique is warranted, with potential for significant advancements in model robustness and versatility.Moreover, this approach could lead to the development of a pretrained model utilizing FL, which could be seamlessly integrated with DTL.
Transformers-based CI techniques: AI researchers have adopted the CNN-LSTM model that excels in capturing spatial and temporal features, enhancing performance in sequential data tasks [8], [122].However, Transformer-based ASR techniques, such as connectionist temporal classification (CTC), bidirectional encoder representations from transformer (BERT), and others, proved in the literature that have the potential to greatly enhance the functioning of ASR [123]- [125].By leveraging the self-attention mechanism, transformers can improve speech intelligibility by effectively suppressing background noise and modeling long-range dependencies.They can also aid in acoustic scene analysis, separating and prioritizing important auditory information in complex environments.Transformers can build language models that enhance ASR systems, improving speech comprehension for users.Additionally, transformers enable personalized sound processing by adapting stimulation patterns and processing parameters based on user-specific preferences.They facilitate multi-modal integration, combining audio and visual inputs to enhance speech recognition and sound localization.Furthermore, transformers support longterm learning and adaptation, continually optimizing CI performance over time.These advancements offer promising prospects for improving auditory experiences and overall quality of life for CI users.
Exploring chat-bots-based CI capabilities: Chat-bot techniques offer several opportunities to enhance the functioning of CIs.They can provide real-time support, troubleshooting, and personalized rehabilitation programs for users, empowering them to address common issues and improve their auditory skills.Chat-bots enable remote monitoring, allowing users to share data and receive adjustments to their device settings without in-person appointments.They also offer emotional and psychological support, fostering a sense of community and well-being.Chat-bots contribute to data collection for research and development, aiding in the improvement of CI technology and rehabilitation protocols.Additionally, chat-bots employML to continuously learn from user interactions, improving their responses and understanding over time.These techniques have the potential to enhance the overall user experience, outcomes, and accessibility of CI services.For example, in [126], the effectiveness of ChatGPT-4 in providing postoperative care information to CI patients was evaluated.Five common questions were posed to ChatGPT-4, and its responses were analyzed for accuracy, response time, clarity, and relevance.The results showed that ChatGPT-4 provided accurate and timely responses, making it a reliable supplementary resource for patients in need of information.

VI. CONCLUSION
This review has provided a comprehensive overview of the advancements in AI algorithms for CIs applications and their impact on ASR and speech enhancement.The integration of AI methods has brought cutting-edge strategies to address the limitations and challenges faced by traditional signal processing techniques in the context of CIs.Moreover, the application of AI in CI has led to the emergence of new datasets and evaluation metrics, offering alternative methods for validating proposed schemes without the need for human surgical intervention and traditional tests.The review highlighted the role of ASR in optimizing speech perception and understanding for CI users, contributing to the improvement of their quality of life.ASR not only enhances basic speech recognition but also aids in the recognition of environmental sounds, enabling a more immersive auditory experience.Furthermore, ASR finds applications in authentication systems, event recognition, source separation, and speaker recognition, extending its reach beyond communication.Various AI algorithms, belong to ML and DL, have been explored in the context of CIs, demonstrating promising results in speech synthesis and noise reduction.These algorithms have shown the potential to overcome challenges associated with multiple sources of speech, environmental noise, and other complex scenarios.The review has summarized and commented on the best results obtained, providing valuable insights into the capabilities of AI algorithms in this biomedical field.Moving forward, the review suggests future directions to bridge existing research gaps in the domain of AI algorithms for CIs.It emphasizes the need for high-quality data inputs, algorithm transparency, and collaboration between researchers, clinicians, and industry experts.Addressing these aspects will facilitate the development of more accurate and efficient AI algorithms for CI, ultimately benefiting individuals with hearing impairments.The integration of advanced AI algorithms has the potential to revolutionize the field of CIs, providing individuals with hearing impairments to better communicate and engage with the world around them.Continued research and development in this area hold great promise for the future of CI technology.

•
The implementation of CI, along with a comprehensive elucidation of the taxonomy encompassing ML and DLbased CIs, is thoroughly expounded upon.Additionally, recommended frameworks for AI-based CI are thoroughly discussed and succinctly summarized in tables for enhanced clarity.• Providing detailed insights into the applications of ML and DL within the domain of CI, encompassing functions such as denoising and speech enhancement, segmentation, thresholding, imaging, as well as CI localization, along with various other functionalities.• Delving into the existing gaps in AI-driven CI, of-

FIGURE 2 :FIGURE 3 :
FIGURE 2: Bibliometrics analysis of the papers included in this review.(a) Papers distribution over the last years.(b) Percentage breakdown of paper types included in this review.

2 •
|A ∩ B| |A| + |B| The metric dice coefficient similarity (DCS) is used to evaluate the performance of the vestibule segmentation network.The Dice coefficient is a widely used similarity metric in image segmentation tasks.It measures the overlap between the predicted segmentation mask and the ground truth mask.ASD a∈S(A) min b∈S(B) (a − b) + b∈S(B) min a∈S(A) (b − a) |S(A)| + |S(B)| Average surface distance (ASD) is a commonly used evaluation measure in medical image segmentation tasks.It quantifies the average distance between the surfaces of two segmented objects, typically the predicted segmentation and the ground truth.AVD max(d(A, B), d(B, A)) Average volume difference (AVD) quantifies the average difference in volume between a predicted segmentation and a reference or ground truth segmentation.. SNR 10 log 10 Signal Noise dB Signal to noise ratio (SNR) is a measure of the quality of the speech signal.It is commonly used to evaluate the quality of the stego-speech (the speech signal after the hidden information has been embedded).A lower SNR indicates that the steganography technique has introduced more distortion to the speech signal.ASE Eij N Is the average surface error (ASE).The distances between corresponding points on the measured surface and the reference surface are computed using Eij = |I1(i, j) − I2(k, l)|, known as point to point error (P2PE).N is the total number of correspondence points.Normalize average error by dividing by the imaging system's dynamic range.(i, j), and (k, l) represent pixel values in the first and second images, respectively.ILD 20 log 10 L RInteraural level difference (ILD) a psychoacoustic metric, measuring sound level differences between left (L) and right (R) ears, reflects cues essential for sound localization, enhancing spatial awareness in auditory perception for directional sound source identification.Mean endpoint error (MEE) measures the average absolute difference between the true Li and estimated Li endpoint locations across multiple utterances N .

FIGURE 4 :FOXFIGURE 5 :
FIGURE 4: Taxonomy of the employed AI techniques for CI.

FIGURE 6 :
FIGURE 6: Diagram illustrating the procedures utilized in the production of microarrays, analysis of pathogens, data modeling, and forecasting the attachment of pathogens to novel polymers [54].
In 2023, Huang et al. proposed in [84], a DL-based sound coding strategy for CIs, called ElectrodeNet.By leveraging DNN, CNN, and LSTM, ElectrodeNet replaces conventional envelope detection in the ACE strategy.Objective evaluations using measures like STOI and NCM demonstrate strong correlations between ElectrodeNet and ACE.Additionally, subjective tests with normal-hearing listeners confirm the effectiveness of ElectrodeNet in sentence recognition for vocoded Mandarin speech.The study extends ElectrodeNet with ElectrodeNet-CS, incorporating channel selection (CS) through a modified DNN network.ElectrodeNet-CS produces N-of-M compatible electrode patterns and performs comparably or slightly better than ACE in terms of STOI and sentence recognition.This research showcases the feasibility and potential of deep learning in CI coding strategies, paving the way for future advancements in AI-powered CI systems.

Author
et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS 100 × 100 window; d, i: 70 × 70 window; e, j: final map).The two examples show how the cochlea detection task can benefit from the proposed multi-scale approach.Especially, the second example shows how false positives (i.e. the connected auditory canal incorrectly detected by the 70 voxel-side CNN, panel (i)) are reduced and corrected in the final probability mask (panel (j)).

FIGURE 12 :
FIGURE 12: Taxonomy of AI-based applications for CIs and their benefits.

TABLE 2 :
List of publicly available datasets used for CIs applications .It is a test resource that was developed in two versions: MHINT-M for use in Mainland China and MHINT-T for use in Taiwan.The development of MHINT took into consideration the tonal nature of Mandarin, recognizing the importance of lexical tone in designing the test.

TABLE 4 :
Summary of some proposed methods based on different DL techniques.When comparing the work with numerous existing schemes, only the best-performing one will be highlighted.
The hidden state at time t, denoted as h t , is computed based on the input x t , the previous hidden state h t−1 , and model parameters W and U , with b representing the bias term.The equations governing the hidden state update are given by: [79]m, which is approximately half of the error obtained with a previously proposed technique.Gogate et al.[79]propose a robust real-time audio-visual speech enhancement framework for CIs.By leveraging a GAN and DNN, the framework effectively addresses visual and acoustic speech noise in real-world environments.Experimental results demonstrate significant improvements in speech quality and intelligibility, offering potential benefits for CI users in noisy social settings.

TABLE 5 :
Summary of the performance and limitations of specific DL applications dedicated to CIs.In cases where multiple tests are conducted, only the best performance is reported.The DL model comprises Siren noise at 6dB, a classifier, and the DDAE.The transfer learning (TL) is incorporated to help reduce the number of parameters in the model.