Identifying Genetic Variants Associated With Noise-induced Hearing Loss Based on a Novel Strategy for Evaluating Individual Susceptibility

Background: The overall genetic prole for noise-induced hearing loss (NIHL) remains to be explored. Here we used a novel machine learning (ML) strategy to evaluate individual susceptibility to NIHL and identify the underlying genetic variants based on a subsample of participants with extreme phenotype. Methods: Demographic and audiometric data of 5,539 shipbuilding workers from large cross-sectional surveys were included in four ML algorithms to predict the hearing level. The area under the curve (AUC) and prediction accuracy were used to assess the performance of the classication models. We screened 300 participants that were misclassied by all of the four ML models, with extreme phenotypes implying they were either highly susceptible or resistant to NIHL and used whole-exome sequencing (WES) to identify the underlying variants associated with NIHL risk among the NIHL-susceptible and NIHL-resistant individuals. Subsequently, candidate risk loci were validated in a large independent noise-exposed cohort, followed by a meta-analysis. Results: With 10-fold cross-validation, the performances of the four ML models were robust and similar, with average AUC and accuracy ranging from 0.783 to 0.798 and 73.7% to 73.8%, respectively. The phenotypes of the NIHL-susceptible group and NIHL-resistant group were signicantly different (all p<0.001). After WES analysis and ltering, 12 novel variants contributing to NIHL susceptibility were identied and replicated. The meta-analyses shown that the rs41281334 A allele of CDH23 (OR=1.506, 95% CI=1.106-2.051) and the rs12339210 C allele of WHRN (OR=3.06, 95% CI=1.398-6.700) were signicantly associated with increased risk of NIHL after adjustment for conventional risk factors. Conclusions: This study determined two novel genetic variants in CDH23 rs41281334 and WHRN rs12339210 associated with NIHL risk, based on a potential approach for evaluating individual susceptibility using ML models. multiprotein complexes, which play a role in maintaining stereocilia length[45], to autosomal recessive non-syndromic type 31 (DFNB31) We speculate that the nonsynonymous of rs41281334 and rs12339210 in exons the expression level of the genes but heterozygous carriers do not have profound congenital initially due to the weak genetic ecacy. However, their inner ear may be more vulnerable to noise exposure and they may experience hearing loss earlier than a normal person. The present study provides a novel strategy to evaluate the individual susceptibility of NIHL based on the prediction error of ML models and highlights an application for prescreening high-risk individuals from noise-exposed populations before gene sequencing. We also expanded the mutation spectrum of NIHL susceptibility genes and validated the association between CDH23 rs41281334 and WHRN rs12339210 variants and NIHL susceptibility for the rst time with human genetic evidence; this should be followed up in larger cohorts and veried by functional studies.


Background
Noise is one of the most common pollutions in industrial settings and communities. Long-term exposure to hazardous noise can result in noise-induced hearing loss (NIHL), which has become the second most frequent form of sensorineural hearing loss, after age-related hearing impairment [1]. It is estimated that 16% of disabling hearing loss in adults is attributed to occupational noise worldwide, ranging from 7-21% in various subregions [2,3], posing a huge burden to health care [4].
NIHL is a heterogeneous disease induced by interactions between genetic and environmental factors. It is known that the risk of noise-induced damage to the auditory system depends on the noise intensity and duration of exposure [5]. However, there are undeniable large variations in a person's susceptibility to noise exposure, and such differences are likely regulated by genetic background [6,7]. Twin studies estimated the heritability for noise-induced hearing loss (NIHL) to be approximately 36% [8], and strain-speci c variation in sensitivity has been demonstrated in several heterozygous and homozygous knockout mice, including SOD1, CDH23, and MYH14 [9][10][11] which have been shown to be more sensitive to noise than their wild-type littermates. To date, many case-control studies screening for tag single nucleotide polymorphisms (SNPs) have been adopted in noise-exposed human populations, involving putative susceptibility genes that are known to play functional roles in the inner ear [12]. However, this method is inherently biased and limited in its scope for the discovery of novel mutations [13], and the overall genetic pro le for NIHL remains to be explored.
To proceed with the study of NIHL susceptibility genes, an appropriate quanti cation and selection of workers who are highly susceptible or resistant to noise is crucial. High-throughput sequencing technology such as whole exome sequencing (WES) [14] is a highly effective approach for discovering genes underlying multifactorial diseases, and the extreme phenotype design will increase the power of a genome-wide search for susceptible or protective genetic variants [15,16]. Yet, scienti c consensus is still lacking regarding screening protocols for identifying individuals who are at the two extreme ends of the NIHL distribution. Previously, susceptible individuals were selected with hearing loss at 3, 4, or 6 kHz after matching with controls in the aforementioned tag SNPs studies. Other studies have estimated the individual susceptibility to NIHL based on acoustic re ex [17], or the regression model between hearing loss and noise exposure dose [18]. The sample size of these studies was relatively small, and they failed to take other risk factors into consideration. For example, aging [19], and the use of alcohol and tobacco [20] and other harmful chemicals [21] will aggravate the progress of NIHL, while female sex is a protective factor [22]. Another study explored the Page 3/12 classi cation of noise-susceptible and noise-resistant workers based on the ISO 1999:1990 standard [23]; however, the results depended on the statistical distribution of hearing threshold levels of a speci c noise-exposed population, and have become arbitrary. In addition, while the notch audiogram at 3-6 kHz was considered as the marker for the differential diagnosis of NIHL from other types of sensorineural hearing loss [24,25], extended high-frequency (EHF) audiometry has been advocated in recent years and has shown a great advantage in the early identi cation of hearing loss due to various reasons, especially noise exposure [26,27]. Thus, quantifying NIHL only at the traditional frequencies may introduce bias.
Machine learning (ML) techniques have been successfully used for building medical predictive classi cation models [28,29]. One major advantage of ML is its ability to identify nonlinear relationships between multiple variables and a targeted outcome (in our case, the contribution of various factors to NIHL), which makes it possible to assess the individual susceptibility by integrating multiple factors. One recent study used ML to predict NIHL with limited variables including age, exposure duration, and statistical metric of noise kurtosis [30]. The main focus of this study was just on the prediction accuracy of algorithms, which ranged from 76.6-83%, while the reasons for the prediction errors were not further analyzed. The participants who were misclassi ed by ML that deviated signi cantly from the predicted results primarily had abnormal phenotypes. It is clear that some unknown mechanisms or factors must be responsible for them. In the case of NIHL, these mechanisms could be attributed to genetic variants.
In this study, we aimed to evaluate the individual susceptibility of NIHL in a large-scale noise-exposed population and investigate the potential genetic variants using the misclassi ed participants with extreme phenotypes screened by ML models. We hypothesized that WES analysis might identify variants associated with extreme susceptibility or resistance to NIHL. To our knowledge, this is the rst study to discover variants using exome sequencing and an extreme phenotype study design for NIHL.

Study participants
The study was conducted in a large shipyard in Shanghai and comprised several phases. It was initiated in June 2015, after which the employees underwent occupational health examinations annually. The most recent auditory and demographic data were collected from a total of 6,840 Chinese noise-exposed workers, primarily comprising grinders, welders, stampers, and cutters, all of whom are exposed to high levels of noise. A structured questionnaire was lled through face-to-face interviews to collect demographic features, occupational history, medical history, smoking and alcohol drinking habits, family history including genetic and drug-related hearing loss, the use of hearing protection devices, and exposure to other harmful chemicals. The exclusion criteria are shown in Fig. 1. This study was approved by the Institutional Ethics Review Board of the Shanghai Sixth People's Hospital a liated with Shanghai Jiao Tong University (Approval No: 2017 − 136) and was registered in the Chinese Clinical Trial Registry (registration number: ChiCTR-RPC-17012580). Potential consequences and bene ts were explained, and written informed consent was obtained from each participant before this study.

Noise exposure estimation
Industrial noise was measured with an ASV5910-R digital recorder (Aihua Instruments; Hangzhou, China) across the work areas of different jobs according to the national standard of China [31]. The long-term equivalent (Leq) noise level was adopted as the primary exposure metric and measured three times at each spot. The mean Leq value in each spot was transformed into 8 h of continuous equivalent A-weighted sound pressure level (Leq8h, shown in Table 1). Cumulative noise exposure (CNE) was calculated using LAeq8h over the years of on-duty time (Leq-total, dBA·year): CNE = Leq8h + 10*lg(T), where T is the duration of employment in years [32].

Machine learning for individual susceptibility assessment
Predictive modelling and performance evaluation Different ML algorithms have different advantages. The following four supervised algorithms were used for the classi cation of hearing impairment: adaptive boosting (AdaBoost) [33], multi-layer perceptron (MLP) [34], random forest (RF) [35] and support vector machine (SVM) [36]. Age, sex, CNE, smoking, and alcohol drinking status were used as inputs for predictive modeling of PTA 3 − 6 & 10−12.5 dichotomy. To train and validate the models, 10-fold cross-validation was adopted. In short, the entire dataset was randomly divided into 10 datasets using the caret package in R programming language v 3.6.1; nine of them for modeling, and the remaining one for validation. This step was repeated for 10 runs and the parameters of each algorithm were adjusted to ensure that the model had the best classi cation performance, which was estimated by two indexes: area under the receiver operator characteristic (ROC) curve (AUC) and prediction accuracy. The reported accuracy and AUC are the average over the 10 cycles. The algorithms were implemented using randomForest, adabag, monmlp, and e1071 libraries. When building the AdaBoost and RF models, default parameters were utilized. For the MLP model, we used ve nodes in the rst hidden layer and 15 ensembles to t, and set the cut-off value of 0.5 in prediction probability for dichotomous classi cation. Regarding the SVM algorithm, hyper-parameters including gamma and cost were initially determined by 10-fold cross-validation, and the best of which was applied to train the classi er.
Individual susceptibility assessment and extreme individual selection Individual susceptibility was assessed based on whether an individual was correctly classi ed by comparing the predicted label with the actual label. We consider the few participants who were misclassi ed with abnormal phenotypes to be either highly susceptible or resistant to NIHL. For example, those who were predicted to be in the B-HL group but actually had severe hearing loss were regarded as susceptible to NIHL, and conversely, those who were predicted to be the W-HL group but actually had better hearing were regarded as resistant to NIHL. To avoid errors of the model itself, the selection procedures for extreme individuals were strictly applied to the subsamples selected from the two misclassi ed groups by all four models for the next step of exome sequencing.

Identi cation and replication of risk variants
The procedure for genomic DNA preparation and exome sequencing analysis is described in the supplementary material. To identify the most likely pathogenic mutations, functional variants were ltered as follows: (1) considering the limited sample size and false negative signals, the SNPs with p values of Fisher's exact test for genetic association analysis < 0.05 or marginal signi cance (0.05 < P < 0.10), were selected; (2) the analysis was restricted to non-synonymous (missense), stop-gain/loss (nonsense), and splicing because changes in amino acids may affect biological functions; (3) minor allele frequencies of the mutation less than 0.1 in one of the 1000 genomic data (1000g_all), gnomAD data (gnomAD_ALL and gnomAD_EAS), and ExAC public database: and (4) mutation loci within candidate genes which have been shown to be involved in several crucial pathways including oxidative stress, potassium ion circulation, heat shock protein, notch signaling, apoptosis signaling, and monogenic gene of hereditary hearing loss [6,12,37] (see supplementary Table S2 for gene list). From the ltered results, a truly pathogenic rare mutation can be obtained by removing the diversity locus between individuals. Additional two groups of noise-exposed participants were selected from the total sample for replication: 1,077 individuals with an average hearing threshold < 25 dB HL but as high as possible in terms of age and noise exposure dose were classi ed as the low-risk group, and 1,031 individuals with an average hearing threshold ≥ 25 dB HL but as low as possible in terms of age and noise exposure dose were classi ed as high-risk group. The demographic characteristics of the two groups are summarized in Supplementary Table S1. Candidate SNPs were genotyped using the ligation detection reaction and SNaPshot assay. Ten percent of the samples were randomly selected and genotyped repeatedly for quality control, and the concordance was > 99.9%.

Statistics
Continuous data are presented as mean ± standard deviation (SD) and were compared using the Mann-Whitney test between groups given their skewed distribution. Categorical data are expressed as number (%) and were compared using Pearson's χ2 test. The Hardy-Weinberg equilibrium test was performed before association analysis. The allelic frequencies between the NIHL-susceptible and NIHL-resistant groups were compared using Fisher's exact test, and logistic regression was used to compare the difference in genotype distributions between the two independent validation groups under an additive model (AA = 0, Aa = 1, aa = 2; a is the minor allele), and odds ratios (ORs) with 95% con dence intervals (CIs) are presented. Statistical analyses were performed using SPSS 24.0 (IBM, Armonk, NY, USA) or PLINK v1.9 [38]. Combined ORs from two stages were calculated using a Comprehensive Meta-Analysis (Biostat, Englewood, NJ, USA) with a xed-or random-effect model after testing for heterogeneity. Differences were considered signi cant when p < 0.05.

Results
Demographic characteristics Figure 1 outlines the study procedures and criteria for participant exclusion. A total of 5,539 individuals with the most recent hearing data were included for classi cation model construction, and their demographic characteristics are summarized in Table 1. The whole study samples were divided into two groups based on their hearing impairment level. Compared with the B-HL group, the W-HL group had an older age, longer career length, larger average cumulative noise exposure dose, and a higher proportion of male workers, smokers, and drinkers. Notably, the hearing impairment at different frequency ranges in the W-HL group were signi cantly worse than those in the B-HL group. All differences were statistically signi cant at p < 0.01.

Machine learning for individual susceptibility
The prediction performances of the four algorithms were robust and similar, with no signi cant difference in terms of accuracy and AUC. The MLP algorithm achieved the highest accuracy of 73.8% with an AUC of 0.797 for prediction, followed by Adaboost

Discovery and validation of variants in the two extreme groups
After quality control for WES data, we discovered 993,409 SNPs and 207,683 short indels. Following the ltering criteria described above, we screened 1,104 non-synonymous low-frequency mutations located in the exons. All SNPs were in Hardy-Weinberg equilibrium (p > 0.05). The presence of NIHL-associated genes in our cohort was further investigated. We intersected the remaining loci that were enriched in the NIHL-susceptible group but were not present or observed at low frequencies in the NIHL-resistant group. Finally, with a threshold p value of 0.1, 12 novel variants were identi ed, detailed information and allele frequencies are displayed in Table 2. Given the sample size and the low mutation frequency, 10 variants in NPC1, GJB2, EPHA2, TCIRG1, CDH23, KITLG, PTPRQ, WHRN, OTOGL, and ADGRV1 genes were signi cantly (P < 0.05) associated with the risk of NIHL, whereas other two mutations in PTPRQ and KARS genes were marginally (0.05 < P < 0.1) associated with the risk of NIHL. To validate the effects of these SNPs, we further sought to replicate them in the independent samples. The rs10862089 could not be genotyped due to unquali ed primer ampli cation, the remaining loci were successfully replicated in both 300 WES samples and 2,108 independent validation samples. Since the rs199632510 variant was not detected in the NIHL-resistance group, it was excluded in the following statistical analysis. Table 3 shows the association effects of the 10 SNPs with NIHL risk under additive model. After adjustment for age, sex, CNE, smoking, and drinking status, the rs1805084 of NPC1, the rs72474224 of GJB2, the rs41281334 of CDH23, the rs147541734 of PTPRQ, and the rs117041419 of KARS were signi cant in the 300 WES samples, while only the rs41281334 of CDH23 and the rs12339210 of WHRN were signi cant in the independent validation samples. We also performed a meta-analysis for these SNPs combining the results of the two cohorts. As shown in Table 3, the risk allele A of rs41281334 (OR = 1.506, 95% CI = 1.106-2.051) and risk allele C of rs12339210 (OR = 3.06, 95% CI = 1.398-6.700) conferred a higher risk of NIHL. Chr., chromosome; AA, amino acid. P value is the results of Fisher's exact test, the odds ratio (OR) with 95% con dence interval (CI) shown is for the minor allele.

Discussion
The present study proposes the utilization of ML algorithms for the assessment of individual susceptibility to NIHL based on prediction error, and investigates potential genetic variants through misclassi ed individuals with extreme phenotypes. In the rst phase, all of the ML models exhibited robust and similar performances. Most importantly, we focused on the individuals that were misclassi ed by all of the four ML models, with large deviations from the predicted results implying that genetic variation may be involved. In the second phase, we used WES to explore the underlying variants associated with NIHL risk and validated in a large independent cohort. With this novel two-stage approach, we screened two subgroups of individuals with opposite phenotypes, and determined two novel variants of rs41281334 and rs12339210 that signi cantly increased the risk of NIHL for the rst time in the Chinese Han population.
Since hearing restoration is not currently available, computational methods to identify susceptible individuals and novel candidate genes for early identi cation and prevention of high-risk individuals from working in noisy environments are necessary. ML algorithms are superior to traditional statistical methods owing to their exibility and evolving performance for exploring the complex associations between risk factors and the development of NIHL. Although we expect the prediction accuracy of ML models to be as high as possible, it is restricted when only using phenotypic data due to genetic factors. All four models performed well in our study, achieving similar accuracy and AUC, with a prediction accuracy of approximately 74%, while those who were misclassi ed by all four ML models had extreme phenotypes, namely, the degree of their hearing loss did not conform with their age, noise exposure level, and other factors, indicating that this novel strategy could be applied to distinguish high-risk individuals in noise-exposed populations.
Several studies employing an extreme phenotype design have been successful in identifying rare variants [39,40]. In this study, we retrieved a wide range of candidate genes related to NIHL as previously reported for discovering novel variants. Other unknown genes may also contribute to the variation in susceptibility, but we have not listed them for a lack of de nite evidence. It is recognized that replication of ndings in independent populations is much more important than obtaining highly signi cant p values [41]. Thus, to avoid false-positive results, replication in independent sample sets and combined meta-analyses were performed. Despite signi cant phenotypic differences in our samples, most of the candidate loci showed negative results after adjustment. Notably, we identi ed the variant rs72474224 of GJB2, which is consistent with one recent study that the knock-in homozygous mice based on human p.V37I variant (c.109G > A) manifested as more vulnerable to noise damage [42], however, we failed to replicate this association in the validation samples. The CDH23 gene has been reported to be linked to NIHL susceptibility [43,44]. CDH23 encodes cadherin 23, which is expressed in a variety of structures within the inner ear. In our study, the A allele of rs41281334 improved NIHL risk by almost 1.5-fold. Although the WHRN variants have not been reported to be associated with NIHL, the risk allele C of rs12339210 signi cantly increased NIHL risk by 3-fold even after adjustment in our study. The WHRN mRNA transcripts are expressed in the organ of Corti, vestibular, and retinal tissues, where mutations disturb the aid in assembling large multiprotein complexes, which play a role in maintaining stereocilia length [45], leading to autosomal recessive non-syndromic deafness type 31 (DFNB31) or Usher syndrome [46]. We speculate that the nonsynonymous mutations of rs41281334 and rs12339210 located in exons might in uence the expression level of the genes but heterozygous carriers do not have profound congenital deafness initially due to the weak genetic e cacy. However, their inner ear may be more vulnerable to noise exposure and they may experience hearing loss earlier than a normal person.
Several limitations of this study should be mentioned. First, the data in this study was derived from cross-sectional data without long-term follow-up, which is critical to con rm the susceptibility of these workers. Second, the small sample size of the WES made it di cult to achieve genome-wide signi cance. In addition, all participants were from the Chinese Han population, causing conceivable inherent bias due to the impact of ethnic differences on genetic polymorphism, and the results should be validated among multiple ethnic populations.

Conclusions
The present study provides a novel strategy to evaluate the individual susceptibility of NIHL based on the prediction error of ML models and highlights an application for prescreening high-risk individuals from noise-exposed populations before gene sequencing. We also expanded the mutation spectrum of NIHL susceptibility genes and validated the association between CDH23 rs41281334 and WHRN rs12339210 variants and NIHL susceptibility for the rst time with human genetic evidence; this should be followed up in larger cohorts and veri ed by functional studies. . Potential consequences and bene ts were explained, and written informed consent was obtained from each subject before this study.

Consent for publication
Not applicable.

Availability of data and materials
The data of this study are not publicly available due to the issue of intellectual property but are available from the corresponding author on reasonable request.