Environment Knowledge-Driven Generic Models to Detect Coughs From Audio Recordings

Goal: Millions of people are dying due to respiratory diseases, such as COVID-19 and asthma, which are often characterized by some common symptoms, including coughing. Therefore, objective reporting of cough symptoms utilizing environment-adaptive machine-learning models with microphone sensing can directly contribute to respiratory disease diagnosis and patient care. Methods: In this work, we present three generic modeling approaches – unguided, semi-guided, and guided approaches considering three potential scenarios, i.e., when a user has no prior knowledge, some knowledge, and detailed knowledge about the environments, respectively. Results: From detailed analysis with three datasets, we find that guided models are up to 28% more accurate than the unguided models. We find reasonable performance when assessing the applicability of our models using three additional datasets, including two open-sourced cough datasets. Conclusions: Though guided models outperform other models, they require a better understanding of the environment.


A. Motivation
A CCORDING to the world health organization (WHO), over 6.5 million people have died worldwide since the outbreak in November 2019 [1]. COVID has become one of this century's most devastating respiratory diseases due to its high death toll and long-lasting health complexities. In addition to COVID-19, a range of inflammatory respiratory diseases, including chronic obstructive pulmonary disease (COPD), asthma, and many others, cause the magnitude of mortality and morbidity. According to the Centers for Disease Control and Prevention (CDC), annually, more than 15 million Americans are affected by COPD, and more than 150 thousand die of COPD each year, i.e., 1 death every 4 minutes due to COPD [2]. Due to asthma, on average, 10 Americans die daily, according to the Asthma and Allergy Foundation of America [3]. While these respiratory diseases are spreading human suffering and upending the lives of billions of people around the globe, they have some similarities in their symptoms. For example, common symptoms of COVID-19 are dry cough, fever, muscle or body aches, congestion, breathing difficulty, and fatigue, according to CDC [4]. Similarly, patients with COPD have coughing and difficulty in breathing [5]. Furthermore, asthma patients suffer from coughing and wheezing [6]. Thereby, coughing is found to be one of the major symptoms of several respiratory diseases, such as lung cancers, cystic fibrosis, aspiration, and bronchitis [7]. Therefore, a better and early understanding of cough and its patterns can help to assess people's condition and diagnosis of a disease, which is difficult in traditional approaches that rely on viral tests (based on samples from the nose and mouth) or antibody tests [8], chest X-ray or spirometry tests [9], blood tests, pulse oximetry, and sputum tests [10], [11] due to the time and resource requirements that are not available in most primary care access points or at homes. An automated and continuous reporting of cough symptoms using continuous smartphonemicrophone sensing and predictive machine learning models can help us to overcome the limitations of current approaches. This smartphone-based objective cough reporting can help not only to detect people's conditions early but also can be very useful for monitoring patient conditions remotely. However, machine learning models are often trained in certain environments (e.g., clinics or homes) consisting of a known set of ambient sounds or noises [12], [13], [14] and may not generalize to new environments due to the lack of prior knowledge about the new backgrounds, i.e., unknown acoustical conditions or settings. For instance, models developed targeting forced coughs [15], [16] assume an ideal environment with low to no background noises, and models/apps developed for nocturnal cough detection [12], [17]

assume relatively stable environments comprised
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ VOLUME 4, 2023 55 of known continuous noises, such as air conditioner noises that do not change frequently compared to day time dynamic outdoor environments comprised of a wide range of known and unknown background noises. But incorporating prior knowledge about the environment is not always possible, especially for a new user. Therefore, designing a system that does not need prior knowledge about the background and can adapt over time will be helpful.
On the other hand, some researchers have used deep neural network models to detect coughs since they can be easily deployed on edge devices, including smartphones [15], [16], [43]. For instance, a team of researchers has detected bronchitis, bronchiolitis, pertussis, and healthy coughs using a convolutional neural network (CNN) with a precision score above. 8 [44]. Another group of researchers has detected COVID-19, pertussis, bronchitis, and healthy coughs using CNN and SVM classifiers with an overall accuracy of around. 88 [16]. However, the implementation has relied on two major components: (1) a smartphone app to record a user's forced coughs and (2) a cloud server to process and detect coughs from the smartphone audio recordings. Another team has detected symptomatic COVID-19 coughs and healthy coughs through a similar system consisting of deep neural networks and smartphone-server integration with an area under the receiver operating characteristic curve of. 88 [43]. These phone-server integration-based implementations have raised privacy concerns since privacy-sensitive raw audio data from a user's smartphone are sent to a remote cloud server.
To overcome privacy concerns, researchers have developed smartphone-based systems that do not require data upload [15]. However, this work still requires a user to cough in front of a smartphone microphone. These recordings are often captured in an idle environment with relatively lower background noise [15], [16]. Additionally, this approach may miss natural coughs (e.g., sleep time coughs), which can better represent a user's state and is also not applicable to dynamic environments with different background noises at varying intensities. Similarly, models/apps developed targeting nocturnal environments fail to work in dynamic daytime environments [12], [17]. Hence, there is a need for a generic cough detection system that can be used continuously and does not need prior knowledge about the environment, initially. Over time, the system can adapt as it gets a better understanding of the environment.

C. Contribution
Adapted from our previous work [13], in this work, we present a trade-off between the availability of knowledge about a user's environment (i.e., a user's familiarity with the environment) and model performance, when identifying coughs utilizing three different modeling approaches, i.e., unguided (no prior knowledge is needed, but it is not the best performing model), semi-guided (some, but no specific prior knowledge is needed, resulting in a better-performing model than the unguided model), and guided (specific prior knowledge is needed, and it is the best performer) modeling approaches based on the availability of knowledge about the environments (Section II-B) to detect coughs from smartphone-microphone audio recordings. Compared to our previous work, in this work, our models are tailored to a user's knowledge about the surroundings. For example, an unguided model can be tailored to a user who has no prior knowledge about the background, semi-guided or guided models can be tailored to a user who has some knowledge about the surroundings. In this work, we utilize dynamic first and second temporal derivatives in addition to the Mel-frequency cepstral coefficient used in the previous work. In addition to the prior classifiers, we use the gradient boosting, which works better than other classifiers in most cases.
In this work, we test the applicability of our models using six distinct datasets, including two respiratory disease-specific datasets. The first three datasets (Sections II-C1-II-C3) are used to develop and determine the best models. To determine the applicability of generic cough models, we use three additional cough datasets (Sections II-C4-II-C6), including respiratory disease-specific COVID-19 and COPD datasets.
We find that the guided models can achieve around 12%-28% higher accuracy and F 1 score when compared to the unguided models (Sections III-A and III-B). Additionally, the semi-guided models perform relatively better than the unguided models. Therefore, semi-guided models can be an intermediate approach starting from the unguided and transitioning to the guided models for situations where a user does not have a clear idea about the environment at the beginning, but with the pass of time, the user can get a better understanding of environment-specific data.

II. MATERIALS AND METHODS
While developing models to deploy in real-life settings, knowing the environments and the number of classes is always a major challenge. This problem is even more severe while developing Knowledge-driven modeling schemes with cough sounds (class-1) and non-cough sound categories (class-0); m stands for the total number of instances from the cough or non-cough class, r (= 5 for guided, or 15 for semi-guided) stands for the total number of non-cough sound types used in class-0, and n stands for the total number of instances per non-cough sound type (i.e., n = m/r) when modeling; Later, in Tables I  and II, we present the values of these parameters when discussing our train-test methodology in Section II-D2. models to detect a particular type of sound, e.g., cough, with or without the presence of various background noises in an unconstrained natural life. In Fig. 1, we present three modeling schemes with or without some knowledge about environments.
This section introduces three categories of non-cough sounds used in this work, followed by our three environment knowledgebased modeling approaches. Then, we introduce multiple datasets we utilize in this work, followed by our approaches to processing the data and developing models.

A. Non-Cough Sound Categories
We utilize the following three categories of environmental sounds to construct our non-cough class (i.e., class-0). r Category#1 (Animal sounds): As a representation of animal sounds, we utilize five types of sounds, i.e., frog, crow, cricket, rooster, and dog sound recordings.
r Category#2 (Human-made sounds): We use snoring, breathing, sneezing, laughing, and throat clearing (T/C) sound recordings as a representative of human-made sounds.
r Category#3 (Hardware sounds): As a representation of hardware sounds, we include washing machine (W/D), door knock (D/K), vacuum cleaner (V/C), engine, and air conditioner (A/C) sound recordings.

B. Knowledge-Driven Modeling Schemes
As depicted in Fig. 1, in this work, we present three modeling schemes based on a user's prior knowledge about the environments. Our modeling approaches are:

1) Unguided Models:
In this approach, we develop models assuming that a user does not have any prior knowledge about the environment composed of various sounds, except the target sound, e.g., cough (class-1). Therefore, we develop the unary (one class) models using only cough instances (m), as demonstrated in Fig. 1. In this unary modeling approach, part of the cough instances from class-1 will be used as non-cough instances depending on the values of the outlier threshold parameter, which will be presented in more detail in the "Parameter Optimization" section (Section II-E4). In the case of unary models, no non-cough instances will be used for model training. This will be further discussed in the "Training-Test Splits" section (Section II-D2). Though this type of model has broader applicability, it may underperform compared to the models developed with some prior knowledge about the environments.
2) Guided Models: In this approach, we assume that a user has a detailed understanding of the environments and different noises in the backgrounds compared to unguided models. We develop three separate binary guided models considering one of the three background sound categories (Section II-A) as class-0 (non-cough class). Each sound category comprises five types of sounds (i.e., r = 5 as demonstrated in Fig. 1), and n = m/r random non-cough instances will be picked from one of the five types of sounds uniformly for class balancing. This will be further discussed in the "Training-Test Splits" section (Section II-D2). In all cases, class-1 is composed of cough events. While it is expected that the binary models developed from one type of environment will work well in a similar type of environment, those models may struggle in other types of environments. For example, when models trained considering the presence of five types of animal sounds work well for similar backgrounds, they are expected to struggle when deploying/testing in environments with hardware noises.

3) Semi-Guided Models:
In this approach, we assume that a user has a better understanding of the background environment than in the case of the unguided models, but not as detailed as in the case of the guided models. Therefore, we utilize the coughs (class-1) and r = 15 types of non-cough sounds when developing binary models for the semi-guided environments (Fig. 1). For class balancing, n = m/r random non-cough instances will be uniformly picked from the 15 sound types presented in Section II-A. This will be further discussed in the "Training-Test Splits" section (Section II-D2). The way these models are developed is expected to work better than the unguided models, but worse than the guided models.
In Section II-E2, we present the naming convention of different unguided, guided, and semi-guided models developed and tested in this work.

C. Audio Datasets
In this manuscript, all our modeling and model performance assessments are based on six different audio datasets collected using smartphone microphones. To develop models and determine the best models, we utilize three datasets: 1) Environmental Sound Classification (abbreviated as ESC) dataset, 2) FreeSound dataset, and 3) Urban Sound 8 K (abbreviated as US-8 K) dataset. To test the applicability of our models, we use three additional datasets: 4) SoundSnap (abbreviated as SNP) dataset, 5) Coswara COVID-19 (abbreviated as COVID-19 or COVID) dataset, and 6) chronic obstructive pulmonary disease (abbreviated as COPD) dataset. When developing models, we also consider three categories of non-cough sounds to constitute class-0.

1) ESC Dataset:
To train-test different models, we utilize the Environmental Sound Classification (ESC) dataset [45], which is composed of 50 distinct sound types with 40 5-secondlong labeled clips per type. The audio clips are recorded at 44.1 kHz frequency. We mainly use this dataset to obtain our training cough instances. Each audio recording is comprised of multiple two or three-phase cough events [13]. In addition to healthy people's cough sounds, in this work, we consider this dataset to obtain three categories of background sounds: 1) five types of animal sounds (i.e., frog, crow, cricket, rooster, and dog sounds), 2) four types of human-made sounds (i.e., snoring, breathing, sneezing, and laughing sounds), and 3) four types of hardware sounds (i.e., door knock (D/K), washing machine (W/D), vacuum cleaner (V/C), and engine sounds).
2) FreeSound Dataset : We consider the FreeSound dataset [46] to obtain throat clearing (T/C) sounds as one of five types of human-made sounds used in this work. We obtained 37 clips that are 2.58 ± 4.2 seconds long and sampled at 44.6 ± 4.2 kHz frequency. For model development and noise augmentation, we use throat clearing (T/C) clips as common background noise. During binary-model training, these noise clips are used as part of class-0.

3) US-8 K Dataset:
To gather air conditioner sounds, i.e., one of the five types of hardware sounds used in this work, we utilize the Urban Sound 8 K (US-8 K) dataset [47], which is composed of 8732 labeled sound clips obtained from 10 urban sound types. Clips are up to 4-second and sampled at a frequency of 44.1 kHz. From this dataset, we consider 40 randomly picked air conditioner (A/C) sound clips as a source of common background sounds (class-0) while developing models.

4) SNP Dataset:
To determine the robustness of our models trained from ESC-coughs, we consider the SoundSnap (SNP) dataset [48] to obtain test cough sounds obtained from healthy people. Each audio clip consists of multiple cough events and is recorded at a sampling frequency of 46.65 ± 11.10 kHz. Therefore, we segment these cough clips into events (discussed in Section II-D1).

5) COVID-19 Dataset:
To determine the applicability of our models trained from the ESC-coughs, we use the cough recordings gathered from the Coswara COVID-19 dataset [49], [50]. The Coswara COVID-19 dataset is still growing up. Audio clips are recorded at a sampling rate of 47.82 ± 0.83 kHz. This dataset contains breathing, coughing, and speech sounds collected from healthy and unhealthy participants. We collect cough and breathing sounds from participants who tested positive for COVID-19. Throughout this manuscript, we interchangeably use the term "COVID" and "COVID-19" to indicate this dataset and the coughs obtained from the dataset. 6) COPD Dataset : We also collect coughs from a set of 12 patients (average age of 56.2 ± 0.9 years) with chronic obstructive pulmonary disease and name the dataset as the COPD dataset [14]. We recorded coughs using the RecForge II smartphone application 1 at a sampling frequency of 44.1 kHz. We kept smartphones around one meter distant from the subjects. We utilize this COPD dataset to test the applicability of models developed from the ESC-coughs.

D. Data Processing
Since we obtain data from various sources, we first modify the sampling frequency of all cough and non-cough audio events to a fixed sampling frequency of 44.1 kHz before any further processing. Next, we go through the following steps.

1) Audio Segmentation and Cough Event Extraction:
In this manuscript, we use various types of non-cough data that are already labeled. On the other hand, the cough sounds in a clip come with multiple cough events, either two or three phases [13]. Therefore, we follow a two-fold approach to collect cough event ground truths from audio clips. First, we use the Audacity desktop application [51] to load the audio clips and then perform a visual and auditory inspection to determine cough events and their phases before cropping/segmenting and storing. Next, we automate the process by developing an energy threshold-based audio segmentation followed by a phase classification approach, similar to the method developed in our previous work [12]. In Table I, we present a summary of cough events obtained from various datasets.
2) Training-Test Splits: For class balancing, we start with the same m = 106 instances from class-1 (cough class) and  class-0 (non-cough class). As presented in Table II, for class-0, we uniformly pick the samples from the five types of sounds (i.e., r = 5 for one of the three guided models) or 15 types of sounds (i.e., r = 15 for the semi-guided models), gathered from the three sound categories. When splitting into train-test sets, we first randomly split the m = 106 original coughs 10 times using a 90%-10% mutually exclusive train-test split to perform 10 rounds of training and testing. This way, each split consists of around 96 (i.e., 106 * 0.9 ) train and 10 (i.e., 106 -96 = 10) test coughs. Similarly, we pick the same number of random train-test non-cough instances uniformly from r = 5 (in case of each guided model) or r = 15 (in case of the semi-guided model) noncough sound types. Thereby, we randomly pick n = 21 − 22, the number of instances from one of the three non-cough sound categories (animal, human-made, or hardware) consisting of r = 5 non-cough sound types as class-0 when developing guided models. Similarly, we randomly pick n = 7 − 8, the number of instances from the three non-cough sound categories (animal, human-made, and hardware) consisting of r = 15 non-cough sound types as class-0 when developing semi-guided models.
In each split, we also consider the 17 augmentations (presented in the next section, i.e., Section II-D3) of each training cough event/instance in the training set. Similarly, we also consider the 17 augmentations of each test cough event/instance along with the original cough events/instances. Thereby, we obtain a total of 1728 (i.e., 96 * (1+17)) training instances and 180 (i.e., 10 * (1+17)) test instances from each class with mutual exclusion between train-test sets.

3) Data Augmentation:
In real-world settings, audio cough recordings are altered due to variations in a user's physical and mental conditions (excitement, tiredness, exercise, and other numerous states) as well as the changes in the environments, i.e., backgrounds. To imitate these changes and capture the associated variations in audio recordings when developing models, we augment original cough and non-coughs events gathered from the US-8 K, FreeSound, and ESC datasets using various pitch shifts and time stretches. With these augmentations, we introduce data variation to train a model that is more resistant to overfitting. We use 14 pitch shifts (±0.5, ±1, ±1, 5, ±2, ±2.5, ±3, ±3.5) and three time stretches (0.5, 0.25, and 0.75).

4) Feature Extraction:
In this work, we primarily use the Mel-frequency cepstral coefficient (MFCC) [52], which is a widely used method for spectral feature extraction when recognizing speech. In this feature extraction method, frequency bands are adapted to the human perception levels. However, using only MFCCs (static features) can flaw the locality. Therefore, we choose to use the first and second temporal derivatives (Δ and Δ − Δ) to mitigate the potential flaws of MFCCs. This combination of dynamic features and static MFCCs can be useful to increase the accuracy and the robustness of various audio event detection systems [53]. Thereby, we compute 40 MFCCs, 40 Δ and 40 Δ − Δ features, i.e., a set of 120 candidate features from every cough and non-cough event.

E. Model Development
In this section, we first present the classifiers and model naming conventions used in our modeling approach presented in this work. Next, we present the optimization steps.

P oly.Kernel, K(y i , y j ) = (1 + γy
Where γ and d are used to represent the "scale parameter" and "degree" parameters. In the equations, y i and y j are used to represent the two feature vectors. Also, we use parameter C to indicate the misclassification penalty/cost. For unary models, we use the support vector machine with polynomial and RBF kernels supported by the Sci-kit learn machine learning package.

2) Model Naming
Convention: Throughout this manuscript, we follow a standard naming convention when referring to different models developed using different classifiers and sound categories. We use a compound term "Modeling approach -Classifier type", followed by "Classifier abbreviation -Number of sound types used to make class-0", followed by " (sound category abbreviation)" to refer to a specific model. For example, "G-B RF-5 (M)" is used to indicate an optimal "guided" model trained with "binary random forest (RF)" classifier using the "five types of human-made sounds as class-0." Similarly, "G-B GB-5 (A)" and "G-B RF-5 (H)" are used to indicate optimal "guided" models trained with "gradient boosting (GB)" and "random forest (RF)" classifiers using five types of background sounds gathered from the "animal" and "hardware" sounds, respectively, as class-0. Class-1 consists of coughs, as always. Similar to guided models, we use "S-B RF-15" to refer to an optimal "semi-guided" model trained with "binary random forest (RF)" classifier using fifteen types of sounds (gathered from the three sound categories) as class-0 and cough sounds as class-1. Since the negative class is comprised of all three sound categories, we simply drop the sound category from the term. Finally, we use "U-U SVM" to refer to an optimal "unguided" model trained with a "unary support vector machine (SVM)" classifier using only cough sounds. Since we do not have any non-cough sounds, we drop the class-0 constituting sound type count and sound categories from the term.
3) Feature Optimization: We consider the "Select the K Best" approach to determine the most dominant feature sets for binary classifier-based guided and semi-guided models. While training a model, we choose different sets of features and calculate the performance (ACC and F 1 scores) of the model using the 90% training data of a random split. We finally compute the average of 10 scores obtained from 10 separate splits for a specific type of model with a particular feature count. From our experiment, we find K = 120 is an optimal choice for the best guided and semi-guided models. Similarly, we consider a variance-based approach (i.e., smallest or largest variance) to select different sets of influential features for the unary classifier-based unguided models. From our experiment, we find 70 smallest variance feature is a good compromise for the best unguided models. In Table III, we present various classifiers/models with their optimal feature count.

4) Parameter Optimization:
When training models with the 90% data, we utilize the grid search to determine optimal values for different parameters from a range of values, which includes degree, d ∈ [1,3] (ACC and F 1 scores) of the model. We finally compute the average of 10 scores obtained from 10 separate splits for a specific type of model with a particular set of parameter values. In Table III, we present various classifiers/models with their associated set of parameters and their optimal values.

III. RESULTS
In this manuscript, we consider recall, accuracy (ACC), false positive rate (FPR), precision, false negative rate (FNR), and F 1 score to compare the performance of different modeling approaches. Additionally, consider the area under the curvereceiver operating characteristic (AUC-ROC) for the binary classifier-based models.

A. Unguided Model Evaluation
As discussed in Section II-B1, we develop unguided models using unary SVM with "Polynomial kernel" (Poly.) (1) or "Radial Basis Function" (RBF) kernel (2). After training, we apply our trained models on the 10% test data (discussed in Section II-D2). In Table III, we summarize test performance values of different unguided models trained with the unary SVM classifiers utilizing only cough events. In the table, models are presented with their optimal parameter values and feature counts. We observe that the Poly. kernel-based unguided model (highlighted rows in the table) always outperforms the RBF kernel-based unguided model. While testing the unguided models in environments with the presence of 15 types of background sounds, we observe that the Poly. kernel-based model achieves 38% higher accuracy and ≈ 20% higher F 1 score than the RBF kernel-based model. Additionally, in the case of unary SVM Poly. kernel-based guided model, we observe on average ≈ 15% higher accuracy when testing in environments with hardware sounds compared to environments with animal or human-made sounds (Table III).

B. Guided and Semi-Guided Model Evaluation
As discussed in Sections II-B2 and II-B3, we develop guided and semi-guided models using binary models. After training, we apply our trained models on the 10% test data (discussed in Section II-D2). In Table III, we summarize test performance of different guided and semi-guided models trained from cough events (class-1) and non-cough events (class-0). In the table, models are presented with their optimal parameter values and feature counts. When the gradient boosting (GB) guided model works the best for the environments that comprised of animal sounds (i.e., class-0), the random forest (RF) guided models work the best for the environments that comprised of human-made sounds and hardware sounds (i.e., class-0) (highlighted rows in the table). Among the three best guided environment models, we observe that the RF-based guided model for the environments with human-made has the lowest average accuracy of .89 ± .04. Compared to this human-made environment guided model, other two models achieve ≈ 7% (i.e., (.95-.89)/.89*100%, for both animal and hardware) higher accuracy. The lowest performance in environments with human-made background sounds can be  IV  SUMMARY OF THE STATE-OF-THE-ART AND OUR WORK explained by close similarity between coughs and other humanmade sounds compared to animal and hardware background sounds.
In the case of semi-guided models, RF classifier-based model works the best in environments with the presence of all 15 types of background sounds (i.e., the last block of seven models in Table III). We also test the best semi-guided model, i.e., RF, on the three categories of environments separately and we obtain average accuracy of .9 ± .02 (animal), .84 ± .03 (human-made), and .91 ± .03 (hardware), respectively. This finding supports our previous findings while testing three separate guided models on their relevant environments.
Comparison with the state-of-the-art: We primarily use sensitivity (SEN = 1-FNR), specificity (SPE = 1-FPR), and AUC-ROC to compare the performance of our models with some benchmarks (Table IV). While our models can achieve the highest SEN and AUC-ROC compared to other works, our models suffer from lower SPE compared to others [40], [54], [55]. But we tested our models with a wide range of non-cough events compared to other works [40], [41], [54], [55]. Moreover, some work use ECG, thermistor, chest belt, accelerometer, and contact data in addition to audio data [55].

C. Model Comparison
In this section, we present the performance comparison among unguided, semi-guided, and guided models using different datasets. First, in Section III-C1, we present the model comparison when testing on the part of the three known datasets (i.e., ESC, FreeSound, and US-8 K datasets), but we keep the train and test sets mutually exclusive and change the background environments. In this comparison, we use the accuracy measure to compare different models when testing on various cough sounds (class-1) and non-cough sounds (class-0) separately.
Next, in Section III-C2, we compare the applicability of models (trained from the ESC, FreeSound, and US-8 K datasets) when testing on cough samples obtained from three unknown datasets, i.e., SNP, COPD, and COVID datasets. In this comparison, we primarily have test cough sounds (class-1) obtained from the unknown datasets. Therefore, we use the accuracy measure for performance comparison.

1) Model Comparison Using Known Datasets With Vary-
ing Environments: In Fig. 2, we observe that, in general, guided and semi-guided models outperform the unguided model. When testing three types of models on cough sounds we observe that guided models, except the "G-B RF-5 (M)", perform better than  the semi-guided model, which outperforms the unguided model, i.e., "U-U SVM model." Similarly, when comparing the three guided models, we observe that models trained and tested in similar environments outperform the other two models trained from different environments. For example, when testing on the five types of animal sounds, the "G-B GB-5 (A)" guided model outperforms the other two guided models trained from humanmade sounds (i.e., "G-B RF-5 (M)") and hardware sounds (i.e., "G-B RF-5 (H)"). Similarly, the "G-B RF-5 (M)" guided model works the best when tested on human-made sounds and the "G-B RF-5 (H)" guided model perform the best when tested on hardware sounds. Compared to animal and hardware sounds, human-made sounds lead to lower performance while applying the best guided models on their relevant sounds/environments. This is similar to what we have observed and discussed in Section III-B.
In the upper part of Table V, we summarize the test accuracy values of various models using different datasets. In general, we observe improvements when moving from unguided to semiguided to guided models. Compared to the unguided models, we achieve an increase in average accuracy by 18% (cough), 22% (animal), 14% (human-made), and 7% (hardware) using the best guided models, decided based on the highest confidence. All three types of models down perform when applying on humanmade sounds.
Next, we further investigate the detailed performance of three types of models while testing on one of the five types of sounds within each category. In Fig. 3(a), we summarize the performance values of different models utilizing boxplots. In general, unguided models perform the worst among the three types of models, guided models with similar backgrounds perform the best, and among the three guided models human-made sound data-driven models perform the worst in the case of individual sound types. These findings are very similar to what we have observed so far in the case of aggregated analysis. Additionally, in Fig. 3(a), we find that compared to unguided models, guided models are, on average, around 65% (when tested on laughing and throat-clearing (T/C) sounds) and 20% (when tested on breathing, sneezing, and snoring) more accurate. That is, the difference between the unguided and the best guided model is huge when tested on laughing and throat clearing (T/C), compared to the remaining three human-made sounds (i.e., breathing, sneezing, and snoring).
Next, we investigate the low performance of all models when testing on the "laughing" and "throat clearing" (T/C) sounds (observed in Fig. 3(a)). In Fig. 3(b), we use the t-distributed stochastic neighbor embedding (t-SNE) plot, which is a way to explore the relationship among high-dimensional neighbors in a two-dimensional plane, to compare the data distribution of the "laughing", "throat clearing" (T/C), and "cough" sounds.
We also plot the t-SNE distribution of "sneezing" sounds (one of the sounds types where models achieve high accuracy) to better understand the issues that lead to low performance when classifying the "laughing" and "throat clearing" sounds compared to other human-made sounds, such as "sneezing." In the figure, every data sample is a two-dimensional representation of the 120 features (obtained from the three sets of features, i.e., MFCCs, Δ, and Δ − Δ) of that sample. We obverse that "laughing" and "throat clearing" sounds are overlapped with "cough" sounds compared to the "sneezing" sounds; thereby, classification models find it more challenging to distinguish "laughing" or "throat clearing" sounds from "cough" sounds compared to other human-made sounds, such as "sneezing."

2) Model Applicability Comparison Using Unknown
Datasets: In the lower part of Table V, we find that guided models achieve relatively higher accuracy compared to the unguided and semi-guided models as before. Guided models achieve the highest average accuracy of 1.0 for the ESC coughs (with 10 random 90%-10% train-test splits) and the lowest average accuracy of. 92 for COPD coughs. Furthermore, these guided models achieve at least. 96 average accuracy when tested on SNP or COVID cough datasets.
In Fig. 3(c), we demonstrate more detailed analysis of model performances using boxplots when testing the three types of ESC-cough trained models (i.e., unguided, semi-guided, and guided models) on various cough datasets. We find that "G-B RF-5 (H)" models, i.e., guided models developed using the ESC-coughs as class-1 and hardware sounds as class-0, achieve higher accuracy than the other two types of guided models trained with class-0 comprised of animal sounds or humanmade sounds, i.e., "G-B GB-5 (A)" and "G-B RF-5 (M)" models in general. In the figure, we find that "G-B RF-5 (H)" models, i.e., guided models developed with class-0 comprised of hardware environmental sounds, consistently achieve more than. 95 accuracy across all types of coughs, except the COPD coughs, where the models achieve more than. 9 accuracy. On the other hand, guided models developed using class-0 comprised of animal sounds or human-made sounds are less accurate, i.e., lower than. 75, when applying on COPD or COVID cough datasets.
Next, we utilize the t-SNE plot to investigate model performance variation across various cough datasets. In Fig. 3(d), we present the distribution of various coughs gathered from all four cough datasets, i.e., SNP (blue pentagons), ESC, COPD (pink crosses), and COVID (red triangles). When assessing model test performance, we use the original cough events obtained from the SNP, COPD, and COVID datasets. However, we use both original and augmented (17 augmentations discussed in Section II-D3) ESC coughs to train-test models using 90%-10% splits. Therefore, in the t-SNE plot, we consider both the original (black squares) and augmented (green circles) versions of ESC coughs, but the original coughs from SNP, COPD, and COVID datasets.
In Fig. 3(d), we find that SNP cough instances (blue pentagons) and ESC cough instances (black squares and green circles) are completely overlapped. Therefore, models developed with ESC coughs can easily identify SNP coughs. However, the COVID cough instances (red triangles) and COPD cough instances (pink crosses) create two clusters that are separable from the ESC coughs. Thereby, ESC-cough trained models struggle to identify COPD and COVID cough instances. Compared to the COVID cluster, the COPD cluster is composed of more instances. Therefore, ESC-cough trained models underperform when applied to identify COPD coughs compared to COVID coughs.

IV. DISCUSSION
In this work, we attempt to develop three types of generic cough models based on a user's prior knowledge about the surrounding environment and try to detect different types of coughs, including coughs obtained from patients with two respiratory diseases (COVID-19 and COPD). We find that a user can expect to get better performance (ACC or F 1 score) when identifying cough and non-cough sounds utilizing the best guided models compared to the unguided models. But, the guided models require a user to have a better understanding of the environment compared to the unguided models, where a user does not need prior knowledge about the surroundings. We also find that semi-guided models perform relatively better than the unguided models. Thereby, when a user does not have any idea about the environment, the user can start with the unguided models. As time passes and the user has some idea about the environment, semi-guided models can replace the unguided models. Finally, when the user has a clear idea about the environment, guided models can replace the semi-guided models to provide a highly accurate decision.
We find that ESC cough-trained generic unguided, semiguided, and the best guided achieve consistent accuracy across unknown datasets, except the COPD dataset (lower part of Table V). Therefore, disease-specific models can be developed to detect chronic coughs, such as COPD. However, in the case of a sudden pandemic outbreak, such as COVID-19, it is difficult to find enough data from patients to train disease-specific models during the early stage of the outbreaks. In such cases, we can start with generic cough models, and over time, we can develop mixed models from the generic cough models using transfer learning. Mixed models will require relatively fewer disease-specific coughs than the disease-specific models trained from more extensive disease-specific data.
A major limitation of our work is the limited number of cough and non-cough events and the unavailability of different non-cough human sounds obtained from patients in unknown datasets. However, we augment the original cough sounds to create the effect of changes in the natural environment and a user's physical condition or mental state, and randomly split the entire dataset 10 times when developing models to circumvent various issues, including overfitting and data sparsity. Therefore, our findings show a promise, which can further be investigated and validated with a large-scale extended period longitudinal study with varying diseases, patient demographic, types of noncough human sounds, and advanced models.
Furthermore, the drop in performance when testing healthy people's cough models on patients could be due to differences in voiced phases (e.g., frequency and noise) between coughs from patients and healthy people. Also, the voiced phase does not always appear and may get confused with some parts of laughing or throat clearing. Confounding factors, such as device variability, may affect data distribution (e.g., Fig. 3(d)) and model performance. Also, in real-world deployment, model performance can be affected by device positioning and placement. To overcome the barrier effect some standard techniques can be adopted [56], [57], [58], [59]. All of these will require detailed investigation and beyond the scope of this work. Additionally, in a real-world deployment, as the system make a transition from unguided to semi-guided or guided models with time pass, the system can identify different background sounds using sound classification approaches [60], [61], [62] and retrain the initial unguided cough model to obtain more robust models utilizing relevant background sounds by following approaches similar to the Federated learning, environment knowledge broadcast among users, and place discovery [63], [64], [65], [66], [67], [68]. These are beyond the scope of this manuscript.
While our findings in this work show the promise of developing models to detect cough symptoms utilizing a user's background environment knowledge about the presence of different types of sounds, the effective applicability of such models for disease diagnosis depends on many other factors, including detecting other symptoms (e.g., breathing difficulty) and integration of the self-reported subjective symptoms in addition to objective predictions [69], [70], [71], [72], [73], [74], [75]. Additionally, people's medical history and health records can be integrated for better diagnosis of diseases and people's conditions. This will require careful investigation with additional large-scale longitudinal studies with diverse subject groups and diseases.

V. CONCLUSION
When evaluating our modeling approaches (i.e., unguided, semi-guided, and guided modeling approaches) using 10 random splits, we find that a user can expect to get 12%-28% higher accuracy and F 1 score when identifying cough and non-cough sounds utilizing guided models compared to the unguided models (Table III). We also find that semi-guided models outperform the unguided models. While this work shows the feasibility of the approach, additional studies will be required for the clinical validation of models before commercializing the work.