User authentication system based on human exhaled breath physics

Mukesh Karunanethy; Rahul Tripathi; Mahesh V. Panchagnula; Raghunathan Rengaswamy

doi:10.1371/journal.pone.0301971

Abstract

This work, in a pioneering approach, attempts to build a biometric system that works purely based on the fluid mechanics governing exhaled breath. We test the hypothesis that the structure of turbulence in exhaled human breath can be exploited to build biometric algorithms. This work relies on the idea that the extrathoracic airway is unique for every individual, making the exhaled breath a biomarker. Methods including classical multi-dimensional hypothesis testing approach and machine learning models are employed in building user authentication algorithms, namely user confirmation and user identification. A user confirmation algorithm tries to verify whether a user is the person they claim to be. A user identification algorithm tries to identify a user’s identity with no prior information available. A dataset of exhaled breath time series samples from 94 human subjects was used to evaluate the performance of these algorithms. The user confirmation algorithms performed exceedingly well for the given dataset with over 97% true confirmation rate. The machine learning based algorithm achieved a good true confirmation rate, reiterating our understanding of why machine learning based algorithms typically outperform classical hypothesis test based algorithms. The user identification algorithm performs reasonably well with the provided dataset with over 50% of the users identified as being within two possible suspects. We show surprisingly unique turbulent signatures in the exhaled breath that have not been discovered before. In addition to discussions on a novel biometric system, we make arguments to utilise this idea as a tool to gain insights into the morphometric variation of extrathoracic airway across individuals. Such tools are expected to have future potential in the area of personalised medicines.

Citation: Karunanethy M, Tripathi R, Panchagnula MV, Rengaswamy R (2024) User authentication system based on human exhaled breath physics. PLoS ONE 19(4): e0301971. https://doi.org/10.1371/journal.pone.0301971

Editor: Sandip Varkey George, University of Aberdeen, UNITED KINGDOM

Received: September 14, 2023; Accepted: March 26, 2024; Published: April 22, 2024

Copyright: © 2024 Karunanethy et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Data Availability: All relevant data are available from the ‘Harvard Dataverse’ database (DOI: https://doi.org/10.7910/DVN/MKVJQT).

Funding: The author(s) received no specific funding for this work.

Competing interests: The authors declare there is a patent application on the discussed technology filed at Indian Patent Office, under the title ‘Exhaled breath based user authentication and diagnosis’ (PATENT PENDING - 202241065024). This does not alter our adherence to PLoS ONE policies on sharing data and materials.

Introduction

Human exhaled breath is largely turbulent. During exhalation, air is forced out of the lung through trachea by the contracting diaphragm. To start with, the Reynolds number (a dimensionless quantity defined as the ratio of inertial to viscous forces within a fluid) associated with flow through trachea is sufficiently high, typically ranging from around 2300 for silent breathing to over 9000 for vigorous breathing indicating a highly turbulent flow [1–4]. In addition, as the air passes through the trachea, it interacts with the complex internal structures associated with the upper respiratory tract, leading to complexity in the flow [3, 5]. The upper respiratory tract consists of the larynx, the pharynx, and the oral cavity. Owing to the complexity associated with the interaction between air that is already turbulent [3, 4] with the upper respiratory tract, we hypothesize that the turbulent signatures in the exhaled air are unique and identifiable from person-to-person. A plausible way to test this hypothesis is to build a user authentication system that would answer the question of classifiability of a human subject purely based on the fluid dynamics of the exhaled breath, essentially serving the purpose of a biometric user authentication system. Such a system is a real-time system to verify a user’s identity using any measured feature pertaining to the user’s physiology or behaviour. Thus, authentication can be broadly seen as comprising two classes of methods: physiological biometrics (eg., fingerprints, iris scans, facial recognition, etc.) and behavioural biometrics (eg., gait analysis, voice ID, breathing gesture [6], etc.). There are two major modes of deployment of a user authentication/access system [7]: (i) user confirmation or verification, and (ii) user identification. In the confirmation mode, a user declares his or her identity, which is to be confirmed. In this case, the user’s biometric data is compared to a specific set of data of the same person obtained during an enrollment process. In the identification mode, a user does not disclose his or her identity. In that case, a user’s data is compared with all registered data in the database of bona fide users, and the user is identified. We will discuss algorithms for testing the two biometric modes in this manuscript and argue that exhaled breath contains sufficient information to implement both biometric modes.

Human exhaled breath has proven to be a non-invasive diagnostic tool for a spectrum of medical problems as well. [8] studied the diagnosis of malarial infection by analysing the breath composition, or “breathprint” which contains a series of volatile organic compounds (VOCs) produced by the P. falciparum-infected erythrocytes. They built a nearest mean binary classifier with leave-1-breath-sample-out cross-validation scheme to assign predictions. The European Respiratory Society (ERS) technical standard [9] reported that the fraction of nitric oxide in exhaled gas is a potential biomarker for lung diseases. [10] showed the potential of breath-based metabolomics (breathomics) in personalised medicine. Mass spectrometry is one of the main platforms used for data profiling in these techniques. In their study, [11] reported enhancements required in the analysis of single exhaled breath metabolomic data for the unique identification of patients with acute decompensated heart failure. [12] made attempts to develop a breath analyzer system to measure blood glucose levels and to classify diabetic/non-diabetic patients using a support vector machine (SVM) classifier based on acetone levels in breath measured using chemical sensors. [13] reviewed various breath sampling methods with a bibliometric study. [14–16] studied the potential advantages of breath tests as a non-invasive technique with potential biomarkers in disease diagnosis. The above efforts in the literature proving exhaled breath as a biomarker largely involve the analysis of its chemical composition by various techniques. In other words, these studies have shown that the compounds present in exhaled air produce a molecular signature. There exists no evidence in the literature of any attempt to develop an identifier purely based on the fluid dynamic aspects of the exhaled airflow.

Respiratory flow measurements are widely performed using spirometers and pneumotachographs. Inspirational flow patterns in humans were studied using measurements from a cycloergometer to theoretically estimate mechanical work during inhalation by [17]. [18] studied the human respiratory flow patterns using pneumotachographic flow measurements at the mouth. Hot wire anemometry (HWA) has been used by several researchers in the past for respiratory flow measurements. [19] demonstrated the application of HWA in respiratory flow measurements in small animals. [20] investigated the performance of a constant temperature hot wire anemometer (CT-HWA) system for respiratory gas flow rate measurements. The study demonstrated that a CT-HWA will meet the response requirements and be insensitive to changes in temperature and humidity that are frequently experienced in respiratory flows. In the research by [21] and later by [22], it was shown that CT-HWA can be used to measure fluid flow in the forced oscillations technique applied to the human respiratory system, as a substitute for the pneumotachograph. Other studies reporting the implementation of CT-HWA for measuring expiratory flow parameters are by [23, 24]. [25] showed that CT-HWA can be used as a flow transducer for spirography. In conclusion, HWA is a robust tool for obtaining time-resolved turbulence signature measurements in flows. Most of the work in the literature has taken advantage of the HWA data for flow rate calculations, effectively using it only as an alternative for spirometry-based studies. We propose to use HWA measurements (the complete time series of instantaneous velocity data) of turbulence in human exhaled breath as input signals for the development of a biometric system.

Behavioural biometrics use a person’s gestures, such as gait patterns or breathing gestures. Recent work by [6, 26] revealed the prospects of exploiting breathing acoustics for user authentication. They built a new behavioural biometric signature called BreathPrint based on audio features acquired from a microphone sensor in smartphones, wearables and other IoT devices. [6] deployed a conventional machine learning model based on the Gaussian mixture model (GMM), while [26] established the feasibility and performance evaluation of a Recurrent Neural Network (RNN)-based deep learning models. A novel WiFi-based breathing estimator UbiBreathe developed by [27] works as a respiratory monitoring system based on the received signal strength (RSS) data from a nearby WiFi-enabled device. A continuous user verification system was developed using this approach by [28] for round-the-clock user verification, built based on user-specific respiratory features derived based on waveform morphology analysis and fuzzy wavelet transformation. A deep learning-based scheme also detects the existence of spoofing attacks. [29] developed a speaker recognition system, BreathID based on breath biometrics. Breath during speech is considered trivial or a noise component. They showed that unique breath features can be formulated by a template matching technique for speaker recognition.

In summary, the use of HWA and, more broadly, breath turbulence measurements as a tool for biometric authentication has not been attempted in the literature. Conventional biometric systems such as voice, face, and fingerprint recognition have their own disadvantages. There is a need to develop more sophisticated biometric systems that could make use of internal physiological features of the human body. We attempt to build a novel user authentication system based on human exhaled breath, using the principles of multidimensional hypothesis testing and machine learning. This system is different from an acoustics-based biometric system, since it does not require vocal data from the human subject and is built solely on the fluid dynamic information contained in the exhaled breath.

The experimental dataset and methodology

A measurement-based study was employed to develop algorithms for biometric authentication. Measurements of the exhaled breath were made using a Dantec Dynamics^® 55P11 hot wire probe. It consists of a 5μm diameter, 1.25mm long platinum-coated tungsten wire, which acts as the sensor. A Dantec Dynamics MiniCTA^® 54T42 module housed the CT-HWA’s signal processing and output system. The hot wire probe was calibrated using a standard procedure of simultaneous measurement of the flow velocity and the anemometer voltage. The calibration was performed using a Dantec Dynamics StreamLine Pro^® automatic calibrator, between a velocity range of 0 − 5 m/s. Using this procedure, we were able to determine the calibration constants from an assumed velocity-voltage relation. This relation is a least-square polynomial fit of order-4 in the velocity-voltage space as shown in Fig 1. In the current study, the raw voltage time series was itself used in all the analyses. This helps us avoid frequent re-calibration of the probe. The initial calibration was performed only to make sure that the voltage and velocity signals are monotonically positively correlated (as can be inferred from the least square fit from Fig 1).

Download:

Fig 1. Calibration curve for the hot wire anemometer.

A fourth order least square fit of the experimental data (shown as maroon dotted line) becomes the calibration curve for the hot wire anemometer in use. The polynomial equation of the fourth order fit is shown inside the plot.

https://doi.org/10.1371/journal.pone.0301971.g001

Participants

94 participants were recruited to take part in this study, following the ethical approval from the Institutional Ethics Committee (IEC) of the Indian Institute of Technology Madras, Chennai, India (IITM—IEC Protocol No. IEC/2018–03/MP/01). The participants were students of the Indian Institute of Technology Madras. Their age ranged from 21 to 27 years. Data were collected only once (one set of 10 breath samples) per participant. Volunteers with epileptic disorder were excluded from participation. The experimental data collection was carried out between 8th and 17th January, 2019. All volunteers who participated in this study had given their written informed consent. The recorded time series data were analyzed anonymously.

Data collection and analysis

A schematic of the experimental setup is shown in Fig 2A. It consists of a mouthpiece assembled into an aluminium circular cross-section channel which housed the hot-wire probe aligned to its axis to measure the streamwise component of the turbulent exhaled flow velocity. The human subjects were allowed to exhale through their mouths into the experimental measurement setup. The nose was clipped during data recording to ensure that all the exhaled air passes through the oral cavity before entering the experimental setup. Each human subject was provided with a new disposable plastic mouth-piece to wrap their mouth around, through which the subjects exhaled. The obstruction of the tongue to the flow was avoided by placing the mouth-piece above the tongue. Data were obtained in each exhalation trial lasting about 1.5 seconds, with 10 trials recorded per subject. Each time series was recorded by sampling the voltage response at 10kHz. This effectively gives us 15000 data points in a time series, the relevance of which would be discussed in the following sections. The time series signal from a typical exhalation trial is shown in Fig 2B. In our study, we investigated the multifractal properties of the time series, since interestingly, human exhaled breath has been found to display multifractality, based on our analysis which is discussed in detail in Part 1 of S1 Text. This was performed using the well-known technique called multifractal detrended fluctuation analysis (MFDFA) developed by [30].

Download:

Fig 2. Experimental setup and recorded time series.

(A) Depiction of the experimental setup for data collection. It consists of a disposable mouth-piece, a mouth-piece mount housing a hot wire anemometer and a data acquisition system. (B) A typical human exhalation velocity signal measured using a standard hot wire anemometer. The time signals were sampled at 10kHz for 1.5 seconds.

https://doi.org/10.1371/journal.pone.0301971.g002

Given a set of time series signals from a library of users, our algorithm comprises of segmentation, normalization, feature extraction and subdivision of feature set into training and testing sets. The training dataset became part of the enrolled database, whereas the testing dataset was used for testing the performance of the authentication algorithms. The enrollment and algorithm testing depends on the type of algorithm being used. More details of user authentication systems are discussed in section titled User confirmation algorithms (page 8).

Time series segmentation, normalization and selection

Segmentation of time series is a standard practice in many data analysis techniques to obtain dividing points on a signal with or without stationarity. In machine learning problems with limited availability of time series samples, segmentation is of vital importance. By performing an efficient segmentation on the basis of certain statistical measures, we can obtain sufficient number of samples to train and test machine learning models. Fig 2B is a plot showing the instantaneous voltage response from the hot wire probe for 1.5 seconds. It was obtained by sampling at a frequency of 10kHz, giving us a sufficiently resolved long series to perform segmentation without losing any significant information on the flow physics.

We will now discuss the segmentation process. Each time signal was divided into 19 overlapping segments using a window size equal to 1/10th the length of the signal and a sliding width of half the segment size. Remember the machine learning models may tend to overfit the training data when there are large number of overlapping segments. The purpose of using overlapping windows was to capture the end effects of the time series segments during feature extraction. So, the chosen segment width and sliding width are justified as each part of the time signal appears only in two segments. This effectively gives 1500 data points to each segment making it sufficiently long to reliably extract features using tools discussed in this manuscript. As a result, a maximum of 190 representative time blocks become available for the analysis for each user. Each of the time signals were normalized before feature extraction, making the time series comparable across realizations. This would also make all signals independent of the sensor being used for the measurement, since these features only rely on the temporal correlation structure in the series and not on the raw data values. This approach can be termed as being sensor-agnostic. Regardless of whether the time series signal is measured using a hot-wire/film probe, or a laser-based technique, the performance of the algorithm will not be affected, as long as there are sufficient data points to properly capture the temporal structure in the flow. We then build an algorithm which works with these features which are invariant to the absolute value of the time series. z-score normalization which is popularly known as standardization was used to normalize the time series. To perform z-score normalization, the mean of the entire time series is subtracted from each data point in the time series. Then, the resulting values are divided by the standard deviation of the time series. This scales the data so that it has a mean of zero and a standard deviation of one. The resulting normalized time series will have values that represent the number of standard deviations away from the mean. The z-score normalization has the form shown in Eq 1. (1) where z(i) is the normalized time series, x(i) is the original time series of length N, (μ_t) is the mean of the time series, and (σ) is the standard deviation of the time series. The time series becomes unitless after normalization.

MFDFA was performed on all normalised time series, and it revealed that not all spectra exhibit the expected shape. The general shape of a multifractal spectrum is convex or more precisely an inverted parabola, with the peak occurring at the central moment. This convex shape signifies the presence of multifractal scaling, indicating that different parts of the time series exhibit distinct scaling behaviors. Certain time segments were observed to result in a spectrum with folds or distortions. Fig 3 shows an example of such a distortion. The multifractal spectrum for a time signal and three randomly chosen segments X, Y and Z from the same time series are displayed. Fig 3A shows the entire time signal and the chosen segments. Out of the three segments, X and Z show a typical spectral shape, whereas segment Y consists of a fold towards the left hand side of the spectrum (see Fig 3B).

Download:

Fig 3. Multifractal spectra for different segments of a time signal.

The multifractal spectra corresponding to the entire time signal (maroon) and time segments X, Y and Z (black, bounded by gray band) in (A) are shown in (B). It is evident that few segments exhibit an inverted parabola shape and spectrum B has a distortion.

https://doi.org/10.1371/journal.pone.0301971.g003

There could be several reasons for the appearance of folds in the multifractal spectrum. (i) They could occur due to irregularities or data artifacts in the time series itself, such as noises, outliers, etc. which may arise due to inconsistent exhalation by the user during data acquisition. For example, during the period of 1.5 seconds, if the user exhales abruptly for the first 1 second of the trial, and then the breath velocity steadily decays for the remaining 0.5 seconds. The segment which falls between these two regions might contain irregularities within it. Such irregularities could introduce inconsistencies in the scaling behaviour. (ii) The spectrum may be affected by the non-stationarity of the time series, which is when the statistical properties change with time, such as due to change in breath velocity. (iii) Spectral folds might even arise due to the finite size of the time segment. Limited number of data points may not capture the scaling properties at different scales. Investigating the type of distortions or the reason behind this behaviour of the spectrum for certain time segments fell outside the scope of this work. Instead, we made use of this behaviour as an indicator to judge whether a segment is valid or not. All segments which showed non-convex singularity spectra were discarded in our analysis. Also, the segments which produce a spectral width less than 0.05 were rejected, since they exhibit a very low degree of multifractality. These two strategies effectively make MFDFA a tool for time series selection, for further feature extraction and analysis. Any time signal which contains significant number of segments with inconsistent scaling behaviour can be rejected using this tool during the data recording step itself. A numerical example discussing how a multifractal singularity spectrum can have non-convex shapes can be found in [31].

Feature extraction

Features were extracted from normalized time signals using various time series feature extraction techniques. Unlike other physiological biometric systems where image-based patterns or features are used as templates to match an individual’s identity, our input data is a time series from an individual, which requires feature extraction. Several features of the time series were studied in order to develop insights into the data. The multifractal spectral information was incorporated into our analysis by including them in the set of features. The fact that the time series contains information pertaining to the correlation structure becomes relevant to machine learning algorithms. In keeping with this principle, we extract a set of three important features from the spectrum: (i) β, the abscissa corresponding to the spectral maxima, (ii) ω, the width of the spectrum, and (iii) ϵ, the bias or asymmetry parameter of the spectrum. The parameters β, ω and ϵ are dimensionless. These features are visualised on the multifractal spectrum of an exhaled breath time signal in Fig 4. It was also noted from our analysis that the spectra showed clear differences in their temporal structure; i.e., parameters such as β, ω and ϵ were different for different time signals. Several other multifractal spectral features have also been considered in the literature [31–33]. We chose these three features for simplicity and also they encompass the most important descriptions of a multifractal spectrum. Investigating how unique these features behave is of interest to this work.

Download:

Fig 4. The multifractal spectrum.

Plot of the spectrum of singularities f(α) against the singularity strength α, computed for an exhalation time series segment. The parameters β, ω and ϵ are the features that characterize a multifractal spectrum.

https://doi.org/10.1371/journal.pone.0301971.g004

In addition to the use of MFDFA as a feature extraction algorithm, we also use an automated time series feature extraction algorithm named tsfresh (Time Series FeatuRe Extraction on the basis of Scalable Hypothesis tests) developed by [34]. The tool generates over 700 time series features using 63 different time series characterization methods. The following discussion pertains to the preparation of dataset for model building, training and testing of the algorithms. A consolidated pipeline of the algorithm towards model library building including time series normalization, and selection, followed by feature extraction and reduction, is shown in Fig 5.

Download:

Fig 5. Flow chart of the algorithm.

Flow chart showing the algorithm pipeline, including time series normalization, filtering, feature extraction, feature reduction, and data splitting into training and testing. The time signal shown here is one of the segments of the original time series. Note that the representation of blue bar for training dataset and green bar for testing dataset will be consistent in further discussions in this manuscript. The training data of all users were used for building ⁿC₂ binary classifier models, which becomes the process known as enrollment.

https://doi.org/10.1371/journal.pone.0301971.g005

Features extracted by these algorithms from all available time series are concatenated and passed through a low-variance filter. This was done to remove those feature columns with a variance value below a given threshold, which in our case was 1%. The rationale behind applying this low variance filter was to eliminate features that exhibit very little variation across instances. Such low-variance features may not provide useful insights for classification tasks. Furthermore, highly correlated features were removed from the feature set. A correlation threshold of 80% was chosen for this purpose. Removing features by these techniques reduce the dimensionality, simplifies the model, and potentially improves model performance by focusing on more informative features. All features which were derived from the absolute values of the time series, such as maximum/minimum values, quantile information, etc., were disregarded. For example, inclusion of mean value of a signal will bias the algorithms and allow them to classify on the basis of the mean values itself, which was undesired. It was observed that different human subjects were able to exhale in different velocity bands depending on their lung capacity. The filtered feature matrix thus obtained is a stack of vectors from each time series sample available, and it consisted of approximately 450 time series features. This feature space is high dimensional and may contain redundant features that can be excluded. The reduced feature set will also reduce the computational complexity of the algorithms. We adopted a feature selection method using binary random forest classifiers. Binary classifiers were built on pairwise combinations of the users’ feature datasets. The importance of the features can be quantified for every random forest binary classifier by estimating how much the random forest’s performance would suffer if a given feature were to be eliminated. This impurity-based feature importance developed by [35] was used for picking the top features. The top 10 most prevalent features among users were chosen as the feature space after computing the top 10 features from each classifier. In the later sections of this manuscript, the methods used for model construction and the physical insights of these features will be described. The reduced feature matrix thus obtained contains features of all the users in the database. For each user, the dataset was split into training (60%) and test (40%) sets. It is important to note that this splitting was done after shuffling between groups of features corresponding to the 19 time blocks for each subject. We know that there were 190 time signals for each user in the database with each set of 19 signals coming from a single recorded time series (see subsection titled Time series segmentation, normalization and selection (page 5)). Shuffling without grouping would result in the same information being spread across the training and testing dataset, which was undesired. By doing this, we made sure that out of 10 exhaled breath samples, 6 became part of the training set and 4 became part of the test set. The training feature set was used to build the model library and the test feature set was used for user confirmation/identification tests.

Building of model library

We have formulated the multi-class classification problem into a series of binary classification problems. Several studies have explored the application of pairwise binary classifiers for handling multi-class problems. A description of this technique which is also known in literature as class binarisation or round robin classification can be found in [36]. [37–39] are few others who have studied class binarisation for multi-class classification. In order to perform tests with a machine learning based algorithm, it was necessary to build binary classifier models using binary combinations of the training datasets and these models were stored in a model library. Computational simulations were setup to evaluate the performance of the user confirmation and identification algorithms. Let us briefly see how the model library grows with the addition of users to the existing database of users. This is known as enrollment mode of the biometric system. Say, there are n disjointed users U₁, U₂, …U_n in the current state of the users’ database. ⁿC₂ binary classifier models can be built, which makes up the complete model library. With the addition of a user, the updated size of the users’ database becomes n + 1. Therefore, the size of the model library increases by n and becomes ⁿ⁺¹C₂. This growth can be expressed as (2)

This means that when a new user is added to the users’ database, n additional binary classifier models are to be built and stored in the model library. Expectedly, this follows a second-order power-law variation of the form y = ax^m with the multiplication factor a ≈ 0.5 and exponent m ≈ 2.

User confirmation algorithms

Two different user confirmation algorithms were built using the extracted feature data. The first approach was based on statistical hypothesis testing, which involves the testing of a null hypothesis against an alternative hypothesis. The second approach was based on machine learning models. In case of a machine learning based algorithm development, the training data were used to build random forest binary classifier models, thereby creating a library of models. In the case of the hypothesis testing based algorithm, model building process is redundant, and the predictions are made based on the hypothesis test results between a user’s test data and available training data, making it an instance-based algorithm. These algorithms will be referred to as UCA.HT (User Confirmation Algorithm—Hypothesis Testing) and UCA.ML (User Confirmation Algorithm—Machine Learning) in later sections. The Hotelling’s T² test [40] was used in UCA.HT, which is a multidimensional version of the Student’s t-test.

Confirmation algorithm based on hypothesis testing

The use of hypothesis testing as an instance-based binary classifier has been attempted in the literature. [41] compared the machine learning approach and the statistical testing based on p − variations; and the idea of instance-based classification by hypothesis testing was investigated by [42]. [43] provided a detailed description on how binary decision problems can be formulated as hypothesis testing and/or binary classification. In a system based on hypothesis testing, the library comprises of the training datasets of all the users. Since we are building an algorithm which is intended to work alongside a machine learning algorithm, we formulate the hypothesis test based algorithm to work on binary pairs of users. To be more precise, the library will comprise of training datasets of pairs of users. It will be referred to as user-pair data in further discussions. Fig 6 shows a flow chart of the user confirmation algorithm which is based on hypothesis testing principles. The equality-of-means test was performed between a test data and each training data in pairs present in the library to infer whether the null hypothesis is to be rejected or not, as depicted in Fig 6. Here, the null hypothesis states that the two samples come from the same distribution (H₀: μ_a = μ_b), and the alternate hypothesis states that the samples come from different distributions (H₁: μ_a ≠ μ_b). A detailed description on the test statistic and formulation of the Hotelling’s T² test can be found in the original work by [40].

Download:

Fig 6. User confirmation algorithm based on hypothesis testing.

A flow chart of the user confirmation algorithm based on hypothesis testing. The user confirmation block will be made use in the user identification algorithm later in this manuscript. An example of the hypothesis test against user-pair is illustrated inside the dotted box, directed from the user confirmation block by the red asterisk. Given a user i, the user confirmation block’s output was reposed to answer the question “Are you indeed User i?” based on a threshold.

https://doi.org/10.1371/journal.pone.0301971.g006

When a test user, say ‘User i’ was to provide the input, the pairwise Hotelling’s T² tests are performed between the test user’s data and the training data of n − 1 pairs of users which include ‘User i’, where n is the number of users in the database. Let us look at one of those tests as shown inside dotted box in Fig 6. By performing a hypothesis test against a user-pair, for example, (1, 2), we get a pair of p − values, (p₁, p₂). The tests were performed with a confidence level of 99.9%, and therefore a p − value of 0.001 or less was sufficient to reject the null hypothesis. At least one of the two p − values need to be above 0.001 for the algorithm to accept the null hypothesis. The predicted user is then the user corresponding to a higher p − value. If both p − values are either equal to or below 0.001, no predictions were made. After the test, the predictions made here are reposed as an answer to the question “Is it User i? (Yes/No)”. The pipeline discussed so far becomes the ‘User Confirmation Block—HT’ for the hypothesis testing based algorithm. The output of this block is a scalar v which is equal to the count of model predictions which says ‘Yes’. Here, a threshold of 50% of the predictions was used for defining the minimum confidence of confirmation. This means that HT(i, i) accepts the null hypothesis and HT(i, j) ∀j = 1, 2, …n and i ≠ j rejects the null hypothesis in at least 50% of the cases. Then, the User i is so confirmed. Here, HT(i, j) stands for the hypothesis test between a User i and User j.

The equality-of-means test can actually be viewed from two perspectives: (a) Testing the distribution of test data against the distribution of n training data; (b) Testing the distribution of test data against the distribution of training data in pairs as discussed so far. The former strategy produces n test results and the algorithm would face one of three scenarios: (i) If only one test accepts the null hypothesis, the user identity is presumed to be of the user corresponding to that particular test; (ii) If more than one tests accept the null hypothesis, the user corresponding to the test which corresponds to highest p-value is presumed to be of the user identity (predicted user). In either case, if the predicted user matches with the test user, the user is confirmed, otherwise not; (iii) If all tests reject or no test rejects the null hypothesis, then the user is not confirmed. Although the former case (procedure (a)) is a computationally simpler formulation, the latter case (procedure (b)) becomes more relevant in our study since we are trying to build a multi-model approach for user identification. It was also noted that the latter approach gave better confirmation results (for UCA.HT) compared to the former approach.

Confirmation algorithm based on machine learning

Following the discussions from subsection titled Building of model library (page 8), generating ⁿC₂ binary classifiers is necessary to handle the multiclass problem. The choice of a classifier depends on the specific characteristics of the dataset. A detailed discussion on the model-building procedure and the choice of a binary classifier can be found in Part 2 of S1 Text. Based on this analysis, we chose random forest as the appropriate binary classifier model for the model library. For the rest of this work, we will employ random forest as our machine learning algorithm and report results from this tool for both user confirmation and user identification.

Once the model building was complete and the entire library was stored, the test user data were given as input, say ‘User i’. The algorithm selects those models from the library which were built using the same test user and makes predictions using each model as depicted in the flow chart in Fig 7. The predictions made here are answers to the question “Is it User i? (Yes/No)”. The pipeline discussed so far becomes the ‘User confirmation block—ML’. The output of this block is a scalar v which is equal to the count of model predictions which says ‘Yes’. Here, a threshold of (again) 50% of the predictions was used for defining the minimum confidence of confirmation. This means that if the algorithm confirms the user in more than half the classification trials, i.e., when v > (n/2), the user is confirmed, else not.

Download:

Fig 7. User confirmation algorithm based on machine learning.

A flow chart of the user confirmation algorithm based on machine learning. The user confirmation block will be made use in the user identification algorithm later in this manuscript. Given a user i, the user confirmation block’s output was reposed to answer the question “Are you indeed User i?” based on a threshold.

https://doi.org/10.1371/journal.pone.0301971.g007

User identification algorithm

This work is the first attempt of its kind to build a biometric system which works purely based on human exhaled breath to identify the user with no disclosure of the user’s identity by the user himself or herself. Even though the user confirmation system works exceptionally well, the grand challenge in this area of research is to test the performance of a user identification system. The confirmation algorithm tries to answer the question “Are you User i?”, while an identification algorithm would essentially answer the more general authentication question “Who is the User?”. In pursuit of this grand challenge, we have developed a user identification algorithm built on top of approaches discussed in this manuscript. The machine learning based algorithm would use the same model library built earlier to perform the predictions. A block diagram of the algorithm is shown in Fig 8. The user identification algorithm incorporates the user confirmation block during the identification of a given user. When a new test user data is given as input, say User j, the algorithm runs the user confirmation block by considering all the users in the database as trial users. This effectively is equal to running through all the ⁿC₂ models present in the library, but in batches of trial users, User i, where i = 1, 2, 3, …n. The output of this pipeline is a vector V of size (1, n) with each element v_i being a result of the corresponding trial confirmation test. The identified user from this algorithm will be the trial user corresponding to the maximum value in the vector V. When more than one confirmation trial results in the maximum prediction value (two elements of V having the maximum value), the algorithm does not identify any user. The user identification algorithm is made generic, which means that any user confirmation algorithm (instance-based or model-based) can be used within this algorithm and the output of this algorithm will be the vector V containing the count of predictions. This allows us to build a multi-modal approach for user identification where multiple identification results can be combined using a weighted sum. This is similar to a classical black board architecture where results from multiple expert units are combined. We will now present a brief discussion on this approach. Let us call the outputs from a hypothesis test based and machine learning based user identification algorithms as V^HT and V^ML respectively. We can take a weighted sum of these two vectors to get a new vector V′ which will have the advantages of both the algorithms as shown in Eq 3. (3) where, w₁ and w₂ are the weights associated with hypothesis test based algorithm and the machine learning based algorithm, respectively. The weights can take values between 0 ≤ w ≤ 1 and sum of the weights should always sum up to 1. This approach can be generalised for a combination of multiple user identification algorithms as shown in Eq 4. When we have r output vectors V₁, V₂, V₃, …, V_r from r algorithms, Eq 3 becomes, (4)

Download:

Fig 8. A generic user identification algorithm.

Given a test user j, the algorithm performs n confirmation trials. One confirmation trial is the equivalent to running the user confirmation block (either HT from Fig 6 or ML from Fig 7) for a trial user i. The identified user corresponds to the maximum prediction based on the n confirmation tests. Note that in the case where more than one confirmation trial results in the maximum prediction value, the algorithm does not identify a user.

https://doi.org/10.1371/journal.pone.0301971.g008

Results and discussions

User confirmation system

Confirmation tests were performed for all users (n) available in the database. Each set of confirmation tests were repeated 66 times by shuffling training and test data split-up. The results of the algorithm from each of these trials can be interpreted as follows: number of confirmed users denoted by c, and number of unconfirmed users denoted by u. In order to quantify the performance of the algorithms, we define a metric called the true confirmation rate (TCR) which is a ratio of the confirmed users and total number of users as shown in Eq 5. (5)

The confidence of confirmation (η) for a user confirmation algorithm is the percentage prediction of the favourable user during a confirmation test. It directly quantifies how confident the algorithm is while attempting to confirm a user i. It can be defined as, (6) where, v_i is the favourable user predictions as seen in Figs 6 and 7, i.e., the total number of model predictions that matches the user that the algorithm is attempting to confirm, and n is the total number of users in the database. The value of η_i ∀i = 1, 2, …n has to pass a threshold confidence of confirmation, say η_t, for a user to be confirmed. A comparison of the histogram of η_i is shown in Fig 9 for one trial of n confirmation tests. The study revealed that the machine learning based algorithm performs better than the hypothesis testing based algorithm. This validates the ability of a random forest classifier to capture the decision boundary better, when compared to its hypothesis testing based counterpart. For the UCA.HT, the TCR was 50±9.6%, whereas, for the UCA.ML, the TCR was 97±2.5%. This implies that almost every user was able to pass the threshold of 50% in the machine learning based algorithm. This signifies that the algorithm achieves a greater level of confidence while confirming a user using UCS.ML.

Download:

Fig 9. Comparison of the confidence of confirmation η_i.

Histograms of confidence of confirmation η_i compared between (A) a machine learning based approach (random forest classifiers) and (B) a hypothesis testing based classification approach, for one trial of n confirmation tests. In the example shown here, the predictions from ML classifiers give a range of η_i values distributed between ≈38% to 100%, whereas the predictions from HT based classifiers produce η_i values only close to 0% and 100%.

https://doi.org/10.1371/journal.pone.0301971.g009

We shall now investigate why the machine learning based classification algorithm performs better in comparison with a hypothesis test based classification. In the case of hypothesis testing, we know that the rejection of null hypothesis is based on the confidence level chosen. The confidence level can be visualized as a demarcating hyper-surface between two n-dimensional normal distributions. For simplicity, let us have a look at the decision boundaries captured by the random forest classifier and the hypothesis test based classifier in a chosen two dimensional feature space. Fig 10 shows a visualisation in the (β, ω) plane for a randomly chosen user-pair. The blue and red markers are the training data points corresponding to two user classes, respectively. The class regions are computed using a structured synthetic dataset in the feature space.

Download:

Fig 10. Comparison of the decision boundaries in (β, ω) plane.

Decision boundaries captured by (A) random forest classifier and (B) hypothesis testing based classifier for a randomly chosen user-pair. The scattered points are the training data points with red and blue labels denoting their true classes respectively. The line separating the two contour regions is the decision boundary. Accuracy of each model against the test data is displayed at the top right corner of their respective plots. The RF classifier captures a complex decision boundary compared to the HT based classifier.

https://doi.org/10.1371/journal.pone.0301971.g010

For the purpose of visualisation of a hypothesis test based classifier’s decision boundary, z − tests were performed in each dimension separately, for every data point from the synthetic dataset against one of the user’s training data. The tests were performed under the null hypothesis that the data point belongs to the distribution of the training data, under a confidence level of 99.9%. The overall null hypothesis is accepted only if the null hypothesis in both the dimensions are accepted. Comparing the decision boundaries captured by a hypothesis test based algorithm and a random forest model for the same pair of users, one can observe that the random forest model has the ability to capture a more complex decision boundary between two user classes. This lets the random forest classifier to achieve a test data accuracy of 90.9%, whereas the hypothesis testing based classifier achieves only 73.9%. Now that we have established that the machine learning based algorithm is better than the hypothesis test based algorithm for user confirmation, we will now investigate how these two algorithms perform for user identification in the following section.

User identification system

The identification algorithm discussed in Fig 8 shows that we obtain a vector V of favourable user predictions. Based on the values of vector V_j with j = 1, 2, 3, …n, we can obtain the following outcomes:

True positives (t)—Number of users who were identified correctly.
False positives (f)—Number of users who were identified incorrectly.
Not identified (h)—Number of users who the algorithm was unable to identify.

We shall define the following performance metrics to evaluate the user identification algorithm:

Precision (P) or Positive Predictive Value (PPV), which quantifies the percentage of users who were identified correctly among all the identified users. (7)
This parameter quantifies the probability of correct predictions given a judgement (identification) by the algorithm.
Accuracy (E), which quantifies the percentage of users who were identified correctly among all the users n. (8)

The precision and accuracy values computed using Eqs 7 and 8 respectively, were 35±10.5% and 29±9.1% respectively, for the hypothesis test based algorithm. The results reported in this section are in the format ‘μ_p±2σ_p’ where μ_p and σ_p are mean and standard deviation of the performance metrics respectively. For the random forest based algorithm, we were able to observe precision and accuracy values of 26±7.2% and 22±6.4% respectively. These values were computed on the basis of the maximum votes received by a user among n confirmation trials, as described previously in Fig 8. When we combine the results from both the algorithms using Eq 3 with w₁ = 0.3 and w₂ = 0.7, we get precision and accuracy values of 32±8.5% and 31±8.5% respectively. Note that the values reported here are also influenced by the threshold η_t which in this case was set to 55%. The parameters w₁, w₂, and η_t can be tweaked to make the algorithm behave on both extremes—(i) to be very liberal (low precision, low accuracy); (ii) to be very conservative (high precision, low accuracy). Taking the example of a particular trial with n = 94, for a weights setting of w₁ = 0.3 and w₂ = 0.7, η_t = 50% produces the outcomes (t, f, h) = (31, 58, 5), giving a precision of 34.8% and accuracy of 33.0%. For the same weights, η_t = 96% produces the outcomes (t, f, h) = (18, 6, 70), giving a precision of 75.0% and accuracy of 19.1%. The former case allows for a lot of false positives by making judgements on most of the instances, whereas the latter case of the algorithm makes judgements stringently.

With the right set of hyperparameters (w₁, w₂, …w_r (in the general case from Eq 4) and η_t), a multi-modal approach is expected to improve the robustness of the overall algorithm. If one classifier produces incorrect predictions for certain trials, other classifiers in the ensemble can compensate for it and provide correct predictions. The contribution of each algorithm can be controlled by the weights. This robustness helps in improving the generalization of the ensemble model. The following discussion is based on results produced from this combined algorithm. We know that the highest voted user becomes the identified user from the algorithm. Based on the 66 shuffle trials, we have the following understanding of the user database. 21.3% to 42.6% of the users can be correctly identified by them being the highest voted users, 39.4% to 57.4% of the users can be correctly identified as at least the second highest voted users, and 50.0% to 66.0% of the users can be correctly identified as at least the third highest voted users. This is remarkable given that it is the first attempt in the literature to classify and uniquely identify individuals based solely on the fluid physics of the exhaled breath. We believe that this is conclusive evidence that the fluid dynamic structure of the exhaled breath contains uniquely identifiable information.

This algorithm holds tremendous potential for future use in the area of personalised medicine and also as a novel way to store biological data. This can be achieved by careful model selection and generalisation of classifier models. Advanced models such as deep neural networks can be made use to enhance the multi-model approach discussed in this manuscript.

Physical insights: Understanding the defining features

In order to make a physics-based argument for the uniqueness of human exhalation, it is important to investigate the physical significance of the most important features that result in robust classification. These would be the set of features or attributes which inherently differentiate the classes for a given training data. As we have seen, the importance of the features were quantified for every random forest binary classifier for choosing a reduced feature set in subsection titled Feature extraction (page 6). These features are to be investigated to understand their physical meaning in the context of the current problem in hand. A description of the most important classifying features (in the decreasing order of importance) are as follows.

The singularity strength or Hölder exponent corresponding to the maximum (β) of the multifractal spectrum of the exhaled breath time series: This is a feature extracted using the MFDFA. β explains the long range correlation present in the time series. A low value indicates that the underlying process becomes correlated and loses fine structure, becoming more regular in appearance [30]. This, in our case, would relate to the organised motion of vortical structures in the turbulent exhaled air flow. For some subjects the vorticity pattern might be more irregular than the others, which could be attributed to the extrathoracic morphology.
The sum over the absolute value of consecutive changes in the velocity time series: This feature represents the total magnitude of absolute differences between successive data points. In the context of our study, a higher value of this feature indicates a greater overall change in velocity between consecutive data points, i.e., the velocity changes rapidly and frequently. In contrast, a low value of this feature indicates that the velocity is smooth and consistent. It provides a quantitative measure of how much the velocity values fluctuate over successive time intervals, which in our case is 0.1 milliseconds. The detection of distinctive patterns in these fluctuations can provide insights into the presence of vortical structures in exhaled breath flow, contributing to the uniqueness of these patterns for individual subjects and enabling their classification by the algorithm.
Third coefficient of the autoregressive AR(r) model with order parameter r = 10: The parameter r is the maximum lag of the autoregressive process. The AR model generally predicts future behavior based on past data. The importance of the third as well as fourth (point 8) coefficients show that there is some correlation between successive values in the time series for most of the users.
The number of peaks in the time series with a support (s) of at least 1: A peak of support s is defined as a sub-sequence in the time series where a value occurs that is greater than its s neighbors to the left and to the right. When s is set to 1, this feature computes the number of peaks in the time series where a value is greater than its immediate neighbors. This feature can provide insights into the presence or intensity of localised fluctuations in the flow.
The number of different Continuous Wavelet Transform (CWT) peaks present in the signal for smoothing width of 1: This feature was extracted from the time series by applying CWT using Ricker wavelet with width, w = 1. This method simultaneously evaluates the signal in the temporal and frequency domains. In the context of our study, the identified CWT peaks represent distinctive features in the breath signature. Physically, these peaks may correspond to specific events or patterns that are characterised by rapid changes in both time and frequency domains. For instance, a CWT peak could signify the presence of a sudden, localized change in the breath velocity with a particular frequency content. The number of distinct peaks across the considered width scales provides a quantitative measure of the breath signature’s complexity. It can be utilized to compare the signals based on their peak characteristics.
The value of partial autocorrelation function at a lag of 3: The partial autocorrelation is a statistical measure that quantifies the linear relationship between a time series variable and its lagged values. In the context of our exhaled breath flow, the partial autocorrelation can provide insights into the temporal dependence and correlation structure of the breath velocity. This means that this feature can be useful in understanding the persistence or memory of the signal. It suggests that a strong linear relationship between the current flow state and its state 3 time steps ago have been important for the classification of human subjects. In our analysis, a ‘time step’ corresponds to the original sampling rate of 10kHz. Therefore, when we refer to a lag of 3 time steps, it signifies a duration of 0.3 milliseconds.
Width of the multifractal spectrum (ω) of the exhaled breath time series: ω describes the richness of the multifractality present in the time series, i.e., wider the range of singularity strength, richer the structure of the signal. The spectral width can implicitly represent the intensity or the level of turbulence present in the flow of exhaled breath. Turbulence is characterized by fluctuations in velocity at different scales. A wider range of turbulence scales is reflected by a wider spectral width, indicating a more turbulent flow. This might be attributed to factors such as extrathoracic constriction, or increased turbulence due to specific breath patterns or breath dynamics.
Fourth coefficient of the autoregressive AR(r) model with order parameter r = 10.
The number of different continuous wavelet transform (CWT) peaks present in the signal for smoothing width of 5. This feature was extracted using the same technique as discussed in point 5, but with a width of w = 5. A larger smoothing width typically leads to a broader wavelet. A wider wavelet provides a smoother analysis that might emphasize broader features and lower-frequency components in the signal. Conversely, a smaller smoothing width of w = 1 (point 5) would result in a narrow wavelet, allowing for a more detailed examination of rapid changes in the signal (sensitive to high-frequency components).
Kurtosis of the velocity time series calculated with the adjusted Fisher-Pearson standardized moment coefficient, g2: We know that Kurtosis is a higher-order statistical attribute of velocity signals. The heaviness of the tails of the probability density functions of normalized time series could be distinct for each user. This feature will help us in assessing the degree of deviation from the Gaussian distribution and provides evidence of skewed behaviour of the time series.

Computational complexity of the algorithm

Run-time of an algorithm is an extremely important factor for a real-time biometric system. It was generally observed that the size of the input feature set affects the amount of computational resources required to run an algorithm. It was observed that the hypothesis test based algorithm performs predictions faster than the machine learning based algorithm which is because the former is an instance-based classifier. Since the user identification algorithm depends on the number of users and in turn the number of models in the model library, the identification time per user was expected to scale up with the size of the library. The identification time was observed to show a linear relationship with the size of the library (of the form y = ax, with slope a ≈ 1) as seen in the Fig 11. The error bars show 95% confidence interval at every data point.

Download:

Fig 11. Dependence of user identification time on the size of model library.

Plot showing the linear relation of user identification time with the growth of model library. This is applicable to the ML based algorithms which include building of binary classifier models (also known as enrollment in the context of biometrics). The error bars show 95% confidence interval at every data point.

https://doi.org/10.1371/journal.pone.0301971.g011

One of the advantages of building an algorithm which uses ⁿC₂ binary classifiers instead of a single multi-class classifier is that it is massively parallelisable. As long as we have sufficient number of cores to run model loading and prediction, the parallelisation is possible. This significantly improves the computational time by several orders.

Conclusion

We have provided evidence for the feasibility of a novel biometric system that works based on the turbulence information present in human exhaled breath. The use of a hot-wire anemometer for data acquisition allowed us to build a compact working setup. The faster response time of a constant temperature hot wire anemometer and the real-time computation in combination will possibly make the setup implementable as a biometric authentication system. Since the input of the exhaled breath-based biometric system is correlated with the internal morphology of the human body, it is impossible for a hacker to spoof-authenticate a user. This is because it is difficult to reconstruct an original time series and subsequently the binary classifier models that consolidate all the relevant features (biometric traits) of the true user. Preliminary studies carried out and presented in this work based on time series data from 94 human subjects have shown promising results. We recommend the machine learning approach discussed in this work as a procedure to build a working user confirmation system, as it produces good accuracy in confirming users. It achieved a true confirmation rate of over 97%, which is because of the ability of random forest models to capture complex decision boundaries between the classes. Although the dataset performs really well for a user confirmation algorithm, the real test of a biometric system comes in for the user identification algorithm, where the test user’s identity is not revealed a priori. Building such an algorithm comes with more challenges and would require samples from a larger population to be evaluated. We recommend a multi-model approach for the user identification system, as discussed in this manuscript. The results from our study show that a user identification algorithm performs reasonably well with maximum precision and accuracy of ≈40% each for optimum parameter settings. 39.4% to 57.4% of the users were correctly identified as at least the second highest voted users.

Our study reveals the possibility that a system built solely on the basis of the fluid dynamics of human exhaled breath could be a potential tool to understand the person-to-person variation in turbulent signatures of exhaled breath. This uniqueness in observed signature could potentially be correlated to the morphometric variation present in the extrathoracic airway. To make comments on the intricate structures within the upper respiratory tract, we might need experimental proof on cadaver models, or simultaneous imaging of upper tract along with the HWA data. Such a study would give us insights on how the structures exhibit considerable morphological diversity among individuals. While our study does not involve direct experimentation with throat morphology, it prompts consideration of how these morphological variations could contribute to the surprisingly unique turbulent signatures found in exhaled breath. Further investigation would give us better understanding on the relationship between these morphological traits and the distinct fluid dynamic signatures. For example, it is possible that the turbulence information can be correlated to occlusion in the extrathoracic passage and its nature, which is a major source of deposition of aerosolised therapeutics. Such an understanding will help us delve deeper into the area of personalised medicines.

Supporting information

S1 Text. Supplementary materials for user authentication system based on human exhaled breath physics.

The supporting information for this research article includes: Part 1: A statistical description which describes the Multifractal Detrended Fluctuation Analysis (MFDFA) of human exhaled breath velocity time series; Part 2: Model library building procedure and model selection for the machine learning based algorithm.

https://doi.org/10.1371/journal.pone.0301971.s001

(PDF)

S1 Checklist. Human participants research checklist.

https://doi.org/10.1371/journal.pone.0301971.s002

(PDF)

Acknowledgments

The authors acknowledge the HPCE, the Synchrony, and the SENAI of the Indian Institute of Technology Madras for providing the required high performance computing resources. The authors thank the NCCRD, IIT Madras, for providing the hot-wire anemometer setup and the calibration facility to carry out the experiments. The authors are also thankful to all the participants who volunteered to give their exhaled breath data for this study.

References

1. Pedley TJ. Pulmonary Fluid Dynamics. Annual Review of Fluid Mechanics. 1977;9(1):229–274.
- View Article
- Google Scholar
2. Chang HK, El Masry OA. A model study of flow dynamics in human central airways. Part I: Axial velocity profiles. Respiration Physiology. 1982;49(1):75–95. pmid:7146646
- View Article
- PubMed/NCBI
- Google Scholar
3. sen Wang C. Chapter 3 Airflow in the respiratory system. In: Inhaled Particles. vol. 5 of Interface Science and Technology. Elsevier; 2005. p. 31–54.
4. Finlay WH. The mechanics of inhaled pharmaceutical aerosols. San Diego, CA: Academic Press; 2001.
5. Dekker E. Transition between laminar and turbulent flow in human trachea. Journal of Applied Physiology. 1961;16(6):1060–1064. pmid:13884939
- View Article
- PubMed/NCBI
- Google Scholar
6. Chauhan J, Hu Y, Seneviratne S, Misra A, Seneviratne A, Lee Y. BreathPrint: Breathing Acoustics-Based User Authentication. In: Proceedings of the 15th Annual International Conference on Mobile Systems, Applications, and Services. MobiSys’17. New York, NY, USA: Association for Computing Machinery; 2017. p. 278–291.
7. Woodward JD, Webb KW, Newton EM, Bradley M, Rubenson D, Larson K, et al. In: A PRIMER ON BIOMETRIC TECHNOLOGY. 1st ed. RAND Corporation; 2001. p. 9–20.
8. Schaber CL, Katta N, Bollinger LB, Mwale M, Mlotha-Mitole R, Trehan I, et al. Breathprinting reveals malaria-associated biomarkers and mosquito attractants. The Journal of infectious diseases. 2018;217(10):1553–1560. pmid:29415208
- View Article
- PubMed/NCBI
- Google Scholar
9. Horváth I, Barnes PJ, Loukides S, Sterk PJ, Högman M, Olin AC, et al. A European Respiratory Society technical standard: exhaled biomarkers in lung disease. European Respiratory Journal. 2017;49(4).
- View Article
- Google Scholar
10. Rattray NJ, Hamrang Z, Trivedi DK, Goodacre R, Fowler SJ. Taking your breath away: metabolomics breathes life in to personalized medicine. Trends in biotechnology. 2014;32(10):538–548. pmid:25179940
- View Article
- PubMed/NCBI
- Google Scholar
11. Samara MA, Tang WW, Cikach F, Gul Z, Tranchito L, Paschke KM, et al. Single exhaled breath metabolomic analysis identifies unique breathprint in patients with acute decompensated heart failure. Journal of the American College of Cardiology. 2013;61(13):1463–1464. pmid:23500243
- View Article
- PubMed/NCBI
- Google Scholar
12. Guo D, Zhang D, Li N, Zhang L, Yang J. Diabetes identification and classification by means of a breath analysis system. In: International conference on medical biometrics. Springer; 2010. p. 52–63.
13. Lawal O, Ahmed WM, Nijsen TM, Goodacre R, Fowler SJ. Exhaled breath analysis: a review of ‘breath-taking’methods for off-line analysis. Metabolomics. 2017;13(10):1–16. pmid:28867989
- View Article
- PubMed/NCBI
- Google Scholar
14. Mashir A, Dweik RA. Exhaled breath analysis: the new interface between medicine and engineering. Advanced Powder Technology. 2009;20(5):420–425. pmid:20948990
- View Article
- PubMed/NCBI
- Google Scholar
15. Pereira J, Porto-Figueira P, Cavaco C, Taunk K, Rapole S, Dhakne R, et al. Breath analysis as a potential and non-invasive frontier in disease diagnosis: an overview. Metabolites. 2015;5(1):3–55. pmid:25584743
- View Article
- PubMed/NCBI
- Google Scholar
16. Das S, Pal M. Non-invasive monitoring of human health by exhaled breath analysis: A comprehensive review. Journal of The Electrochemical Society. 2020;167(3):037562.
- View Article
- Google Scholar
17. Lafortuna CL, Minetti AE, Mognoni P. Inspiratory flow pattern in humans. Journal of Applied Physiology. 1984;57(4):1111–1119. pmid:6501028
- View Article
- PubMed/NCBI
- Google Scholar
18. Painter R, Cuningham D. Analyses of human respiratory flow patterns. Respiration physiology. 1992;87(3):293–307. pmid:1604054
- View Article
- PubMed/NCBI
- Google Scholar
19. Godal A, Belenky D, Standaert T, Woodrum D, Grimsrud L, Hodson W. Application of the hot-wire anemometer to respiratory measurements in small animal. Journal of applied physiology. 1976;40(2):275–277. pmid:1249009
- View Article
- PubMed/NCBI
- Google Scholar
20. Lundsgaard JS, Grønlund J, Einer-Jensen N. Evaluation of a constant-temperature hot-wire anemometer for respiratory-gas-flow measurements. Med Biol Eng Comput. 1979;17(2):211–215. pmid:155766
- View Article
- PubMed/NCBI
- Google Scholar
21. Silva ISS, Freire RCS, Silva JF, Naviner JF, Sousa FR, Catunda SYC. Architectures of anemometers using the electric equivalence principle. In: IMTC/2002. Proceedings of the 19th IEEE Instrumentation and Measurement Technology Conference (IEEE Cat. No.00CH37276). vol. 1; 2002. p. 397–401 vol.1.
22. Araujo GA, Freire RC, Silva JF, Oliveira A, Jaguaribe E. Breathing flow measurement with constant temperature hot-wire anemometer for forced oscillations technique. In: Proceedings of the 21st IEEE Instrumentation and Measurement Technology Conference (IEEE Cat. No. 04CH37510). vol. 1. IEEE; 2004. p. 730–733.
23. Kandaswamy A, Kumar CS, Kiran TV. A virtual instrument for measurement of expiratory parameters. In: IMTC/2002. Proceedings of the 19th IEEE Instrumentation and Measurement Technology Conference (IEEE Cat. No.00CH37276). vol. 2; 2002. p. 1255–1258 vol.2.
24. Xu C, Nielsen P, Gong G, Liu L, Jensen R. Measuring the exhaled breath of a manikin and human subjects. Indoor Air. 2015;25(2):188–197. pmid:24837295
- View Article
- PubMed/NCBI
- Google Scholar
25. Plakk P, Liik P, Kingisepp PH. Hot-wire anemometer for spirography. Medical and Biological Engineering and Computing. 1998;36(1):17–21. pmid:9614743
- View Article
- PubMed/NCBI
- Google Scholar
26. Chauhan J, Seneviratne S, Hu Y, Misra A, Seneviratne A, Lee Y. Breathing-based authentication on resource-constrained iot devices using recurrent neural networks. Computer. 2018;51(5):60–67.
- View Article
- Google Scholar
27. Abdelnasser H, Harras KA, Youssef M. UbiBreathe: A ubiquitous non-invasive WiFi-based breathing estimator. In: Proceedings of the 16th ACM International Symposium on Mobile Ad Hoc Networking and Computing; 2015. p. 277–286.
28. Liu J, Chen Y, Dong Y, Wang Y, Zhao T, Yao YD. Continuous user verification via respiratory biometrics. In: IEEE INFOCOM 2020-IEEE Conference on Computer Communications. IEEE; 2020. p. 1–10.
29. Lu L, Liu L, Hussain MJ, Liu Y. I Sense You by Breath: Speaker Recognition via Breath Biometrics. IEEE Transactions on Dependable and Secure Computing. 2020;17(2):306–319.
- View Article
- Google Scholar
30. Kantelhardt JW, Zschiegner SA, Koscielny-Bunde E, Havlin S, Bunde A, Stanley HE. Multifractal detrended fluctuation analysis of nonstationary time series. Physica A: Statistical Mechanics and its Applications. 2002;316(1-4):87–114.
- View Article
- Google Scholar
31. Eke A, Herman P, Sanganahalli B, Hyder F, Mukli P, Nagy Z. Pitfalls in fractal time series analysis: fMRI BOLD as an exemplary case. Frontiers in Physiology. 2012;3. pmid:23227008
- View Article
- PubMed/NCBI
- Google Scholar
32. Shimizu Y, Barth M, Windischberger C, Moser E, Thurner S. Wavelet-based multifractal analysis of fMRI time series. NeuroImage. 2004;22(3):1195–1202. pmid:15219591
- View Article
- PubMed/NCBI
- Google Scholar
33. Zhang X, Zeng M, Meng Q. Multivariate multifractal detrended fluctuation analysis of 3D wind field signals. Physica A: Statistical Mechanics and its Applications. 2018;490:513–523.
- View Article
- Google Scholar
34. Christ M, Braun N, Neuffer J, Kempa-Liehr AW. Time Series FeatuRe Extraction on basis of Scalable Hypothesis tests (tsfresh—A Python package). Neurocomputing. 2018;307:72–77.
- View Article
- Google Scholar
35. Breiman L. Random Forests. Machine Learning. 2001;45(1):5–32.
- View Article
- Google Scholar
36. Fürnkranz J. Round Robin Classification. Journal of Machine Learning Research. 2002;2(4):721–747.
- View Article
- Google Scholar
37. Lorena AC, de Carvalho ACPLF, Gama JMP. A review on the combination of binary classifiers in multiclass problems. Artificial Intelligence Review. 2008;30(1-4):19–37.
- View Article
- Google Scholar
38. Galar M, Fernández A, Barrenechea E, Bustince H, Herrera F. An overview of ensemble methods for binary classifiers in multi-class problems: Experimental study on one-vs-one and one-vs-all schemes. Pattern Recognition. 2011;44(8):1761–1776.
- View Article
- Google Scholar
39. Lan G, Gao Z, Tong L, Liu T. Class binarization to neuroevolution for multiclass classification. Neural Computing and Applications. 2022;34(22):19845–19862.
- View Article
- Google Scholar
40. Hotelling H. The Generalization of Student’s Ratio. The Annals of Mathematical Statistics. 1931;2(3):360–378.
- View Article
- Google Scholar
41. Janczura J, Kowalek P, Loch-Olszewska H, Szwabiński J, Weron A. Classification of particle trajectories in living cells: Machine learning versus statistical testing hypothesis for fractional anomalous diffusion. Phys Rev E. 2020;102:032402. pmid:33076015
- View Article
- PubMed/NCBI
- Google Scholar
42. He Z, Sheng C, Liu Y, Zou Q. Instance-Based Classification Through Hypothesis Testing. IEEE Access. 2021;9:17485–17494.
- View Article
- Google Scholar
43. Li JJ, Tong X. Statistical Hypothesis Testing versus Machine Learning Binary Classification: Distinctions and Guidelines. Patterns. 2020;1(7):100115. pmid:33073257
- View Article
- PubMed/NCBI
- Google Scholar

[ref1] 1. Pedley TJ. Pulmonary Fluid Dynamics. Annual Review of Fluid Mechanics. 1977;9(1):229–274.
View Article
Google Scholar

[2] View Article

[3] Google Scholar

[ref2] 2. Chang HK, El Masry OA. A model study of flow dynamics in human central airways. Part I: Axial velocity profiles. Respiration Physiology. 1982;49(1):75–95. pmid:7146646
View Article
PubMed/NCBI
Google Scholar

[5] View Article

[6] PubMed/NCBI

[7] Google Scholar

[ref3] 3. sen Wang C. Chapter 3 Airflow in the respiratory system. In: Inhaled Particles. vol. 5 of Interface Science and Technology. Elsevier; 2005. p. 31–54.

[ref4] 4. Finlay WH. The mechanics of inhaled pharmaceutical aerosols. San Diego, CA: Academic Press; 2001.

[ref5] 5. Dekker E. Transition between laminar and turbulent flow in human trachea. Journal of Applied Physiology. 1961;16(6):1060–1064. pmid:13884939
View Article
PubMed/NCBI
Google Scholar

[11] View Article

[12] PubMed/NCBI

[13] Google Scholar

[ref6] 6. Chauhan J, Hu Y, Seneviratne S, Misra A, Seneviratne A, Lee Y. BreathPrint: Breathing Acoustics-Based User Authentication. In: Proceedings of the 15th Annual International Conference on Mobile Systems, Applications, and Services. MobiSys’17. New York, NY, USA: Association for Computing Machinery; 2017. p. 278–291.

[ref7] 7. Woodward JD, Webb KW, Newton EM, Bradley M, Rubenson D, Larson K, et al. In: A PRIMER ON BIOMETRIC TECHNOLOGY. 1st ed. RAND Corporation; 2001. p. 9–20.

[ref8] 8. Schaber CL, Katta N, Bollinger LB, Mwale M, Mlotha-Mitole R, Trehan I, et al. Breathprinting reveals malaria-associated biomarkers and mosquito attractants. The Journal of infectious diseases. 2018;217(10):1553–1560. pmid:29415208
View Article
PubMed/NCBI
Google Scholar

[17] View Article

[18] PubMed/NCBI

[19] Google Scholar

[ref9] 9. Horváth I, Barnes PJ, Loukides S, Sterk PJ, Högman M, Olin AC, et al. A European Respiratory Society technical standard: exhaled biomarkers in lung disease. European Respiratory Journal. 2017;49(4).
View Article
Google Scholar

[21] View Article

[22] Google Scholar

[ref10] 10. Rattray NJ, Hamrang Z, Trivedi DK, Goodacre R, Fowler SJ. Taking your breath away: metabolomics breathes life in to personalized medicine. Trends in biotechnology. 2014;32(10):538–548. pmid:25179940
View Article
PubMed/NCBI
Google Scholar

[24] View Article

[25] PubMed/NCBI

[26] Google Scholar

[ref11] 11. Samara MA, Tang WW, Cikach F, Gul Z, Tranchito L, Paschke KM, et al. Single exhaled breath metabolomic analysis identifies unique breathprint in patients with acute decompensated heart failure. Journal of the American College of Cardiology. 2013;61(13):1463–1464. pmid:23500243
View Article
PubMed/NCBI
Google Scholar

[28] View Article

[29] PubMed/NCBI

[30] Google Scholar

[ref12] 12. Guo D, Zhang D, Li N, Zhang L, Yang J. Diabetes identification and classification by means of a breath analysis system. In: International conference on medical biometrics. Springer; 2010. p. 52–63.

[ref13] 13. Lawal O, Ahmed WM, Nijsen TM, Goodacre R, Fowler SJ. Exhaled breath analysis: a review of ‘breath-taking’methods for off-line analysis. Metabolomics. 2017;13(10):1–16. pmid:28867989
View Article
PubMed/NCBI
Google Scholar

[33] View Article

[34] PubMed/NCBI

[35] Google Scholar

[ref14] 14. Mashir A, Dweik RA. Exhaled breath analysis: the new interface between medicine and engineering. Advanced Powder Technology. 2009;20(5):420–425. pmid:20948990
View Article
PubMed/NCBI
Google Scholar

[37] View Article

[38] PubMed/NCBI

[39] Google Scholar

[ref15] 15. Pereira J, Porto-Figueira P, Cavaco C, Taunk K, Rapole S, Dhakne R, et al. Breath analysis as a potential and non-invasive frontier in disease diagnosis: an overview. Metabolites. 2015;5(1):3–55. pmid:25584743
View Article
PubMed/NCBI
Google Scholar

[41] View Article

[42] PubMed/NCBI

[43] Google Scholar

[ref16] 16. Das S, Pal M. Non-invasive monitoring of human health by exhaled breath analysis: A comprehensive review. Journal of The Electrochemical Society. 2020;167(3):037562.
View Article
Google Scholar

[45] View Article

[46] Google Scholar

[ref17] 17. Lafortuna CL, Minetti AE, Mognoni P. Inspiratory flow pattern in humans. Journal of Applied Physiology. 1984;57(4):1111–1119. pmid:6501028
View Article
PubMed/NCBI
Google Scholar

[48] View Article

[49] PubMed/NCBI

[50] Google Scholar

[ref18] 18. Painter R, Cuningham D. Analyses of human respiratory flow patterns. Respiration physiology. 1992;87(3):293–307. pmid:1604054
View Article
PubMed/NCBI
Google Scholar

[52] View Article

[53] PubMed/NCBI

[54] Google Scholar

[ref19] 19. Godal A, Belenky D, Standaert T, Woodrum D, Grimsrud L, Hodson W. Application of the hot-wire anemometer to respiratory measurements in small animal. Journal of applied physiology. 1976;40(2):275–277. pmid:1249009
View Article
PubMed/NCBI
Google Scholar

[56] View Article

[57] PubMed/NCBI

[58] Google Scholar

[ref20] 20. Lundsgaard JS, Grønlund J, Einer-Jensen N. Evaluation of a constant-temperature hot-wire anemometer for respiratory-gas-flow measurements. Med Biol Eng Comput. 1979;17(2):211–215. pmid:155766
View Article
PubMed/NCBI
Google Scholar

[60] View Article

[61] PubMed/NCBI

[62] Google Scholar

[ref21] 21. Silva ISS, Freire RCS, Silva JF, Naviner JF, Sousa FR, Catunda SYC. Architectures of anemometers using the electric equivalence principle. In: IMTC/2002. Proceedings of the 19th IEEE Instrumentation and Measurement Technology Conference (IEEE Cat. No.00CH37276). vol. 1; 2002. p. 397–401 vol.1.

[ref22] 22. Araujo GA, Freire RC, Silva JF, Oliveira A, Jaguaribe E. Breathing flow measurement with constant temperature hot-wire anemometer for forced oscillations technique. In: Proceedings of the 21st IEEE Instrumentation and Measurement Technology Conference (IEEE Cat. No. 04CH37510). vol. 1. IEEE; 2004. p. 730–733.

[ref23] 23. Kandaswamy A, Kumar CS, Kiran TV. A virtual instrument for measurement of expiratory parameters. In: IMTC/2002. Proceedings of the 19th IEEE Instrumentation and Measurement Technology Conference (IEEE Cat. No.00CH37276). vol. 2; 2002. p. 1255–1258 vol.2.

[ref24] 24. Xu C, Nielsen P, Gong G, Liu L, Jensen R. Measuring the exhaled breath of a manikin and human subjects. Indoor Air. 2015;25(2):188–197. pmid:24837295
View Article
PubMed/NCBI
Google Scholar

[67] View Article

[68] PubMed/NCBI

[69] Google Scholar

[ref25] 25. Plakk P, Liik P, Kingisepp PH. Hot-wire anemometer for spirography. Medical and Biological Engineering and Computing. 1998;36(1):17–21. pmid:9614743
View Article
PubMed/NCBI
Google Scholar

[71] View Article

[72] PubMed/NCBI

[73] Google Scholar

[ref26] 26. Chauhan J, Seneviratne S, Hu Y, Misra A, Seneviratne A, Lee Y. Breathing-based authentication on resource-constrained iot devices using recurrent neural networks. Computer. 2018;51(5):60–67.
View Article
Google Scholar

[75] View Article

[76] Google Scholar

[ref27] 27. Abdelnasser H, Harras KA, Youssef M. UbiBreathe: A ubiquitous non-invasive WiFi-based breathing estimator. In: Proceedings of the 16th ACM International Symposium on Mobile Ad Hoc Networking and Computing; 2015. p. 277–286.

[ref28] 28. Liu J, Chen Y, Dong Y, Wang Y, Zhao T, Yao YD. Continuous user verification via respiratory biometrics. In: IEEE INFOCOM 2020-IEEE Conference on Computer Communications. IEEE; 2020. p. 1–10.

[ref29] 29. Lu L, Liu L, Hussain MJ, Liu Y. I Sense You by Breath: Speaker Recognition via Breath Biometrics. IEEE Transactions on Dependable and Secure Computing. 2020;17(2):306–319.
View Article
Google Scholar

[80] View Article

[81] Google Scholar

[ref30] 30. Kantelhardt JW, Zschiegner SA, Koscielny-Bunde E, Havlin S, Bunde A, Stanley HE. Multifractal detrended fluctuation analysis of nonstationary time series. Physica A: Statistical Mechanics and its Applications. 2002;316(1-4):87–114.
View Article
Google Scholar

[83] View Article

[84] Google Scholar

[ref31] 31. Eke A, Herman P, Sanganahalli B, Hyder F, Mukli P, Nagy Z. Pitfalls in fractal time series analysis: fMRI BOLD as an exemplary case. Frontiers in Physiology. 2012;3. pmid:23227008
View Article
PubMed/NCBI
Google Scholar

[86] View Article

[87] PubMed/NCBI

[88] Google Scholar

[ref32] 32. Shimizu Y, Barth M, Windischberger C, Moser E, Thurner S. Wavelet-based multifractal analysis of fMRI time series. NeuroImage. 2004;22(3):1195–1202. pmid:15219591
View Article
PubMed/NCBI
Google Scholar

[90] View Article

[91] PubMed/NCBI

[92] Google Scholar

[ref33] 33. Zhang X, Zeng M, Meng Q. Multivariate multifractal detrended fluctuation analysis of 3D wind field signals. Physica A: Statistical Mechanics and its Applications. 2018;490:513–523.
View Article
Google Scholar

[94] View Article

[95] Google Scholar

[ref34] 34. Christ M, Braun N, Neuffer J, Kempa-Liehr AW. Time Series FeatuRe Extraction on basis of Scalable Hypothesis tests (tsfresh—A Python package). Neurocomputing. 2018;307:72–77.
View Article
Google Scholar

[97] View Article

[98] Google Scholar

[ref35] 35. Breiman L. Random Forests. Machine Learning. 2001;45(1):5–32.
View Article
Google Scholar

[100] View Article

[101] Google Scholar

[ref36] 36. Fürnkranz J. Round Robin Classification. Journal of Machine Learning Research. 2002;2(4):721–747.
View Article
Google Scholar

[103] View Article

[104] Google Scholar

[ref37] 37. Lorena AC, de Carvalho ACPLF, Gama JMP. A review on the combination of binary classifiers in multiclass problems. Artificial Intelligence Review. 2008;30(1-4):19–37.
View Article
Google Scholar

[106] View Article

[107] Google Scholar

[ref38] 38. Galar M, Fernández A, Barrenechea E, Bustince H, Herrera F. An overview of ensemble methods for binary classifiers in multi-class problems: Experimental study on one-vs-one and one-vs-all schemes. Pattern Recognition. 2011;44(8):1761–1776.
View Article
Google Scholar

[109] View Article

[110] Google Scholar

[ref39] 39. Lan G, Gao Z, Tong L, Liu T. Class binarization to neuroevolution for multiclass classification. Neural Computing and Applications. 2022;34(22):19845–19862.
View Article
Google Scholar

[112] View Article

[113] Google Scholar

[ref40] 40. Hotelling H. The Generalization of Student’s Ratio. The Annals of Mathematical Statistics. 1931;2(3):360–378.
View Article
Google Scholar

[115] View Article

[116] Google Scholar

[ref41] 41. Janczura J, Kowalek P, Loch-Olszewska H, Szwabiński J, Weron A. Classification of particle trajectories in living cells: Machine learning versus statistical testing hypothesis for fractional anomalous diffusion. Phys Rev E. 2020;102:032402. pmid:33076015
View Article
PubMed/NCBI
Google Scholar

[118] View Article

[119] PubMed/NCBI

[120] Google Scholar

[ref42] 42. He Z, Sheng C, Liu Y, Zou Q. Instance-Based Classification Through Hypothesis Testing. IEEE Access. 2021;9:17485–17494.
View Article
Google Scholar

[122] View Article

[123] Google Scholar

[ref43] 43. Li JJ, Tong X. Statistical Hypothesis Testing versus Machine Learning Binary Classification: Distinctions and Guidelines. Patterns. 2020;1(7):100115. pmid:33073257
View Article
PubMed/NCBI
Google Scholar

[125] View Article

[126] PubMed/NCBI

[127] Google Scholar

Figures

Abstract

Introduction

The experimental dataset and methodology

Participants

Data collection and analysis

Time series segmentation, normalization and selection

Feature extraction

Building of model library

User confirmation algorithms

Confirmation algorithm based on hypothesis testing

Confirmation algorithm based on machine learning

User identification algorithm

Results and discussions

User confirmation system

User identification system

Physical insights: Understanding the defining features

Computational complexity of the algorithm

Conclusion

Supporting information

S1 Text. Supplementary materials for user authentication system based on human exhaled breath physics.

S1 Checklist. Human participants research checklist.

Acknowledgments

References