Monitoring the security of audio biomedical signals communications in wearable IoT healthcare

The COVID-19 pandemic has imposed new challenges on the healthcare industry as hospital staff are exposed to a massive coronavirus load when registering new patients, taking temperatures, and providing care. The Ebola epidemic of 2014 is another example of a pandemic which a hospital in New York decided to use an audio-based communication system to protect nurses. This idea quickly turned into an Internet of Things (IoT) healthcare solution to help to communicate with patients remotely. However, it has grabbed the attention of criminals who use this medium as a cover for secret communication. The merging of signal processing and machine-learning techniques has led to the development of steganalyzers with very higher efficiencies, but since the statistical properties of normal audio files differ from those of purely speech audio files, the current steganalysis practices are not efficient enough for this type of content. This research considers the Percent of Equal Adjacent Samples (PEAS) feature for speech steganalysis. This feature efficiently discriminates the least significant bit stego speech samples from clean ones with a single analysis dimension. A sensitivity of 99.82% was achieved for the steganalysis of 50% embedded stego instances using a classifier based on the Gaussian membership function.


Introduction
Cryptography is the science of making data unintelligible to unauthorized people without concealing the existence and other critical information about the encrypted data such as duration and frequency of communications, message size, and sender and recipient's identities to the adversaries [1,2]. Considering the need for concealed communications, steganography was designed as another security method that can conceal secrets within the body of digital media, enhance privacy, prevent traffic analysis, and allow the transfer of secrets in an imperceptible manner. However, considering the applications of steganography, digital forensic scientists have designed a countermeasure that allows them to supervise potential secret data exchanges that may be performed by hackers, terrorists, and lawbreakers [3,4].
Image, video, and audio are the digital media formats that have been widely used by steganographers as cover media [4,5]. Among these formats, audio files occupy an outstanding proportion of today's media transfer, and therefore, create magnificent opportunities for illegal data transfer without grabbing attention. Notwithstanding, digital forensic scientists have paid less attention to the audio format, as compared to images, for developing state-of-the-art audio steganalysis algorithms, and this domain requires modern steganalyzers that exploit the statistical characteristics of audio formats [6].
As shown in Fig. 1, steganalysis methods can be classified according to the signal characteristics.
Multiplicative, phase-encoding, and echo-embedding are the three targeted steganalysis subclasses. The process of multiplicative subclass may be described as s[n] ¼ c [n](1 þ m[n]) where c[n], m[n] and s[n] are cover, secret message, and stego, respectively [7]. Normal steganalyzers cannot efficiently detect multiplicative steganography, and hence, this class adds absolute value logarithms of audio samples to the steganalysis discrimination factor [1]. Phase coding, the second targeted class, is based on the fact that relative phases between blocks are preserved, while that amongst consecutive blocks are changed [8].
The third class, namely, echo embedding, holds a bank of kernels in which one selected kernel convolves with segmented parts of the cover audio file [9]. Positive-Negative (PN) and Forward-Backward (FB) are two sample echo embedding kernels.
The universal class can be classified into calibrated and noncalibrated branches. In the calibrated methods, the challenge is to find features that have been formed mostly based on the embedded message instead of the original signals [1]. The first subclass in calibrated methods is self-generated cover estimation. Hausdorff Distance (HD), Audio Quality Metrics (AQM), generalized Gaussian distribution, and Gaussian Mixture Model (GMM) are the main methods in this subclass [1]. The second subclass is a constant referencing in which M(x ⋅ y) measures the discrepancies between x and y signals and also is extendable to whole signals [10]. Linear basis, which is the third subclass, deals with estimating the original cover space model [10] and works efficiently if the high-dimensional model of cover space is known. Re-embedding, the final class, utilizes both given and re-embedded samples for steganalysis [10]. The intersection between the aforementioned calibrating class methods is that they all need the cover signals for calibration.
The methods in the noncalibrated branch extract the required steganalysis features from time or frequency domain signals [1]. The statistical models of histogram, linear prediction residue, Markov-process-based features, and chaotic-based features are the main time-domain features [11,12]. The main source of features for the frequency domain are the reversed-psychoacoustic model of human hearing, feature fusion, Mel-Frequency Cepstrum Coefficients (MFCC), and the MFCC derivative of audio signals [1].
Despite the development of various audio steganalysis algorithms, the number of fundamental researches in the field of speech steganalysis is very limited. More importantly, the current practices are not efficient enough to preserve the security of the biomedical audio signal communications. This paper proposes a novel speech steganalysis algorithm within the class of noncalibrated universal algorithms. Compared to the state-of-the-art, the proposed algorithm facilitates a fast detection of stego signals due to its single analysis dimension which requires lighter computation.
The rest of this paper is organized as follows: Section 2 reviews the background and related works for WAV, MP3, VoIP, and speech formats. The designed feature and structure of the algorithm are elaborated in Section 3. Section 4 describes the conducted experiments and compares the achievements of the work with the related researches. Finally, Section 5 presents the conclusion of the paper.

Background and literature
Applications of audio communications on the Internet have been increasing constantly, and the majority of Internet of Things (IoT) based health monitoring systems have added this feature to their services [13,14]. The popularization of this communication channel has turned it into a favorable channel for cyber adversaries. For instance, a wide range of algorithms and tools such as S-Tool, Hide4PGP, and MP3Stego exist that can embed secret messages in the audio files [15]. As a consequence, audio steganalysis algorithms have become a critical tool for preserving the security of the cyberworld.
Liu et al. [16] introduced a method based on the derivative-based Fourier spectrum and Mel-cepstrum for the steganalysis of WAV files. They reported the method as reliable if the ratio of data hiding is high. Afterwards, to enhance the sensitivity of their method, Liu et al. added Mel-cepstrum coefficients and second-order derivatives of the Markov transition features of audio signals to their algorithm. In another work, Tint and Mya [17] utilized MFCC, Zero Crossing Rate (ZCR), short-time energy features, and spectral flux in their designed steganalyzer. Evaluation results proved that the algorithm works efficiently in the detection of stego files made by Hide4PGP, S-tools4, and Stegowav. The virtual ear [18] is sensitive to variations in the high-frequency regions and is structured based on the Reversed-Mel scale.
In the field of MP3 steganalysis [19], employed a recompression calibration method for detecting the stego files generated by MP3Stego. The algorithm performs well in high-ratio embedding. In another work, Yan et al. [20] designed a steganalysis algorithm for attacking files generated by MP3Stego. The algorithm calculates the Standard Deviation (SD) of the second-order differential sequence in the quantization phase.
Because of the massive application of VoIP technology in messengers, social networks, and various platforms, it is crucial to have specifically designed VoIP steganalyzers. To this end, Ren et al. [21] proposed an Adaptive Multi-Rate (AMR) steganalyzer to detect secret messages hidden based on [22,23] steganography algorithms. Their steganalyzer works based on the probability of the same pulse position and achieved more than 99% accuracy for an embedding ratio of 60%.
In [24], Ren et al. used the Markov model of the adjacent codebook for attacking Huffman codebook stego samples that were created based on Advanced Audio Coding (AAC). Under certain criteria, the algorithm reached 100% accuracy.
One of the earliest Machine-Learning (ML) steganalyzers was designed by Ozer et al. [25]. The model was a mix of SVM and Audio Quality Metrics (AQM) and evaluated against six data hiding techniques four watermarking and two steganography techniques. Based on the chosen data-hiding technique, the steganalyzer accuracy fluctuates between 87% and 100%. Later, Kraetzer and Dittmann [26] utilized MFCCs as a classification feature and fed the extracted coefficients into a Radial Basis Function (RBF) SVM for detecting stego samples. In a more advanced work [27], presented a Deep Belief Network (DBN) algorithm for detecting stego samples produced by StegHide, Hide4PGP, and FreqSteg. Moreover, the authors claimed that the algorithm could identify which one of the tools produced the stego sample.
In [28], the authors introduced an SVM-based VoIP speech steganalyzer that exploits pulse-position distribution characteristics even in low bit-rate VoIP speeches. Their discrimination factors are a set of Markov transition probabilities of pulse position based on joint probability matrices for identifying pulse-to-pulse correlation, probability distribution of pulse positions for long-time distribution of features, and short-time  Table 1 Notations used.

D
The imported data array of signals from wav file D i The i th sample in the imported data array from wav file Sp Speech samples array Si Silence samples array Sp i The i th sample in the speech samples array Si i The i th sample in the silence samples array PEAS Sp

Percent of Equal Adjacent Samples (PEAS) in the speech samples array PEAS Si
Percent of Equal Adjacent Samples (PEAS) in the silence samples array τ The threshold for speech and silence separation within signal power domain P The array of power of signals P i Power value of the i th sample in the power array P max Maximum power value in the power array P min Minimum power value in the power arraŷ P Normalized power arraŷ

Pi
The i th sample within the normalized power array The average GMF values for speech and silence PEAS m Mean m Sp Mean of speech-band training data m Si Mean of silence-band training data False Negative invariance characteristics of speech signals. The algorithm was evaluated against a large set of G.729a-encoded speech samples.

Proposed steganalysis method
The designed steganalysis algorithm is built based upon a novel feature named "Percent of Equal Adjacent Samples", and thus, hereinafter, the algorithm is referred to as PEAS. To formulate the PEAS process and workflow, and also to keep the language clear and consistent throughout the paper, Table 1 summarizes the used notations and abbreviations.

PEAS workflow
PEAS steganalysis has three main components: signal classification, feature extraction, and instance classification. The first component, i.e., signal classification, separates speech and silence signals using four subcomponents: (1) calculating signal power, (2) normalizing power, (3) applying threshold, and (4) signal classification. The second component, i.e., feature extraction, calculates the PEAS values of speech and silence signals. The final component, i.e., instance classification, is in charge of defining the steganalysis results: clean or stego. Fig. 2 shows the overall PEAS steganalysis workflow.

Signal classification
The signal classification component aims at classifying the input  signals into speech and silence classes. To this end, it calculates and then normalizes the signal power using the first two subcomponents. The third subcomponent applies the threshold value τ to separate speech and silence areas. Based on the location of the signals in the normalized power array and the detected speech and silence areas, the fourth subcomponent generates two separate arrays containing speech and silence signals. These steps are illustrated and labeled as A to E in Fig. 3.
The power of a signal is calculated based on Equation (1). However, since the power of a signal depends on the recording quality, two separate recordings of one voice may differ considerably. Normalizing power is a process by which this issue can be resolved as it converts the given signals into the same scale. As the signal powers of a given instance are organized within an array, a min-max feature scaling, as shown in Equation (2), was used to normalize the power array.
Once the power array is normalized, the signals can be classified into speech and silence bands according to the value of τ. In Fig. 3, this process is illustrated and labeled as D, E, and F. The efficient value of τ is defined in Section 3.6 based on the results of several experiments. In the next step, a mapping between the detected speech and silence bands and input samples constructs the content of the speech and silence arrays. Algorithm 1 shows the steps of this process.

Feature extraction
Feature extraction, i.e., the second component, receives two arrays containing speech and silence samples and calculates the PEAS values of the arrays.

PEAS feature
The PEAS feature is regarded as a discrimination factor between clean and stego samples. To calculate PEAS, all adjacent samples, from the first signal until the last one, are compared. If the values of the compared adjacent samples are equal, the counter of equal adjacent samples increases by 1. PEAS is the proportion of the value of the counter to the total number of the samples in the file. The formulas for calculating the PEAS values of speech and silence bands are presented in Equations (3) and (4), respectively. The concept of PEAS is illustrated in Fig. 4.
In addition, the pseudo-code given in Algorithm 2 shows the steps for calculating PEAS for speech and silence signals.

Signal correlativity regression
Data embedding distorts the harmony among the signals of a recorded speech. Consequently, the PEAS values of stego signals differs from those of clean ones. In other words, the more the data embedded, the more is the distortion in the natural harmony between adjacent samples. At low embedding ratios, the signal distortions are hardly sensible; in contrast, at higher ratios, the distance increases, and stego signals are more accurately detectable. In PEAS speech steganalysis, the levels of distortions between adjacent samples are utilized as discrimination factors.

Instance classification
The third component, i.e., instance classification, classifies the analyzed sample as clean or stego. It receives the PEAS values from the feature extraction component and calculates their Gaussian membership degrees compared to the ranges in the extracted clean and stego reference profiles. The outputs of the first subcomponent are five sets of Gaussian membership degrees obtained by comparing PEAS values with five reference profiles. The second subcomponent calculates the average value for all Gaussian membership degree pairs and passes them into the decision-making subcomponent. The final subcomponent chooses the embedding ratio of the corresponding reference profile with the Gaussian functions are statistical functions widely applied in mathematics and signal and image processing and describe normal data distribution and help resolve diffusion equations. The Gaussian membership function is a type of these functions, and it is characterized by two parameters for classification tasks: Mean (m) and Standard Deviation (σ). In this function, each element is mapped to a value in the range of 0-1. This value, i.e., the so-called membership degree or membership value, quantifies the membership grade of each element in form of a fuzzy set (0-1).
To calculate the Gaussian membership degree for a given value, two parameters m and σ are required. In this work, the two parameters are taken from the corresponding training dataset. Since the given voice instances were examined in terms of speech and silence bands and also with five embedding ratios, 10 sets of m, σ are required to construct a profile that covers the steganalysis of voice instances in all embedding ratios. The basic structure of the PEAS reference profile is given in Table 2. The extracted PEAS values from a given instance are compared with the profiles to determine whether it is a stego. The Gaussian membership function is the chosen classifier in PEAS. The membership degree is calculated using Equation (5), where x is a variable holding the feature value, and G(x) is its membership degree. As the PEAS values for speech and silence bands are the selected steganalysis features, by replacing the variable x in Equation (5) with PEAS Sp and PEAS Si , we can obtain Equations (6) and (7), respectively.
Using Equations (6) and (7), two Gaussian membership degrees are produced. To determine a single value for decision making, the average of these membership values is computed by using Equation (8).
To achieve a final and unified equation, in Equation (8), G(x) is replaced with its equivalent in Equation (5). Equation (9) is the final equation, and it is a combination of the average Gaussian membership function and PEAS Sp and PEAS Si formulas. To obtain the final steganalysis result, five calculations are performed using Equation (9) with m and σ of PEAS Sp and PEAS Si in the five embedding ratios. The embedding ratio returning the highest average Gaussian membership degree defines whether the given sample is identified as clean or stego. Algorithm 3 presents the code for defining the steganalysis result using Equation (9).

Database
In this research, the speech part of the Austeg dataset was used to train and evaluate the PEAS steganalysis algorithm. The speech part of this dataset consists of 18,000 noisy and noise-free speech instances in English, Farsi, and Chinese languages and three chunk lengths of 3, 6, and 10 s. The instances are in 44100 Hz, 16-bit mono WAV format. To produce the required stego instances, the Wavsteg tool in Python was used to embed the Austeg dataset instances by the LSB replacement algorithm in ratios of 12.5%, 25%, 37.5%, and 50%. The final dataset, including the stego copies, consisted of 90,000 instances in 0% (clean), 12.5%, 25%, 37.5%, and 50% embedding ratios.

Training
As mentioned, 70% of the utterance instances in the database were used for training the classifier. Since the Gaussian membership function   Table 3 Extracted values for m and σ for speech and silence bands of sample voices in the training database; for five embedding ratios and five thresholds. requires the parameters m and σ for the given classification tasks, these values were extracted from speech and silence bands of clean and stego instances in the database. As shown in Table 3, the parameters are extracted from the samples in five embedding ratios: 0%, 12.5%, 25%, 37.5%, and 50%. Another important parameter in this method is the threshold for separating speech and silence signals. In the initial stage, the efficient value for the threshold is not clear, and it must be defined by performing multiple experiments on a range of thresholds. To this end, five set of experiments were conducted corresponding to the threshold values of 0.1, 0.15, 0.2, 0.25, and 0.3. Table 3 lists the sets of m and σ values obtained during the training process. According to sensitivity and specificity values listed in Table 4, the best classifications are made when the threshold is set to 0.15; so, the sets of m and σ values corresponding to this threshold are shown in boldface in Table 3.

Evaluation
The most important metric for evaluating a steganalysis algorithm is its capability to detect stego samples, and this capability is measured by the metric Sensitivity. In simple words, Sensitivity is the ratio of correctly detected stego samples and the total number of stego samples; it can be computed using Equation (10). In addition, a steganalysis algorithm should have a reliable performance in detecting clean samples to avoid false alarms. Specificity is the metric to measure this property, and it can be calculated using Equation (11).

Performance evaluation
The proposed PEAS speech steganalysis was implemented using different variations of Support Vector Machine (SVM) and a Gaussian membership function. The SVM is implemented using a linear, polynomial (scale mode), and Gaussian (auto and scale modes) kernels.
Considering the implementation results, in terms of the sensitivities listed in Table 4, SVM shows brilliant achievements with very high detection rates, even for the lowest embedding ratio. However, when the specificity is considered, all SVM implementations show very poor performances. In practice, the majority of the given instances are clean, and so, the large number of false alarms generated by the method makes it an inapplicable solution.
In steganalysis process, the main goal is to detect the stego instances. Therefore, Sensitivity is assigned greater priority than Specificity. To choose the most efficient classifier and optimized threshold, the average of sensitivities resulting from classifiers is calculated for each threshold. For each classifier in Table 4, the row that returns the highest average Sensitivity is shown in boldface.
From the results listed in Table 4, the Specificities of the chosen SVM implementations in bold fluctuates between 8.4% and 35.59%.  Considering this range, even the highest SVM-based Specificity value is not good enough for developing a reliable and applicable seganalyzer. When the Gaussian membership function is used as a classifier, it does not outperform SVM-based implementations in terms of Sensitivity; however, in terms of the Specificity, there exists a huge gap between them. For a threshold of 0.15, the Gaussian membership function has a very reliable Sensitivity; meanwhile it has a Specificity of 81.2%. Therefore, this classifier is chosen as the PEAS steganalysis detection engine, and the most efficient threshold is defined as 0.15.

Discussion
To measure in which aspects PEAS steganalysis outperforms the related works, these algorithms are compared in terms of specificity, sensitivity, and the number of analysis dimensions in Table 5. This is notable that some related works have not been evaluated exactly in terms of identical embedding ratios, and thus, for some embedding ratios, they either have no sensitivity value or the real ratio is slightly different.
When the algorithms are compared according to the processing load, PEAS yields the best performance as it has only one analysis dimension. The works in Refs. [12,27] involve a large number of dimensions, and the dimensions reported by Tian et al. [28], Miao et al. [29], and Jayasree and Amritha [30] are at least twice the dimensions of PEAS. Note that none of the compared works reported a specificity value. Referring to Table 4, on ignoring this criterion, the implementation of the PEAS algorithm based on an SVM polynomial kernel is enhanced to an average of 97.702% with a starting value of 95.28% for an embedding ratio of 12.5%. However, there is a tradeoff with Specificity, which decreases to 8.4%; in other words, a large number of false alarms are produced.
According to the Sensitivity results reported in Table 5, PEAS outperforms [30] by a distance of more than 23% for the lowest embedding ratio of 12.5%. No other work has reported a sensitivity for this ratio.
For an embedding ratio of 25% [29], reported a higher sensitivity, and for a ratio of 37.5%, none of the other researches listed in Table 5 has given any value. Similar to the lowest ratio, for 50% embedding, PEAS outstandingly outperforms all the related works with a sensitivity value of 99.82%. This value is even higher than the sensitivities reported in Refs. [12,27], and [28] for full-capacity embedding. In conclusion, the comparisons show that, PEAS outperforms the related works for most of the embedding ratios.

Conclusion
This paper proposes a novel speech steganalysis algorithm based on PEAS for detecting hidden secrets in speech signals of healthcare IoT communications. The algorithm can be expressed in four general steps: normalizing the power of the given signal, allocating the signals to speech or silence classes, extracting the PEAS feature from both speech and silence signals, and finally classifying the given speech instance as clean or stego by using a Gaussian Membership Function (GMF) classifier. The algorithm was evaluated in terms of Specificity and Sensitivity. The results reported 81.2% for Specificity and 78.36%, 81.4%, 93.74%, and 99.82% Sensitivity for embedding ratios of 12.5%, 25%, 37.5%, and 50%, respectively. In terms of sensitivity, PEAS steganalysis outperforms the compared related works at three embedding ratios of 12.5%, 37.5%, and 50% with a single analysis dimension. In contrast, in the compared works, number of the analysis dimensions varies between 2 and 281.