Modulation and performance of synchronous demodulation for speech signal detection and dialect intelligibility

: Speech processing is one of the fundamental operations in computer science and it is particularly di ﬃ - cult to process and distinguish speech in di ﬀ erent Arabic dialects when background noise is present. In any nation, communication skills are crucial. Pushing a button is all it takes for the typical person to make phone calls and leave voicemails but for telecommunications experts, the process is very di ﬀ erent. We understand how communication actually works. The terms detection and demodulation are commonly used when addressing the full demodulation process. The procedures and circuits are substantially the same under both designations. As the name implies, demo-dulation is the opposite of modulation, which is applying a signal, such as an audio signal, to a carrier. The demodulation process isolates the output signal from the audio or other signal that was transmitted using amplitude shifts on the carrier. In this study, a system for distinguishing speech signals was developed using modulation and demo-dulation to transmit speech by extracting it from a variety of factors, the most signi ﬁ cant of which is background noise in addition to a wide variety of dialects, which poses a signi ﬁ cant challenge in speech processing. The proposed system was applied to a dataset that was created for a group of voices in di ﬀ erent dialects, and by using important techniques, the noise accompanying the voices was deleted and then the voices were processed with other techniques such as modulation and demodulation to distinguish the dialect. The system has proven e ﬀ ective by distinguishing dialects.


Introduction
Frequency modulation (FM) is a nonlinear method of encoding information on a carrier wave. It may be used for many different applications, each with its own statistics that are mostly driven by the underlying producing process, including telemetry, seismic prospecting, and interferometry. Its most widespread application, though, is in radio transmission, which is frequently used to transmit audio signals that represent voices [1].
There are many different types of distortions, noise situations, and other impairments that might affect a communication channel. When a critical threshold is surpassed, the limitations significantly reduce FM demodulator performance. As a result, there is a dramatic decline in the intelligibility and quality of the detected speech; "threshold effect" is the name given to this occurrence [2].
The excitation (source) signal and the vocal tract transfer function (VTTF) are both represented in the acoustic speech signal. In some applications, it is critical to precisely predict the vocal tract transfer and to eliminate fluctuations brought on by shifting fundamental frequency or pitch. For instance, feature extraction for automatic speech recognition (ASR) frequently uses vocal tract transfer [3]. With an all-pole model, the vocal tract transfer is estimated through linear prediction coding (LPC) analysis. However, background noise might interfere with LPC-based features. Similar to this, a smoothed or fast Fourier transform spectrum will be susceptible to background noise. In this study, we present a noise-resistant method for estimating the speech spectrum envelope, which holds data on vocal tract transfer. In the frequency domain, the method resembles amplitude demodulation. An audio signal modifies or varies the amplitude, frequency, and phase of a sinusoidal signal through the process of modulation. Modulation methods are often applied to extremely low-frequency sinusoids in the context of audio processing. An amplitude modulation (AM) or phase modulation (PM) of the audio signal might be considered, in particular, to occur when control settings for filters or delay lines are changed. Wah-wah, phaser, and tremolo are common AM and PM examples of audio signal [4]. We will first explain basic methods for AM, single-sideband modulation, and PM and highlight their use for audio effects in order to gain a greater knowledge of the potential of modulation techniques. These modulators may be used to create more complex digital audio effects, as will be shown through a number of examples. We shall discuss numerous demodulators in a later section that isolate the input signal or its properties for additional effect processing [5].
In contrast to other uses, this study uses the word "modulation" differently. For instance, the "modulation spectrum" removes components with a high rate of change by applying low-pass filters on the time trajectory of the spectrum. In this article, a system was developed to separate speech signals using modulation and demodulation to transmit speech by extracting it from various factors, the most significant of which is background noise in addition to the variety of dialects, which represents a major challenge in speech processing. The article has been structured as follows: related work in Section 2, modulation and demodulation overview in Section 3, the proposed approach in Section 4, results and discussion in Section 5, and finally, the conclusion in Section 6.

Related work
Decoding radio transmissions and improving speech are typically seen as two distinct issues. However, the statistical characteristics of a measurement method and a previous statistical or deterministic model of the reconstructed signals are typically used to develop effective signal estimating algorithms. This is frequently a challenging topic that lacks an analytical answer and can only be somewhat resolved using oversimplified models of noise and signal. On the other hand, a nonlinear mapping from the input data to the intended output may be thought of as any signal estimate. We may learn such mapping utilizing a collection of study cases, pairs of input-modulated baseband signals, and the desired audio output signals when we have a method for approximating universal functions in our hands.
A technique that resembled amplitude demodulation in the frequency domain was proposed, and its use for automated speech recognition (ASR) was examined, in the study suggested by Zhu and Alwan [6]. The source (excitation) spectrum might be thought of as the carrier and the VTTF as the modulating signal in AM, which is thought to be the mechanism behind speech production. From this perspective, amplitude demodulation might be used to retrieve the VTTF. A nonlinear method that successfully conducts envelope detection by employing harmonic amplitudes and ignoring inter-harmonic dips was used to produce amplitude demodulation of the speech spectrum. The approach was noise resistant because lowenergy frequency regions were ignored. The observed envelope was reshaped using the same theory. The method was then applied in order to build an ASR feature extraction module. It was established that this method outperforms Mel-frequency Cepstral coefficients in the presence of additive noise. Peak isolation was also conducted, which increased recognition accuracy even further [6].
A data masking strategy for AM broadcasting systems was suggested in the work given by Ngo et al. [7]. The data cloaking system in the AM domain is implemented using the cochlear delay (CD) digital acoustic watermarking approach they presented before. They looked at the viability of sending extra inaudible messages in AM signals using a CD-based inaudible watermark technique. The suggested method alters the carrier signal using the new duplex modulation, adding the original and watermark signals as lower and higher sidebands, and then sends the modified signal to the receivers. To extract the messages from the watermarked signal and the original signal using the CD-based watermark, special receivers in the proposed technique used dual demodulation to get both the original signals and the watermarks from the received signals. They were able to successfully extract messages from the observed AM signals thanks to the results of their computer simulations, which showed that the suggested approach may send messages as watermarks in AM signals. The outcomes also showed that the suggested technique and the traditional AM radio systems could maintain good sound quality for the demodulated transmissions. This indicates that the suggested method has the ability to function as a daemon transmitter and also has low-level AM radio system compatibility. Emergency warning systems and heavily trafficked AM radio services can both employ the suggested method [7].
A software-defined-radio receiver for FM demodulation that takes an end-to-end learning-based method that makes use of the prior information of the spoken message in the demodulation process was proposed by Elbaz and Zibulevsky [8]. The baseband version of the receiver's in-phase and quadrature components was used to identify and improve speech. The new system was expected to outperform the existing one in terms of high-performance detection for both auditory disturbances and communication channel noise. There were recognized techniques for low signal-to-noise ratio (SNR) situations in mean square error and perceptual speech quality score evaluation [8].
Orimoto et al. [9] offer a signal identification technique to de-noise genuine voice signals utilizing Bayesian estimates and bone-linked speech. More precisely, a new kind of algorithm for noise removal was theoretically created by adding Bayes' theory, which is based on monitoring speech conducted with air that has been contaminated by ambient background noise. In the suggested speech detection approach, the bone-made speech was employed to estimate speech signals accurately. The application of the suggested technique to air and bone speeches monitored in a real setting with background noise has empirically proven its efficacy.
A multilingual investigation was carried out by Tong et al. [10] (Chinese and English). All of the data gathered demonstrated that the graphite speech sensor had adequate sensitivity to extract acoustic wave parameters. The likelihood of the graphene layer randomly breaking was simultaneously decreased by the proposed cylindrical structure with microsurfaces. Additionally, a neural network was trained using voice data obtained from a microphone and a grapheme elastic sensor for speech identification. The dataset blended with the vocal cord speech signals has 75.9% recognition accuracy. The comparison demonstrated that there was sufficient unique information in the signals picked up by the sensor to carry out the speech recognition tasks .

Modulation and demodulation overview
In this section, modulation and demodulation will be discussed along with their uses for differentiating, changing, and extracting speech free of background noise.

Speech signal modulation
A sine wave is modulated with an information message x m (t) by a method known as FM [11]: where x c (t) = A c cos(2πf c t) is the sinusoidal carrier, f c is the carrier's base frequency, A c is its amplitude, and fΔ is the frequency deviation, which denotes the greatest shift away from the carrier's base frequency. In this example, x m (t) is the data signal, which is often a voice signal. Model of noise: A number of limitations can affect the signal during transmission and reception. The modulator's job is to rebuild the original signal from the received signal as reliably as possible on the receiving side, overcoming any limitations imposed by the transmission and reception phases. The message signal experiences a number of distortions, as was previously noted [11]. There are two kinds of signal impairments caused by these distortions: A. Phase noise: Impairments resulting from external factors like audio distortions and FM's operation are converted to phase noise from their original audio additive forms [12]: where n(t) is the amplitude noise. B. Amplitude noise: Deterioration brought on by distortions in the communication channel, such as convolution with the channel, multi-path, additive noise brought on by the propagation characteristics of the channel environment, etc., result in additive amplitude noise, r(t) = y(t) + n(t) [11].
Each of the noise models mentioned above has a statistical model in communication systems that are often believed to be white Gaussian noise. Figure 1 shows a schematic of the communication system and its components for clarity.

Speech signal demodulation
The estimation of the VTTF and the elimination of pitchrelated information are our objectives. This has to deal  with frequency domain demodulation of the voice spectrum [13]. Recovery of the carrier signal is necessary for coherent demodulation, which is used in FM radio, for instance. On the other hand, incoherent demodulation uses envelope detection with a rectifier and low-pass filter. Following full-wave rectifying the modulated signal, Figure 2 shows the incoherent demodulation process in the time domain. We will use a similar approach but carry it out in the frequency domain, where the resulting spectrum is "harmonically demodulated" [14].
Different types of FM detectors and demodulators are available. In the past, when radios were composed of discrete devices, some types were more common. However, today, the phase-locked loop (PLL)-based detector and quadrature/coincidence detectors are the most extensively employed since they are the easiest to incorporate into integrated circuits and require few, if any, changes [15].
Normally, the intermediate-frequency (IF) stage may work so that the IF amplifier is forced into limiting in order to enhance the FM receiver's noise performance. By doing this, the noise-causing amplitude changes are eliminated, leaving just the frequency variations.
The following list includes common FM demodulators used in walkie-talkies, portable radios, radio communication systems, broadcast receivers, etc. [16,17]: A. Slope detection: This is a very basic type of FM demodulation that gets its demodulation from the receiver's selectivity. Only when the receiver lacks FM functionality it is employed since it is not very effective. This method of FM detection has a great number of drawbacks, including the receiver's sensitivity to amplitude fluctuations and the radio's selectivity curves complete nonlinearity. B. Ratio detector: When transistor radios employed discrete components, this form of the detector was one that was extensively used. Utilizing a transformer with a third winding was necessary for the ratio detector in order to provide an additional phase-shifted signal for the demodulation procedure. Two diodes, a few resistors, and capacitors were utilized in the ratio detector. The ratio FM detector was a costly type of detector due to the transformer it utilized, despite the fact that it worked effectively. These FM demodulators were expensive to produce since all wrapped components are more expensive than resistors and capacitors, and following the development of integrated circuit technology, when various circuits could be employed, the ratio detector was rarely used. Nevertheless, it did well in its day. Another benefit of the PLL FM demodulator is that it can be readily added to an integrated circuit, which lowers the overall cost of the receiver chip and, ultimately, the radio receiver. Because the PLL was programmed to follow the instantaneous frequency of the incoming FM signal, the PLL FM demodulator worked. The voltage-regulated oscillator inside the loop has to monitor the incoming signal's frequency in order to keep the loop locked. The demodulated output of the audio or other modulation signal was therefore given by the voltage-controlled oscillator, whose tuning voltage fluctuated in accordance with the instantaneous frequency of the signal. E. Quadrature detector: FM radio IFs now frequently employ the quadrature FM detector. It offers exceptional levels of performance and is simple to install. The quadrature coincidence version of an FM demodulator may be applied to an integrated circuit extremely quickly and for essentially no extra cost. Because of this, it is a particularly appealing solution for contemporary receiver designs. A quadrature detector/coincidence detector is often included in integrated circuits that are intended to perform the functions of a complete receiver or an IF strip. As a result, FM demodulation may be added to the final receiver for almost no additional cost.
There are several uses for these FM demodulators. Depending on the application, designers can choose from a variety of FM demodulator types, including broadcast, twoway radio communications, portable radios and walkietalkies, high-end communication receivers, and so on [13].
Although PLL-based circuits and the PLL FM detector are the most often used detectors, so are quadrature detectors. Although occasionally still in use, the Foster Seeley and ratio FM detectors are mostly only found in older radios that employed discrete components. 4 The proposed approach 4.1 System dataset FM stations require high amounts of frequency variation to enable high-quality audio broadcasts. The peak deviation value is set to 75 kHz, while the output signal's sampling frequency is set to 240 kHz, according to US FM broadcasting regulations. The modulating audio signal's default frequency is 48 kHz. Due to the aforementioned factors, the training set was created using Matlab FM modulation with the aforementioned minimum requirements. The amount of baseband samples the modulator generates (five inphase and five quadrature) for each audio sample on its input is determined by the aforementioned system limits. We presume that conversion from IF to baseband will be handled by additional digital or analog gear in order to avoid directly influencing the FM pass-band signal.
Synchronous detection, also known as heterodyning the signal down to baseband, is the conversion process that is often carried out in the analog front end. The demodulator's computational needs are reduced when the highfrequency signal is converted to baseband, allowing for more convenient processing at a sampling rate lower than the original carrier frequency.
The continuous speech set was utilized to create the sound waveforms for our research. Each utterance is represented by a 16-bit, 16 kHz waveform file in the dataset. We employed male Arab speakers who spoke several dialects. We sampled the baseband signal's phase and quadrature components as two features for the system input. The waveforms were sampled at 48 kHz in accordance with the Arabic standard specification.

Speech signal modulation and demodulation
The ability to communicate is essential in any country. For the average person, speaking on the phone and leaving voicemails only need the push of a button but for telecommunication engineers, the situation is completely different. We are aware of how communication truly functions. Information is transmitted on a different carrier signal through the process of modulation. When discussing the entire demodulation procedure, the detection and demodulation of words are frequently employed. The names essentially refer to the same procedure and circuit.
Demodulation, as the name suggests, is the reverse of modulation, which involves applying a signal, such an audio signal, to a carrier.
The demodulation procedure separates the audio or other signal conveyed by amplitude changes on the carrier from the resulting output signal.
AM is frequently used in audio applications; hence, the audio output is frequently used. It is frequently used for land communications for uses related to aviation, including broadcast entertainment, two-way radio communications, and frequently within walkie-talkies.
Although there are many other forms of modulation, we will be using AM for this work. In our current work, we will take the following steps: • Consider the sound of the dataset, delete undesirable frequencies from the voice signal by filtering it, and use modulation and create a demodulation filter. • Make signal to pass through the filter, demodulation procedure, and utilize a low pass filter. • Use modulation and make a demodulation envelope detector and demodulation procedure. • Utilize a low pass filter, plot time-domain graphs, and signals are Fourier transformed for frequency analysis. • Shift the origin of the frequency waves and plot frequency domain graphs. Figure 3 shows a flowchart of the proposed system.

Results and discussion
In this section, the most important results obtained will be explained. The dataset we obtained was utilized in Arabic and in several dialects, with varied speech speed and the presence or absence of background noise. The suggested system was created using the Matlab application and codes have been written for three different dialect sounds.
The modulated signal varied with time according to the dialect and volume of the signal, which has been proved ( Figure 4). In terms of modulation, the interference between low-amplitude frequencies had the least impact on the final signals that were subtracted. High-amplitude frequencies had the majority of the sound characteristics and there was very little interference between them. As a consequence, the interference had no impact on our final results. That has better results compared with the results in Francis [18].
Three sounds with different dialects have been examined in this work. The magnitude in dB has been measured as shown in Figure 5, and the different amplitude values according to the frequency variation. The magnitude has a high value reaching × − 2.5 10 3 with 5 kHz and it decreased with high frequency. There was no phase shift during demodulation (φ = 0), the demodulated signals in sound (2) were identical to the original signal ( Figure 5). The demodulated signals (voice signals) are shifted by 10°, 30°, or more degrees in sound (3), attenuating them and making them weaker than the original signals. The voice signals are severely weakened when the shift is 90°, and there appears to be no output speech. Interference between low-amplitude frequencies had the least effect on the final removed signals in our latest study in terms of modulation. The majority of sound characteristics are found in highamplitude frequencies, and there was little overlap between them. Therefore, the intervention had no effect on our final results.

Conclusion
Speech processing is one of the main processors in computer science, and processing and distinguishing speech in different Arabic dialects is a particular challenge because of the difficulty of doing so when there is background noise, and because of the abundance and diversity of Arabic dialects. The Arabic dialects were classified by building a system that relied on modulation and demodulation in two stages. The first is extracting sound characteristics and the second is distinguishing those characteristics according to the proposed system. Speech processing and accent recognition can be improved by introducing deep learning and artificial intelligence (AI) techniques such as neural networks, fuzzy logic, and other AI technologies. Future aspects will examine more dialects with different frequencies and find the SNR.

Conflict of interest:
The authors state no conflict of interest.   Data availability statement: Most datasets generated and analyzed in this study are within this manuscript. The other datasets are available on reasonable request from the corresponding author with the attached information.