A rough set theory and deep learning based predictive system for gender recognition using audio speech

Speech is one of the most delicate medium through which gender of the speakers can easily be identiﬁed. Though the related research has shown very good progress in machine learning but recently, deep learning has imparted a very good research area to explore the deﬁciency of gender discrimination using traditional machine learning techniques. In deep learning techniques, the speech features are automatically generated by the reinforcement learning from the raw data which have more discriminating power than the human generated features. But in some practical situations like gender recognition, it is observed that combination of both types of features sometimes provides comparatively better performance. In the proposed work, we have initially extracted and selected some informative and precise acoustic features relevant to gender recognition using entropy based information theory and Rough Set Theory (RST). Next, the audio speech signals are directly fed into the deep neural network model consists of Convolution Neural Network (CNN) and Gated Recurrent Unit network (GRUN) for extracting features useful for gender recognition. The RST selects precise and informative features, CNN extracts the locally encoded important features, and GRUN reduces the vanishing gradient and exploding gradient problems. Finally, a hybrid gender recognition system is developed combining both generated feature vectors. The developed model has been tested with ﬁve bench mark and a simulated dataset to evaluate its performance and it is observed that combined feature vector provides more eﬀective gender recognition system specially when transgender is considered as a gender type together with male and female.


Introduction
Gender recognition from speech and images has always remained a challenging task.It is a very common and needful requisite in all areas including health care section, forensic lab, and any industrial area.Speech and image both are important data to identify the gender.Speech is the medium through which gender can be easily identified.It is known as a physiological signal which represents information at multiple levels such as linguistic content (like language, word, accent etc.) and paralinguistic content (like gender, age, emotion etc.).Beside it, speech also carries important information of acoustic nature of sound.As the technology is enhancing and the use of electronic devices (such as mobile phone Google assistant, alexa) has been already introduced and in the peak demand as per the market value is concerned, the trade for the development of speech and audio analytic tools is kept on increasing.The research is going on not only in the linguistic areas [1] such as extracting message and working on words but also in paralinguistic areas such as Automatic identification of speaker [2], emotion analysis [3], [4] from speech.This area has a wide range of applications including telecom industry.Gender has yet been devoted only with two types, male and female.There was no stand given to transgender.In this current era, the transgender has also getting right and now itself getting privileged with a unique gender.Plenty of work has been done for male and female gender identification from speech using machine learning [5], [6], [7].The techniques are still enhancing the system for better performance.The traditional supervised and unsupervised machine learning techniques generally used higher label features extracted by the human from the speech for categorization of speakers.These higher label features may not be sufficient for optimal categorization.The lower label features extracted using various deep neural networks [8], [9] are more effective for this purpose.In this paper, we have explored two different deep neural network models, namely, Convolution Neural Network (CNN) [10] and Gated Recurrent Unit Network (GRUN) [11] and have demonstrated that they perform better than the traditional machine learning approaches.

Motivation
Deep learning is the subset of machine learning where various layers of networks provide different interpretation on the feeding.The main benefit of deep network is that it does not require higher label structured data for classification, rather it uses the raw input data all the way through different layers.Each hierarchy of layers of network defines specific set of features just as similar to human brain solve any problem hierarchically passing queries to the concept of related queries.After processing the data within different layers of deep neural network, the system computes the appropriate identifiers for classification of data.Gender recognition has many applications such as, improving the intelligence of a surveillance system, analyzing the customer's demands for store management, allowing the robots to perceive gender, and so on.Though many works have been introduced for gender identification using deep learning but transgender has not been considered as a gender in most of them.As transgender is very difficult to distinguish from male and female based on the speech, so the concept of rough set theory and information theory is very helpful for distinguish them from other class of genders.This motivate us to propose a hybrid model integrating Rough Set Theory (RST) and deep neural networks, namely CNN and GRUN to select the minute features from speech which can differentiate transgender, male and female speakers more effectively.Thus we extract features in different forms which are complementary to each other.The classification model is learned using this multi-view dataset to make full use of the hidden information.The RST selects precise and informative human extracted features which are very important to distinguish transgender from others; on the other hand, deep learning model captures the advantage of extracting the locally encoded important features with the help of CNN and long-term dependencies with the help of GRU.

Literature Survey
Gender recognition from speech is a very well known topic among researchers from past decades.Plenty of research has been undergone for this problem.As time passed the technique to develop the gender recognition system get improved to enhance the performance.Earlier lot of research had done for gender recognition [12], [13], [14] using machine learning, data mining and pattern recognition.Bisio et al. [15] represented gender driven emotion recognition system from audio signal allowing effective human-computer intelligent interaction.Their system consists of two subsystems, one is gender recognition and other is emotion recognition.Gender recognition is done based on pitch extraction from the audio signal and a Support Vector Machine (SVM) based emotion classification model is developed.Zeng et al. [16] proposed Gaussian Mixture Model (GMM) based approach for gender classification by applying the combined parameters of speech and relative spectral perceptual linear predictive coefficients to model the characteristics of male and female speech.The model provides very good accuracy even if sufficiently noisy speech is considered.
The paper [17], [18] proposed speaker, age and gender recognition system using acoustic and prosodic level features.Their work also adopted the approach of GMM to train the classification model.Yasmin et al. [19] had proposed a new system of gender recognition using acoustic features from speech signal.The work has adopted perceptual features like pitch, MFCC, tempo and other low level acoustic features.Ahmad et al. [20] proposed a technique for gender recognition using MFCC for telephonic voice application and the performance has been compared with different well known existing classifiers, where SVM has been found to come up with better result for classifying male and female.Harb et al. [21] also introduced gender recognition system using audio signal with the help of first order spectrum statistics of 1 second windows.They have used neural networks as classifiers.To improve the performance, the research is now going on in the area of deep learning.Levi et al. [22] proposed a simple convolution net architecture to estimate age and gender of the persons based on images.They claimed that the model is very effective even if the amount of learning data is limited.Alkhawaldeh et al. [23] described a gender classification model with the help of one dimension convolution neural network.They used the features, such as Mel Spectrogram, Mel Frequency Cepstral Coefficients, as the single dimension input sequence to CNN for training of the model.Kabil et al. [24] proposed a work for gender recognition from raw speech signal using convolution neural network.In their work, the audio data itself has been supplied to the convolution model to train the model for gender identification.Mansanet et al. [25] also classified male and female from image using local deep neural netwrok.The local deep neural network used local features of image to train the model for gender classification.Dehghan et al. [26] described the details of Sighthound's fully automated age, gender and emotion recognition system.The backbone of the system consists of several deep convolution neural networks that are not only computationally inexpensive, but also provide state-of-the-art results on several competitive benchmarks.Rajeev et al. [27] presented an algorithm for simultaneous face detection, landmarks localization, pose estimation and gender recognition using deep convolution neural networks.The proposed method fuses the intermediate layers of a deep CNN using a separate CNN followed by a multi-task learning algorithm that operates on the fused features.It exploits the synergy among the tasks which boosts up their individual performances.Wolfshaar et al. [28] applied deep convolution neural networks on gender classification by fine-tuning a pretrained neural network.They explored the performance of dropout support vector machines by training them on the deep features of the pretrained network as well as on the deep features of the fine-tuned network.Wang et al. [29] proposed a speech emotion and age/gender recognition system using deep neural networks.They have used deep neural networks to encode each utterance into a fixed-length vector by pooling the activation of the last hidden layer over time.The feature encoding process is designed to train the utterance-level classifier for better classification and a kernel extreme learning machine is further trained on the encoded vectors for better utterance-level classification.Markitantov et al. [30] presented a novel approach in the paralinguistic field of age and gender recognition by speaker's voice based on deep neural networks.The training and testing of proposed models were implemented on the German speech corpus aGender.They have conducted experiments using different network topologies, including neural networks with fully-connected and convolution layers.Their method provides better result of speaker age recognition than speaker gender recognition in comparison to existing traditional classification methods.Sánchez-Hevia et al. [31] dealt with joint gender recognition and age group classification from speech for improving the functionalities of interactive voice response systems.Due to the discriminative and representation capabilities of deep neural networks, they have used it in speech processing problems for features extraction and selection.They have presented various neural network architectures and compared themselves using Mozilla's 'Common Voice' dataset, an open source speech corpus.Gupta et al. [32] proposed a stacked machine learning technique for gender recognition through voice using the acoustic parameters of voice sample.The performance of their work is compared with some traditional and useful existing classifiers to demonstrate the effectiveness of their models.Ertam et al. [33] proposed an effective deeper LSTM networks based gender recognition system using audio data set.Initially, they have selected 10 most effective features and subsequently applied a double-layer LSTM architecture based deep learning networks.Based on the performance, authors claims that their model is an effective and fast approach for gender recognition.In the best of the author knowledge, most of the gender recognition systems developed by the researchers are capable of classifying male and female voice using audio signals.
As discussed, although much effort has been dedicated to improve the performance of gender recognition, the noted algorithms suffer from the following limitations and challenges.
• The earlier works handles only two different gender types, male and female.But in presence of transgender, it is very difficult to recognize all genders separately.This limitation is tried to overcome by devising a novel feature selection algorithm using information theory and rough set theory, which helps to select only informative, and precise features from the audio speech.
• Either CNN or RNN are used by the researchers for gender recognition.The combination of these two models is sometimes beneficiary.The CNN is responsible for capturing the locally encoded important features and GRU is used to consider long-term dependencies among the features.Thus it is one of the challenging tasks to construct a hybrid deep model for gender recognition.
• The previous works either use human extracted features or machine extracted features but not the both for gender recognition, which may not properly learn the model, specially when the transgender comes into the picture.This challenge and limitation is handled by developing a hybrid gender recognition model integrating information theory, rough set theory, and deep neural networks.
The proposed work explores about how the transgender can be distinguished from male and female.A hybrid deep neural network model together with the concepts of RST and information theory is framed for this purpose in the paper.

Contribution
Speech is produced by humans using a natural biological mechanism in which lungs discharge the air and convert it to speech passing through the vocal cords and organs including the tongue, teeth, and lips.In general, a speech and voice recognition system can be used for gender identification.Gender recognition is a technique to identify the gender category of a speaker by processing speech signals based on the extracting acoustic features such as duration, intensity, frequency and filtering.Recently, many machine learning techniques are available for gender recognition.But, transgender is a different gender of human being which is very difficult to identify using face recognition and speech recognition system.Traditional machine learning techniques are not so capable to accurately classify the audio data that contain all three genders, i.e., male, female, and transgender.Deep learning based gender recognition system provides comparatively better performance than traditional machine learning techniques by giving training to the model using huge volume of audio data.It has been observed that different deep neural networks provide different performance, no single model always provide the best result for all audio data.At the same time, machine extracted features are not self sufficient for gender recognition, as many imprecise and ambiguous features exist in the dataset due to the mixture of transgender voice with male and female vices.In the paper, we have proposed a novel hybridized deep neural network model combining CNN and GRU together with RST to develop a gender recognition system.We have considered human extracted acoustic features and applied them in a proposed RST based feature selection algorithm to filter out the redundant and irrelevant features and select only informative and precise features of audio speech.The CNN is used to capture the locally encoded important features and GRU is used to consider long-term dependencies among the features.Thus we extract features in different forms which are complementary to each other.The classification model is learned using this multi-view dataset to make full use of the hidden information.It is observed that, the method not only recognize the male and female accurately, but also recognize equally the transgender based on audio speeches.The main contributions of this paper are concluded by the following steps and depicted in Figure 1.
1.A possible set of acoustic features are extracted from audio speech and a novel feature selection algorithm is devised using the concepts of information theory and rough set theory to select only the informative, precise, and unambiguous features relevant for gender recognition.2. The deep neural network architectures of CNN and GRU are explored for developing gender recognition system.A hybrid deep neural model combining CNN and GRU together with RST is framed considering the selected acoustic features.The selected acoustic features are combine with deep neural model extracted features and the resultant feature vector is fed into the model for gender recognition.
3. All the developed models (i.e., CNN, CNN and GRUN, CNN and GRUN and RST) are experimented using both sample and benchmark audio datasets to evaluate them.It is observed that RST based hybrid deep neural network model has the higher capability to recognize the genders specially when transgender is present in the dataset.The method is also compared with some popular gender recognition algorithms to demonstrate the effectiveness of the proposed model.The remaining part of the paper is structured in the following ways.Section 2 describes details about various human extracted features of audio signal and Section 3 describes informative optimal feature subset selection technique based on information theory and RST.Section 4 describes the proposed hybrid model combining CNN and GRUN architecture together with RST developed for recognizing genders based on audio speeches.Section 5 provides the experimental setup and empirical result analysis and finally, section 6 gives the brief conclusions and the future scopes of the paper.

Description of human extracted features
Gender information is a distinctive and the most important property in a speech.Determination of this information from a speech signal is a substantial subject.Gender information used for various purposes in many applications including speech emotion recognition, human to machine interaction, sorting of telephone calls by gender categorization, automatic salutations, muting sounds for a gender and so on.Gender identification can improve the prediction of other speaker traits such as age and emotion, either by jointly modeling gender with age or in a pipelined manner.Speaker verification systems also implicitly or explicitly use gender information.In general, identification of a speaker gender is important for increasingly natural and personalized dialogue systems.The acoustic features of the speech signal are very much helpful for the gender recognition.There are a set of features used for recognizing the voice gender.The most common features utilized for voice gender recognition are Frequency, Pitch, mel-frequency cepstral coefficients (MFCCs), power spectrogram chroma (Chroma), and tempo features.
1. Frequency: The resonance structure of the vocal tract can be easily examined by drawing a smooth line above the spectrum, as shown in Figure 2. It gives the macro-shape of the spectrum of a speech signal, which is often used to model speech signals.From Figure 2, it is observed that the speech signals have many frequency features.We have considered only the frequencies, F 0, F 1, and F 2, where F 0 is the fundamental frequency, and F 1, F 2 are the first two formants, providing the two lowest resonances of the vocal tract.The frequency features include the statistics of fundamental frequency F 0 and the first 2 formants, F 1 and F 2. The F 0, F 1, and F 2 are computed over windows of 20 ms with overlaps of 10ms.The overlapping time is considered as the speech signal generally remains stationary in this time scale.The statistical properties, such as Mean, maximum, minimum, median and the standard deviation of all three frequencies are used as extracted frequency based features of the speech signals.Thus, as a whole 15 frequency features are used.The F 0 is computed by auto-correlation method, and F 1 and F 2 are computed by solving the roots of the Linear Predict Coding (LPC) polynomial using P RAAT [34], an open-source toolkit for speech analysis.The frequencies are only computed through the vowels periods, and for the consonants, they are assumed as 0, and are not considered in the statistics.2. Pitch: Pitch is termed as the degree of shrillness and harshness of a voice.Pitch is described as the fundamental frequency of glottal pulse.Precisely, the quality of any tone can be dictated by the rate of vibrations through which it is generated.The main motive of using the pitch feature for gender recognition is that the average fundamental frequency (i.e., reciprocal of pitch period) for men is typically in the range of 100 Hz to 146 Hz, and that for women is 188 Hz to 221 Hz [35].But, an overlap of the pitch values between male and female voices naturally exists as shown in Figure 3.
We have estimated the pitch period of a speech sample as sum of amplitude modulation -frequency modulation (AM-FM) formant models [36].AM components represent the envelope of the short-time speech signals which only contains information within a certain bandwidth, which reduces the noise effect and the distortion effect of the speech signals.In the proposed work, 88 pitch based features have been detected.The speech signal has been distributed into 88 frequency bands.The short-time mean-square power (STMSP) has been calculated for each band.Therefore, the average of STMSP of each band, computed using equation ( 1), has been considered as a single pitch feature obtained from As a result, total 88 pitch features are extracted from each speech signal.In equation (1), P s represents the STMSP of s-th band, q is the total number of samples of each band in frequency domain and x f is the sampled value of each band.Here, the value of s is 88.

Cepstral Coefficients:
In addition to Frequency statistics and Pitch features, we have explored the use of Mel-frequency cepstral coefficients (MFCCs) as features for gender detection.MFCCs are coefficients that collectively make up an MFCC feature of a signal.In MFCC, the frequency bands are equally spaced on the mel scale, which approximates the human auditory system's response more closely than the linearly-spaced frequency bands used in the normal cepstrum.This frequency warping can allow for better representation of sound, which takes important role for gender recognition.For calculating MFCCs, speech signal is divided into collection of frames of 20 ms duration.We have generated 26 MFCCs for each frame and computed the mean and standard deviation of each coefficient.
Thus 52 features are computed for each speech signal.4. power spectrogram chroma: Chroma based feature is an interesting and powerful representation of audio in which the entire spectrum is projected into 12 bins representing 12 distinct semitones or Chroma.Chroma depended trait is so powerful acoustic feature that it can be used to study the characteristics of tones of a speech.Chroma features can reflect melodic and harmonic nature of a speech signal.Chroma features indicate the intensity of different frames of a signal.Chroma traits reflect perceptual dissimilarities among different speech because of which this feature is considered for gender recognition.5. Tempo: Each speech acquires its own speed which can be measured by its tempo feature.It is measured in terms of beats per minute.Since the nature of speech signal is clearly reflected by tempo based feature, this feature has been extracted as a suitable one for gender recognition.Tempo feature is being calculated by the help of the novelty curve from the given input speech signal.Novelty curve is described as type of detection function whose peaks convey the note onsets.In order to extract these features, the novelty curve has broken into non-overlapping tempo windows, each having 20 ms duration.Then, from every window, first 30 Fourier coefficients have been calculated.Finally, the mean and standard deviation of each Fourier coefficient of all windows is computed, which provides a 60-dimensional tempo based feature vector of the audio signal.

Feature Selection using Rough Set Theory
After feature extraction, a dataset DS = {S, F, C} is obtained based on set of speech signals, where is the set of classes representing gender types (here, it is of 3 classes, male, female and transgender).All these higher level human extracted features may not be important and some may be redundant during gender recognition.So feature selection algorithm provides us a minimum set of informative features.To find the most informative subset of features, we have used the information theory [38] and rough set theory [39] together.Information theory [38] is applied to rank all the features based on the ascending order of their entropy, and rough set theory [39] is used to apply the quick reduct [40] generation algorithm for feature selection.The traditional quick reduct algorithm is modified by incorporating the concept of information theory and the step-wise floating forward selection and backward removal concepts [41].So before describing the proposed feature selection algorithm, we discuss the relevant concepts used for feature selection.

Information Theory
Information theory [38], discovered by Claude Shannon, has quantified entropy.This is key measure of information which is usually expressed by the average number of bits needed to store or communicate one symbol in a message.Information gain calculates the reduction in entropy from transforming a dataset in some way.Entropy measures the level of impurity in a group of samples.The higher the entropy the more the information content.If a feature say, then the entropy of the feature F i is given by equation (2), where p j is the probability of occurrence of discrete value x j of F i in dataset.
From equation (2), it is observed that, if feature F i assumes only one value, then the entropy of this feature becomes zero, which implies that it is not a good feature for learning a classifier.Similarly, if all discrete values are of equal number in the dataset, then the entropy becomes maximum.Thus higher the entropy value implies more important the corresponding feature is and vice versa.In this section, we want to determine which feature in a given training feature set F is most useful for discriminating between the classes (i.e., gender type) to be learned.Here, we have used entropy of the features with respect to the class attribute (i.e., decision feature).In this slightly different usage, the calculation is referred to as mutual information between each condition feature and the decision feature.Mutual information calculates the statistical dependence between a condition feature and a decision feature and is the name given to information gain when applied to feature selection.Information gain tells us how important a given feature is for classifying the samples of different classes.Let, there are k classes, C 1 , C 2 , • • • , C k (in our application of gender recognition, we have considered three different classes, male, female and transgender) in a dataset.
Let p i be the probability that an arbitrary speech, say S i in S belongs to class C i .So, p i is estimated by p i = si n .The amount of information, needed to decide if an arbitrary speech in S belongs to any of the class C j is defined by equation (3).
Assume that using feature F x of F , the speech set S is partitioned into sets {P 1 , P 2 , • • • , P v }, where P i contains s ij number of speeches of class C j , for j = 1, 2, • • • , k and i = 1, 2, • • • , v. Then the entropy or the expected information needed to classify the speeches in all subsets P i is computed using equation ( 4), where By the entropy theory, the encoding information gained by classifying the speeches using the feature F x is given by equation ( 5), for So, for any two features, F x and F y , g(F x ) > g(F y ) implies that feature F x is more informative than F y for classification of speeches.

Rough Set Theory
Rough Set Theory (RST) [39] is a very important concept purely based on mathematics which is frequently used in data mining and knowledge discovery.The dependency of a feature on another feature is easily determined using the indiscernibility relation, a preliminary but very powerful concept of RST.In the work, we are interested to find the dependency of each feature F x in F on the decision feature C using the indiscernibility relation.Indiscernibility relation is an equivalence relation defined over a subset of features which gives the equivalence classes of speeches such that the speeches in an equivalence class are indiscernible from each other based on the considered feature subset.All the extracted features in feature set F of decision system DS = {S, F, C} are real valued which are not suitable for discriminating the speeches using RST.So, the features are discretized using a popular modified chi2 algorithm [42].Thus F becomes the condition feature set of discrete values.The indiscernibility relation IND(P) is defined in (6), where P ⊆ F.
From the definition of indiscernibility relation, it can easily be proved that it is an equivalence relation, which induces a partition of equivalent classes.Each equivalence class contains a subset of speeches of S which are indiscernible from each other.Each speech S i in S provides an equivalence class using the equivalence relation IN D(P ), computed using equation (7).
The equivalence class obtained using equation ( 7) is a set of speeches indistinguishable from each other with respect to the feature subset P ⊆ F .Similarly, any speech from remaining speech set S − [S i ] P is selected arbitrarily to compute its corresponding equivalence class.This process is repeated until each speech is placed in any one equivalence class.In our work, we have partitioned the speech set S based on the single feature 7), we get a set of equivalence classes, ∀i = 1, 2, • • • , m.One of the most important aspects of feature selection is the discovery of feature (or attribute) dependencies, that describe which features are strongly related to which other features.As the given system has the decision feature (i.e., class variable), so we measure the dependency of each of the condition feature in F on the decision feature C. Let us consider one condition feature F i of F and the decision feature C to measure the degree of dependency between them, for which following steps are applied: • The indiscernibility relation, defined in equation ( 6), induces a partition {[x] {Fi} } of equivalence classes for P = {F i } and partition {[x] C } for P = C.
• Let Q j is an equivalence class in partition {[x] C }.The lower approximation of the target set Q j is a set S{F i }(Q j ) of all speeches positively belong to the target set Q j and is defined by equation ( 8), where obviously • The positive region P OS C (F i ) is the region which contains all the speeches definitely belong to the equivalence classes of partition {[x] {Fi} } and is defined by equation (9).
It basically contains all the speeches obtained by taking the union of lower approximations, S{F i } with respect to feature • The dependency of target feature C on feature F i is the ratio of number of speeches in the positive region to the total number of speeches in the dataset.This dependency is denoted by γ {Fi} (C) and is defined by equation (10), where S is the set of all speeches in the dataset.
Obviously, this dependency value ranges from 0 to 1.The larger the value means more the feature C is dependent on feature F i .The feature on which C is more dependent is considered as more important feature for classification.

Step-wise floating forward selection and backward removal
Feature selection [43] is a process of selecting a subset of the original features, yet produce similar or almost similar analytical results.It tries to select a minimum set of features such that the probability distribution of different classes given the values for those features is as close as possible to the original distribution given the values of all features.But, selection of minimum set of features is an NP-hard [44] problem, and so different heuristics are applied by different researchers to select it.The objective of all the heuristics is to search the space of possible feature subsets that is optimal or near optimal with respect to some objective function(s).In the proposed work, we have used two objective functions, namely information gain based on information theory, and feature dependency based on rough set theory.The most common generic heuristic search algorithms are based on step-wise forward selection (SFS) and step-wise backward removal (SBR) of features based on the objective functions.Both are the iterative process either select or remove one feature in each iteration based on the objective functions.The process terminates in the proposed algorithm, when the value of the feature dependency based objective function fulfil a certain condition.Both the processes are very time consuming as only one feature is examined to be selected or removed in an iteration.To make it more efficient, bidirectional search (BDS) is used where both the SFS and SBR are applied simultaneously.SFS is performed from the empty feature set and SBR is performed from the full feature set.The main limitation of these three algorithms is that once a feature either selected or removed cannot be respectively removed or selected further.It is required because, in case of SFS, some features selected previously may become non useful after the addition of other features, and similarly, in case of SBR, some features previously removed cannot be allowed to reevaluate its usefulness.To overcome such limitation, we have used the concept of Step-wise floating forward selection (SFFS) and Step-wise floating backward removal (SFBR).SFFS starts from the empty feature set and after performing each SFS step, it performs multiple SBR steps as long as the value of the objective function improves.Similarly, SFBR starts from the full feature set, and after each SBR, it performs multiple SFS steps as long as the value of the objective function improves.The proposed feature selection algorithm performs bidirectional search repeatedly applying SFFS followed by SFBR together with the concepts of information theory and RST by defining objective functions.As discussed in the subsections of this section, the heuristics applied for feature selection are based on information theory, RST, forward selection and backward removal techniques.The proposed feature selection algorithm is described by Figure 4, where Figure 10a gives the overall higher level details, and Figure 10b, and Figure 10e gives the details work flow of SFFS and SFBR techniques.From the figures, it is noticed that, in SFFS, SFS performs once and SBR performs many times, similarly, in SFBR, SBR performs once and SFS performs many times.Initially, R and W are empty sets, and take as input to SFFS algorithm, which provide new values of R and W .These new values of R and W are input of SFBR algorithm and modified values are returned by this algorithm, which are again inputed to SFFS, the same process is repeated.Finally, the process terminates while attribute dependency of C with respect to F (i.e., γ F (C)) is equal to attribute dependency of C with respect to R (i.e., γ R (C)) and the algorithm returns the informative and precise set R of features.
As our dataset contains the target variable (i.e., gender type), so we have proposed a supervised feature selection algorithm to select the features which are mostly dependent on the class or target variable.As the selected features have dependency with the class variable, it is expected that the classifiers generated by the selected features would be more effective for predicting the class labels of the objects.In the paper, we have computed information gain of each feature over the class variable to measure its dependency with the class variable.The feature with maximum information gain is considered as the most informative feature, and vice versa.Next, we follow the bidirectional search repeatedly using SFFS followed by SFBR.For selection of feature we consider the maximum information gain and for removal we select the feature with worst information gain.
To perform SFFS, initially, we start with the empty feature set R and select the feature with maximum information gain.If the feature dependency increases compare to the previous value then we insert it into the set R, otherwise we discard it and select a feature with the next highest information gain.After adding one feature, we repeatedly select an arbitrary feature from R and if its removal from R increases the feature dependency then only it is removed from R and stored in set W , otherwise, the same process is done for next feature in R and continued it until all features in R are exhausted.This process gives the feature subset R selected finally by one execution of SFFS.
Next, we perform SFBR, where we initially consider whole feature set F to select the worst feature with respect to the information gain.But as SFFS has already selected feature subset R they will not be further removed, so the initial feature set considers for SFBR is F = F − R, from which the worst feature is selected.If the removal of this feature from F increases the feature dependency compare to that of F then it is removed and stored into W , otherwise next worst feature is selected from F and the same process is repeated.Finally, when one feature is removed from F and stored into W , we repeatedly select an arbitrary feature from W and if its addition into R increases the feature dependency then only it is inserted into R and removed from W , otherwise, the same process is done for next feature in W and continued it until all features in W are exhausted.This process gives the modified feature subset R.This process of performing SFFS, followed by SFBR is repeated until the feature dependency of R on C is similar to that of F on C. Thus, following this process, once a feature is either selected or removed gets the opportunity to remove or select into the final feature subset R. The pseudo code of the proposed bidirectional floating forward selection and backward removal (BFFSBR) algorithm is described in Algorithm 1.To overcome these problems, we have used a variant of RNN, known as Gated Recurrent Unit network (GRUN).

Spectrogram image of audio signal
Before generation of the multi-layer deep neural architecture based gender recognition system from audio signal, preprocessing of audio signal is very important, which analyse the signal and convert it into spectrogram of 2 − D image.We have used Fourier Transform to convert a continuous audio signal from time-domain to frequency-domain.Fourier transform not only gives the frequencies but also magnitude of each frequency present in the signal.In a spectrogram representation of audio signal, one axis represents the time and the other axis represents frequencies, where the colors represent amplitude of the observed frequency at a particular time.To create the spectrogram, we break the audio signal into different frames of smaller sizes and perform Discrete Fourier Transform (DFT) on each frame to get its frequency.We consider all the frames of the signal in order, i.e., frame-1 first, then frame-2, and so on.So, frame number represents the time.We have considered the frames in overlapping way so that all the frequencies are captured.In spectrogram calculation, We have considered frame duration of 25 ms long as human can't generally speak more than one phoneme in this time frame and allowed an overlapping of 40% among two consecutive frames.As our signal is sampled at 16 KHz (average number of samples in one second), each frame is of amplitude (16000 × 25 × 0.001) = 400.As there is an overlapping of 40%, so a particular frame contains (400 × 40 100 ) = 160 amplitude from the next frame, i.e., there is an overlapping of 160 amplitude between every two consecutive pair of frames of the signal.As the frames are overlapping on each other, spectral resolution is important.To achieve the better resolution, we have used Hanning window, which is, in general, effective in 95% of cases.It has good frequency resolution and reduced spectral leakage.Next, Hanning window is multiplied with amplitudes and passes to the Fourier Transform function.The output of the Fourier Transform algorithm is a list of complex numbers of size = 400 2 = 200, which represents amplitudes of different frequencies within the frame.Thus, we get a list of 200 amplitudes of frequency bins which represent frequencies from 0 Hz -8 KHz, as our sampling rate is 16K.The absolute values of those complex-valued amplitudes are calculated and normalized, for each frame.For each frame of the signal, we perform the same algorithm and finally, we get the spectrogram of the audio signal in the form of a 2 − D matrix, where rows and columns represent frame number and frequency bin, respectively and the values in the cells of the matrix represent the strength of the frequencies.So, we can consider the signal, which is transformed to a spectrogram, as an image and easily apply it in our Deep Neural architecture for gender recognition.

Convolution Neural Network
The main objective for designing CNN is to use the concept of convolution, which generates filtered feature maps.The CNN models train on the basis of many layers where each audio signal input passes through a series of convolution layers along with filters (i.e., Kernals), and Pooling to extract the informative features of the audio signal.Deep neural network performs two different functions, namely feature engineering and classification [45], [46], [47], [48].The feature engineering process [45], [46] automatically extracts useful and nonlinear features from the raw data using convolution and pooling layers by optimizing the weights between the layers.In classification [47], [48], the useful features are transformed into a vector and fed into a fully connected layer, followed by an activation function such as softmax function to classify the speech signals into different groups based on the gender type.The function of softmax is to transform the output with probabilistic values between 0 and 1.The CNN sequence for gender recognition is shown in Fig. 5 and the functionalities of the model are discussed below.• Convolution Layer: This is the first layer where raw waveform of audio speech is fed.Convolution maintains the relationship among the extracted features of the signal with the help of small square windows of the input data.Thus, we feed spectrogram (i.e., the image form of audio signal) as input in convolution layer-I and a filter or kernel of size 5 × 5 undergoes throughout the input image to get the convolved feature matrix.The convolved feature is reduced in dimension as compared to that of the spectrogram by applying valid padding.The activation function used in this layer is the Rectified Linear Unit (ReLU), which invokes non-linearity of the data.This activation function is mostly used in CNN as it is known that the real world data needs the network to learn non-negative linear values.
• Pooling Layer: Similar to the Convolution Layer, the Pooling layer is responsible for reducing the dimension of the convolved features, which reduces the time complexity of the model to process the data.The main advantage of using this layer is that it extracts dominant features which are rotational and positional invariant, which helps to train the model effectively.The most frequently used pooling operations are max pooling and average pooling.Max pooling selects the maximum pixel value from the region of the image covered by the Kernel and the average pooling returns the average value of all the pixels in the region of the image covered by the kernel.There is no hard and fast rule about which pooling operation performs better, as it is both data and application dependent.The average pooling flattens the input image so the sharp and dark features cannot be identified properly.It simply performs dimension reduction as a noise suppressing mechanism.On the other hand, max pooling selects the bright pixels from the image, and thus suppress the noise.It ignores the noisy activation and performs de-noising along with the dimension reduction of the dataset.Hence, we can say that max pooling performs better than average pooling.The energy of the audio signal changes frequently with respect to pixel value, so sharp feature need to be identified, and that is why, max pooling mechanism is chosen for the proposed method.
• CNN Layer: The convolution layer and the pooling layer, together form one CNN layer.Depending on the complexities in the images, the number of CNN layers may be increased to capture low-level details of the image.But increase of number of CNN layers increases the computational complexity of the model, and so there must be a trade off between them.In the proposed method, we have considered two CNN layers, where each convolution layer uses a kernel of size 5 × 5 with valid padding, and each max pooling layer uses a kernel of size 2 × 2 with valid padding.The valid padding means that no padding is required and all the dimensions are valid so that the input image gets fully covered by the filter.
After passing the audio signal through the two CNN layers, the model is successfully enabled to learn about the features.

Recurrent Neural Network
A recurrent neural network (RNN) is a kind of artificial neural network where connections between nodes form a directed graph along a temporal sequence.RNN can use their internal state to process variable length sequences of inputs, that is why we generally applying it in gender recognition from speech.RNNs have additional stored states, and the storage can be under direct control by the neural network.The storage can also be replaced by another network to incorporate time delays or feedback loops.Such controlled states are referred to as gated state or gated memory, and are part of gated recurrent units (GRUs), which has the capability of reducing the vanishing gradient and exploding gradient problems, by which the Vanilla RNN [49] suffers.In our proposed work, the output of the CNN is flattened and subsequently fed into the Gated Recurrent Unit Network (GRUN) as input.Let, in any current instant of time t, x t be the input sequence and y t be the output of GRU and s t−1 is the internal state of GRU at previous time instant.The equation (11) computes update gate output z t at time instant t using the sigmoid function, σ(), where W z and U z are the weights.The update gate provides information from the previous time instant (t − 1) required for further processing.
Similarly, the reset gate output r t at time instant t is computed using Equation (12), where W r and U r are the weights.Reset gate decides how much of the previous information need to be forgotten.
Finally, the output of the GRU is computed using Equation ( 13), (14), and (15), where h t is the memory content and s t is the internal state at time instant t.The operation performs the dot product of two vectors, and σ() and tanh() are the sigmoid and tan-hyperbolic activation functions, respectively.The functionality of the GRU network is shown in Fig. 6.

Gender Recognition
The proposed gender recognition system is developed in three different ways: (i) using CNN, (ii) using combination of CNN and GRUN (i.e., CNN + GRUN), and (iii) using combined feature selection algorithm, CNN and GRUN (i.e., BFFSBR + CNN + GRUN).After passing the audio signal through the two CNN layers, the CNN model is successfully enabled to learn about the features.The outputs of second CNN layer are flattened into the output vector y t of CNN model.In case of CN N + GRU N model, the output of CNN once flattened is fed into GRU network and it provides the output vector y t , as shown in Fig. 6.In BF F SBR + CN N + GRU N model, the output vector obtained from GRU is merged with the feature vector obtained using BF F SBR feature selection algorithm, and the combined feature vector is used as output vector y t of the model.In all the three models, the output y t is fed into a fully connected feed forward neural network that is having an intermediate layer consisting of 64 neurons with sigmoid activation function and a final dense layer consisting of 3 neurons (each one for a gender type) with softmax activation for classification purpose.After a sufficient number of epochs (we have used 500 epochs), the model is able to distinguish between dominating and certain low-level features and finally, pass them through a fully connected layer, where dropout probability of 0.2 and Adam optimizer are used, to classify them using the Softmax Classification technique.The dropout or regularization is used to remove the over fitting problem of the model.The output O c j of j−th neuron of a current dense layer is computed using Equation (16), where O p i is the output of the i th neuron of the previous layer, W ij is the weight between the i th neuron of the previous layer and j th neuron of the current layer, and b c j is the bias attached with the j th neuron of the current layer.
Finally, the Softmax operation on the output layer is defined using Equation (17), where O f j is the output of the j−th neuron of the final layer.
Thus, the proposed gender recognition model is described in terms of a block diagram, as described in Fig. 7.The input audio signal is used to select some higher level human extracted features using rough set theory and information theory based feature selection algorithm (BFFSBR), which yields output feature vector F 3 for each audio file.The input audio signal is directly applied on CNN to extract the output feature vector F 1 .In CNN based gender recognition system, F 1 is considered as output vector y t and used for gender recognition.But in case of CN N + GRU N based gender recognition system, output of CNN is applied on GRUN which provides output feature vector F 2 .In this recognition system F 2 is used as output vector y t .In BF F SBR + CN N + GRU N based gender recognition system, output feature vector F 2 of GRUN and F 3 of BFFSBR are merged and considered as output vector y t .Thus, we have developed three different gender recognition models, as shown in Fig. 7.

Experimental Results and Discussions
The experiments are carried out on Google Colab virtual platform consists of Nvidia K80/T 4 GPU,12 GB RAM & 358 GB of hard disk.The python language is used for implementation of the proposed methodologies.The Keras with Tensorflow in backend is used for Deep Learning implementation.We train the neural network for 500 epochs by backpropagation algorithm.During training, we employ dropout rate equals to 0.2.The extensive experiments are done for the evaluation of proposed gender recognition systems using six different datasets.The details of the datasets and performance comparisons are given in following subsections.Some experiments are done using WEKA tool [50] and other experiments including the proposed BFFSBR feature selection method are carried using python programming language.

DataSet Collection
In the proposed work, following six different speech datasets have been taken into account.
1.The simulated speech dataset, DS 1 , is collected from different website from the internet and recorded with some of the speakers.The dataset contains 1500 audio files having 600 female, 500 males, and 400 transgender speeches.The duration of each file is 1 minute 30 seconds.We have collected the speeches of two different languages, Hindi and English.Out of 600 male speech files, 400 are of English language and remaining 200 are of Hindi language.Out of 500 female speech files, 350 are of English language and remaining 150 are of Hindi language, and for 400 transgender speech files, 250 are of English language and remaining 150 are of Hindi language.We have found 26 transgenders out of which 16 transgenders are chosen for providing English speeches and 10 transgenders provide Hindi speeches.Similarly, 50 male speakers and 50 female speakers are selected to record both the English and Hindi speeches.2. Dataset, DS 2 , is of Multi-lingual Indian Language dataset (generated by Indian Institute of Technology, Kharagpur), used in Reddy et al. [51].The dataset had been collected by recording the speech through broadcast television channels using DISH-TV.It contains total 28 well known Indian languages [52].
For few languages, availability of TV broadcast channels were not present, where broadcast radio channels are utilized for grabbing the speech.This speech corpus contains news, interviews and live shows.There are 10 speakers taken into account for each language and the duration of each audio speech is about 5-10 minutes.We have labeled each language with a gender type based on the independent opinion of five different experts.
3. The dataset, DS 3 , a bench mark dataset, is the collection of Ryerson Audio-Visual Dataset for Emotional Speech and Song [53].The dataset contains 1440 audio files of eight type of emotion.The speech files of the database are taken by the voice of 24 professional actors where 12 are male and 12 are female with 60 trial per actor.North American accent has been used in the speech files.4. The dataset, DS 4 consists of 90,000 raw audio waveforms collected from the audio domain of V oxF orge1 database.VoxForge is an open-source speech recognition corpus which consists of recorded samples, submitted by users using their own microphone.The dataset consists of six languages [52], each of 1500 samples. 5.The dataset, DS 5 is a benchmark Ryerson Audio-Visual Dataset for Emotional Speech and Song (RAVDESS) [53].The dataset contains 8 different types of emotion.This Database of Speech acquires 1440 audio files of eight type of emotion.The speech files of the database are taken by the voice of 24 professional actors where 12 are male and 12 are female with 60 trial per actor.North American accent has been used in the speech files.The speech has been spoken with different emotions include calm, happy, sad, angry, fearful, surprise, disgust and neutral.All these expressions have been performed with two levels of energy, normal and strong.We have used this dataset to verify that our gender recognition model performs well for such versatile data too.6.The dataset, DS 6 is the VocalSet [54]

Evaluation of proposed BFFSBR feature selection method
The proposed RST and Information theory based feature selection algorithm (BFFSBR) has been compared with some recently published feature selection algorithms using all six dtasets.The comparison is done based on number of features selected and the accuracy of the classifiers used.The feature selection algorithms used are (i) Rough-spanning tree based feature selection algorithm (RMST) [43], (ii) Classification of vocal and non-vocal segments in audio clips using genetic algorithm based feature selection (GAFS) [55] (iii) Relevant feature selection and ensemble classifier design using bi-objective genetic algorithm (RFSA) [56], (iv) Acoustic feature selection for automatic emotion recognition from speech (AFSS) [57], (v) Exploring boundary region of rough set theory for feature selection (RSFS) [52], and (vi) Speech-Based Emotion Recognition: Feature Selection by Self-Adaptive Multi-Criteria Genetic Algorithm (SFGA) [58].To measure the accuracy of the classifiers based on reduced feature set, we have considered eight different classifiers, namely Support vector machine (SVM), K -nearest neighbors (KNN), Decision tree (DT), Neural network (NN), Random forest (RF), Naïve Bayes (NB), Adaboost (BST), and Sequential minimal optimization (SMO).SVM is used with RBF kernal, K value of KNN is set to the square root of sample size of data and 10-fold cross validation is used to measure the performance of the classifiers.The classification accuracy is measured using WEKA tool [50] and the results are listed in Table 1 for all the datasets.The best result in each dataset obtained are marked by bold face font in Table .From the table, it is observed that, in all cases, the proposed BFFSBR feature selection method selects the minimum number of features, where RM ST and RF SA also selects the minimum number of features in case of DS 2 and DS 3 , respectively.But only minimum number of feature selection is not the criteria of a good feature selection algorithm.Thus we have measured the accuracy of eight different classifiers based on the reduced datasets obtained by the feature selection algorithms.In case of DS 1 , RSF S algorithm performs better than the proposed method with respect to the accuracy of decision tree (DT) classifier only.Similarly, for DS 2 , RSF S algorithm performs better than the proposed method with respect to the accuracy of SV M and N B classifiers.For DS 5 , the proposed method performs better than all other feature selection algorithms with respect to the accuracy of all eight classifiers, and for all other datasets, the proposed method is dominated by one or two feature selection algorithms with respect to one or two classification accuracies.Thus the proposed method is superior than the other methods in terms of both number of features selected and accuracy of the classifiers, which shows the efficacy of the method.
Wlicoxon's rank sum test [59], a non-parametric statistical test, is performed for independent samples with p-value of 0.05 (or a significance label of 5%) to evaluate whether the results obtained by the proposed BFFSBR algorithm differs from the other feature selection algorithms in a statistically significant manner.The test confirms that the probability that the proposed algorithm is statistically and significantly different from other algorithms is at least 0.95, as the performance(i.e., accuracy) of the classifiers for the reduced datasets obtained by the proposed algorithm is differing from the best result with a p value equal or less than 0.05.If the p value is greater than 0.05 between the best algorithm and the other algorithm then a ‡ symbol is used for the second one to indicate that the difference is statistically significant, otherwise the two performances are considered equivalent, i.e., the difference between the corresponding algorithms is not statistically significant and we use a ≈ symbol, as shown in Table 1.Thus, the experimental results show that the proposed BFFSBR algorithm is comparatively better than other feature selection algorithms.

Evaluation of proposed Gender recognition system
The proposed gender recognition system is evaluated using 10-fold cross validation method by computing some performance validation indices, such as Accuracy (A), Precision (P), Recall (R), and F-measure (F), which are defined in Equation ( 18) to (21), respectively, where TP, TN, FP and FN are known as true positive, true negative, false positive and false negative, respectively.
True positive (TP) is the set of objects of a dataset which are actually of positive class and the classifier also predicts them as positive class, True negative (TN) is the set of objects of the dataset which are actually of negative class and the classifier also predicts them as negative class.False positive (FP) is the set of objects of the dataset which are actually of negative class but the classifier predicts them wrongly as positive class and False negative (FN) is the set of objects of the dataset which are actually of positive class but the classifier predicts them wrongly as negative class.First, we have computed these performance metrics of our three proposed gender recognition models, namely CN N , CN N +GRU N , and BF F SBR+CN N +GRU N using all six datasets.Out of all the datasets, only DS 1 contains three different classes, Male (M ), Female (F ), and Transgender (T ), where all other datasets contain binary class, M and F .For these three models the performance metrics are computed as follows: • All the datasets, except DS 1 , have two class labels, M and F .To compute the values of TP, TN, FP, and FN, first M is considered as positive class and F is considered as negative class and computed all these four values.Based on these values, using Equation ( 18) to ( 21) four performance metrics are calculated.Next, F is considered as positive class and M is considered as negative class and computed all these four values, and similarly, the metrics are computed.Finally, the average values are considered as the accuracy (A), Precision (P), Recall (R), and F-Measure (F) of the model for the dataset.
• Dataset DS 1 contains three class labels, namely M, F, and T. In this case, once M is considered as positive class and rest two as negative class and the values of TP, TN, FP, and FN are computed and accordingly, performance metrics, A, P, R, and F are calculated.Similarly, considering F as positive class and rest as negative class and T as positive class and rest as negative class, the metrics values are computed.Finally, the average of all three values provides the resultant accuracy (A), Precision (P), Recall (R), and F-Measure (F) of the model for the dataset DS 1 .
The computed performance metrics of our proposed three gender recognition models for all six datasets are listed in The performance metrics of three different proposed gender recognition systems are also visualized using Figure 9 for all six datasets.It is observed that, accuracy, precision, and F −Measure of the proposed hybrid model (i.e., BF F SBR + CN N + GRU N ) is the highest for all datasets, only recall of CN N + GRU N is slightly higher than the model BF F SBR + CN N + GRU N for DS 2 .Thus, we consider the deep hybrid model with rough set theory and information theory (i.e., BF F SBR+CN N +GRU N ) as the best proposed model for gender recognition.

Comparison of proposed model with other related models
Based on the results given in Table 1 and Table 2, it is observed that the proposed three deep neural network based models work better than the traditional classifiers used in Table 1, simply because these classifiers are applied on the datasets described by the human extracted higher label features.The proposed models used the concepts of deep neural networks where machine itself extracts the features from the raw speech signals.Among the three proposed models, the model BF F SBR + CN N + GRU N which combines both human extracted and machine extracted features performs better than the other two models,   3.For better visualization, the lists of values are represented by bar chart as shown in Figure 9. From the table and figure it is observed that, the proposed BF F SBR + CN N + GRU N model outperforms others in terms of accuracy and F-Measure in all datasets except dataset DS 3 where [30] shows the best F-Measure.The proposed method provides the highest precision values for all datasets except DS 2 , and highest recall values for all datasets except DS 3 , and DS 4 .Considering all four performance metrics, it is observed that the next two best models followed the proposed model are Markitantov et al. [30] and Ertam et al. [33], which is also true in terms of wilcoxon Ranksum test.It is also observed that, the proposed method provides the highest values of all four performance measure metrics in case of dataset DS 1 , which contains the transgender speeches.As earlier mentioned that this dataset may contain more imprecise and uncertain information, so the proposed rough set and information theory based hybrid deep neural network model performs perfectly for gender recognition, which is the main objective of this paper.
The models are also Compared by analysing Receiver Operating Characteristic (ROC) curves generated for all datasets.It also helps us to measure the performance of the models by computing Area Under the Curve (AUC).More the AUC value implies better the model is and vice versa.To construct the ROC curve, we need to compute the True Positive Rate (TPR) and False Positive Rate (FPR) from the equation ( 22) and (23), respectively.The FPR is considered along X-axis and TPR is considered along the Y-axis, as shown in Fig. 10.From the equation ( 22) and ( 23), it is clear that both the TPR and FPR values ranges in between 0 and 1 and so the ROC curve lies within a square box of unit area.The ROC curve for the ideal or standard model should be the curve joining the points (0,0) to (0,1) and (0,1) to (1,1), which gives the AUC value as 1 square unit.The 45 o diagonal line connecting (0,0) to (1,1) is the ROC curve corresponding to random chance.Thus the curve lies below the diagonal line is considered as the worse model and curve above the diagonal line is considered as the better model.It implies that larger the AUC value of a model implies better the model is.We have used all seven gender recognition models for all six datasets and the ROC curves are drawn as shown in Fig. 10.The Figure gives the ROC curves of all seven models for dataset DS 1 to DS 6 , respectively.From the figure, it is observed that the curve obtained for proposed BFFSBR+CNN+GRUN model gives better result than other models with respect to the AUC value.

Conclusion
The proposed gender recognition system is developed based on multi-view feature selection concepts.
The human extracted features are evaluated using Rough set and information theory to select only the informative, precise and unambiguous features by removing uncertainty and ambiguity from the dataset.On the other hand, machine extracted features are generated using CNN and GRUN based deep neural networks.Finally, the features are combined and applied in the gender recognition system.Thus we extract features in different forms which are complementary to each other.The classification model is learned using this multi-view dataset to make full use of the hidden information.The method is applied for six different kinds of audio speech based datasets and obtained very promising results for gender recognition.The dataset DS 1 is the sampled dataset generated by us by collecting audio speeches of three different genders, where transgender is also considered.It is observed from the experimental results that, for DS 1 , the proposed methods works more effectively, from which we may conclude that RST based human extracted features take important role in gender recognition.Generally, voice of transgender is different to distinguish from male or female voice accurately, and this uncertainty is tackled by the RST.We may apply different architectures of the deep neural networks together with fuzzy set theory and rough set theory for the same purpose, which is the future scope of this paper.

Figure 1 :
Figure 1: The work flow diagram of the proposed gender recognition system

Figure 2 :
Figure 2: The resonance structure of the spectrum of a speech signal

Figure 3 :
Figure 3: The overlapping of pitch values between male and female

Figure 4 :
Figure 4: The flowchart of the proposed feature selection algorithm

Figure 6 :
Figure 6: Gender Prediction using CNN and GRU Network

Figure 7 :
Figure 7: The proposed Gender Recognition System dataset which contains recordings from 20 different singers (11 are male and 9 are female) performing a variety of vocal techniques.The dataset has recordings of 10.1 hours of professional singers performing different types vocal techniques.It consists of diverse set of voices using different vocal techniques sung on the basis of different scales.The dataset diversify our range and variety of songs.Thus the proposed gender recognition model is also applied on singing speech data.

(a) Performance metrics for DS 1 (
b) Performance metrics for DS 2 (c) Performance metrics for DS 3 (d) Performance metrics for DS 4 (e) Performance metrics for DS 5 (f) Performance metrics for DS 6

Figure 8 :
Figure 8: Performance comparison of proposed gender recognition systems

T
P R = |T P | |F N | + |T P | (22) (a) Performance metrics for DS 1 (b) Performance metrics for DS 2 (c) Performance metrics for DS 3 (d) Performance metrics for DS 4 (e) Performance metrics for DS 5 (f) Performance metrics for DS 6

Figure 9 :
Figure 9: Performance comparison of proposed model with related models

Figure 10 :
Figure 10: ROC curve based comparison of different gender recognition systems The CNN is used for extracting the locally encoded important features to capture the non-linearity of the data and the RNN is used to provide a memory for capturing the long term dependencies.The simple RNN model mainly suffers from the vanishing gradient problem where the gradient becomes extremely low and the exploding gradient problem where the gradient becomes extremely high.
Deep Neural architecture for Gender Recognition An audio is of two different types, (i) audio signal which is Amplitude v/s Time, and (ii) Spectrogram which is Frequency Content v/s Time.The amplitudes are not very informative, as they give only the loudness of the audio recording.The frequency domain provides the better understanding of the audio signal, which gives different frequencies present in the signal.In the work, we have used spectrogram form of audio signal to extract the useful features.We have fed the spectrogram image form of audio signal into our proposed multi-layer deep neural architecture based model of gender recognition.Recently, among various machine learning techniques, deep learning models have gained popularity for classification of objects.We have employed deep learning model by combining Convolution Neural Network (CNN) and Recurrent Neural Network (RNN).

Table 2 .
From this table, it is observed that, both the accuracy and F-Measure of the proposed BF F SBR + CN N + GRU N model are the highest for all six datasets.In some cases, like for the datasets, DS 2 and DS 4 , the CN N + GRU N model also provides good performance.For DS 2 , it gives the best Recall value and for DS 4 , it provides the best Accuracy and Recall values.Though the proposed BF F SBR + CN N + GRU N performs better than the other two proposed models, but the values of the performance [59]ics are quiet closed to each other, so statistical test is done using Wlicoxon's rank sum test[59]by considering both accuracy (A) and F-Measure (F) separately.Similar to Table1, two symbols ‡ and ≈ are used to imply that the two models (i.e., CN N , and CN N + GRU N ) are different or equivalent to BF F SBR + CN N + GRU N model, respectively.From the observed symbols, we can say that the model BF F SBR + CN N + GRU N is superior and statistically and significantly different from the other two proposed models.This result demonstrates that combining some human extracted features of audio signals with deep neural networks, it is possible to develop more accurate and effective gender recognition system.

Table 2 :
Evaluation of proposed models based on some performance metrics