Deep Learning For Audio

Speech is the most eﬃcient and convenient way of communication. The learning capabilities of the deep learning architecture can be used to develop the sound classiﬁcation system to overcome the eﬃciency issues of the traditional systems. We propose to develop a model that classiﬁes the audio of the speaker.


I. INTRODUCTION
Waveform is a general way to represent an audio signal.Signal Processing is the art and science of changing the data obtained from the time series for analysis or enhancement purposes.Audio signals are the three-dimensional signals that represent time, amplitude and frequency.Audio wave Sampling: Sampling is a method of converting an analogue audio signal into a digital signal.While sampling a sound wave, the computer takes measurement of this sound wave at a regular interval called sampling interval.Each measurement is then saved as a number in binary format.Speech recognition can be done at a sampling rate of 16 kHz (16000 times per second).
As the neural networks will have to do the speech classification, it is very important to feed the network inputs with relevant data.An appropriate preprocessing is necessary in order to be sure that the input of the neural networks is characteristic for every word, while having a small spread amongst samples of the same word.Noise and difference in amplitude of the signal can distort the integrity of a word while timing variations can cause a large spread amongst samples of the same word.All these problems are dealt with signal processing.For example, we have Alexa, a smart speaker that recognises the voice of a particular speaker and responds to the voice commands.Similarly, we have siri in iphone.We can use this sound classification model for biometric authentication purposes in service centres.

II. LITERATURE SURVEY
We have to transform the data to features, which can then be fed into the algorithm.Most important features that are used to build the model for our audio classification in our project are: RMSE (Root Mean Square Error) Energy: The energy of a signal corresponds to the overall signal amplitude and is approximately associated with how intense the signal is.
n-x(n)-2 describes the energy of the signal.RMSE is the root mean square of this formulation.

Zero Crossing Rate
By looking at different speech and audio waveforms, we can see that depending on the content, they vary a lot in their smoothness.For example, voiced speech sounds are more smooth than unvoiced ones.To calculate the zero-crossing rate of a signal, we need to compare the sign of each pair of consecutive samples.In other words, for a length N signal, you need 0(n) operations.Such calculations are also extremely simple to implement, which makes zero crossing rate an attractive measure for low complexity applications.

Spectral Centroid
Spectral features are the frequency-based features, which are obtained by converting the time signal into the frequency domain.In the context of speech recognition or classification of a speaker, frequency information can be represented in the form of spectral centroids.

Spectral roll off
It is the frequency below which lies a given percentage of total spectral energy e.g, 85%

Chroma
Chroma refers to 12 different pitch classes.It is a 12-element feature vector indicating how much energy is present in the signal of each pitch class.

MFCC (Mel-frequency Cepstral Coefficients).
The idea of MFCC is to compress information about the vocal tract i.e. smoothed spectrum into a small number of coefficients.The mel-frequency cepstral coefficients(MFCC) of a signal are a small set of features usually about 10 to 20, which describes the overall shape of the spectral envelope.
By using fourier transforms, we split the complex sound wave into simple sound waves and then adds up to sum of energy.

III. MODEL
We are extracting the audio features from the audio files.Then we are preprocessing the data.Then, we are applying the neural networks and predicting the speaker.Finally, we are calculating how accurately our model is predicting the speaker.
In this, we have 5 steps.
Step 2: Preprocess and transform the dataset Step 3: Apply the neural network.
We have already collected the audio samples and converted them into csv files to extract information from the files.

A. Audio extraction features
Librosa is a python package used to analyse music and audio.It provides the building blocks required for the development of music knowledge recovery systems.We use Librosa's in-built functions such as mfcc, chroma, etc which generate an MFCC from audio data from the time series.All the extracted features are kept in the csv file.

B. Preprocess and transform the dataset
The filenameArray contains all audio files.For speakers a different array is produced.The respective speakers are given a weight.For example, if the speaker name begins with j, the speaker is assigned a "0."We add the respective speaker's audio file to the assigned number from the list.We remove the filename, number, chroma stft and unnecessary columns.
We have a total of 1500 recordings at the repository.We split the dataset into training and validation sets using sklearn.modelselection.traintest split.70% of the dataset is used for training the model and the remaining 30% is used for validation.There are total 1050 recordings in training data and validation data has 450 recordings.Data Normalization: Data normalization gets rid of many irregularities which can hinder the data analysis.The information inside the data can typically be structured in such a way that it can be visualised and analysed by data normalisation.We are using sklearn StandardScaler for normalizing the data.It transforms the data, so it has mean as 0 and standard deviation as 1.In short, the data is standardised.Standardization is useful for negative-value results.The data is structured in a typical standard distribution.Classification is more useful.Each value in the dataset is subtracted by the mean and divided by the standard deviation.

C. Apply the Neural Network
CNN: Convolutional Neural Networks (CNNs) is the most popular neural network model being used for image classification problem.Instead of a fully connected network of weights from each, a CNN has just enough weights to look at a small patch.The main advantage of CNN compared to its predecessors is that it automatically detects the important features without any human supervision.
There are total 13 layers and we are using the dense layer and the dropout layer, where the dense layer uses the back-propaganda function.Dropout is used to make the model to forget each time after it learns, then overfitting might happen which can be prevented.So, softmax is used as the classifier classification as we have multiple classes so when we pass through the model it will show the probability distribution.Here, we are getting ready the model then we should pass the data.The optimizer is the adam and the loss function is the sparce categorical crossentropy.In model.fitwe will pass the data to the model with epochs 50 and batch size 128.
LSTM: Long short-term memory (LSTM) is an artificial recurrent neural network (RNN) architecture used in the field of deep learning.
These networks are well-suited to classifying, processing and making predictions based on time series data, since there can be lags of unknown duration between important events in a time series.
First, we are taking the embedding matrix with size of 5000 and dropout of 0.2.For all the models the dropout is 0.2 because to create a uniform standard among all the model's foe comparison we can change if we want to increase, he accuracy of the model.
Here, the max pooling is 2 so we will generate a 2*2 matrix.We are using LSM here and in dense layer we are using softmax for classification.We are training with early stopping to avoid the overfitting, epochs=50 and batch size=128.
MLP: Multilayer perceptrons are often applied to supervised learning problems.They train on a set of input-output pairs and learn to model the correlation (or dependencies) between those inputs and outputs.Training involves adjusting the parameters, or the weights and biases, of the model in order to minimize error.We are using two layers i.e dense layer followed by relu for activation and for the classification we are using softmax.The optimizer is "adam"and the loss is "sparce categorical crossentropy".Training by early stopping to avoid overfitting epochs=50 and batch size=128.

RNN:
A recurrent neural network (RNN) is a class of artificial neural networks where connections between nodes form a directed graph along a temporal sequence.An RNN remembers each information through time.It is useful in time series prediction only because of the feature to remember previous inputs as well.This is called Long Short-Term Memory.
First layer is Embedding layer which is similar to LSTM, but here zfirst we are taking the embedding matrix and we are passing data and multiplying.
Second, we are using LSTM, the differences between this LSTM in RNN and actual LSTM is the recurrent drop for back propagation.After that we used "relu" and "dropout" layer.softmax for classification with epochs 50 and batch size=128.

E. Prediction
We have plotted the graph for the training vs testing loss.The prediction and ground truth is calculated where x-axis is prediction and y-axis is ground turth which refers to the accuracy of the training set's classification for supervised learning techniques.This is used in statistical models to prove or disprove research hypotheses.

CNN Model
The Accuracy obtained from the CNN model is 97.11%.The prediction and ground truth for the CNN model is calculated and shown in the below table.MLP Model The Accuracy obtained from the MLP model is 99.11%.The prediction and ground truth for the MLP model is calculated and shown in below table The MLP has more accurate prediction and ground truth values when compared to other models.

RNN Model
The Accuracy obtained from the RNN model is 82.89%.The prediction and ground truth for the RNN model is calculated

IV. CONCLUSION
In this project the heighest accuracy of identifying a particular speaker is 99.11% by using multilayer perceptron which is a class of feedforward artificial neural network.
The learning capabilities of the deep learning network model architecture was more accurate in the sound classifica-