An Enhanced ANN-HMM based classification of video recordings with the aid of audio-visual feature extraction

INTRODUCTION: As an essential part of life, the use of the Internet has increased exponentially. This rising Internet bandwidth speed has made video data transmission a more popular and modern form of information exchange. For classification of video date files there is a requirement of human efforts.Also for reducing the rate of clutter in video data on Internet, a suitable automatic video classification method is required. OBJECTIVES: In this work, we tried to find a successful model for video classification. METHODS: To make a successful model we use different schemes of visual and audio data analysis. On the other hand we choose some music, traffic and sports videos for different analysis. The model is based on Hidden Markov model (HMM) and Artificial neural network (ANN) classifiers.In order to gather the final results, we developed an “enhanced ANN-HMM based” model. RESULTS: Our approach attained an average of 90% success rate among all three classification classes. CONCLUSION: In aim of this work is to categorize and caption the videos automatically.Here we proposed an enhanced HMM-ANN based classification of video recordings with the aid of audio visual feature extraction.


Introduction
Sheer number of Internet users has increased exponentially in the last few years.With an increasing number of users and modes of communications, users are spending a vast amount of time on videos.It is quite difficult, however, for humans to categorize and caption too many videos.Additionally, people need to quickly recognize what kinds of videos they are going to watch, which requires advanced techniques to do video classification and captioning.Also, in some cases, we need to block some contents.For example, parents do not want their kids to watch violent or abusive content on the Internet; for this purpose, we need such an advanced video classification technique which can identify and block that unwanted content.In this work, we have explored methods and architectures to process videos.
The automatic categorization and caption are helpful for users to have better experiences when they are watching videos.The purpose of this work is to understand how to categorize and caption the video automatically.In this work, an enhanced technique of video classification has been proposed which is based on Hidden Markov model (HMM) and artificial neural network (ANN techniques) as well as the techniques known as: "ANN-HMM based classification of video recordings with the aid of audio-visual feature extraction"

BASIC OF ANN-HMM BASED MODEL 2.1 Artificial neural network
Pooja Mehta, Sahil Kaswan and Jaspreet Kaur 2 applications of the ANN are simple and used in classification problems to speech recognition.The nodes are used as the processing elements, and the connection between the different nodes have numerical values.The nodes I the network considers different inputs and calculates the single output based on the inputs and the weights.The output is given to another neuron.The layers which lie in between the two layers, or we can say the layers between output and input are known as hidden layers.The hidden layers act as individual feature detectors and helps in recognizing more and more patterns which is complex.These algorithms give the suitable solutions for the problems which are generally characterized high dimensionality noisy and imperfect sensor data.The main advantage of the neural network is that the model of the system can be built from the available data.

Hidden Markov Model
A hidden markov model is defined as doubly stochastic process with an underlying stochastic process which is not observable but can be observed only through another set of stochastic process which produces the sequence of observed symbols.The mechanism and the elements of the HMM is that,  There are finite numbers of states say N in the model where the signal in a state have some quantifiable properties. At each clock time t, a new state is entered based on the probability of the transition which depends on the previous state. An observation output symbol is produced according to probability distribution which depends on the current state.There are N such probability distribution which represents the random variables.
A HMM Model is defined by some set of states S = {s1, s2, . ..,sNs }, (analogous to the above described three possible weather conditions), and also a set of parameters Θ = {π, A, B}: 34 The prior probabilities πi = P(q1 = si) are the probabilities of si being the first state of a state sequence.Collected in a vector π.(The prior probabilities were assumed equiprobable in the last example, πi = 1/Ns.)The transition probabilities are the probabilities of switching from state i to state j: ai,j = P(qn+1 = sj |qn= si).They are arranged in the matrix A.  There are two types of neural networks where the speech signals are considered such as the time delay network and length sequence which is processed at every step.The method followed is specified the inputs are shifted right for every unit delay line which links each set of inputs u units to the right adjacent and then the next input pattern is fed to the left position.perceptronfor time sequence process that helps in changing the spatial sequence over corresponding units.The input layer will settled as man the recurrent neural networks.The time delay neural networks represents an effort to train a static multi-layer.
The recurrent neural networks help in providing a powerful extension of feed forward models by inserting the connections between the arbitrary pairs of units independently.The recurrent connections are reliable for an evolution of time of the internal state of the network.

PROPOSED ALGORITHM AND ITS WORKFLOW
In this research, the proposed a video classification method which uses audio as well as visual features of input video data for classification and the framework starts with the collection of sample video and is processing through preprocessing technique in which video and audio information is separated from the input video.The video data is converted into a set of frames followed by segmentation and extraction of critical features from both audio and video data.Later, feature level fusion process is initiated for concatenating amongst the audio, and video features and classification is achieved through a machine learning algorithm.The block diagram of the system being proposed is shown in Figure 3.The workflow of the proposed system is illustrated in Fig. 4. Firstly, the video is extracted from the online website and is processed through the audio-video separation stage where the visual images and the audio file is extracted from the video.Initially, the video is converted into individual frames and processed through the key feature extraction stage, which is used to distinguish between the individual images and is processed through dimensionality reduction technique.Later, GLCM based features extraction technique is employed for extracting the useful features from video and zero forcing equalizer with Mel-Frequency Cepstral Coefficient (MFCC) for extracting the useful features.These features are later combined in the fusion stage and then processed through ANN_HMM stage to assess the performance of system.The in-depth description of the individual algorithm will be explained in the following stages as follows.
 In this research, the input data is processed through the keyframe extraction stage which is done on the basis of feature information.Color is the frequently accepted and essential feature for analyzing the input video. Features are stated as a function of individual or more number of measurement units in which quantifiable amount of data about the object is specified along with its significant characteristics of the video.

RESULTS AND DISCUSSION
The input videos considered are music, Sports and traffic.The input video is divided into 750 frames and those images are extracted using the feature extraction.The software used is MATLAB and following are the results parameters:

Confusion matrix
The confusion matrix is used in the machine learning to specify the complications of the statistical classification which is also called as error matrix and helps in visualizing

HMM output
The below figure7 shows Hidden Markov model graph where the parameters such as emission and transition are compared.As the transition rate increases the emission rate increases gradually and shows a sudden increase at some point and remains constant after few transitions.The table 1.1 shows the complete values which is obtained after the simulation process and gives the accuracy achieved.In the above table, considering Music Video-2, Sports video-2, and Traffic Video-2 we achieved 96.5777 % accuracy.So we conclude that 96.5777 5% data is correctly classified and the Sensitivity is 100%.

CONCLUSION
With an increasing number of users and modes of communications, users are spending an amount of time on videos.However, it is quite difficult for human beings to categorize and caption too many videos.So, videos which we are going to watch require advanced techniques to do video classification and captioning.Also, in a few cases, we need to block some contents.Like, parents don't want their kids to watch violent or abusive content on the internet and for this purpose, we need such an advanced video classification technique which can find and block that unwanted content.In this work, we have explored methods and architectures to understand videos.The automatically categorization and caption are helpful for users to have better experiences when watching videos.The aim of this work is to know how to categorize and caption the video automatically.We have proposed an enhanced HMM-ANN based classification of video recordings with the aid of audio-visual feature extraction.The results evaluated and analysed shows the better accuracy when compared with the traditional.The obtained accuracy of 97% by the ANN-HMM algorithm shows that the classifier used for classifying the type of video based on the categories used.
The emission probabilities distinguished the possibility of a definite observation x, if the model is in state si .Depending on the kind of observation x we have:  for discrete observations, xn∈ {v1, . ..,vK}: bi,k = P(xn = vk|qn = si), the probabilities to observe vk if the current state is qn = si .The numbers bi, k can be arranged in a matrix B. (For the case of weather model, with K = 2 attainable observations v1 = and v2 = .) for continuous observations, e.g., xn∈ R D: A set of functions bi(xn) = p(xn|qn = si) represent the probability densities (probability density functions, pdfs) over the observation space for the system being in state si .Assembled in the vector B(x) of functions.Emission pdfs are mostly configured, e.g., by mixtures of Gaussians.The implementation of the video recordings with the help of the audio-visual feature extraction process is explained where the audio feature extraction is performed by the zero forcing equalizer integrated with Mel-Frequency Cepstral Coefficients (MFCC).The audio features which are extracted using the method is co-efficient, Delta and Delta data.The three types of video signal are considered such as music, traffic and sports and the extraction are performed for all the signals.The video signal which is separated from the input video is converted into frames and the Segmentation histogram technique is applied.The feature extraction technique used for the video signals is Grey-level cooccurrence Matrix.The features extracted which includes energy, variance, sum entropy, homogeneity, correlation, contrast, sum variance etc. and is performed for rest of the two video signals also.The audio feature signals are trained for n number of predefined inputs.After extraction of both audio features and video feature the fusion of two is combined using feature Concatenation technique.The concatenation of audio and video features is performed and the classification technique should be applied.The classification model is trained and will generate the references and the features of the test input will be compared in a similarity measures to find that the test input belongs to the class of trained model.The Artificial neural 35 network classifier is used for the fusion data where the output of the ANN is given to Hidden Markov model.The flow diagram of the model is shown below.

Figure 1 .
Figure 1.ANN-HMM model The implementation of the video recordings using the ANN-HMM model is the best classification technique performed.The HMM model is evolved by using a spoken broadcast digit corpus compiled by Bell core.There are totally 9 states in HMM.For every digit, 120 exemplars are used in training and 38 exemplars reused for trials.The neural network

Figure 3 .
Figure 3. Block diagram of the proposed systemThe histogram-based segmentation process is considered to enhance the quality of the image and to obtain the digitalized form.

Figure 4 .
Figure 4. Workflow of the proposed system

Figure 5 .
Figure 5. Confusion matrix for the proposed mode

Table 1 .
Performance Evaluation TableIf we consider Music Video-3, Sports video-3, and Traffic Video-3 we can achieve 88.5777 % accuracy.So we can conclude that 88.5777% data is correctly classified.Sensitivity is 87.6000.Thus the results obtained shows that the accuracy obtained in different types of video has maximum accuracy of 97% where the video is classified.