Wavelet Statistical Feature Based Malware Class Recognition and Classification using Supervised Learning Classifier

Malware is a malicious instructions which may harm to the unauthorized private access through internet. The types of malware are incresing day to day life, it is a challenging task for the antivius vendors to predict and caught on access time. This paper aims to design an automated analysis system for malware classes based on the features extracted by Discrete Wavelet Transformation (DWT) and then by applying four level decomposition of malware. The proposed system works in three stages, pre-processing, feature extraction and classification. In preprocessing, input image is normalized in to 256x256 by applying wavelet we are denoising the image which helps to enhance the image. In feature extraction, DWT is used to decompose image into four level. For classification the support vector machine (SVM) classifiers are used to discriminate the malware classes with statistical features extracted from level 4 decomposition of DWT such as Daubechies (db4), Coiflet (coif5) and Bi-orthogonal (bior 2.8). Among these wavelet features the db4 features effectively classify the malware class type with high accuracy 91.05% and 92.53% respectively on both dataset. The analysis of proposed method conducted on two dataset and the results are promising. keywords: Classification, Discrete Wavelet Transform, Feature Extraction, Malware Class, Texture and Pattern.


INTROdUCTION
The analysis of texons played a major role in classification the pattern classification techniques and applications in the areas of image processing are growing increasingly.The image processing and pattern classification represents the state of art developments in the field.Texture pattern recognition is the task of classify input feature vector data in to classes based on the selected features from the vector.There are two types of classification supervised classification and unsupervised classification.The pattern recognition has applications in computer vision, SAR image classification, and speech classification and texture classification.The texture classification plays a major role in many applications Such as medical image analysis, pattern classification and so on.Supervised classification methods are used for face recognition, OCR, object detection and classification.Unsupervised classification methods are used in finding hidden structures, segmentation and clustering.Wavelet transforms have become one of the most important and powerful tool of signal processing and representation.Now a day, it has been used in image processing, data compression and signal processing in different applications different wavelets are used.In this paper we present the overview of wavelets transformations in image processing.The objective of this paper is to give comparison results of the wavelet transform with their family.
Malware 1 is software that performs unwanted features like Virus, Worm and Trojan horse.The functionalities of a malware such as execution and infection, self replication that infect another host, privilege escalation, manipulation that damages the host and concealment that hides from detection.The visualization of malware is an image is read as binary vector of 8 bit unsigned integers that are to be organized into a 2D array.This can be visualized as a gray scale image in the range [0, 255] the width of an image is fixed and height is allowed to vary depending on the file size.Internet plays a very important role which also motivates the unauthorized access.Today development of the internet and their uses is growing day by day which motivates the number of malware distributes more, especially for economic profits.According to the report of Symantec every day a millions of malware variants are observed an exigent task to say zero day attack is.Malware is a term used to refer a variety of forms of unsympathetic or intrusive software including computer viruses, worms and other malicious programs.It can take form of executables code and script content and other software 2 .Malware analysis includes two type static analysis and dynamic analysis.Static analysis which includes the signatures of malware identified.
Malware is a term used for malicious data that get installed on your machine and performs unwanted tasks such as stealing passwords and data.Malware visualization 4 is a field of knowledge that focuses on representing malware in the form of visual features.That could possibly be used to deliver more information about a particular malware.Graphical visualization helps to gain more information about malware.Its ever increasing new malware produced by every day is a challenging task 2 .The exponential increase in the number of new signatures released every year 3 Symantec reported corpus over 286 million in 2010, to 2,895,802 new signatures in 2009, to 169,323 in 2008.The boarder level all malicious data stored in drives can be represented as a binary string made up of number of zeros and ones.This represents the binary string which is reshaped in to a matrix and represented as grayscale image.That's why the description of all malicious data is converted into gray scale image.The description of an image has been well studied in the field of computer vision.GIST descriptors 5,6 specially used on scene classification based on texture and object identification as well as classification.
The descriptions are forwarded into classification algorithm for training and testing of malware image using SVM 7 .The file fragment used as a grayscale image 8 identification of malware.The behavior of malware 9 is analyzed the entropy based 10 effective features are used for classification with entropy graph.The distance learning techniques are used with structural information for classification done on automatic

Related Work
Texture plays a very important role in many research areas including image processing, pattern recognition, and medical image analysis also in computer vision.Texture analysis aims to finding a distinctive way of representing the primary characteristics of textures and represent them in some simpler but unique form, so that they can be used for robust, accurate classification and segmentation of objects.Through the texture statistical features plays a significant role in image analysis.Only a few architectures implement onboard textural feature extraction.Statistical texture features are formulated by using gray level of malware image.The motivation of this work is that textures of a malware images are extracted effective features that considers the spatial relationship of pixels in a level co-occurrence matrix this matrix also called as gray level spatial dependence matrix a number of texture features are extracted namely contrast, correlation, energy, mean , standard deviation, entropy, RMS and homogeneity are computed.

MATERIALS
The proposed work analyzed by using standard databases mahenhuer and malimng dataset.The datasets are consists of 24 malware family with 3131 variants of it and another dataset consists of 25 malware types.The details are listed in table 1.There are 3131 malware images and 1245 malware images of different malware families listed below.

METHOdOLOGY
The proposed methodology we are applying wavelet low pass and high pass filters on malware image and extracted the effective features for classification.The classification consist of training phase and testing phase, where we are considering effective features selection for training images from the database.The following Fig1 illustrates the methodology of detection of malware variants.

Pre-processing
The first we need to prepare the dataset for testing and training data from the dataset.In this stage we are trained the dataset using dataset images, where we are collected randomly images from individual malware family samples are varied from 20 to 25 images and train the samples using the extracted feature vector to individual malware family samples total 666 images are trained from 3131 dataset.In this testing stage we are testing the complete dataset of each sample of the malware family from the dataset for SVM multiple class classifier.The pre-processing stage we are loading the image and applying common operations such as normalization, filter and sub block average.The resultant filtered image is send to the next stage.The scale and translation parameters are given by,S=2-m and T=n2-m where m ,n are the subset of all integers.Thus, the family of wavelet is defined in equation 1.

Supervised Classification
SVM is a supervised learning classifier that seeks an optimal hyper-plane to separate two or more classes of samples from the dataset.The mapping the input data into a higher dimensional space is done by using kernel functions with the aim of obtaining a better distribution of the data in the form of three kernels rbf, linear and distributed.Then, an optimal separating hyper-plane will be drawn in the high-dimensional feature space can be easily found in

Result Analysis Of Malware Recognition
The experimental analysis is done on the both malware dataset which consists of the 24 malware class and 9 Trojan classes.The results are analysed through the wavelets based statistical feature for malware classification and recognition.The wavelets family applied on discrete wavelet transform (DWT).

Fig. 1 :
Fig. 1: Texture Similarities of Malware Classes (a) Trojan Class (b) Worm Class with different variants of Malware time signal x[n], the decomposition is given by: ...(4)   In case of images, the DWT is applied to each dimensionality separately.The resulting image X is decomposed in first level is xA, xH,xV and xD as approximation, horizontal, Vertical and diagonal respectively.The xA component contains low frequency components and remaining contains high frequency components 29 .Hence, X= xA+{xH+xV+xD}.Then DWT applied to xA for second level , third level and fourth level decomposition.Hence the wavelet provides hierarchical framework to interpret the image information.The basis of wavelet transform that is localized on mother wavelet.The statistical feature extraction (SFE) stage we are applying wavelet filters such as Discrete Wavelet Transform then the extracted 11 statistical features are constructed a feature vector and to get normalized features for classification.The SFE features such as contrast, correlation, energy, homogeneity, mean, standard deviation, entropy, RMS, variance, smoothness, kurtosis, and skewness.

Table 4 : Comparison of TPR for malware dataset with wavelet family dataset Training data Testing data Method TPR Accuracy
The classification error rate is very less compare to existing work on classification of malware.The contributions of this paper are as fallows.Wavelet Transform with DWT is used to extract effective wavelet based statistical features by applying wavelet transforms with wavelet family like db4, bior2.8,sym4 and coen5.Further our future work we develop model where we can classify and detect the particular Trojan malware family more accurately with genetic algorithm and adaboost techniques for classification of further research work.This research work is funded and supported by UGC under Rajiv Gandhi National Fellowship (RGNF) UGC Letter No: F1-17.1/2014-15/RGNF-2014-15-SC-kAR-69608,February, 2015.