Stock Price Prediction using Artificial Neural Model: An Application of Big Data

In recent time, stock price prediction is an area of profound interest in the realm of fiscal market. To predict the stock prices, authors have proposed a technique by first calculating the sentiment scores through Naïve Bayes classifier and after that neural network is applied on both sentiment scores and historical stock dataset. They have also addressed the issue of data cleaning using a Hive ecosystem. This ecosystem is being used for pre-processing part and a neural network model with inputs from sentiment analysis and historic data is used to predict the prices. It has been observed from the experiments that the accuracy level reaches above 90% in maximum cases, as well as it also provides the solid base that model will be more accurate if it trained with recent data. The intended combination of sentiment analysis and Neural networks is used to establish a statistical relationship between historic numerical data records of a particular stock and other sentimental factors which can affects the stock prices.


Introduction
The importance of data has been immensely increased in last decade.The pace at which earth is producing data is meteoric, no wonder the rapid growth of data has given a birth to many problems like storage, analysis and processing of the data.These problems have termed as Big Data.As time has passed people have not only addressed the issue of its storage but also, they have started using that data for analyzing the trends and patterns exists in them and to use those patterns to predict the future trends.Every day there are millions of people who buy and sell stocks of various companies.Thus, Terabytes or even Petabytes of data are being generated from different exchanges.Financial organizations and retail traders can extract a great amount of information which can help them in their trading decisions.Financial market is largely based on the daily trading of stock so by the use of machine learning techniques one can create prediction models which can predict the stock prices in advance.From last few years there have been many up's and downs in stock market, as there are n factors which can affects a share market.Thus, due to its dynamic nature, it is very highly difficult to predict a stock price.To address this issue there should be some system which can detect the pattern in stock prices when influenced by political, economic and natural environment as well as which can take what are the people's sentiment about the particular company.In this paper, authors have presented a possible mixture of sentiment analysis and historical stock price.Now one of the major concerns before predicting the stock prices is to use the reliable historic stock data, thus pre-processing of data is also must.
In paper [1] the authors have proposed a system in which the stock market data is extracted by applying keywords on twitter data and store it into Hadoop Distributed File System (HDFS) using flume or Hadoop Malav Shastri, Sudipta Roy and Mamta Mittal 2 commands.After that data is pre-processed by removal of slang words and other unnecessary elements of the particular tweet.In another paper [2], the authors have used Bombay Stock Exchange (BSE) data and used Hadoop MapReduce techniques to preprocess the datasets of BSE stock exchange and predicted the stock values of stocks.
In paper [3], the authors have presented a system that can predicts stock market movement, based on historical stock prices and market sentiment analysis.They used data of Standard and Poor's 500 (S&P 500) from January 2008 to April 2010 from Yahoo! Finance.After that they used Naïve Bayes classification for sentiment analysis, and stock movement were predicted using Support Vector Machines (SVM), Logistic and Neural network techniques.In paper [4] prediction of stock prices of three Indian National Stock Exchange (NSE) listed companies has been done using SVM technique.
In paper [5], authors have considered two strategies, series and parallel in financial time series forecasting and has scrutinized the performance of the same.In this, authors have concluded that using Auto Regressive Integrated Moving Average (ARIMA) along with multilevel perceptron produces much more accurate results.In paper [6] authors have predicted the stock value using sentiment analysis.Authors have considered news headlines for the sentiment analysis purpose.They have used three approaches: KNN, SVM and Naïve Bayes classification.In paper [7] authors have used clustering and multiple regression for forecasting the stock price.In paper [8] authors have developed a Natural Language Processing (NLP) model for stock forecasting.This model basically uses online news to forecast future stock values.In paper [9] to overcome the time series forecasting, authors have used an outlier data mining technique for stock forecasting.They concluded that their approach is better for predicting long term behavior of stock trend.In paper [10] by using decision tree classifier specifically ID3 and C4.5 authors have suggested better times for buying and selling the stock prices.In paper [11] authors compared the performance of four machine learning algorithms which are SVM, Random Forest (RF), Naïve Bayes (NB), and Artificial Neural Network (ANN), in predicting the future value for Reliance and Infosys datasets.In paper [12] authors proposed a polynomial neural network for the task of stock market forecasting.They have also used the concept of partial descriptions and used them with the original features.Researchers have applied machine learning and deep learning technologies for prediction in number of applications like in health sector, crime sector and images analysis [13][14][15][16][17].
Plethora of research has already been done to predict the stock prices or market trends, by considering either the numerical historical stock prices or the textual sentiments data and maximum researcher did not consider them together.Moreover, while doing the sentiment analysis data is taken from twitter, which are less reliable compared to the news headlines.So, in this paper, authors have considered both historical stock prices as well as sentiment analysis from the news which is novel in itself.Moreover, Big Data technologies have been presented to handle the large data by using Big Data ecosystem which is a solution for cleaning the data before using it for prediction purpose.

Proposed Methodology
The proposed methodology has included sentiment analysis of news dataset as well as historic stock prices.The reason behind considering a news dataset for sentiment analysis is that unlike other resources news headlines are majorly made upon the statistical facts and different events.Big data doesn't always mean HDFS or map reduce.Hive ecosystem have been used to clean dataset as it can directly interact with HDFS (Hadoop Distributed File System).The architecture of proposed methodology is shown below in Figure 1.In this, the stock market data is cleaned by HIVE and passed to ANN whereas on news dataset sentiment analysis has performed and generated sentiment score is passed to Multilevel Perception Network.

Data Cleaning With HIVE
As mostly data available from twitter and stock market is in unorganized form, so to get insight into it, one should opt data cleaning.Hive is better solution for cleaning of data because of three things, apart from the fact that HIVE is built upon Hadoop MapReduce, which is a framework for distributed and parallel processing of large data, its architecture and query language make it a unique [18][19].
Using Hive Query Language (HQL) users can perform multi query on same input data.Most interesting part of HIVE is that compiler of HQL translates the statement into a directed acyclic graph of map reduce jobs.So the query is divides into smaller map reduce jobs.The data with too many null values may drive some undesirable results, so it is really important to clean the data.Figure 2 is the snapshot of NSE data which containing many null values.Now, schema of table is to be created in Hive according to data retrieved from data sources.Major columns are date, opening price, closing price, volume, adjusted closing price.Data is collected in csv format which makes it easier to store in Hadoop as well as to load in Hive table.Figure 3 shows the data after loading it in a Hive table and this data contains many null values in it.So, HQL is being used here so that it doesn't consider the records which has null values.

Predicting adjusted closing prices with sentiment analysis and Artificial Neural Network
Multilevel Perception ANN has been used to predict the future trends.In this news headlines, are classified using NB classifier in two classes positive and negative.Thus sentiment score has been created from that classification.Now this sentiment score is used along with other five attributes, which are date, opening price, highest value on that day, lowest value on that day and volume of shares traded on that day as an input to ANN.Date is important attributes as it helps in establishing statistical relationship between dates with other attributes, so that it enables us to extract patterns between dates and closing prices, and stock prices are subject to time series data, there are noticeable effects on stock prices as and when time passes.Moreover, date also help in knowing the stock prices patterns before the weekend and after the weekend same with the public holidays which may affect particular stocks prices.Author's goal is to forecast closing prices of a stock on a particular day by giving previous day's data as an input.
In this paper authors have demonstrated the complete method for Apple stock, the reason behind taking up the Apple stock is that it is more consumer faced company, which has many end users worldwide, so they assume that the news which are being daily created for Apple should have some decent amount of portion of news which are directly related to problems that it's consumers are facing as well as their sentiments about the products of the company.So, historical stock prices as well as news are taken for the period 2013 to 2016 from www.nasdaq.com.and further this dataset has divided into two parts, 3/4th portion of the data is used as training data and 1/4th portion as test data.One more case is considered in which data of year 2016 has been taken into consideration and it is also further divided into training and testing datasets.

Sentiment Analysis
Sentiment Analysis plays very important role in s stock market as its prices are majorly dependent on external factors like political factors or geographical factors.Thus, the given news is divided into two classes one is positive (POS) another is negative (NEG) using Naive Bayes classifier.These classes are assigned a score 1 to positive and 0 to negative.This score is then used with other data attributes like opening price, high, low, volume etc.The model is represented in Figure 5.

Figure 5. Flow of Data from sentiment analysis to final stock data table
Textual Data Classification using Naïve Bayes Classifier It operates on word to word and find each unique word probability [20].In this apple's stock prices is considered and news headlines regarding apple stock is collected, small chunk of it is presented in Table Great Design a good iPhone X POS In first step, unique words are identified like People, loved, the, iphone10, hated, a, great, good, poor, design.Next step is to convert these words into the feature matrix, which represents how many times unique word is occurring.Table 2 represents the feature matrix for POS and NEG classes.
Further, subsets of feature matrix, one with feature matrix of positively classified headlines, and another with feature matrix of negatively classified headline are represented in Table 3 and 4   The output is a set of continuous values, and like every prediction based application of neural network it is also having one output node.One more important thing is error signal which is a difference between the desired output and the actual output, the weights get updated in training period, so that generated error after some iteration should be minimum.The maximum number of iterations can be fixed, so that the training period will stop after that much iteration.In this model: Date, opening price, high of the day, low of the day, volume and sentiment score are the input variables along with a bias value.The hidden layer and an output layer which gives a single value as an output makes this system a multilevel perceptron regression.So internally the network regresses different independent input variable onto the single dependent variable by using equation 4 to compute the loss function [21]: It is also clear by the formula that it's a squared error function, and α>0 is a non-negative hyper parameter that controls the magnitude of the penalty.Now coming to optimization according to gradient decent which is an algorithm for finding a minimum of the function or let's say optimization algorithm the gradient ∇LossW of the loss with respect to the weights is computed by equation 5: Where, i is the iteration step, and ε is the learning rate with a value larger than 0. At last the learning process stops when it reaches the maximum number of iteration or when the error loss is below the predefined threshold value.After creating an instance of the perceptron model with such kind of perceptron model is being trained.In input, two parameters one the input data and other is the targeted value has been provided where, input data is the training dataset and target values are the closing prices of the next day.So it is fitting a model for training data of a particular day with the target value of next day's adjusted closing price.

Results
In this study, the stock prices data from 2013 to 2016 has been considered and this dataset is divided into two parts: 3/4 th portion of the dataset as training data and other 1/4 th part is considered as test data.In the first case, test dataset is of the period from 04-01-2016 to 30-12-2016 and two hidden layers are taken.1 st hidden layer consists of 7 neurons and second layer has 9 neurons.Authors have also considered one another case in which model is trained on one-year dataset which is of 2016.For this one hidden layer which contains 9 neurons is taken.The reason behind considering two different model with different size of datasets are to compare the results which are quite noticeable.The results suggest that stock prices prediction is more likely effective for the short term.The reason behind this is the time series nature of the data, in first case, model is trained on perceptron network, with the data of three years 2013, 2014, 2015 and predicted the value for the fourth year i.e. 2016.In this case, perceptron network has no knowledge about the price patterns of the year 2016 as it is only familiar with the price patterns of 2013, 2014 and 2015.Thus, it has a high error value as compared to second case, where model is trained with specifically 2016 th dataset, and after that predicting the value for any given instance in 2016 works better than the previous case.These results clearly suggesting that stock prices predictions are more effective for shorter period of time.

Conclusion
In this paper, the authors have focused on stock market data's preprocessing with the help of HIVE, Hadoop ecosystem.HIVE can do work extremely fast as it divides the query in several map-reduce job which can be executed in parallel as well as it can store large amount of data easily with the help of HDFS.Thus, HIVE is much more effective than any other relational database system.Further, the stock price prediction is based on the sentiment analysis of news headlines and historical stock data.In this model, two different scenarios have been taken, one with training on longer period of data (three years data) and another on shorter period of data (oneyear data).From the experiments it has been observed that accuracy reaches 91% in first case whereas 98% in other case, which clearly indicates that Stock price prediction model is more effective for shorter period of data.
Other than this in future, the words which are actually affecting the stock price can also be extracted so that those can be used to find the news data.In that way the sentiment analysis can be improved more.These kinds of solution will definitely make a better prediction system for stock market.

Figure 1 .
Figure 1.Architecture of the proposed methodology

Figure 2 .
Figure 2. Data retrieved from the data source containing many null values

Figure 3 .
Figure 3. Data after loading operation in HIVE table,showing NULL values

Figure 4 .
Figure 4. Data in the output table without null values Stock Price Prediction using Artificial Neural Model: An Application of Big Data EAI Endorsed Transactions on Scalable Information Systems Online First respectively.

Figure 7
is the scatter plot of predicted stock prices generated by the Perceptron network fitted on the actual values of the test data from 04-01-2016 to 30-12-2016.In this figure, black data points are actual data points where red are the predicted values.

Figure 8
is the scatter plot of predicted stock prices generated by the Perceptron network fitted on the actual values on the test data from 03-10-2016 to 30-12-2016.

Figure 8 .
Figure 8. Values AAPL stock generated by the Perceptron network fitted on the actual values from 03-10-2016 to 30-12-2016

Table 1 .
1. Basically, it is a training data where authors have manually classified news in POS and NEG classes.Training data example for sentiment analysis

Table 3 .
Feature matrix of only positive classes

Table 4 .
Feature matrix of only negative classes

Headline People Loved The iPhoneX Hated A Great Poor Design Good Class
Malav Shastri, Sudipta Roy and Mamta Mittal Figure 6.Snapshot of results of most informative features and accuracy (At the end)In stock data, authors have added one more column sentiment score after adjusted closing price.This final table has been given in table8.
Multilevel Perceptron: Multilevel perceptron uses different loss functions for classification and regression.Authors trained it using back propagation and identity function is used as activation function in the output layer.Stock Price Prediction using Artificial Neural Model: An Application of Big Data EAI Endorsed Transactions on Scalable Information Systems Online First

Table 8 .
Final table to be used as input in prediction algorithm

Table 9
represents the last 15 predicted values and actual values of the period 04-01-2016 to 30-12-2016 out of total 252 predicted values.

Table 9 .
Last 15Actual and predicted values of the period 04-01-2016 to 30-Error in this model is measured as Mean Absolute Percentage Error (MAPE), which is the measurement of accuracy of prediction model, this error is 8.2148 in this case, thus this model is 91.8% accurate.

Table 10
2016 is taken as test data.MAPE value in this case is 1.5830.So, this model is 98.42% accurate, which is considerably higher than the previous model.