Stock Market Index Data and indicators for Day Trading as a Binary Classification problem

Classification is the attribution of labels to records according to a criterion automatically learned from a training set of labeled records. This task is needed in a huge number of practical applications, and consequently it has been studied intensively and several classification algorithms are available today. In finance, a stock market index is a measurement of value of a section of the stock market. It is often used to describe the aggregate trend of a market. One basic financial issue would be forecasting this trend. Clearly, such a stochastic value is very difficult to predict. However, technical analysis is a security analysis methodology developed to forecast the direction of prices through the study of past market data. Day trading consists in buying and selling financial instruments within the same trading day. In this case, one interesting problem is the automatic individuation of favorable days for trading. We model this problem as a binary classification problem, and we provide datasets containing daily index values, the corresponding values of a selection of technical indicators, and the class label, which is 1 if the subsequent time period is favorable for day trading and 0 otherwise. These datasets can be used to test the behavior of different approaches in solving the day trading problem.


Specifications
When necessary, data are filtered to remove missing or inaccurate data

Experimental features
The data provided consist of some series of daily prices (opening, closing, maximum, minimum), several series of technical indicators, and a class label. They can be used to apply day trading algorithms or classification algorithms Data accessibility Data is within this article

Value of the data
The datasets are taken from real-world major financial markets and they are very recent: they range from 20th April 2010 to 12th July 2016.
The datasets contain a vast selection of financial indicators regarded as highly trend indicative by technical analysis.
The datasets are filtered and cleaned to remove data errors and missing. These datasets can be used as benchmarks by researchers willing to test trading algorithms on realworld recent data.
These datasets can also be used as benchmarks to test classification strategies on publicly available difficult data.

Data
We provide daily time series for two major indices belonging to two different stock markets. The first one is the Standard & Poor's 500 (S&P 500), which is an American stock market index based on the market capitalizations of 500 large companies having common stock listed on the NYSE or NASDAQ. This is one of the most commonly followed equity indices, and many consider it one of the best representations of the U.S. stock market. The second is the Financial Times Stock Exchange Milano Indice di Borsa (FTSE MIB), which is the primary benchmark Index for the Italian equity markets. It consists of the 40 most-traded stock classes on the exchange, and captures approximately 80% of the domestic market capitalization. For these indices, for each trading day ranging from 20th April 2010 to 12th July 2016, we provide the opening price, closing price, maximum, minimum, and a number of indicators regarded as highly trend indicative by technical analysis (see, e.g., [1][2][3] and references therein), as described in more detail in the next section. Each data record corresponds to one day.
We also provide a binary classification for each day: the class is 1 if the subsequent time period is favorable for day trading and 0 otherwise. Data are filtered to check and to correct missing or inaccurate data. Indicators which are computed using the n past observations are available only from the ðn þ 1Þ-th record of the dataset. The class is not available for the last record. These missing data are encoded as 'N'. No other missing data appear in the dataset. Data cleaning is indeed an important issue for similar data (see, e.g., [4] for references on this widespread problem). The data provided can be used to test the effectiveness of technical analysis in predicting the trend, or to test the accuracy of classification algorithms.

Experimental design, materials and methods
Each data record refers to one single trading day. Such a time period is indicated by a subscript t A 1; …; m. Each data record is identified with the date and it contains the following values: o t denotes the opening price of the index; c t denotes its closing price; max t denotes its maximum price; min t denotes its minimum price. By using the above values, for each trading day we compute: the return r t ¼ ðc t À c t À 1 Þ=c t À 1 of the index; the percentage variation of the closing price δ t ¼ 100ðc t À c t À 1 Þ=c t À 1 .
After this, we compute the indicators described below. For each of them, the current value is denoted by a subscript t, the previous by t À 1, etc.

Momentum
Momentum is conventionally regarded as the basic trend-following indicator. It shows trend by remaining positive while an uptrend is sustained, or negative while a downtrend is sustained. The momentum M t n ð Þ of the current time period t is computed as the difference between the current closing price c t and the closing price of n days ago c t À n M t ðnÞ ¼ c t À c t À n In our case we use n ¼ 5. Range: momentum can take any real value, either positive or negative. Positive values of momentum denote that the index trend is increasing, and vice versa.

EMA
Moving averages are widely used for the analysis of time series. A simple moving average (SMA) is the unweighted mean of the previous n data of the historical price data, most often the closing price. A weighted moving average (WMA) has multiplying factors to give different weights to the different prices. Usually, recent prices receive more importance than older prices. In particular, an exponential moving average (EMA) applies weighting factors which decrease exponentially in the past, however never reaching zero. EMA t n ð Þ is computed using the current closing price c i and the EMA of the previous day EMA t À 1 n ð Þ.
In our case we use n ¼ 12 and also n ¼26.
Range: EMA has the same range of the price of the asset; in general it can take any real positive value.

MACD
Moving Average Convergence/Divergence (MACD) is an oscillator that should reveal changes in the strength, direction, momentum, and duration of a trend in a stock's price. The simplest version of MACD is the difference between two moving averages, one over a shorter period n and one over a longer period m.
Further insight can be obtained by using a third moving average of the MACD n; m ð Þ itself over a period s, called "signal line" SL s ð Þ. When MACD n; m ð Þincreases and crosses the signal line, it is a bullish signal; when it decreases and crosses the signal line, it is a bearish signal.
MACD t n; m; s ð Þ¼ð EMA t n ð ÞÀ EMA t m ð ÞÞÀSL t s ð Þ In our case we use n ¼ 12, m ¼26 and s¼9. Range: MACD can take any real value, either positive or negative. Positive values denotes that the index trend is increasing, and vice versa.

ROI
Return on Investment (ROI) is one way of considering profits in relation to capital invested. Usually it is the ratio between return and invested capital. In our case, we use the average return over the last n days, denoted by aver r t ; r t À 1 ; …; r t À n þ 1 È É , and the current closing value.
In our case we use n ¼ 10, 20 and 30. Range: ROI can take any real value, either positive or negative. Positive values denote income, negative ones denote loss.

RSI
Relative Strength index (RSI) is a momentum oscillator that compares the magnitude of recent gains and losses over a specified time period to measure speed and change of price movements of a security. By defining the upward change as u t ¼ c t À c t À 1 if c t 4 c t À 1 and 0 otherwise, and the downward change as d t ¼ c t À 1 À c t if c t o c t À 1 and 0 otherwise, the relative strength RSðnÞ is the average of the last n upward changes divided the average of the last n downward changes. RS t n ð Þ ¼ aver u t ; u t À 1 ; …; u t À n þ 1 È É aver d t ; d t À 1 ; …; d t À n þ 1 È É Then, RSI is computed as follows RSI t n ð Þ ¼ 100 À 100 1 þ RS t RSI is considered a signal of overbought when above 70 and a signal of oversold when below 30. In our case we use n ¼ 10, 14 and 30. Range: RSI oscillates between 0 and 100. It is near to 0 when the corresponding upward changes are near to 0, it is near to 100 when the corresponding downward changes are near to 0.

STOCHRSI
Stochastic oscillators attempt to predict price turning points by comparing the closing price of a security to its price range. This concept can be applied to the RSI itself, obtaining the Stochastic RSI (SRSI). By computing the RSI range from its minimum in the last n periods min RSI t ; f RSI t À 1 ; …; RSI t À n þ 1 g and its maximum in the last n periods max RSI t ; RSI t À 1 ; …; RSI t À n þ 1 È É , the SRSI is defined as follows. SRSI t n ð Þ ¼ RSI t n ð ÞÀmin RSI t ; RSI t À 1 ; …; RSI t À n þ 1 È É max RSI t ; RSI t À 1 ; …; RSI t À n þ 1 È É À min RSI t ; RSI t À 1 ; …; RSI t À n þ 1 È É In our case we use n ¼ 10, 14 and 30. Range: its range is between 0 and 1.

ATR
Average True Range (ATR) measures the degree of price volatility. The rage of a price is simply defined as max t À min t , the True Range (TR) extends it to yesterday's closing price if it was outside of today's range: Now, by denoting with EMA t n; X ð Þ the exponential moving average of a generic X over the last n periods, we have that ATR is the exponential moving average of the TR: In our case we use n ¼ 14.
Range: it is any positive value.

ADX
Average Directional Index (ADX) does not indicate trend direction or momentum, only trend strength. It is computed using the positive directional indicator (þ DI), the negative directional indicator (-DI), and the Average True Range (ATR).
By defining the upmove as up t ¼ max t À max t À 1 and the downmove as dw t ¼ min t À 1 À min t , if up t 4 dw t and up t 4 0 then þ DM t ¼ up t ; otherwise þ DM t ¼ 0; if dw t 4 up t and dw t 4 0 then À DM t ¼ dw t ; otherwise ÀDM t ¼ 0: Now, recalling that EMA t n; X ð Þdenotes the exponential moving average of X over the last n periods, we compute þ DI t n ð Þ ¼ 100 EMA t ðn; þ DMÞ ATR t ðnÞ À DI t n ð Þ ¼ 100 EMA t ðn; ÀDMÞ ATR t ðnÞ ADX is finally computed as follows, with absð:Þ denoting the absolute value: In our case we use n ¼ 14.
Range: it ranges between 0 and 100. Generally, ADX values below 20 indicate trend weakness, and values above 40 indicate trend strength.

Williams %R
Williams %R is an oscillator that analyzes whether a stock or commodity market is trading near the high or the low, or somewhere in between, of its recent trading range. %R t n ð Þ ¼ À100 max max t ; max t À 1 ; …; max t À n þ 1 È É À c t max max t ; max t À 1 ; …; max t À n þ 1 È É À min min t ; min t À 1 ; …; min t À n þ 1 È É In our case we use n ¼ 14.
Range: it ranges between À 100 and 0. A value of À100 means that the close today was the lowest low of the past n days, and 0 means that today's close was the highest high of the past n days.

CCI
Commodity Channel Index (CCI) is used to identify cyclical trends not only in commodities, but also equities and currencies. Define the Typical Price (TP) as follows.
By computing the simple average over the last n periods of the typical price and its standard deviation, CCI is defined as follows.
In our case we use n ¼ 20. Range: the CCI fluctuates above and below zero. The constant 0.015 should ensure that approximately 70À80% of CCI values lay between À 100 and þ100.

UO
Ultimate Oscillator (UO) uses buying or selling "pressure", represented by where the daily closing price falls within the daily true range. The Buying Pressure (BP) and the True Range (TR) are computed as follows.
BP t ¼ c t À minfmin t ; c t À 1 g TR t ¼ maxfmax t ; c t À 1 gÀminfmin t ; c t À 1 g Then, the total buying pressure over the past n days is computed as follows.
Such a total buying pressure is computed for short, intermediate and long time intervals, and the UO is: In our case we use n ¼ 7, m ¼14 and s¼28. Range: it ranges between 0 and 100.

Class
The problem of data classification is the attribution of labels to records according to a criterion automatically learned from a training set, that is a set of records that already have a class. Classification is a very important data mining task (see also [5]), and many classifications algorithms are today available (e.g., [6]). We assign the class to each record, so that any portion of the dataset can be used as training set. After this learning phase, the classification algorithm will be able to predict the class for the rest of the records. The accuracy of such a prediction can be computed by comparing it with the real class, which is the one given in the dataset.
The class that we assign to each daily record is 1 if the subsequent day is favorable for intra-day trading and 0 otherwise. Favorable for intra-day trading means that the increase between the opening price and the closing price of the same day is large enough for obtaining a profit by buying at the opening price and selling at the closing price. A threshold must be selected to define "large enough"; we select the value 0.3%, which should provide a reasonable opportunity for profit. Therefore, the class is defined as follows. Its prediction would allow to perform intra-day trading in the following day, as described above, or it could possibly be used to define inter-day trading strategies.
Note that the class of a given day is clearly not computable from the data available up to that day. However, we assigned it for the whole dataset by simply looking, for each day, at the following day.
According to technical analysis, there should be some kind of relation between the above described indicators at day t and the market evolution at day t þ 1, that determines the class of day t (see for example [7]). The classification algorithm aims at discovering such a relation by predicting the class using the above described indicators. For an analysis of the existence of profit opportunities with respect to the market index, see also [8].
The datasets are provided in CSV format, that can be opened with MS Excel or as text file.

Transparency document. Supplementary material
Transparency data associated with this paper can be found in the online version at http://dx.doi. org/10.1016/j.dib.2016.12.044.