Elsevier

Expert Systems with Applications

Volume 42, Issue 3, 15 February 2015, Pages 1325-1339
Expert Systems with Applications

Developing an approach to evaluate stocks by forecasting effective features with data mining methods

https://doi.org/10.1016/j.eswa.2014.09.026Get rights and content

Highlights

  • A comprehensive study on likely effective features in risk and return prediction.

  • Stock risk and return prediction with different classification methods.

  • Hybrid FS algorithm on the basis of filter and function-based clustering.

Abstract

In this research, a novel approach is developed to predict stocks return and risks. In this three stage method, through a comprehensive investigation all possible features which can be effective on stocks risk and return are identified. Then, in the next stage risk and return are predicted by applying data mining techniques for the given features. Finally, we develop a hybrid algorithm, on the basis of filter and function-based clustering; the important features in risk and return prediction are selected then risk and return re-predicted. The results show that the proposed hybrid model is a proper tool for effective feature selection and these features are good indicators for the prediction of risk and return. To illustrate the approach as well as to train data and test, we apply it to Tehran Stock Exchange (TSE) data from 2002 to 2011.

Introduction

Of the most important concerns of market practitioners is future information of the companies which offer stocks. A reliable prediction of the company’s financial status provides a situation for the investor to more confident investments and gaining more profits (Huang, 2012a, Huang, 2012b). One can refer to different studies about share gaining and return prediction, for example, time series stock price prediction model (Araújo & Ferreira, 2013), buy–hold–sell prediction model (Wu et al., 2014, Zhang et al., 2014), Index prediction model with Anfis (Svalina, Galzina, Lujić, & Šimunović, 2013) or MARS and SVR (Kao, Chiu, Lu, & Chang, 2013), profit gaining (Ng, Liang, Li, Yeung, & Chan, 2014). However, unlike the return, risk has been rarely considered for prediction, while customers usually balance their return for a proper level of risk, then clearly both risk and return are important factors in financial decision making (Barak et al., 2013, Tsai et al., 2011). Without risk evaluation the portfolio efficient frontier does not make sense. Thus, this paper implements the forecasting of both risk and return of stocks which has tremendous effect on price setting. Also, up-down prediction of stock movement such as (Patel et al., 2014, Yu et al., 2014, Zhang et al., 2014) cannot result in precision view of stock future and investors gaining. While classifying the amount of risk and return to different categories like our method gives more specific and clear knowledge.

Therefore, in this study, the simultaneous prediction of risk and return classes with different classification algorithms is investigated.

To predict risk and return variables accurately, the effective factors need to be identified. In fact, one of the key issues of stock prediction design lies on how to select representative features for prediction (Zhang, Hu, Xie, Wang, et al., 2014).

Most studies in this area focus on technical features, financial ratios or macroeconomic indicators. For example, Tsai and Hsiao (2010) studied 8 financial ratios and 16 macroeconomic indicators as the main features to predict stock return by back propagation in Taiwan stock market. Cheng, Chen, and Lin (2010) conducted a comprehensive study on macroeconomic and technical features and studied 8 financial ratios and 10 macroeconomic indicators to investigate their effect on return variation in Taiwan stock market. By applying probabilistic back propagation algorithm, rough set and C4.5 Tree, they achieved 76% accuracy. de Oliveira, Nobre, and Zárate (2013) use 15 technical indicators and 11 fundamental indexes to prediction of stocks movement in Petrobras with artificial neural networks and obtain 87.50% for direct prediction. Tsai et al. (2011) considered 19 financial ratios and 11 macroeconomic indicators in Taiwan stock market by combining logistic regression algorithm, MLP back propagation and CART Tree to investigate their effect urn (negative or positive) on the stock return and achieved 66.67% accuracy based on bagging and voting algorithms. In majority of studies, as mentioned, the focus is mostly on financial ratios, macroeconomic indicators, and technical indicators based on experts’ ideas to predict returns. However, this paper presents a systematic and efficient methodology for comprehensive searching the potential representative features on stock market in 3 categories of financial ratio, profit and loss reports, and stock pricing models and not arbitrarily choosing likely effective features.

Furthermore, many studies have claimed and verified that feature selection (FS) is the key process in stock prediction modeling (Tsai & Hsiao, 2010). Zhang, Hu, Xie, Wang, et al. (2014) use a causal feature selection (CFS) algorithm to find effective features in Shanghai stock exchanges. The idea in their model is about causalities based feature selection algorithm. They assert that CFS represents direct influences between various stock features, while correlation based algorithms cannot distinguish direct influences from indirect ones. Wu et al. (2014) use textual and technical features to improve prediction accuracy of stock market. They use SVR algorithm and trend segmentation method to forecast trends and generate trading signals, respectively. Their feature selection algorithm is stepwise regression analysis. Although there are a variety of studies in the area of feature selection, almost all of them use a single feature selection model.

In this research, a novel hybrid feature selection algorithm on the basis of filter and function-based clustering method is applied to select the important features. What makes our proposed approach different from the previous ones is that we consider the combination of 9 different feature selection algorithms with function-based clustering algorithm. Hybrid model of our paper enjoys the power and advantage of correlation based algorithms like Chi-square, One-R in addition to the power of classified errors based, interval based, and information based algorithms like SVM, Relief-f, and Gini index/gain ration algorithms respectively. The effectiveness of our model is illustrated with the prediction of both risk and return of stocks and then analyzing the results with and without implementing of our hybrid feature selection algorithms.

To sum up, in the first stage of paper, a complete list of likely effective features on the stocks risks and returns are identified. After developing an appropriate database in the second stage, different classification algorithms are used to predict the risk and return. We also scrutinize on the effect of their results to our data base based on feature-oriented view point. Finally, in the third stage, a novel hybrid feature selection algorithm on the basis of filter and function-based clustering method is applied to select the important features which affect the prediction of risk and return.

The contribution of the paper is summarized as follow:

  • A comprehensive and systematic study to identify the likely effective features in risk and return prediction.

  • Stock risks as well as return prediction with different classification methods.

  • Designing a hybrid feature selection algorithm on the basis of filter and function-based clustering.

  • Finally, each algorithm with a feature-oriented view point is analyzed. The results indicate the factors which cause strength and weakness of that algorithm. As a result the nature of each feature is provided according to the amount of interference variable in their prediction.

The rest of the article is organized as follows. In Section 2, the proposed model is presented which has three stages. In Section 3, to illustrate the approach, we implement it with some real data from Tehran Stock Exchange (TSE). The results are analyzed in which the predictions with and without considering important effective features are also compared. Then in Section 4, a discussion on real return and risk prediction with important features has been represented. Finally, some conclusion and future research directions are provided in Section 5.

Section snippets

Proposed model

Our proposed algorithm which consists of three stages is shown in Fig. 1. In the first stage a database is developed and data is pre-processed. Non-systematic risk as well as real return is predicted with classification algorithms in the next stage. A hybrid feature selection algorithm is also presented in the third stage and risk and return are re-predicted based on selected features.

Experimental results and analysis

In this study a database including 44 input features and 2 goal features are gathered from TSE data from 2003 to 2012. The resulting database has 1963 records for 400 companies.

According to a group of experts, 5 intervals were introduced for the real return: very high with a range higher than 9.3, high with the range of 4–9.3, average with a range of 1.14–4, low with the range of −1.3 to 1.14 and very low that lower than −1.3. Risk is also classified in 3 intervals: high in range of higher than

The real return results in prediction with selected features

If for denser structure trees all effective features in first prediction are selected by the proposed hybrid model, results in better accuracy, such as “BF Tree”, “LAD Tree”, and “FT Tree”. Otherwise, it is possible that accuracy drops, like “CART and Rep” TREEs. The selected features have different effect on the accuracy of forecasting. Some trees with large structure, such as J48 Graph and J48 Tree are get lower accuracy, while some get a higher accuracy such as ID3 Numerical. Higher accuracy

Conclusions

In this study, an approach for simultaneous prediction of risk and real return were developed by applying data mining technique as well as fundamental data set. To do this, first through a comprehensive study, the features which can be potentially effective on risk and return were investigated. Then, after developing an appropriate database the preprocessing of database step was taken. To predict the real return and risk, 20 and 15 different prediction algorithms were applied respectively.

References (64)

  • E.F. Fama et al.

    Size, value, and momentum in international stock returns

    Journal of Financial Economics

    (2012)
  • T.-P. Hong et al.

    Mining rules from an incomplete dataset with a high missing rate

    Expert Systems with Applications

    (2011)
  • C.-F. Huang

    A hybrid stock selection model using genetic algorithms and support vector regression

    Applied Soft Computing

    (2012)
  • C.F. Huang

    A hybrid stock selection model using genetic algorithms and support vector regression

    Applied Soft Computing

    (2012)
  • C.-J. Huang et al.

    Application of wrapper approach and composite classifier to the stock trend prediction

    Expert Systems with Applications

    (2008)
  • L.-J. Kao et al.

    A hybrid approach by integrating wavelet-based feature extraction with MARS and SVR for stock index forecasting

    Decision Support Systems

    (2013)
  • R.K. Lai et al.

    Evolving and clustering fuzzy decision tree for financial time series data forecasting

    Expert Systems with Applications

    (2009)
  • W.-S. Lee et al.

    Combined MCDM techniques for exploring stock selection based on Gordon model

    Expert Systems with Applications

    (2009)
  • N. Levin et al.

    Predictive modeling using segmentation

    Journal of Interactive Marketing

    (2001)
  • J. Lewellen

    Predicting returns with financial ratios

    Journal of Financial Economics

    (2004)
  • B. Li

    Sign eigenanalysis and its applications to optimization problems and robust statistics

    Computational Statistics & Data Analysis

    (2006)
  • H.-Y. Lin

    Feature selection based on cluster and variability analyses for ordinal multi-class classification problems

    Knowledge-Based Systems

    (2013)
  • G. Sadka et al.

    Predictability and the earnings–returns relation

    Journal of Financial Economics

    (2009)
  • I. Svalina et al.

    An adaptive network-based fuzzy inference system (ANFIS) for the forecasting: The case of close price indices

    Expert Systems with Applications

    (2013)
  • C.-F. Tsai et al.

    Combining multiple feature selection methods for stock prediction: Union, intersection, and multi-intersection approaches

    Decision Support Systems

    (2010)
  • C.-F. Tsai et al.

    Predicting stock returns by classifier ensembles

    Applied Soft Computing

    (2011)
  • C.-F. Tsai et al.

    Determinants of intangible assets value: The data mining approach

    Knowledge-Based Systems

    (2012)
  • J.-L. Wang et al.

    Stock market trading rule discovery using two-layer bias decision tree

    Expert Systems with Applications

    (2006)
  • J.-L. Wu et al.

    An intelligent stock trading system using comprehensive features

    Applied Soft Computing

    (2014)
  • H. Yu et al.

    A SVM stock selection model within PCA

    Procedia Computer Science

    (2014)
  • R. Bauer et al.

    Empirical evidence on corporate governance in Europe: The effect on stock returns, firm value and performance

    Journal of Asset Management

    (2004)
  • L. Bernstein et al.

    Analysis of financial statements

    (1999)
  • Cited by (77)

    • Evaluating the performance of ensemble classifiers in stock returns prediction using effective features

      2023, Expert Systems with Applications
      Citation Excerpt :

      Among the 20 selected features, six are associated with macroeconomic indicators and the rest belong to financial ratios (Table 8). A close look at the macroeconomic indicators reveals that the annual changes in OPEC oil price (AOPC) as well as gold coin price (AGCC) place leading-order controls on the Iran stock market; however, the importance of these two features has been largely overlooked in the prior studies (Emamgholipour et al., 2013; Barak and Modarres, 2015; Barak et al., 2017). Thus, those studies need to be re-evaluated to incorporate the impact of macroeconomic features, particularly non-traditional ones (e.g., AOPC and AGCC), on the stock returns.

    • Machine learning techniques and data for stock market forecasting: A literature review

      2022, Expert Systems with Applications
      Citation Excerpt :

      On account of this, combinations of several methods such as KNN + SVM (Cao et al., 2019; Chen & Hao, 2017), ANN + SVM (Lu & Wu, 2011; Weng et al., 2017), and others have been investigated for predicting stock prices or returns. In addition, the forecast accuracy of the techniques mentioned above have been improved using feature selection methods (Barak & Modarres, 2015; Zhang et al., 2014), feature extraction methods such as principal component analysis (PCA) (Chen & Hao, 2018; Wang & Wang, 2015), evolutionary algorithms such as genetic algorithms (GA) (Ye et al., 2016), Wavelet transforms (Chiang et al., 2016), and particle swarm optimizations (Chai et al., 2015), to name a few. Moreover, in contrast to the previously discussed supervised learning techniques, the ability of clustering as an unsupervised method was also examined for forecasting stock prices (e.g., Vilela et al., 2019).

    View all citing articles on Scopus
    View full text