Elsevier

Pattern Recognition Letters

Volume 31, Issue 2, 15 January 2010, Pages 151-162
Pattern Recognition Letters

Selecting features of linear-chain conditional random fields via greedy stage-wise algorithms

https://doi.org/10.1016/j.patrec.2009.09.025Get rights and content

Abstract

This paper presents two embedded feature selection algorithms for linear-chain CRFs named GFSA_LCRF and PGFSA_LCRF. GFSA_LCRF iteratively selects a feature incorporating which into the CRF will improve the conditional log-likelihood of the CRF most at one time. For time efficiency, only the weight of the new feature is optimized to maximize the log-likelihood instead of all weights of features in the CRF. The process is iterated until incorporating new features into the CRF can not improve the log-likelihood of the CRF noticeably. PGFSA_LCRF adopts pseudo-likelihood as evaluation criterion to iteratively select features to improve the speed of GFSA_LCRF. Furthermore, it scans all candidate features and forms a small feature set containing some promising features at certain iterations. Then, the small feature set will be used by subsequent iterations to further improve the speed. Experiments on two real-world problems show that CRFs with significantly fewer features selected by our algorithms achieve competitive performance while obtaining significantly shorter testing time.

Introduction

Conditional random fields (CRFs) (Lafferty et al., 2001) are a probabilistic framework for labeling and segmenting relational data. A CRF is an undirected graph model encoding a conditional probability, trained to maximize the conditional probability of the outputs with the given inputs. CRFs have appealed to a lot of researchers in many fields (McCallum, 2003, McCallum et al., 2003, Taskar et al., 2003, Kumar and Hebert, 2003, Zhu et al., 2005, Sarawagi and Cohen, 2005, Sutton and McCallum, 2007a) and been applied to many tasks with great success (Sha and Pereira, 2003, Settles, 2004, Quattoni et al., 2005, Culotta et al., 2005, Lee et al., 2005).

CRFs have the advantage of flexibly incorporating arbitrary, overlapping and agglomerative observation features over generative models. Therefore systems often construct a large number of features for CRFs, based on feature templates of observations which assume that a fixed-sized neighborhood of the current observation is relevant to predicting the output value. However, the more features the better? On one hand, more features provide more information about data. On the other hand, too many features can easily cause overfitting, storage problem and computation problem and so on. Therefore, this paper explores the issue of selecting features for CRFs.

Feature selection methods can be divided into three categories: filter methods, wrapper methods and embedded methods (Blum, 1997, Das, 2001, Guyon and Elisseeff, 2003, Yu and Liu, 2004). Filter methods (Yu and Liu, 2003) evaluate the goodness of features based on the statistical characteristics of the training data, e.g. the relevance of features or attributes of the data to the class concept, independent of any learning algorithm. The process serves as a pre-processing step to the learning algorithm, which learns a hypothesis or a predictor to classify new data without class labels. These methods are advantageous in speed. Wrapper methods (Kohavi and John, 1997) predetermine a learning algorithm and wrap the feature selection around the learning process. For each new subset of features, they learn a new predictor and use the performance of the predictor to score the subset of features. The selected subset of features should be suitable to the predetermined learning algorithm to get better performance. However, these methods are often criticized for their slow and sometimes infeasible speed. In addition, embedded methods (Das, 2001, Breiman et al., 1984) lie between the filter methods and the wrapper methods. They are usually specific to a learning algorithm and perform feature selection during the learning process.

Features in CRFs are not the same as features or attributes in the traditional classification problems. Actually they are conjunctions of attributes and labels. Therefore, it is improper to evaluate the goodness of features in CRFs just based on those criterions designed for traditional feature selection. As a result, it is necessary to exploit new criterions in filter methods for selecting features of CRFs. Moreover, wrapper methods cost too much time to be practical. In this paper, we present two embedded feature selection methods for CRFs named GFSA_LCRF and PGFSA_LCRF. GFSA_LCRF iteratively selects a feature that will improve the log-likelihood of the CRF most at one time. In detail, it starts with a CRF without features. Then it greedily selects the feature inserting which into the CRF will maximally improve the log-likelihood of the training data at a time. Note that, during calculating the gain of log-likelihood of the training data after incorporating a feature into the CRF, it only optimizes the coefficient of this feature instead of tuning all coefficients in the CRF for time efficiency. And then it inserts the selected feature with its optimal coefficient into the CRF. Repeat these procedures until no obvious improvement (which is controlled by a threshold) of the log-likelihood of the training data will be obtained after a new feature is incorporated into the CRF. Finally, all the features which have been incorporated into the CRF are treated as a new feature set for learning new models. The running time of GFSA_LCRF is proportional to the square of the number of all features. Besides it involves a very time-consuming process of computing a combinatory polynomial. Thus, when there are a great number of features, GFSA_LCRF is unpractical. In order to speed up GFSA_LCRF, PGFSA_LCRF adopts pseudo-likelihood instead of likelihood as evaluation metric to iteratively select features. Moreover, it scans all candidate features and forms a small feature set only consisting of some comprising features every iter iterations, and it scans only the features in the small feature set at the other iterations to further improve the speed. Experiments on a small-scale problem, the FAQs segmentation, show that CRFs with features selected by GFSA_LCRF perform best on most tasks. Experiments on a large-scale problem, noun phrases segmentation, show that CRFs with features selected by PGFSA_LCRF perform better and have significantly shorter testing time than those with all features.

The rest of the paper is organized as follows. After an introduction to CRFs is given in Section 2, related work is presented in Section 3. In Section 4, we present two greedy stage-wise feature selection algorithms for linear-chain CRFs. Afterward experimental results on two real-world problems are reported in Section 5. Finally, Section 6 concludes the paper and details future work.

Section snippets

Conditional random fields

In this section, we first briefly review conditional random fields (CRFs) laying emphasis on the linear-chain CRFs, and then describe the training methods for linear-chain CRFs, and finally present labeling methods in linear-chain CRFs.

Related work

In real-world applications, it requires two steps to solve the task of labeling or segmenting relational data with CRFs. Firstly, proper features for the model should be extracted from the raw data. Then the model with these features is learned. Afterward we can use the learned model to label or segment relational data. However, researchers mainly lay emphasis on enhancing the model (McCallum et al., 2003, Taskar et al., 2003, Sarawagi and Cohen, 2005, Sutton and McCallum, 2007b) and improving

The greedy stage-wise feature selection algorithm for linear-chain CRFs

In this section, we first present an embedded feature selection algorithm – greedy feature selection algorithm for linear-chain CRFs (GFSA_LCRF) and describe the process of computing a combinational polynomial in GFSA_LCRF, and then we improve GFSA_LCRF for large-scale problems, presenting the algorithm – pseudo-likelihood greedy feature selection algorithm for linear-chain CRFs (PGFSA_LCRF).

Experiments

In this section, we evaluate the proposed feature selection algorithms for CRFs on two real-world problems. The one is FAQs segmentation, a small-scale problem. The other one is noun phrase segmentation, a large-scale problem. All experiments were performed with our C++ implementation of first-order CRFs on a personal computer with 1.86 GHz Core(TM) 2 processor, 2-G memory and Window XP operation system.

Conclusions and future work

This paper studies selecting features for linear-chain CRFs assuming sufficient features of data have been extracted based on the hand-crafted feature templates and presents two embedded feature selection algorithms for CRFs. These two methods GFSA_LCRF and PGFSA_LCRF are founded on the principle of iteratively selecting the feature that would maximally improve the conditional likelihood and pseudo-likelihood if added to the existing CRFs. GFSA_LCRF is better suitable to small-scale problems,

Acknowledgements

The authors would like to thank the anonymous reviewers for their helpful comments that greatly improved this paper. This work was supported by the National High Technology Research and Development Program (863 Program) of China (Grant No. 2006AA01Z107), the National Natural Science Foundation of China (No. 60702062) and the Key Project of Ministry of Education of China (No. 108115).

References (30)

  • A.L. Blum

    Selection of relevant features and examples in machine learning

    Artif. Intell.

    (1997)
  • R. Kohavi et al.

    Wrappers for feature subset selection

    Artif. Intell.

    (1997)
  • A.L. Berger et al.

    A maximum entropy approach to natural language processing

    Comput. Linguist.

    (1996)
  • L. Breiman et al.

    Classification and Regression Trees

    (1984)
  • Culotta, A., Kulp, D., et al., 2005. Gene Prediction with Conditional Random Fields. Technical Report, University of...
  • Das, S., 2001. Filters, wrappers and a boosting-based hybrid for feature selection. In: Proc. 18th Internat. Conf. on...
  • Dietterich, T.G., 2002. Machine learning for sequential data: A review. In: Proc. Joint IAPR Internat. Workshop on...
  • Dietterich, T.G., Ashenfelter, A., et al., 2004. Training conditional random fields via gradient tree boosting. In:...
  • Freitag, D., McCallum, A., 2000. Information extraction with HMM structure learned by stochastic optimization. In:...
  • I. Guyon et al.

    An introduction to variable and feature selection

    J. Machine Learn. Res.

    (2003)
  • Kumar, S., Hebert, M., 2003. Discriminative random fields: A discriminative framework for contextual. In: Proc. 9th...
  • Lafferty, J.D., McCallum, A., et al., 2001. Conditional random fields: Probabilistic models for segmenting and labeling...
  • C. Lee et al.

    Segmenting brain tumors with conditional random fields and support vector machines

    Lect. Notes Comput. Sci.

    (2005)
  • McCallum, A., 2003. Efficiently inducing features of conditional random fields. In: Proc. 19th Conf. on Uncertainty in...
  • Mccallum, A., Freitag, D., et al., 2000. Maximum entropy Markov models for information extraction and segmentation. In:...
  • Cited by (4)

    View full text