Selecting features of linear-chain conditional random fields via greedy stage-wise algorithms
Introduction
Conditional random fields (CRFs) (Lafferty et al., 2001) are a probabilistic framework for labeling and segmenting relational data. A CRF is an undirected graph model encoding a conditional probability, trained to maximize the conditional probability of the outputs with the given inputs. CRFs have appealed to a lot of researchers in many fields (McCallum, 2003, McCallum et al., 2003, Taskar et al., 2003, Kumar and Hebert, 2003, Zhu et al., 2005, Sarawagi and Cohen, 2005, Sutton and McCallum, 2007a) and been applied to many tasks with great success (Sha and Pereira, 2003, Settles, 2004, Quattoni et al., 2005, Culotta et al., 2005, Lee et al., 2005).
CRFs have the advantage of flexibly incorporating arbitrary, overlapping and agglomerative observation features over generative models. Therefore systems often construct a large number of features for CRFs, based on feature templates of observations which assume that a fixed-sized neighborhood of the current observation is relevant to predicting the output value. However, the more features the better? On one hand, more features provide more information about data. On the other hand, too many features can easily cause overfitting, storage problem and computation problem and so on. Therefore, this paper explores the issue of selecting features for CRFs.
Feature selection methods can be divided into three categories: filter methods, wrapper methods and embedded methods (Blum, 1997, Das, 2001, Guyon and Elisseeff, 2003, Yu and Liu, 2004). Filter methods (Yu and Liu, 2003) evaluate the goodness of features based on the statistical characteristics of the training data, e.g. the relevance of features or attributes of the data to the class concept, independent of any learning algorithm. The process serves as a pre-processing step to the learning algorithm, which learns a hypothesis or a predictor to classify new data without class labels. These methods are advantageous in speed. Wrapper methods (Kohavi and John, 1997) predetermine a learning algorithm and wrap the feature selection around the learning process. For each new subset of features, they learn a new predictor and use the performance of the predictor to score the subset of features. The selected subset of features should be suitable to the predetermined learning algorithm to get better performance. However, these methods are often criticized for their slow and sometimes infeasible speed. In addition, embedded methods (Das, 2001, Breiman et al., 1984) lie between the filter methods and the wrapper methods. They are usually specific to a learning algorithm and perform feature selection during the learning process.
Features in CRFs are not the same as features or attributes in the traditional classification problems. Actually they are conjunctions of attributes and labels. Therefore, it is improper to evaluate the goodness of features in CRFs just based on those criterions designed for traditional feature selection. As a result, it is necessary to exploit new criterions in filter methods for selecting features of CRFs. Moreover, wrapper methods cost too much time to be practical. In this paper, we present two embedded feature selection methods for CRFs named GFSA_LCRF and PGFSA_LCRF. GFSA_LCRF iteratively selects a feature that will improve the log-likelihood of the CRF most at one time. In detail, it starts with a CRF without features. Then it greedily selects the feature inserting which into the CRF will maximally improve the log-likelihood of the training data at a time. Note that, during calculating the gain of log-likelihood of the training data after incorporating a feature into the CRF, it only optimizes the coefficient of this feature instead of tuning all coefficients in the CRF for time efficiency. And then it inserts the selected feature with its optimal coefficient into the CRF. Repeat these procedures until no obvious improvement (which is controlled by a threshold) of the log-likelihood of the training data will be obtained after a new feature is incorporated into the CRF. Finally, all the features which have been incorporated into the CRF are treated as a new feature set for learning new models. The running time of GFSA_LCRF is proportional to the square of the number of all features. Besides it involves a very time-consuming process of computing a combinatory polynomial. Thus, when there are a great number of features, GFSA_LCRF is unpractical. In order to speed up GFSA_LCRF, PGFSA_LCRF adopts pseudo-likelihood instead of likelihood as evaluation metric to iteratively select features. Moreover, it scans all candidate features and forms a small feature set only consisting of some comprising features every iter iterations, and it scans only the features in the small feature set at the other iterations to further improve the speed. Experiments on a small-scale problem, the FAQs segmentation, show that CRFs with features selected by GFSA_LCRF perform best on most tasks. Experiments on a large-scale problem, noun phrases segmentation, show that CRFs with features selected by PGFSA_LCRF perform better and have significantly shorter testing time than those with all features.
The rest of the paper is organized as follows. After an introduction to CRFs is given in Section 2, related work is presented in Section 3. In Section 4, we present two greedy stage-wise feature selection algorithms for linear-chain CRFs. Afterward experimental results on two real-world problems are reported in Section 5. Finally, Section 6 concludes the paper and details future work.
Section snippets
Conditional random fields
In this section, we first briefly review conditional random fields (CRFs) laying emphasis on the linear-chain CRFs, and then describe the training methods for linear-chain CRFs, and finally present labeling methods in linear-chain CRFs.
Related work
In real-world applications, it requires two steps to solve the task of labeling or segmenting relational data with CRFs. Firstly, proper features for the model should be extracted from the raw data. Then the model with these features is learned. Afterward we can use the learned model to label or segment relational data. However, researchers mainly lay emphasis on enhancing the model (McCallum et al., 2003, Taskar et al., 2003, Sarawagi and Cohen, 2005, Sutton and McCallum, 2007b) and improving
The greedy stage-wise feature selection algorithm for linear-chain CRFs
In this section, we first present an embedded feature selection algorithm – greedy feature selection algorithm for linear-chain CRFs (GFSA_LCRF) and describe the process of computing a combinational polynomial in GFSA_LCRF, and then we improve GFSA_LCRF for large-scale problems, presenting the algorithm – pseudo-likelihood greedy feature selection algorithm for linear-chain CRFs (PGFSA_LCRF).
Experiments
In this section, we evaluate the proposed feature selection algorithms for CRFs on two real-world problems. The one is FAQs segmentation, a small-scale problem. The other one is noun phrase segmentation, a large-scale problem. All experiments were performed with our C++ implementation of first-order CRFs on a personal computer with 1.86 GHz Core(TM) 2 processor, 2-G memory and Window XP operation system.
Conclusions and future work
This paper studies selecting features for linear-chain CRFs assuming sufficient features of data have been extracted based on the hand-crafted feature templates and presents two embedded feature selection algorithms for CRFs. These two methods GFSA_LCRF and PGFSA_LCRF are founded on the principle of iteratively selecting the feature that would maximally improve the conditional likelihood and pseudo-likelihood if added to the existing CRFs. GFSA_LCRF is better suitable to small-scale problems,
Acknowledgements
The authors would like to thank the anonymous reviewers for their helpful comments that greatly improved this paper. This work was supported by the National High Technology Research and Development Program (863 Program) of China (Grant No. 2006AA01Z107), the National Natural Science Foundation of China (No. 60702062) and the Key Project of Ministry of Education of China (No. 108115).
References (30)
Selection of relevant features and examples in machine learning
Artif. Intell.
(1997)- et al.
Wrappers for feature subset selection
Artif. Intell.
(1997) - et al.
A maximum entropy approach to natural language processing
Comput. Linguist.
(1996) - et al.
Classification and Regression Trees
(1984) - Culotta, A., Kulp, D., et al., 2005. Gene Prediction with Conditional Random Fields. Technical Report, University of...
- Das, S., 2001. Filters, wrappers and a boosting-based hybrid for feature selection. In: Proc. 18th Internat. Conf. on...
- Dietterich, T.G., 2002. Machine learning for sequential data: A review. In: Proc. Joint IAPR Internat. Workshop on...
- Dietterich, T.G., Ashenfelter, A., et al., 2004. Training conditional random fields via gradient tree boosting. In:...
- Freitag, D., McCallum, A., 2000. Information extraction with HMM structure learned by stochastic optimization. In:...
- et al.
An introduction to variable and feature selection
J. Machine Learn. Res.
(2003)
Segmenting brain tumors with conditional random fields and support vector machines
Lect. Notes Comput. Sci.
Cited by (4)
Supervised approach to recognise Polish temporal expressions and rule-based interpretation of timexes
2017, Natural Language Engineering2D-SAR and 3D-QSAR analyses for acetylcholinesterase inhibitors
2017, Molecular DiversityRecognition of polish temporal expressions
2015, International Conference Recent Advances in Natural Language Processing, RANLPFinding out biological terms from texts with CRFs for reinforcement learning
2012, Applied Mechanics and Materials