ABSTRACT
Conditional log-linear models are a commonly used method for structured prediction. Efficient learning of parameters in these models is therefore an important problem. This paper describes an exponentiated gradient (EG) algorithm for training such models. EG is applied to the convex dual of the maximum likelihood objective; this results in both sequential and parallel update algorithms, where in the sequential algorithm parameters are updated in an online fashion. We provide a convergence proof for both algorithms. Our analysis also simplifies previous results on EG for max-margin models, and leads to a tighter bound on convergence rates. Experiments on a large-scale parsing task show that the proposed algorithm converges much faster than conjugate-gradient and L-BFGS approaches both in terms of optimization objective and test error.
- Baker, J. (1979). Trainable grammars for speech recognition. 97th meeting of the Acoustical Society of America.Google ScholarCross Ref
- Bartlett, P., Collins, M., Taskar, B., & McAllester, D. (2004). Exponentiated gradient algorithms for large--margin structured classification. NIPS.Google Scholar
- Beck, A., & Teboulle, M. (2003). Mirror descent and non-linear projected subgradient methods for convex optimization. Operations Research Letters, 31, 167--175. Google ScholarDigital Library
- Buchholz, S., & Marsi, E. (2006). CoNLL-X shared task on multilingual dependency parsing. Proc. of CoNLL-X. Google ScholarDigital Library
- Collins, M., Schapire, R., & Singer, Y. (2002). Logistic regression, adaboost and bregman distances. Machine Learning, 48, 253--285. Google ScholarDigital Library
- Cover, T., & Thomas, J. (1991). Elements of information theory. Wiley. Google ScholarDigital Library
- Jaakkola, T., & Haussler, D. (1999). Probabilistic kernel regression models. Proc. of AISTATS.Google Scholar
- Keerthi, S., Duan, K., Shevade, S., & Poo, A. N. (2005). A fast dual algorithm for kernel logistic regression. Machine Learning, 61, 151--165. Google ScholarDigital Library
- Kivinen, J., & Warmuth, M. (1997). Exponentiated gradient versus gradient descent for linear predictors. Information and Computation, 132, 1--63. Google ScholarDigital Library
- Lafferty, J., McCallum, A., & Pereira, F. (2001). Conditonal random fields: Probabilistic models for segmenting and labeling sequence data. Proc. of ICML. Google ScholarDigital Library
- Lebanon, G., & Lafferty, J. (2001). Boosting and maximum likelihood for exponential models. NIPS.Google Scholar
- McDonald, R., Crammer, K., & Pereira, F. (2005). Online large-margin training of dependency parsers. Proc. of the 43rd Annual Meeting of the ACL. Google ScholarDigital Library
- Memisevic, R. (2006). Dual optimization of conditional probability models (Technical Report). Univ. of Toronto.Google Scholar
- Minka, T. (2003). A comparison of numerical optimizers for logistic regression (Technical Report). CMU.Google Scholar
- Platt, J. (1998). Fast training of support vector machines using sequential minimal optimization. In Advances in kernel methods - support vector learning. MIT Press. Google ScholarDigital Library
- Sha, F., & Pereira, F. (2003). Shallow parsing with conditional random fields. Proc. of HLT--NAACL. Google ScholarDigital Library
- Shalev-Shwartz, S., & Singer, Y. (2006). Convex repeated games and fenchel duality. NIPS.Google Scholar
- Taskar, B., Guestrin, C., & Koller, D. (2003). Max margin markov networks. NIPS.Google Scholar
- Vishwanathan, S. N., Schraudolph, N. N., Schmidt, M. W., & Murphy, K. P. (2006). Accelerated training of conditional random fields with stochastic gradient methods. Proc. of ICML. Google ScholarDigital Library
- Zhu, J., & Hastie, T. (2001). Kernel logistic regression and the import vector machine. NIPS.Google Scholar
- Exponentiated gradient algorithms for log-linear structured prediction
Recommendations
Convergence of exponentiated gradient algorithms
This paper studies three related algorithms: the (traditional) gradient descent (GD) algorithm, the exponentiated gradient algorithm with positive and negative weights (EG± algorithm), and the exponentiated gradient algorithm with unnormalized ...
On three-term conjugate gradient algorithms for unconstrained optimization
This paper presents a project for three-term conjugate gradient algorithms development. The search direction of the algorithms from this class has three terms and is computed as modifications of the classical conjugate gradient algorithms to satisfy ...
Comments