Extending Logistic Regression Models with Factorization Machines

Pijnenburg, Mark; Kowalczyk, Wojtek

doi:10.1007/978-3-319-60438-1_32

Mark Pijnenburg^19,20 &
Wojtek Kowalczyk¹⁹

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 10352))

Included in the following conference series:

International Symposium on Methodologies for Intelligent Systems

1699 Accesses
1 Citations

Abstract

Including categorical variables with many levels in a logistic regression model easily leads to a sparse design matrix. This can result in a big, ill-conditioned optimization problem causing overfitting, extreme coefficient values and long run times. Inspired by recent developments in matrix factorization, we propose four new strategies of overcoming this problem. Each strategy uses a Factorization Machine that transforms the categorical variables with many levels into a few numeric variables that are subsequently used in the logistic regression model. The application of Factorization Machines also allows for including interactions between the categorical variables with many levels, often substantially increasing model accuracy. The four strategies have been tested on four data sets, demonstrating superiority of our approach over other methods of handling categorical variables with many levels. In particular, our approach has been successfully used for developing high quality risk models at the Netherlands Tax and Customs Administration.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Bassi, D., Hernandez, C.: Credit risk scoring: results of different network structures, preprocessing and self-organised clustering. In: Decision Technologies for Financial Engineering. Proceedings of the Fourth International Conference on Neural Networks in the Capital Markets, pp. 151–61 (1997)
Google Scholar
Basta, S., Fassetti, F., Guarascio, M., Manco, G., Giannotti, F., Pedreschi, D., Spinsanti, L., Papi, G., Pisani, S.: High quality true-positive prediction for fiscal fraud detection. In: International Conference on Data Mining Workshops, ICDMW 2009, pp. 7–12. IEEE (2009)
Google Scholar
Berkman, N.C.: Value grouping for binary decision trees. Technical report, University of Massachusetts (1995)
Google Scholar
Breiman, L., Friedman, J., Stone, C.J., Olshen, R.A.: Classification and Regression Trees. CRC Press, Boca Raton (1984)
MATH Google Scholar
Burshtein, D., Della Pietra, V., Kanevsky, D., Nadas, A.: Minimum impurity partitions. Ann. Stat. 20, 1637–1646 (1992)
Article MathSciNet MATH Google Scholar
Chou, P.A., et al.: Optimal partitioning for classification and regression trees. IEEE Trans. Pattern Anal. Mach. Intell. 13(4), 340–354 (1991)
Article Google Scholar
Friedman, J., Hastie, T., Tibshirani, R.: The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer Series in Statistics. Springer, New York (2009)
MATH Google Scholar
Gupta, G.: Introduction to Data Mining with Case Studies. PHI Learning Pvt. Ltd., Delhi (2014)
Google Scholar
Hosmer Jr., D.W., Lemeshow, S., Sturdivant, R.X.: Applied Logistic Regression, 3rd edn. Wiley, Hoboken (2013)
Book MATH Google Scholar
Kass, G.V.: An exploratory technique for investigating large quantities of categorical data. Appl. Stat. 29, 119–127 (1980)
Article Google Scholar
Koren, Y., Bell, R., Volinsky, C.: Matrix factorization techniques for recommender systems. Computer 42(8), 30–37 (2009). http://dx.doi.org/10.1109/MC.2009.263
Article Google Scholar
Liaw, A., Wiener, M.: Classification and Regression by randomForest. R News 2(3), 18–22 (2002). http://CRAN.R-project.org/doc/Rnews/
Google Scholar
Lichman, M.: UCI Machine Learning Repository (2013). http://archive.ics.uci.edu/ml
van der Maaten, L.J.P., Postma, E.O., van den Herik, H.J.: Dimensionality reduction: a comparative review. Tilburg University Technical report, TiCC-TR 2009-005 (2009)
Google Scholar
Micci-Barreca, D.: A preprocessing scheme for high-cardinality categorical attributes in classification and prediction problems. ACM SIGKDD Explor. Newsl. 3(1), 27–32 (2001)
Article Google Scholar
Rendle, S.: Factorization machines. In: 2010 IEEE International Conference on Data Mining, pp. 995–1000. IEEE (2010)
Google Scholar
Rendle, S.: Factorization machines with libFM. ACM Trans. Intell. Syst. Technol. 3(3), 57:1–57:22 (2012)
Article Google Scholar

Download references

Author information

Authors and Affiliations

Leiden Institute of Advanced Computer Science, Leiden, The Netherlands
Mark Pijnenburg & Wojtek Kowalczyk
Netherlands Tax and Customs Administration, Utrecht, The Netherlands
Mark Pijnenburg

Authors

Mark Pijnenburg
View author publications
You can also search for this author in PubMed Google Scholar
Wojtek Kowalczyk
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Mark Pijnenburg .

Editor information

Editors and Affiliations

Warsaw University of Technology, Warsaw, Poland
Marzena Kryszkiewicz
University of Bari Aldo Moro, Bari, Italy
Annalisa Appice
Institute of Informatics, University of Warsaw, Warsaw, Poland
Dominik Ślęzak
Faculty of Electronics & Information, Warsaw University of Technology, Warsaw, Poland
Henryk Rybinski
Institute of Mathematics, Warsaw University, Warsaw, Poland
Andrzej Skowron
Department of Computer Science, University of North Carolina at Charlotte, North Carolina, USA
Zbigniew W. Raś

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Pijnenburg, M., Kowalczyk, W. (2017). Extending Logistic Regression Models with Factorization Machines. In: Kryszkiewicz, M., Appice, A., Ślęzak, D., Rybinski, H., Skowron, A., Raś, Z. (eds) Foundations of Intelligent Systems. ISMIS 2017. Lecture Notes in Computer Science(), vol 10352. Springer, Cham. https://doi.org/10.1007/978-3-319-60438-1_32

Download citation

DOI: https://doi.org/10.1007/978-3-319-60438-1_32
Published: 14 June 2017
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-60437-4
Online ISBN: 978-3-319-60438-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics