research-article

From Footprint to Evidence: An Exploratory Study of Mining Social Data for Credit Scoring

Authors:
Guangming Guo

University of Science and Technology of China, Anhui, China

University of Science and Technology of China, Anhui, China

0000-0003-1373-5527
View Profile

,
Feida Zhu

Singapore Management University, Singapore

Singapore Management University, Singapore
View Profile

,
Enhong Chen

University of Science and Technology of China, Anhui, China

University of Science and Technology of China, Anhui, China
View Profile

,
Qi Liu

University of Science and Technology of China, Anhui, China

University of Science and Technology of China, Anhui, China
View Profile

,
Le Wu

Hefei University of Technology, Anhui, China

Hefei University of Technology, Anhui, China
View Profile

,
Chu Guan

University of Science and Technology of China, Anhui, China

University of Science and Technology of China, Anhui, China
View Profile

Authors Info & Claims

ACM Transactions on the Web Volume 10 Issue 4Article No.: 22pp 1–38https://doi.org/10.1145/2996465

Published:15 December 2016Publication History

ACM Transactions on the Web

Abstract

With the booming popularity of online social networks like Twitter and Weibo, online user footprints are accumulating rapidly on the social web. Simultaneously, the question of how to leverage the large-scale user-generated social media data for personal credit scoring comes into the sight of both researchers and practitioners. It has also become a topic of great importance and growing interest in the P2P lending industry. However, compared with traditional financial data, heterogeneous social data presents both opportunities and challenges for personal credit scoring. In this article, we seek a deep understanding of how to learn users’ credit labels from social data in a comprehensive and efficient way. Particularly, we explore the social-data-based credit scoring problem under the micro-blogging setting for its open, simple, and real-time nature. To identify credit-related evidence hidden in social data, we choose to conduct an analytical and empirical study on a large-scale dataset from Weibo, the largest and most popular tweet-style website in China. Summarizing results from existing credit scoring literature, we first propose three social-data-based credit scoring principles as guidelines for in-depth exploration. In addition, we glean six credit-related insights arising from empirical observations of the testbed dataset. Based on the proposed principles and insights, we extract prediction features mainly from three categories of users’ social data, including demographics, tweets, and networks. To harness this broad range of features, we put forward a two-tier stacking and boosting enhanced ensemble learning framework. Quantitative investigation of the extracted features shows that online social media data does have good potential in discriminating good credit users from bad. Furthermore, we perform experiments on the real-world Weibo dataset consisting of more than 7.3 million tweets and 200,000 users whose credit labels are known through our third-party partner. Experimental results show that (i) our approach achieves a roughly 0.625 AUC value with all the proposed social features as input, and (ii) our learning algorithm can outperform traditional credit scoring methods by as much as 17% for social-data-based personal credit scoring.

References

William Adams, Liran Einav, and Jonathan Levin. 2007. Liquidity Constraints and Imperfect Information in Subprime Lending. Technical Report. National Bureau of Economic Research.Google Scholar
Sumit Agarwal, John C. Driscoll, Xavier Gabaix, and David Laibson. 2008. Learning in the Credit Card Market. Technical Report. National Bureau of Economic Research.Google Scholar
Sumit Agarwal, Paige M. Skiba, and Jeremy Tobacman. 2009. Payday Loans and Credit Cards: New Liquidity and Credit Scoring Puzzles? Technical Report. National Bureau of Economic Research.Google Scholar
Gerhard Arminger, Daniel Enache, and Thorsten Bonne. 1997. Analyzing credit risk data: A comparison of logistic discrimination, classification tree analysis, and feedforward networks. Computational Statistics 12, 2 (1997).Google Scholar
Alexander Bachmann, Alexander Becker, Daniel Buerckner, Michel Hilker, Frank Kock, Mark Lehmann, Phillip Tiburtius, and Burkhardt Funk. 2011. Online peer-to-peer lending -- a literature review. Journal of Internet Banking and Commerce 16, 2 (2011), 1.Google Scholar
Lars Backstrom, Eric Sun, and Cameron Marlow. 2010. Find me if you can: Improving geographical prediction with social and spatial proximity. In WWW. 61--70. Google ScholarDigital Library
Bart Baesens, Tony Van Gestel, Stijn Viaene, Maria Stepanova, Johan Suykens, and Jan Vanthienen. 2003. Benchmarking state-of-the-art classification algorithms for credit scoring. Journal of the Operational Research Society 54, 6 (2003), 627--635.Google ScholarCross Ref
Shane Bergsma and Benjamin Van Durme. 2013. Using conceptual class attributes to characterize social media users. ACL (1). 710--720.Google Scholar
David M. Blei, Andrew Y. Ng, and Michael I. Jordan. 2003. Latent Dirichlet allocation. Journal of Machine Learning Research 3 (2003), 993--1022. Google ScholarDigital Library
Andreas Blochlinger and Markus Leippold. 2006. Economic benefit of powerful credit scoring. Journal of Banking and Finance 30, 3 (2006), 851--873.Google ScholarCross Ref
Johan Bollen, Huina Mao, and Xiaojun Zeng. 2011. Twitter mood predicts the stock market. Journal of Computational Science 2, 1 (2011), 1--8.Google ScholarCross Ref
Danah Boyd and Kate Crawford. 2012. Critical questions for big data: Provocations for a cultural, technological, and scholarly phenomenon. Information, Communication 8 Society 15, 5 (2012), 662--679.Google ScholarCross Ref
Andrew P. Bradley. 1997. The use of the area under the ROC curve in the evaluation of machine learning algorithms. Pattern Recognition 30, 7 (1997), 1145--1159. Google ScholarDigital Library
Leo Breiman. 2001. Random forests. Machine Learning 45, 1 (2001), 5--32. Google ScholarDigital Library
John D. Burger, John C. Henderson, George Kim, and Guido Zarrella. 2011. Discriminating gender on twitter. In EMNLP. 1301--1309. Google ScholarDigital Library
Moses S. Charikar. 2002. Similarity estimation techniques from rounding algorithms. In STOC. ACM, 380--388. Google ScholarDigital Library
Satyajit Chatterjee, Dean Corbae, Makoto Nakajima, and José-Víctor Ríos-Rull. 2007. A quantitative theory of unsecured consumer credit with risk of default. Econometrica 75, 6 (2007), 1525--1589.Google ScholarCross Ref
Nitesh V. Chawla, Kevin W. Bowyer, Lawrence O. Hall, and W. Philip Kegelmeyer. 2002. SMOTE: Synthetic minority over-sampling technique. Journal of Artificial Intelligence Research (2002), 321--357. Google ScholarDigital Library
Xihui Chen, Jun Pang, and Ran Xue. 2014a. Constructing and comparing user mobility profiles. ACM Transactions on the Web 8, 4, Article 21 (Nov. 2014), 25 pages. Google ScholarDigital Library
Zhuohua Chen, Feida Zhu, Guangming Guo, and Hongyan Liu. 2014b. User profiling via affinity-aware friendship network. In Social Informatics. Springer, 151--165.Google Scholar
Zhiyuan Cheng, James Caverlee, and Kyumin Lee. 2010. You are where you tweet: A content-based approach to geo-locating twitter users. In CIKM. 759--768. Google ScholarDigital Library
Jonathan N. Crook, David B. Edelman, and Lyn C. Thomas. 2007. Recent developments in consumer credit risk assessment. European Journal of Operational Research 183, 3 (2007), 1447--1465.Google ScholarCross Ref
Yuxiao Dong, Yang Yang, Jie Tang, Yang Yang, and Nitesh V. Chawla. 2014. Inferring user demographics and social strategies in mobile social networks. In KDD. 15--24. Google ScholarDigital Library
Liran Einav, Mark Jenkins, and Jonathan Levin. 2013. The impact of credit scoring on consumer lending. The RAND Journal of Economics 44, 2 (2013), 249--274.Google ScholarCross Ref
Robert A. Eisenbeis. 1978. Problems in applying discriminant analysis in credit scoring models. Journal of Banking 8 Finance 2, 3 (1978), 205--219.Google ScholarCross Ref
Clayton Fink, Jonathon Kopecky, and Maksym Morawski. 2012. Inferring gender from the content of tweets: A region specific example. In ICWSM.Google Scholar
Jerome H. Friedman. 2001. Greedy function approximation: A gradient boosting machine. Annals of Statistics (2001), 1189--1232.Google Scholar
Halina Frydman, Jarl G. Kallberg, and Duen-Li Kao. 1985. Testing the adequacy of Markov chain and mover-stayer models as representations of credit behavior. Operations Research 33, 6 (1985), 1203--1214. Google ScholarDigital Library
Hongyu Gao, Jun Hu, Christo Wilson, Zhichun Li, Yan Chen, and Ben Y. Zhao. 2010. Detecting and characterizing social spam campaigns. In IMC. ACM, 35--47. Google ScholarDigital Library
Elizabeth M. Gerber and Julie Hui. 2013. Crowdfunding: Motivations and deterrents for participation. ACM Transactions on Computer-Human Interaction 20, 6 (2013), 34. Google ScholarDigital Library
Sumit Goswami, Sudeshna Sarkar, and Mayur Rustagi. 2009. Stylometric analysis of bloggers’ age and gender. In ICWSM.Google Scholar
Thomas L. Griffiths and Mark Steyvers. 2004. Finding scientific topics. Proceedings of the National Academy of Sciences 101, suppl 1 (2004), 5228--5235.Google ScholarCross Ref
Guangming Guo, Feida Zhu, Enhong Chen, Le Wu, Qi Liu, Yingling Liu, and Minghui Qiu. 2016. Personal credit profiling via latent user behavior dimensions on social media. In PAKDD 2016. 130--142. Google ScholarDigital Library
Mark Hall, Eibe Frank, Geoffrey Holmes, Bernhard Pfahringer, Peter Reutemann, and Ian H. Witten. 2009. The WEKA data mining software: An update. SIGKDD Explorations Newsletter 11, 1 (Nov. 2009), 10--18. Google ScholarDigital Library
David J. Hand and William E. Henley. 1997. Statistical classification methods in consumer credit scoring: A review. Journal of the Royal Statistical Society: Series A (Statistics in Society) 160, 3 (1997), 523--541.Google ScholarCross Ref
Terry Harris. 2013. Default definition selection for credit scoring. Artificial Intelligence Research 2, 4 (2013), p49.Google ScholarCross Ref
Terry Harris. 2015. Credit scoring using the clustered support vector machine. Expert Systems with Applications 42, 2 (2015), 741--750. Google ScholarDigital Library
W. E. Henley and David J. Hand. 1996. A k-nearest-neighbour classifier for assessing consumer credit risk. The Statistician (1996), 77--95.Google Scholar
Liangjie Hong and Brian D. Davison. 2010. Empirical study of topic modeling in twitter. In Proceedings of the 1st Workshop on Social Media Analytics. ACM, 80--88. Google ScholarDigital Library
Nan-Chen Hsieh and Lun-Ping Hung. 2010. A data driven ensemble classifier for credit scoring analysis. Expert Systems with Applications 37, 1 (2010), 534--545. Google ScholarDigital Library
Wen Hua, Zhongyuan Wang, Haixun Wang, Kai Zheng, and Xiaofang Zhou. 2015. Short text understanding through lexical-semantic analysis. In ICDE. 495--506.Google Scholar
Cheng-Lung Huang, Mu-Chen Chen, and Chieh-Jen Wang. 2007. Credit scoring with a data mining approach based on support vector machines. Expert Systems with Applications 33, 4 (2007), 847--856. Google ScholarDigital Library
Michael K. Hulme and Collette Wright. 2006. Internet based social lending: Past, present and future. Social Futures Observatory 11 (2006), 1--115.Google Scholar
Akshay Java, Xiaodan Song, Tim Finin, and Belle L. Tseng. 2007. Why we twitter: An analysis of a microblogging community. In WebKDD/SNA-KDD. 118--138. Google ScholarDigital Library
Herbert L. Jensen. 1992. Using neural networks for credit scoring. Managerial Finance 18, 6 (1992), 15--26.Google ScholarCross Ref
Dean Karlan and Jonathan Zinman. 2009. Observing unobservables: Identifying information asymmetries with a consumer credit field experiment. Econometrica 77, 6 (2009), 1993--2008.Google ScholarCross Ref
David Kempe, Jon Kleinberg, and Éva Tardos. 2003. Maximizing the spread of influence through a social network. In KDD. ACM, 137--146. Google ScholarDigital Library
Vaclav Kozeny. 2015. Genetic algorithms for credit scoring. Expert Systems with Applications 42, 6 (April 2015), 2998--3004. Google ScholarDigital Library
Jochen Kruppa, Alexandra Schwarz, Gerhard Arminger, and Andreas Ziegler. 2013. Consumer credit risk: Individual probability estimates using machine learning. Expert Systems with Applications 40, 13 (2013), 5125--5131.Google ScholarCross Ref
Rui Li and Chi Wang Kevin Chen-Chuan Chang. 2014. User profiling in an ego network: Co-profiling attributes and relationships. In WWW. Google ScholarDigital Library
Rui Li, Shengjie Wang, Hongbo Deng, Rui Wang, and Kevin Chen-Chuan Chang. 2012. Towards social user profiling: Unified and discriminative influence model for inferring home locations. In KDD. 1023--1031. Google ScholarDigital Library
Brian W. Matthews. 1975. Comparison of the predicted and observed secondary structure of T4 phage lysozyme. Biochimica et Biophysica Acta (BBA)-Protein Structure 405, 2 (1975), 442--451.Google ScholarCross Ref
Alan Mislove, Bimal Viswanath, P. Krishna Gummadi, and Peter Druschel. 2010. You are who you know: Inferring user profiles in online social networks. In WSDM. 251--260. Google ScholarDigital Library
Ethan Mollick. 2014. The dynamics of crowdfunding: An exploratory study. Journal of Business Venturing 29, 1 (2014), 1--16.Google ScholarCross Ref
Dong Nguyen, Rilana Gravel, Dolf Trieschnigg, and Theo Meder. 2013. “How old do you think I am?” A study of language and age in twitter. In ICWSM.Google Scholar
Chorng-Shyong Ong, Jih-Jeng Huang, and Gwo-Hshiung Tzeng. 2005. Building credit scoring models using genetic programming. Expert Systems with Applications 29, 1 (2005), 41--47. Google ScholarDigital Library
Michael J. Paul and Mark Dredze. 2011. You are what you tweet: Analyzing twitter for public health. In ICWSM. 265--272.Google Scholar
Claudia Peersman, Walter Daelemans, and Leona Van Vaerenbergh. 2011. Predicting age and gender in online social networks. In SMUC. 37--44. Google ScholarDigital Library
Marco Pennacchiotti and Ana-Maria Popescu. 2011. Democrats, Republicans and Starbucks afficionados: User classification in twitter. In KDD. 430--438. Google ScholarDigital Library
Delip Rao, David Yarowsky, Abhishek Shreevats, and Manaswi Gupta. 2010. Classifying latent user attributes in twitter. In SMUC. 37--44. Google ScholarDigital Library
Lior Rokach. 2010. Ensemble-based classifiers. Artificial Intelligence Review 33, 1--2 (2010), 1--39. Google ScholarDigital Library
Eric Rosenberg and Alan Gleit. 1994. Quantitative methods in credit management: A survey. Operations Research 42, 4 (1994), 589--613. Google ScholarDigital Library
Sara Rosenthal and Kathleen McKeown. 2011. Age prediction in blogs: A study of style, content, and online behavior in pre- and post-social media generations. In ACL. 763--772. Google ScholarDigital Library
Takeshi Sakaki, Makoto Okazaki, and Yutaka Matsuo. 2010. Earthquake shakes Twitter users: Real-time event detection by social sensors. In WWW. ACM, 851--860. Google ScholarDigital Library
Klaus B. Schebesch and Ralf Stecking. 2005. Support vector machines for classifying and describing credit applicants: Detecting typical and critical regions. Journal of the Operational Research Society 56, 9 (2005), 1082--1088.Google ScholarCross Ref
Lyn C. Thomas, David B. Edelman, and Jonathan N. Crook. 2002. Credit Scoring and Its Applications. SIAM. Google ScholarDigital Library
Lita van Wel and Lambèr Royakkers. 2004. Ethical issues in web data mining. Ethics and Information Technology 6, 2 (2004), 129--140. Google ScholarDigital Library
Annette Vissing-Jorgensen. 2011. Consumer credit: Learning your customer’s default risk from what (s)he buys. Available at SSRN: http://ssrn.com/abstract=2023238 (2011).Google Scholar
John C. Wiginton. 1980. A note on the comparison of logit and discriminant models of consumer credit behavior. Journal of Financial and Quantitative Analysis 15, 03 (1980), 757--770.Google ScholarCross Ref
Bing Xiang and Liang Zhou. 2014. Improving twitter sentiment analysis with topic-based mixture modeling and semi-supervised training. In ACL. 434--439.Google Scholar
Bee Wah Yap, Seng Huat Ong, and Nor Huselina Mohamed Husain. 2011. Using data mining to improve assessment of credit worthiness via credit scoring models. Expert Systems and Applications 38, 10 (2011), 13274--13283. Google ScholarDigital Library
Guangxiang Zeng, Ping Luo, Enhong Chen, and Min Wang. 2013. From social user activities to people affiliation. In ICDM.Google Scholar
Hongke Zhao, Qi Liu, Guifeng Wang, Yong Ge, and Enhong Chen. 2016. Portfolio selections in P2P lending: A multi-objective perspective. In KDD (KDD’16). ACM, 2075--2084. Google ScholarDigital Library
Yuan Zhong, Nicholas Jing Yuan, Wen Zhong, Fuzheng Zhang, and Xing Xie. 2015. You are where you go: Inferring demographic attributes from location check-ins. In WSDM. 295--304. Google ScholarDigital Library

Index Terms

From Footprint to Evidence: An Exploratory Study of Mining Social Data for Credit Scoring
1. Information systems
  1. Information retrieval
    1. Retrieval tasks and goals
      1. Document filtering
      2. Information extraction
  2. Information systems applications
    1. Data mining

Recommendations

The use of profit scoring as an alternative to credit scoring systems in peer-to-peer (P2P) lending

This study goes beyond peer-to-peer (P2P) lending credit scoring systems by proposing a profit scoring. Credit scoring systems estimate loan default probability. Although failed borrowers do not reimburse the entire loan, certain amounts may be ...
Read More
Multimodal Post Attentive Profiling for Influencer Marketing
WWW '20: Proceedings of The Web Conference 2020

Influencer marketing has become a key marketing method for brands in recent years. Hence, brands have been increasingly utilizing influencers’ social networks to reach niche markets, and researchers have been studying various aspects of influencer ...
Read More
FinTech Lending and Bank Credit Access for Consumers
Using a unique setting of an online peer-to-peer lender, I show that banks expand credit access for consumers who obtain FinTech loans. Consistent with FinTech relieving information frictions, this effect is stronger for more credit-constrained consumers. ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
ACM Transactions on the Web Volume 10, Issue 4
December 2016
169 pages
ISSN:1559-1131
EISSN:1559-114X
DOI:10.1145/3017848
Editors:
Brian D. Davison
Lehigh University, USA
,
Marianne Winslett
University of Illinois at Urbana-Champaign
Issue’s Table of Contents
Copyright © 2016 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 15 December 2016
- Accepted: 1 September 2016
- Revised: 1 June 2016
- Received: 1 December 2015
Published in tweb Volume 10, Issue 4

Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
P2P lending
Personal credit scoring
consumer finance
features
social data
user profiling
Qualifiers
- research-article
- Research
- Refereed
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 28
  Total Citations
  View Citations
- 880
  Total Downloads
- Downloads (Last 12 months)62
- Downloads (Last 6 weeks)6
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

From Footprint to Evidence: An Exploratory Study of Mining Social Data for Credit Scoring

ACM Transactions on the Web

Abstract

References

Cited By

Index Terms

Recommendations

The use of profit scoring as an alternative to credit scoring systems in peer-to-peer (P2P) lending

Multimodal Post Attentive Profiling for Influencer Marketing

FinTech Lending and Bank Credit Access for Consumers