Learning from noisy label proportions for classifying online social data

Ardehaly, Ehsan Mohammady; Culotta, Aron

doi:10.1007/s13278-017-0478-6

Learning from noisy label proportions for classifying online social data

Original Article
Published: 27 November 2017

Volume 8, article number 2, (2018)
Cite this article

Social Network Analysis and Mining Aims and scope Submit manuscript

879 Accesses
5 Citations
Explore all metrics

Abstract

Inferring latent attributes (e.g., demographics) of social media users is important to improve the accuracy and validity of social media analysis methods. While most existing approaches use either heuristics or supervised classification, recent work has shown that accurate classification models can be trained using supervision from population statistics. These learning with label proportion (LLP) models are fit on bags of instances and then applied to individual accounts. However, it is well known that many social media sites such as Twitter are not a representative sample of the population; thus, there are many sources of noise in these label proportions (e.g., sampling bias). This can in turn degrade the quality of the resulting model. In this paper, we investigate classification algorithms that use population statistical constraints such as demographics, names, and social network followers to fit classifiers to predict individual user attributes. We propose LLP methods that explicitly model the noise inherent in these label proportions. On several real and synthetic datasets, we find that combining these enhancements together can significantly reduce averaged classification error by 7%, resulting in methods that are robust to noise in the provided label proportions.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Towards a Statistical Approach for User Classification in Twitter

An Unsupervised Approach to User Characterization in Online Learning and Social Platforms

On Refining Twitter Lists as Ground Truth Data for Multi-community User Classification

Notes

http://www.quantcast.com/.
http://www.census.gov/geo/reference/centersofpop.html.
http://www.ssa.gov/oact/babynames/.
http://www.ssa.gov/oact/NOTES/as120/LifeTables_Tbl_7.html.
http://www.quantcast.com/measure/.
We substitute training and testing sets of the original dataset because the training set had lower instances than testing set.
https://en.wikipedia.org/wiki/United_States_House_of_Representatives_elections,_2014.
https://en.wikipedia.org/wiki/United_States_Senate_elections,_2014.

References

Al Zamal F, Liu W, Ruths D (2012) Homophily and latent attribute inference: Inferring latent attributes of twitter users from neighbors. In: ICWSM
Amigó E, Carrillo de Albornoz J, Chugur I, Corujo A, Gonzalo J, Martín T, Meij E, de Rijke M, Spina D (2013) Overview of RepLab 2013: evaluating online reputation monitoring systems. In: Proceedings of the fourth international conference of the CLEF initiative, pp 333–352
Ardehaly E Mohammady, Culotta A (2015) Inferring latent attributes of twitter users with label regularization. In: Proceedings of the 2015 conference of the North American chapter of the association for computational linguistics: human language technologies, Association for Computational Linguistics, Denver, Colorado, pp 185–195. http://www.aclweb.org/anthology/N15-1019
Ardehaly EM, Culotta A (2016) Domain adaptation for learning from label proportions using self-training. In: Proceedings of the twenty-fifth international joint conference on artificial intelligence, IJCAI 2016, New York, NY, USA, pp 3670–3676, 9-15 July 2016. http://www.ijcai.org/Abstract/16/516
Argamon S, Dhawle S, Koppel M, Pennebaker JW (2005) Lexical predictors of personality type. In: Proceedings of the joint annual meeting of the interface and the classification society of North America
Barberá P (2013) Birds of the same feather tweet together. Bayesian ideal point estimation using twitter data. In: Proceedings of the social media and political participation, Florence, Italy, pp 10–11
Brodley CE, Friedl MA (1999) Identifying mislabeled training data. J Artif Intell Res 11:131–167
Article Google Scholar
Burger JD, Henderson J, Kim G, Zarrella G (2011) Discriminating gender on twitter. In: Proceedings of the conference on empirical methods in natural language processing, association for computational linguistics, Stroudsburg, PA, USA, EMNLP ’11, p 13011309. http://dl.acm.org/citation.cfm?id=2145432.2145568
Byrd RH, Lu P, Nocedal J, Zhu C (1995) A limited memory algorithm for bound constrained optimization. SIAM J Sci Comput 16(5):1190–1208
Article MathSciNet Google Scholar
Chang MW, Ratinov L, Roth D (2012) Structured learning with constrained conditional models. Mach Learn 88(3):399–431
Article MathSciNet Google Scholar
Chang M, Ratinov L, Roth D (2007) Guiding semi-supervision with constraint-driven learning. In: ACL, association for computational linguistics, Prague, Czech Republic, pp 280–287. http://cogcomp.cs.illinois.edu/papers/ChangRaRo07.pdf
Chang J, Rosenn I, Backstrom L, Marlow C (2010) Epluribus: ethnicity on social networks. In: ICWSM
Cohen R, Ruths D (2013) Classifying political orientation on twitter: it’s not easy! In: ICWSM
Conover MD, Gonçalves B, Ratkiewicz J, Flammini A, Menczer F (2011) Predicting the political alignment of twitter users. In: 2011 IEEE third international conference on Privacy, security, risk and trust (passat) and 2011 IEEE third international conference on social computing (socialcom). IEEE, pp 192–199
Culotta A, Kumar NR, Cutler J (2016) Predicting twitter user demographics using distant supervision from website traffic data. J Artif Intell Res (JAIR) 55:389–408
Article Google Scholar
Diaz F, Gamon M, Hofman JM, Kıcıman E, Rothschild D (2016) Online and social media data as an imperfect continuous panel survey. PloS ONE 11(1):e0145406
Article Google Scholar
Dredze M (2012) How social media will change public health. IEEE Intell Syst 27(4):81–84. https://doi.org/10.1109/MIS.2012.76
Article Google Scholar
Eisenstein J, Smith NA, Xing EP (2011) Discovering sociolinguistic associations with structured sparsity. In: Proceedings of the 49th annual meeting of the association for computational linguistics: human language technologies: Volume 1, Association for Computational Linguistics, Stroudsburg, PA, USA, HLT ’11, p 13651374. http://dl.acm.org/citation.cfm?id=2002472.2002641
Fischler MA, Bolles RC (1981) Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography. Commun ACM 24(6):381–395. https://doi.org/10.1145/358669.358692
Article MathSciNet Google Scholar
Ganchev K, Graca J, Gillenwater J, Taskar B (2010) Posterior regularization for structured latent variable models. J Mach Learn Res 11:20012049. http://dl.acm.org/citation.cfm?id=1756006.1859918
Gopinath S, Thomas JS, Krishnamurthi L (2014) Investigating the relationship between the content of online word of mouth, advertising, and brand performance. Market Sci 33(2):241–258
Article Google Scholar
Graca J, Ganchev K, Taskar B (2007) Expectation maximization and posterior constraints. NIPS 20:569–576
Google Scholar
Jin R, Liu Y (2005) A framework for incorporating class priors into discriminative classification. In: Ho TB, Cheung D, Liu H (eds) Advances in knowledge discovery and data Mining. PAKDD 2005. Lecture Notes in Computer Science, vol 3518. Springer, Berlin
Kamerer D (2013) Estimating online audiences: understanding the limitations of competitive intelligence services. First Monday 18(5). https://dx.doi.org/10.5210/fm.v18i5.3986
Knowles R, Carroll J, Dredze M (2016) Demographer: extremely simple name demographics. In: NLP+ CSS 2016, p 108
Lenhart A, Fox S (2009) Twitter and status updating. PEW Internet & American Life Project, Washington DC
Google Scholar
Lin CJ, Kuo TT, Lin SD (2014) A content-based matrix factorization model for recipe recommendation. In: Tseng V, Ho T, Zhou ZH, Chen A, Kao HY (eds) Advances in knowledge discovery and data mining, lecture notes in computer science, vol 8444. Springer International Publishing, pp 560–571. https://dx.doi.org/10.1007/978-3-319-06605-9_46
Liu W, Ruths D (2013) What’s in a name? Using first names as features for gender inference in twitter. In: AAAI spring symposium on analyzing microtext. http://dblp.uni-trier.de/rec/bibtex/conf/aaaiss/LiuR13
Maas AL, Daly RE, Pham PT, Huang D, Ng AY, Potts C (2011) Learning word vectors for sentiment analysis. In: Proceedings of the 49th annual meeting of the association for computational linguistics: human language technologies, Association for Computational Linguistics, Portland, Oregon, USA, pp 142–150. http://www.aclweb.org/anthology/P11-1015
Maneewongvatana S, Mount DM (2002) Analysis of approximate nearest neighbor searching with clustered point sets. Data Struct Near Neighb Search Methodol 59:105–123
MathSciNet MATH Google Scholar
Mann GS, McCallum A (2007) Simple, robust, scalable semi-supervised learning via expectation regularization. In: Proceedings of the 24th international conference on machine learning, ACM, New York, NY, USA, ICML ’07, p 593600. https://doi.org/10.1145/1273496.1273571
Mann GS, McCallum A (2010) Generalized expectation criteria for semi-supervised learning with weakly labeled data. J Mach Learn Res 11:955984. http://dl.acm.org/citation.cfm?id=1756006.1756038
Mislove A, Lehmann S, Ahn YY, Onnela JP, Rosenquist JN (2011) Understanding the demographics of twitter users. In: Proceedings of the fifth international AAAI conference on weblogs and social media (ICWSM’11), Barcelona, Spain
Musicant D, Christensen J, Olson J (2007) Supervised learning by training on aggregate outputs. In: Seventh IEEE international conference on data mining, 2007. ICDM 2007, pp 252–261. https://doi.org/10.1109/ICDM.2007.50
Nguyen D, Smith NA, Ros CP (2011) Author age prediction from text using linear regression. In: Proceedings of the 5th ACL-HLT workshop on language technology for cultural heritage, social sciences, and humanities, Association for Computational Linguistics, Stroudsburg, PA, USA, LaTeCH ’11, p 115123. http://dl.acm.org/citation.cfm?id=2107636.2107651
O’Connor B, Balasubramanyan R, Routledge BR, Smith NA (2010) From tweets to polls: linking text sentiment to public opinion time series. ICWSM 11:122–129
Google Scholar
Oktay H, Firat A, Ertem Z (2014) Demographic breakdown of twitter users: an analysis based on names. In: ASE Bigdata/Socialcom/Cyber Security Conference, Academy of Science and Engineering (ASE), Los Angeles. http://www.merl.com/publications/TR2014-042
Pennacchiotti M, Popescu AM (2011) A machine learning approach to twitter user classification. In: Adamic LA, Baeza-Yates RA, Counts S (eds) ICWSM. The AAAI Press. http://dblp.uni-trier.de/db/conf/icwsm/icwsm2011.html
Prechelt L (2012) Early stopping — But When?. In: Montavon G, Orr GB, Müller KR (eds) Neural networks: tricks of the trade. Lecture Notes in Computer Science, vol 7700. Springer, Berlin.https://doi.org/10.1007/978-3-642-35289-8_5
Preotiuc-Pietro D, Lampos V, Aletras N (2015) An analysis of the user occupational class through twitter content. In: ACL
Quadrianto N, Smola AJ, Caetano TS, Le QV (2009) Estimating labels from label proportions. J Mach Learn Res 10:23492374. http://dl.acm.org/citation.cfm?id=1577069.1755865
Rao D, Paul MJ, Fink C, Yarowsky D, Oates T, Coppersmith G (2011) Hierarchical Bayesian models for latent attribute detection in social media. In: Adamic LA, Baeza-Yates RA, Counts S (eds) ICWSM. The AAAI Press
Rao D, Yarowsky D, Shreevats A, Gupta M (2010) Classifying latent user attributes in twitter. In: Proceedings of the 2nd international workshop on search and mining user-generated contents, ACM, New York, NY, USA, SMUC ’10, p 3744. https://doi.org/10.1145/1871985.1871993
Rendle S, Schmidt-Thieme L (2008) Online-updating regularized kernel matrix factorization models for large-scale recommender systems. In: Proceedings of the 2008 ACM conference on recommender systems, ACM, New York, NY, USA, RecSys ’08, pp 251–258. https://doi.org/10.1145/1454008.1454047
Rogati M, Yang Y (2002) High-performing feature selection for text classification. In: Proceedings of the eleventh international conference on information and knowledge management, ACM, New York, NY, USA, CIKM ’02, pp 659–661. https://doi.org/10.1145/584792.584911
Rosenthal S, McKeown K (2011) Age prediction in blogs: A study of style, content, and online behavior in pre- and post-social media generations. In: Proceedings of the 49th annual meeting of the association for computational linguistics: human language technologies: Volume 1, Association for Computational Linguistics, Stroudsburg, PA, USA, HLT ’11, p 763772. http://dl.acm.org/citation.cfm?id=2002472.2002569
Salakhutdinov R, Mnih A (2008) Probabilistic matrix factorization. In: Platt JC, Koller D, Singer Y, Roweis ST (eds) Advances in neural information processing systems, Curran Associates, Inc., Red Hook, vol 20, pp 1257–1264. http://papers.nips.cc/paper/3208-probabilistic-matrix-factorization.pdf
Saveski M, Mantrach A (2014) Item cold-start recommendations: learning local collective embeddings. In: Proceedings of the 8th ACM conference on recommender systems, ACM, New York, NY, USA, RecSys ’14, pp 89–96. https://doi.org/10.1145/2645710.2645751
Schapire RE, Rochery M, Rahim MG, Gupta NK (2002) Incorporating prior knowledge into boosting. In: Proceedings of the nineteenth international conference on machine learning, pp 538–545
Schler J, Koppel M, Argamon S, Pennebaker J (2006) Effects of age and gender on blogging. In: AAAI 2006 spring symposium on computational approaches to analysing weblogs (AAAI-CAAW), pp 06–03
Schwartz HA, Eichstaedt JC, Kern ML, Dziurzynski L, Lucas RE, Agrawal M, Park GJ, Lakshmikanth SK, Jha S, Seligman MEP, Ungar LH (2013a) Characterizing geographic variation in well-being using tweets. In: Seventh international AAAI conference on weblogs and social media (ICWSM)
Schwartz HA, Eichstaedt JC, Kern ML, Dziurzynski L, Ramones SM, Agrawal M, Shah A, Kosinski M, Stillwell D, Seligman MEP, Ungar LH (2013) Personality, gender, and age in the language of social media: the open-vocabulary approach. PloS ONE 8(9):e73791. https://doi.org/10.1371/journal.pone.0073791
Article Google Scholar
She Y, Owen AB (2011) Outlier detection using nonconvex penalized regression. J Am Stat Assoc 106(494):626–639
Article MathSciNet Google Scholar
Silver N, McCanc A (2014) How to tell someone’s age when all you know is her name. Retrieved from http://fivethirtyeight.com/features/how-to-tell-someones-age-when-all-you-know-is-her-name/
Takacs G, Pilaszy I, Nemeth B, Tikk D (2008) Investigation of various matrix factorization methods for large recommender systems. In: IEEE international conference on data mining workshops, 2008. ICDMW ’08, pp 553–562. https://doi.org/10.1109/ICDMW.2008.86
Tibshirani J, Manning CD (2014) Robust logistic regression using shift parameters. In: ACL, pp 124–129
Vapnik VN (1995) The nature of statistical learning theory. Springer, New York
Book Google Scholar
Volkova S, Van Durme B (2015) Online bayesian models for personal analytics in social media. In: Proceedings of the twenty-ninth conference on artificial intelligence (AAAI), Austin, TX
Wang Z, Lyu S, Schalk G, Ji Q (2012) Learning with target prior. In: Pereira F, Burges C, Bottou L, Weinberger K (eds) Advances in neural information processing systems, vol 25. Curran Associates, Inc., New York, pp 2231–2239. http://papers.nips.cc/paper/4849-learning-with-target-prior.pdf
Watkins SC (2009) The young and the digital: what the migration to social-network sites, games, and anytime, anywhere media means for our future. Beacon Press, Boston
Google Scholar
Yang Y, Pedersen JO (1997) A comparative study on feature selection in text categorization. In: Proceedings of the fourteenth international conference on machine learning, Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, ICML ’97, pp 412–420. http://dl.acm.org/citation.cfm?id=645526.657137
Yao Y, Rosasco L, Caponnetto A (2007) On early stopping in gradient descent learning. Constr Approx 26(2):289–315. https://doi.org/10.1007/s00365-006-0663-2
Article MathSciNet MATH Google Scholar
Zhang S, Wang W, Ford J, Makedon F (2006) Learning from incomplete ratings using non-negative matrix factorization. In: Proceedings of the 6th SIAM conference on data mining, SDM, pp 549–553
Zhang T, Yu B (2005) Boosting with early stopping: Convergence and consistency. Ann Stat 33(4):1538–1579. http://projecteuclid.org/euclid.aos/1123250222
Zhu J, Chen N, Xing EP (2014) Bayesian inference with posterior regularization and applications to infinite latent svms. J Mach Learn Res 15:1799–1847
MathSciNet MATH Google Scholar

Download references

Acknowledgements

We thank the anonymous reviewers for helpful feedback. This research was funded in part by National Science Foundation under Grants #IIS-1526674 and #IIS-1618244. Any opinions, findings, and conclusions or recommendations expressed in this material are the authors and do not necessarily reflect those of the sponsor.

Author information

Authors and Affiliations

Department of Computer Science, Illinois Institute of Technology, Chicago, IL, 60616, USA
Ehsan Mohammady Ardehaly & Aron Culotta

Authors

Ehsan Mohammady Ardehaly
View author publications
You can also search for this author in PubMed Google Scholar
Aron Culotta
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Aron Culotta.

Appendix: Partial derivatives of LR cost function

We use logistic function derivative, i.e.,

$$\begin{aligned} \frac{\partial }{\partial \theta } \sigma (f) = \sigma (f)(1 - \sigma (f)) \frac{\partial }{\partial \theta } f \end{aligned}$$

(23)

to compute the derivative of hypothesis as:

$$\begin{aligned} \frac{\partial }{\partial \theta } h_{u, i} = h_{u, i}(1 - h_{u, i}) X_{u, i} \end{aligned}$$

(24)

Now we can compute the partial derivative of cost function:

$$\begin{aligned} \begin{aligned} \frac{\partial }{\partial \theta } J(\varTheta )&= -\,\sum _i \left( \tilde{h_i} \frac{\partial }{\partial \theta } \log \bar{h_i} + (1 - \tilde{h_i}) \frac{\partial }{\partial \theta } \log (1 - \bar{h_i})\right) + \lambda \theta \\&= -\,\sum _i \left( \frac{\tilde{y_i}}{\bar{h_i}} \frac{\partial \bar{h_i}}{\partial \theta } - \frac{1 - \tilde{y_i}}{1 - \bar{h_i}} \frac{\partial \bar{h_i}}{\partial \theta }\right) + \lambda \theta \\&= \sum _i \frac{\bar{h_i} - \tilde{y_i}}{\bar{h_i}(1 - \bar{h_i})} \frac{\partial \bar{h_i}}{\partial \theta } + \lambda \theta \\&= \sum _i \frac{\bar{h_i} - \tilde{y_i}}{\bar{h_i}(1 - \bar{h_i})} \frac{\partial }{\partial \theta } \frac{1}{|T_i|} \sum _{u \in T_i} h_{u, i} + \lambda \theta \\&= \sum _i \frac{\bar{h_i} - \tilde{y_i}}{\bar{h_i}(1 - \bar{h_i})|T_i|} \sum _{u \in T_i} \frac{\partial }{\partial \theta } h_{u, i} + \lambda \theta \\&= \sum _i \frac{\bar{h_i} - \tilde{y_i}}{\bar{h_i}(1 - \bar{h_i})|T_i|} \sum _{u \in T_i} h_{u, i}(1 - h_{u, i}) X_{u, i} + \lambda \theta \\&= \sum _{u,i} \frac{(\bar{h_i} - \tilde{y_i})h_{u, i}(1 - h_{u, i})}{\bar{h_i}(1 - \bar{h_i})|T_i|} X_{u, i} + \lambda \theta \\&= \sum _{u,i} e_{u,i} X_{u, i} + \lambda \theta \end{aligned} \end{aligned}$$

(25)

where $e_{u,i}$ is defined in Eq. 10. The partial derivative of other variables in LRBF model is computed similarly.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Ardehaly, E.M., Culotta, A. Learning from noisy label proportions for classifying online social data. Soc. Netw. Anal. Min. 8, 2 (2018). https://doi.org/10.1007/s13278-017-0478-6

Download citation

Received: 07 March 2017
Revised: 19 September 2017
Accepted: 07 November 2017
Published: 27 November 2017
DOI: https://doi.org/10.1007/s13278-017-0478-6

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Learning from noisy label proportions for classifying online social data

Abstract

Access this article

Similar content being viewed by others

Towards a Statistical Approach for User Classification in Twitter

An Unsupervised Approach to User Characterization in Online Learning and Social Platforms

On Refining Twitter Lists as Ground Truth Data for Multi-community User Classification

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Appendix: Partial derivatives of LR cost function

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Learning from noisy label proportions for classifying online social data

Abstract

Access this article

Similar content being viewed by others

Towards a Statistical Approach for User Classification in Twitter

An Unsupervised Approach to User Characterization in Online Learning and Social Platforms

On Refining Twitter Lists as Ground Truth Data for Multi-community User Classification

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Appendix: Partial derivatives of LR cost function

Appendix: Partial derivatives of LR cost function

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation