Skip to main content
Log in

Learning from noisy label proportions for classifying online social data

  • Original Article
  • Published:
Social Network Analysis and Mining Aims and scope Submit manuscript

Abstract

Inferring latent attributes (e.g., demographics) of social media users is important to improve the accuracy and validity of social media analysis methods. While most existing approaches use either heuristics or supervised classification, recent work has shown that accurate classification models can be trained using supervision from population statistics. These learning with label proportion (LLP) models are fit on bags of instances and then applied to individual accounts. However, it is well known that many social media sites such as Twitter are not a representative sample of the population; thus, there are many sources of noise in these label proportions (e.g., sampling bias). This can in turn degrade the quality of the resulting model. In this paper, we investigate classification algorithms that use population statistical constraints such as demographics, names, and social network followers to fit classifiers to predict individual user attributes. We propose LLP methods that explicitly model the noise inherent in these label proportions. On several real and synthetic datasets, we find that combining these enhancements together can significantly reduce averaged classification error by 7%, resulting in methods that are robust to noise in the provided label proportions.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9

Similar content being viewed by others

Notes

  1. http://www.quantcast.com/.

  2. http://www.census.gov/geo/reference/centersofpop.html.

  3. http://www.ssa.gov/oact/babynames/.

  4. http://www.ssa.gov/oact/NOTES/as120/LifeTables_Tbl_7.html.

  5. http://www.quantcast.com/measure/.

  6. We substitute training and testing sets of the original dataset because the training set had lower instances than testing set.

  7. https://en.wikipedia.org/wiki/United_States_House_of_Representatives_elections,_2014.

  8. https://en.wikipedia.org/wiki/United_States_Senate_elections,_2014.

References

  • Al Zamal F, Liu W, Ruths D (2012) Homophily and latent attribute inference: Inferring latent attributes of twitter users from neighbors. In: ICWSM

  • Amigó E, Carrillo de Albornoz J, Chugur I, Corujo A, Gonzalo J, Martín T, Meij E, de Rijke M, Spina D (2013) Overview of RepLab 2013: evaluating online reputation monitoring systems. In: Proceedings of the fourth international conference of the CLEF initiative, pp 333–352

  • Ardehaly E Mohammady, Culotta A (2015) Inferring latent attributes of twitter users with label regularization. In: Proceedings of the 2015 conference of the North American chapter of the association for computational linguistics: human language technologies, Association for Computational Linguistics, Denver, Colorado, pp 185–195. http://www.aclweb.org/anthology/N15-1019

  • Ardehaly EM, Culotta A (2016) Domain adaptation for learning from label proportions using self-training. In: Proceedings of the twenty-fifth international joint conference on artificial intelligence, IJCAI 2016, New York, NY, USA, pp 3670–3676, 9-15 July 2016. http://www.ijcai.org/Abstract/16/516

  • Argamon S, Dhawle S, Koppel M, Pennebaker JW (2005) Lexical predictors of personality type. In: Proceedings of the joint annual meeting of the interface and the classification society of North America

  • Barberá P (2013) Birds of the same feather tweet together. Bayesian ideal point estimation using twitter data. In: Proceedings of the social media and political participation, Florence, Italy, pp 10–11

  • Brodley CE, Friedl MA (1999) Identifying mislabeled training data. J Artif Intell Res 11:131–167

    Article  Google Scholar 

  • Burger JD, Henderson J, Kim G, Zarrella G (2011) Discriminating gender on twitter. In: Proceedings of the conference on empirical methods in natural language processing, association for computational linguistics, Stroudsburg, PA, USA, EMNLP ’11, p 13011309. http://dl.acm.org/citation.cfm?id=2145432.2145568

  • Byrd RH, Lu P, Nocedal J, Zhu C (1995) A limited memory algorithm for bound constrained optimization. SIAM J Sci Comput 16(5):1190–1208

    Article  MathSciNet  Google Scholar 

  • Chang MW, Ratinov L, Roth D (2012) Structured learning with constrained conditional models. Mach Learn 88(3):399–431

    Article  MathSciNet  Google Scholar 

  • Chang M, Ratinov L, Roth D (2007) Guiding semi-supervision with constraint-driven learning. In: ACL, association for computational linguistics, Prague, Czech Republic, pp 280–287. http://cogcomp.cs.illinois.edu/papers/ChangRaRo07.pdf

  • Chang J, Rosenn I, Backstrom L, Marlow C (2010) Epluribus: ethnicity on social networks. In: ICWSM

  • Cohen R, Ruths D (2013) Classifying political orientation on twitter: it’s not easy! In: ICWSM

  • Conover MD, Gonçalves B, Ratkiewicz J, Flammini A, Menczer F (2011) Predicting the political alignment of twitter users. In: 2011 IEEE third international conference on Privacy, security, risk and trust (passat) and 2011 IEEE third international conference on social computing (socialcom). IEEE, pp 192–199

  • Culotta A, Kumar NR, Cutler J (2016) Predicting twitter user demographics using distant supervision from website traffic data. J Artif Intell Res (JAIR) 55:389–408

    Article  Google Scholar 

  • Diaz F, Gamon M, Hofman JM, Kıcıman E, Rothschild D (2016) Online and social media data as an imperfect continuous panel survey. PloS ONE 11(1):e0145406

    Article  Google Scholar 

  • Dredze M (2012) How social media will change public health. IEEE Intell Syst 27(4):81–84. https://doi.org/10.1109/MIS.2012.76

    Article  Google Scholar 

  • Eisenstein J, Smith NA, Xing EP (2011) Discovering sociolinguistic associations with structured sparsity. In: Proceedings of the 49th annual meeting of the association for computational linguistics: human language technologies: Volume 1, Association for Computational Linguistics, Stroudsburg, PA, USA, HLT ’11, p 13651374. http://dl.acm.org/citation.cfm?id=2002472.2002641

  • Fischler MA, Bolles RC (1981) Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography. Commun ACM 24(6):381–395. https://doi.org/10.1145/358669.358692

    Article  MathSciNet  Google Scholar 

  • Ganchev K, Graca J, Gillenwater J, Taskar B (2010) Posterior regularization for structured latent variable models. J Mach Learn Res 11:20012049. http://dl.acm.org/citation.cfm?id=1756006.1859918

  • Gopinath S, Thomas JS, Krishnamurthi L (2014) Investigating the relationship between the content of online word of mouth, advertising, and brand performance. Market Sci 33(2):241–258

    Article  Google Scholar 

  • Graca J, Ganchev K, Taskar B (2007) Expectation maximization and posterior constraints. NIPS 20:569–576

    Google Scholar 

  • Jin R, Liu Y (2005) A framework for incorporating class priors into discriminative classification. In: Ho TB, Cheung D, Liu H (eds) Advances in knowledge discovery and data Mining. PAKDD 2005. Lecture Notes in Computer Science, vol 3518. Springer, Berlin

  • Kamerer D (2013) Estimating online audiences: understanding the limitations of competitive intelligence services. First Monday 18(5). https://dx.doi.org/10.5210/fm.v18i5.3986

  • Knowles R, Carroll J, Dredze M (2016) Demographer: extremely simple name demographics. In: NLP+ CSS 2016, p 108

  • Lenhart A, Fox S (2009) Twitter and status updating. PEW Internet & American Life Project, Washington DC

    Google Scholar 

  • Lin CJ, Kuo TT, Lin SD (2014) A content-based matrix factorization model for recipe recommendation. In: Tseng V, Ho T, Zhou ZH, Chen A, Kao HY (eds) Advances in knowledge discovery and data mining, lecture notes in computer science, vol 8444. Springer International Publishing, pp 560–571. https://dx.doi.org/10.1007/978-3-319-06605-9_46

  • Liu W, Ruths D (2013) What’s in a name? Using first names as features for gender inference in twitter. In: AAAI spring symposium on analyzing microtext. http://dblp.uni-trier.de/rec/bibtex/conf/aaaiss/LiuR13

  • Maas AL, Daly RE, Pham PT, Huang D, Ng AY, Potts C (2011) Learning word vectors for sentiment analysis. In: Proceedings of the 49th annual meeting of the association for computational linguistics: human language technologies, Association for Computational Linguistics, Portland, Oregon, USA, pp 142–150. http://www.aclweb.org/anthology/P11-1015

  • Maneewongvatana S, Mount DM (2002) Analysis of approximate nearest neighbor searching with clustered point sets. Data Struct Near Neighb Search Methodol 59:105–123

    MathSciNet  MATH  Google Scholar 

  • Mann GS, McCallum A (2007) Simple, robust, scalable semi-supervised learning via expectation regularization. In: Proceedings of the 24th international conference on machine learning, ACM, New York, NY, USA, ICML ’07, p 593600. https://doi.org/10.1145/1273496.1273571

  • Mann GS, McCallum A (2010) Generalized expectation criteria for semi-supervised learning with weakly labeled data. J Mach Learn Res 11:955984. http://dl.acm.org/citation.cfm?id=1756006.1756038

  • Mislove A, Lehmann S, Ahn YY, Onnela JP, Rosenquist JN (2011) Understanding the demographics of twitter users. In: Proceedings of the fifth international AAAI conference on weblogs and social media (ICWSM’11), Barcelona, Spain

  • Musicant D, Christensen J, Olson J (2007) Supervised learning by training on aggregate outputs. In: Seventh IEEE international conference on data mining, 2007. ICDM 2007, pp 252–261. https://doi.org/10.1109/ICDM.2007.50

  • Nguyen D, Smith NA, Ros CP (2011) Author age prediction from text using linear regression. In: Proceedings of the 5th ACL-HLT workshop on language technology for cultural heritage, social sciences, and humanities, Association for Computational Linguistics, Stroudsburg, PA, USA, LaTeCH ’11, p 115123. http://dl.acm.org/citation.cfm?id=2107636.2107651

  • O’Connor B, Balasubramanyan R, Routledge BR, Smith NA (2010) From tweets to polls: linking text sentiment to public opinion time series. ICWSM 11:122–129

    Google Scholar 

  • Oktay H, Firat A, Ertem Z (2014) Demographic breakdown of twitter users: an analysis based on names. In: ASE Bigdata/Socialcom/Cyber Security Conference, Academy of Science and Engineering (ASE), Los Angeles. http://www.merl.com/publications/TR2014-042

  • Pennacchiotti M, Popescu AM (2011) A machine learning approach to twitter user classification. In: Adamic LA, Baeza-Yates RA, Counts S (eds) ICWSM. The AAAI Press. http://dblp.uni-trier.de/db/conf/icwsm/icwsm2011.html

  • Prechelt L (2012) Early stopping — But When?. In: Montavon G, Orr GB, Müller KR (eds) Neural networks: tricks of the trade. Lecture Notes in Computer Science, vol 7700. Springer, Berlin.https://doi.org/10.1007/978-3-642-35289-8_5

  • Preotiuc-Pietro D, Lampos V, Aletras N (2015) An analysis of the user occupational class through twitter content. In: ACL

  • Quadrianto N, Smola AJ, Caetano TS, Le QV (2009) Estimating labels from label proportions. J Mach Learn Res 10:23492374. http://dl.acm.org/citation.cfm?id=1577069.1755865

  • Rao D, Paul MJ, Fink C, Yarowsky D, Oates T, Coppersmith G (2011) Hierarchical Bayesian models for latent attribute detection in social media. In: Adamic LA, Baeza-Yates RA, Counts S (eds) ICWSM. The AAAI Press

  • Rao D, Yarowsky D, Shreevats A, Gupta M (2010) Classifying latent user attributes in twitter. In: Proceedings of the 2nd international workshop on search and mining user-generated contents, ACM, New York, NY, USA, SMUC ’10, p 3744. https://doi.org/10.1145/1871985.1871993

  • Rendle S, Schmidt-Thieme L (2008) Online-updating regularized kernel matrix factorization models for large-scale recommender systems. In: Proceedings of the 2008 ACM conference on recommender systems, ACM, New York, NY, USA, RecSys ’08, pp 251–258. https://doi.org/10.1145/1454008.1454047

  • Rogati M, Yang Y (2002) High-performing feature selection for text classification. In: Proceedings of the eleventh international conference on information and knowledge management, ACM, New York, NY, USA, CIKM ’02, pp 659–661. https://doi.org/10.1145/584792.584911

  • Rosenthal S, McKeown K (2011) Age prediction in blogs: A study of style, content, and online behavior in pre- and post-social media generations. In: Proceedings of the 49th annual meeting of the association for computational linguistics: human language technologies: Volume 1, Association for Computational Linguistics, Stroudsburg, PA, USA, HLT ’11, p 763772. http://dl.acm.org/citation.cfm?id=2002472.2002569

  • Salakhutdinov R, Mnih A (2008) Probabilistic matrix factorization. In: Platt JC, Koller D, Singer Y, Roweis ST (eds) Advances in neural information processing systems, Curran Associates, Inc., Red Hook, vol 20, pp 1257–1264. http://papers.nips.cc/paper/3208-probabilistic-matrix-factorization.pdf

  • Saveski M, Mantrach A (2014) Item cold-start recommendations: learning local collective embeddings. In: Proceedings of the 8th ACM conference on recommender systems, ACM, New York, NY, USA, RecSys ’14, pp 89–96. https://doi.org/10.1145/2645710.2645751

  • Schapire RE, Rochery M, Rahim MG, Gupta NK (2002) Incorporating prior knowledge into boosting. In: Proceedings of the nineteenth international conference on machine learning, pp 538–545

  • Schler J, Koppel M, Argamon S, Pennebaker J (2006) Effects of age and gender on blogging. In: AAAI 2006 spring symposium on computational approaches to analysing weblogs (AAAI-CAAW), pp 06–03

  • Schwartz HA, Eichstaedt JC, Kern ML, Dziurzynski L, Lucas RE, Agrawal M, Park GJ, Lakshmikanth SK, Jha S, Seligman MEP, Ungar LH (2013a) Characterizing geographic variation in well-being using tweets. In: Seventh international AAAI conference on weblogs and social media (ICWSM)

  • Schwartz HA, Eichstaedt JC, Kern ML, Dziurzynski L, Ramones SM, Agrawal M, Shah A, Kosinski M, Stillwell D, Seligman MEP, Ungar LH (2013) Personality, gender, and age in the language of social media: the open-vocabulary approach. PloS ONE 8(9):e73791. https://doi.org/10.1371/journal.pone.0073791

    Article  Google Scholar 

  • She Y, Owen AB (2011) Outlier detection using nonconvex penalized regression. J Am Stat Assoc 106(494):626–639

    Article  MathSciNet  Google Scholar 

  • Silver N, McCanc A (2014) How to tell someone’s age when all you know is her name. Retrieved from http://fivethirtyeight.com/features/how-to-tell-someones-age-when-all-you-know-is-her-name/

  • Takacs G, Pilaszy I, Nemeth B, Tikk D (2008) Investigation of various matrix factorization methods for large recommender systems. In: IEEE international conference on data mining workshops, 2008. ICDMW ’08, pp 553–562. https://doi.org/10.1109/ICDMW.2008.86

  • Tibshirani J, Manning CD (2014) Robust logistic regression using shift parameters. In: ACL, pp 124–129

  • Vapnik VN (1995) The nature of statistical learning theory. Springer, New York

    Book  Google Scholar 

  • Volkova S, Van Durme B (2015) Online bayesian models for personal analytics in social media. In: Proceedings of the twenty-ninth conference on artificial intelligence (AAAI), Austin, TX

  • Wang Z, Lyu S, Schalk G, Ji Q (2012) Learning with target prior. In: Pereira F, Burges C, Bottou L, Weinberger K (eds) Advances in neural information processing systems, vol 25. Curran Associates, Inc., New York, pp 2231–2239. http://papers.nips.cc/paper/4849-learning-with-target-prior.pdf

  • Watkins SC (2009) The young and the digital: what the migration to social-network sites, games, and anytime, anywhere media means for our future. Beacon Press, Boston

    Google Scholar 

  • Yang Y, Pedersen JO (1997) A comparative study on feature selection in text categorization. In: Proceedings of the fourteenth international conference on machine learning, Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, ICML ’97, pp 412–420. http://dl.acm.org/citation.cfm?id=645526.657137

  • Yao Y, Rosasco L, Caponnetto A (2007) On early stopping in gradient descent learning. Constr Approx 26(2):289–315. https://doi.org/10.1007/s00365-006-0663-2

    Article  MathSciNet  MATH  Google Scholar 

  • Zhang S, Wang W, Ford J, Makedon F (2006) Learning from incomplete ratings using non-negative matrix factorization. In: Proceedings of the 6th SIAM conference on data mining, SDM, pp 549–553

  • Zhang T, Yu B (2005) Boosting with early stopping: Convergence and consistency. Ann Stat 33(4):1538–1579. http://projecteuclid.org/euclid.aos/1123250222

  • Zhu J, Chen N, Xing EP (2014) Bayesian inference with posterior regularization and applications to infinite latent svms. J Mach Learn Res 15:1799–1847

    MathSciNet  MATH  Google Scholar 

Download references

Acknowledgements

We thank the anonymous reviewers for helpful feedback. This research was funded in part by National Science Foundation under Grants #IIS-1526674 and #IIS-1618244. Any opinions, findings, and conclusions or recommendations expressed in this material are the authors and do not necessarily reflect those of the sponsor.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Aron Culotta.

Appendix: Partial derivatives of LR cost function

Appendix: Partial derivatives of LR cost function

We use logistic function derivative, i.e.,

$$\begin{aligned} \frac{\partial }{\partial \theta } \sigma (f) = \sigma (f)(1 - \sigma (f)) \frac{\partial }{\partial \theta } f \end{aligned}$$
(23)

to compute the derivative of hypothesis as:

$$\begin{aligned} \frac{\partial }{\partial \theta } h_{u, i} = h_{u, i}(1 - h_{u, i}) X_{u, i} \end{aligned}$$
(24)

Now we can compute the partial derivative of cost function:

$$\begin{aligned} \begin{aligned} \frac{\partial }{\partial \theta } J(\varTheta )&= -\,\sum _i \left( \tilde{h_i} \frac{\partial }{\partial \theta } \log \bar{h_i} + (1 - \tilde{h_i}) \frac{\partial }{\partial \theta } \log (1 - \bar{h_i})\right) + \lambda \theta \\&= -\,\sum _i \left( \frac{\tilde{y_i}}{\bar{h_i}} \frac{\partial \bar{h_i}}{\partial \theta } - \frac{1 - \tilde{y_i}}{1 - \bar{h_i}} \frac{\partial \bar{h_i}}{\partial \theta }\right) + \lambda \theta \\&= \sum _i \frac{\bar{h_i} - \tilde{y_i}}{\bar{h_i}(1 - \bar{h_i})} \frac{\partial \bar{h_i}}{\partial \theta } + \lambda \theta \\&= \sum _i \frac{\bar{h_i} - \tilde{y_i}}{\bar{h_i}(1 - \bar{h_i})} \frac{\partial }{\partial \theta } \frac{1}{|T_i|} \sum _{u \in T_i} h_{u, i} + \lambda \theta \\&= \sum _i \frac{\bar{h_i} - \tilde{y_i}}{\bar{h_i}(1 - \bar{h_i})|T_i|} \sum _{u \in T_i} \frac{\partial }{\partial \theta } h_{u, i} + \lambda \theta \\&= \sum _i \frac{\bar{h_i} - \tilde{y_i}}{\bar{h_i}(1 - \bar{h_i})|T_i|} \sum _{u \in T_i} h_{u, i}(1 - h_{u, i}) X_{u, i} + \lambda \theta \\&= \sum _{u,i} \frac{(\bar{h_i} - \tilde{y_i})h_{u, i}(1 - h_{u, i})}{\bar{h_i}(1 - \bar{h_i})|T_i|} X_{u, i} + \lambda \theta \\&= \sum _{u,i} e_{u,i} X_{u, i} + \lambda \theta \end{aligned} \end{aligned}$$
(25)

where \(e_{u,i}\) is defined in Eq. 10. The partial derivative of other variables in LRBF model is computed similarly.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Ardehaly, E.M., Culotta, A. Learning from noisy label proportions for classifying online social data. Soc. Netw. Anal. Min. 8, 2 (2018). https://doi.org/10.1007/s13278-017-0478-6

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s13278-017-0478-6

Keywords

Navigation