skip to main content
research-article

Exploratory Training: When Annonators Learn About Data

Published:20 June 2023Publication History
Skip Abstract Section

Abstract

Data systems often present examples and solicit labels from users to learn a target model, i.e., active learning. However, due to the complexity of the underlying data, users may not initially have a perfect understanding of the effective model and do not know the accurate labeling. For example, a user who is training a model for detecting noisy or abnormal values may not perfectly know the properties of typical and clean values in the data. Users may improve their knowledge about the data and target model as they observe examples during training. As users gradually learn about the data and model, they may revise their labeling strategies. Current systems assume that users always provide correct labeling with potentially a fixed and small chance of annotation mistakes. Nonetheless, if the trainer revises its belief during training, such mistakes become significant and non-stationarity. Hence, current systems consume incorrect labels and may learn inaccurate models. In this paper, we build theoretical underpinnings and design algorithms to develop systems that collaborate with users to learn the target model accurately and efficiently. At the core of our proposal, a game-theoretic framework models the joint learning of user and system to reach a desirable eventual stable state, where both user and system share the same belief about the target model. We extensively evaluate our system using user studies over various real-world datasets and show that our algorithms lead to accurate results with a smaller number of interactions compared to existing methods.

Skip Supplemental Material Section

Supplemental Material

Exploratory Training - When Annotators Learn About Data_final.mp4

mp4

132.2 MB

References

  1. Ziawasch Abedjan, Cuneyt Gurcan Akcora, Mourad Ouzzani, Paolo Papotti, and Michael Stonebraker. 2015. Temporal Rules Discovery for Web Data Cleaning. Proc. VLDB Endow. 9, 4 (2015), 336--347. https://doi.org/10.14778/2856318.2856328Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Ziawasch Abedjan, Xu Chu, Dong Deng, Raul Castro Fernandez, Ihab F. Ilyas, Mourad Ouzzani, Paolo Papotti, Michael Stonebraker, and Nan Tang. 2016. Detecting Data Errors: Where are we and what needs to be done? Proc. VLDB Endow. 9, 12 (2016), 993--1004.Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Serge Abiteboul, Richard Hull, and Victor Vianu. 1994. Foundations of Databases: The Logical Level. Addison-Wesley.Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Azza Abouzied, Dominik Moritz, and Michael J. Cafarella. 2022. HILDA'22: The SIGMOD 2022 Workshop on Human-in-the-Loop Data Analytics. In Proceedings of the 2022 International Conference on Management of Data (Philadelphia, PA, USA) (SIGMOD '22). Association for Computing Machinery, New York, NY, USA, 2552--2553. https://doi.org/10.1145/3514221.3524077Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Charu C. Aggarwal, Xiangnan Kong, Quanquan Gu, Jiawei Han, and Philip S. Yu. 2014. Active Learning: A Survey. In Data Classification: Algorithms and Applications, Charu C. Aggarwal (Ed.). CRC Press, 571--606. http://www.crcnetbase.com/doi/abs/10.1201/b17320--23Google ScholarGoogle Scholar
  6. Patricia C. Arocena, Boris Glavic, Giansalvatore Mecca, Renée J. Miller, Paolo Papotti, and Donatello Santoro. 2015. Messing up with BART: Error Generation for Evaluating Data-Cleaning Algorithms. Proc. VLDB Endow. 9, 2 (oct 2015), 36--47. https://doi.org/10.14778/2850578.2850579Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Jürgen Bernard, Marco Hutter, Matthias Zeppelzauer, Dieter Fellner, and Michael Sedlmair. 2018. Comparing Visual-Interactive Labeling with Active Learning: An Experimental Study. IEEE Transactions on Visualization and Computer Graphics 24, 1 (2018), 298--308. https://doi.org/10.1109/TVCG.2017.2744818Google ScholarGoogle ScholarCross RefCross Ref
  8. Jürgen Bernard, Matthias Zeppelzauer, Markus Lehmann, Martin Müller, and Michael Sedlmair. 2018. Towards User- Centered Active Learning Algorithms. Computer Graphics Forum 37, 3 (2018), 121--132. https://doi.org/10.1111/cgf.13406 arXiv:https://onlinelibrary.wiley.com/doi/pdf/10.1111/cgf.13406Google ScholarGoogle ScholarCross RefCross Ref
  9. Jürgen Bernard, Matthias Zeppelzauer, Michael Sedlmair, and Wolfgang Aigner. 2018. VIAL: A Unified Process for Visual Interactive Labeling. Vis. Comput. 34, 9 (sep 2018), 1189--1207. https://doi.org/10.1007/s00371-018--1500--3Google ScholarGoogle Scholar
  10. Laure Berti-Équille, Hazar Harmouch, Felix Naumann, Noël Novelli, and Saravanan Thirumuruganathan. 2018. Discovery of Genuine Functional Dependencies from Relational Data with Missing Values. Proc. VLDB Endow. 11, 8 (2018), 880--892.Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Colin F. Camerer, Teck-Hua Ho, and Juin Kuan Chong. 2004. Behavioural Game Theory: Thinking, Learning and Teaching. In Advances in understanding strategic behaviour : game theory, experiments, and bounded rationality. Palgrave Macmillan, 120--180.Google ScholarGoogle Scholar
  12. Loredana Caruccio, Vincenzo Deufemia, Felix Naumann, and Giuseppe Polese. 2021. Discovering Relaxed Functional Dependencies Based on Multi-Attribute Dominance. IEEE Trans. Knowl. Data Eng. 33, 9 (2021), 3212--3228.Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Xu Chu, Ihab F. Ilyas, and Paolo Papotti. 2013. Holistic data cleaning: Putting violations into context. In ICDE. IEEE Computer Society, 458--469.Google ScholarGoogle Scholar
  14. Trevor Darrell, Xin Wang, Li Erran Li, Fisher Yu, Zeynep Akata, Wenwu Zhu, Pradeep Ravikumar, Shiji Zhou, Shanghang Zhang, and Kalesha Bullard. 2021. HILL'21: ICML Workshop on Human in the Loop Learning. In Proceedings of the 2021 International Conference on Machine Learning (ICML '21).Google ScholarGoogle Scholar
  15. Kyriaki Dimitriadou, Olga Papaemmanouil, and Yanlei Diao. 2014. Explore-by-example: an automatic query steering framework for interactive data exploration. In International Conference on Management of Data, SIGMOD 2014, Snowbird, UT, USA, June 22--27, 2014, Curtis E. Dyreson, Feifei Li, and M. Tamer Özsu (Eds.). ACM, 517--528. https://doi.org/10.1145/2588555.2610523Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Wenfei Fan. 2008. Dependencies revisited for improving data quality. In Proceedings of the Twenty-Seventh ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, PODS 2008, June 9--11, 2008, Vancouver, BC, Canada, Maurizio Lenzerini and Domenico Lembo (Eds.). ACM, 159--170. https://doi.org/10.1145/1376916.1376940Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Wenfei Fan and Floris Geerts. 2012. Foundations of Data Quality Management. Morgan & Claypool Publishers. https://doi.org/10.2200/S00439ED1V01Y201207DTM030Google ScholarGoogle ScholarCross RefCross Ref
  18. Daniel Fink. 1997. A Compendium of Conjugate Priors. Technical Report. https://www.johndcook.com/CompendiumOfConjugatePriors.pdfGoogle ScholarGoogle Scholar
  19. Drew Fudenberg and David Levine. 1998. The Theory of Learning in Games. MIT Press.Google ScholarGoogle Scholar
  20. Samuel J. Gershman, Eric J. Horvitz, and Joshua B. Tenenbaum. 2015. Computational rationality: A converging paradigm for intelligence in brains, minds, and machines. Science 349, 6245 (2015), 273--278. https://doi.org/10.1126/science.aac6076 arXiv:https://science.sciencemag.org/content/349/6245/273.full.pdfGoogle ScholarGoogle Scholar
  21. Daniel Golovin, Andreas Krause, and Debajyoti Ray. 2010. Near-Optimal Bayesian Active Learning with Noisy Observations. In Proceedings of the 23rd International Conference on Neural Information Processing Systems - Volume 1 (Vancouver, British Columbia, Canada) (NIPS'10). Curran Associates Inc., Red Hook, NY, USA, 766--774.Google ScholarGoogle Scholar
  22. Trevor Hastie, Robert Tibshirani, and Jerome Friedman. 2009. The Elements of Statistical Learning. Springer.Google ScholarGoogle Scholar
  23. Jian He, Enzo Veltri, Donatello Santoro, Guoliang Li, Giansalvatore Mecca, Paolo Papotti, and Nan Tang. 2016. Interactive and Deterministic Data Cleaning. In SIGMOD. ACM, 893--907.Google ScholarGoogle Scholar
  24. Alireza Heidari, Joshua McGrath, Ihab F. Ilyas, and Theodoros Rekatsinas. 2019. HoloDetect: Few-Shot Learning for Error Detection. In SIGMOD. ACM, 829--846.Google ScholarGoogle Scholar
  25. Arvid Heise, Jorge-Arnulfo Quiané-Ruiz, Ziawasch Abedjan, Anja Jentzsch, and Felix Naumann. 2013. Scalable Discovery of Unique Column Combinations. Proc. VLDB Endow. 7, 4 (2013), 301--312.Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Joseph M. Hellerstein, Jeffrey Heer, and Sean Kandel. 2018. Self-Service Data Preparation: Research to Practice. IEEE Data Eng. Bull. 41, 2 (2018), 23--34.Google ScholarGoogle Scholar
  27. Josef Hofbauer and William H. Sandholm. 2002. On the Global Convergence of Stochastic Fictitious Play. Econometrica 70, 6 (2002), 2265--2294. http://www.jstor.org/stable/3081987Google ScholarGoogle ScholarCross RefCross Ref
  28. Benjamin Höferlin, Rudolf Netzel, Markus Höferlin, Daniel Weiskopf, and Gunther Heidemann. 2012. Inter-active learning of ad-hoc classifiers for video visual analytics. 2012 IEEE Conference on Visual Analytics Science and Technology (VAST) (2012), 23--32.Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Ykä Huhtala, Juha Kärkkäinen, Pasi Porkka, and Hannu Toivonen. 1999. TANE: An Efficient Algorithm for Discovering Functional and Approximate Dependencies. Comput. J. 42, 2 (1999), 100--111.Google ScholarGoogle ScholarCross RefCross Ref
  30. Ihab F. Ilyas, Volker Markl, Peter Haas, Paul Brown, and Ashraf Aboulnaga. 2004. CORDS: Automatic Discovery of Correlations and Soft Functional Dependencies. In SIGMOD.Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Jyrki Kivinen and Heikki Mannila. 1992. Approximate Dependency Inference from Relations. In ICDT (Lecture Notes in Computer Science, Vol. 646), Joachim Biskup and Richard Hull (Eds.). Springer, 86--98.Google ScholarGoogle ScholarCross RefCross Ref
  32. Sanjay Krishnan, Jiannan Wang, Eugene Wu, Michael J. Franklin, and Ken Goldberg. 2016. ActiveClean: Interactive Data Cleaning for Statistical Modeling. Proc. VLDB Endow. 9, 12 (aug 2016), 948--959. https://doi.org/10.14778/2994509.2994514Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Rui Li, Rui Guo, Zhenquan Xu, and Wei Feng. 2012. A prefetching model based on access popularity for geospatial data in a cluster-based caching system. International Journal of Geographical Information Science 26, 10 (2012), 1831--1844.Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Christopher H. Lin, Mausam, and Daniel S. Weld. 2016. Re-Active Learning: Active Learning with Relabeling. In Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence (Phoenix, Arizona) (AAAI'16). AAAI Press, 1845--1852.Google ScholarGoogle Scholar
  35. Ester Livshits, Alireza Heidari, Ihab F. Ilyas, and Benny Kimelfeld. 2020. Approximate Denial Constraints. Proc. VLDB Endow. 13, 10 (2020), 1682--1695.Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Ester Livshits, Benny Kimelfeld, and Sudeepa Roy. 2020. Computing Optimal Repairs for Functional Dependencies. ACM Trans. Database Syst. 45, 1, Article 4 (2020), 46 pages. https://doi.org/10.1145/3360904Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Mohammad Mahdavi, Ziawasch Abedjan, Raul Castro Fernandez, Samuel Madden, Mourad Ouzzani, Michael Stonebraker, and Nan Tang. 2019. Raha: A Configuration-Free Error Detection System. In SIGMOD. ACM, 865--882.Google ScholarGoogle Scholar
  38. Ben McCamish, Vahid Ghadakchi, Arash Termehchy, Behrouz Touri, and Liang Huang. 2018. The Data Interaction Game. In Proceedings of the 2018 International Conference on Management of Data (Houston, TX, USA) (SIGMOD '18). ACM, New York, NY, USA, 83--98. https://doi.org/10.1145/3183713.3196899Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. Tom Mitchell. 1997. Machine Learning. McGraw-Hil.Google ScholarGoogle Scholar
  40. Yael Niv. 2009. The Neuroscience of Reinforcement Learning. In ICML.Google ScholarGoogle Scholar
  41. Y Niv. 2009. Reinforcement learning in the brain. The Journal of Mathematical Psychology 53, 3 (2009), 139--154.Google ScholarGoogle ScholarCross RefCross Ref
  42. Eduardo H. M. Pena, Eduardo C. de Almeida, and Felix Naumann. 2019. Discovery of Approximate (and Exact) Denial Constraints. Proc. VLDB Endow. 13, 3 (2019), 266--278.Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. Theodoros Rekatsinas, Xu Chu, Ihab F. Ilyas, and Christopher Ré. 2017. HoloClean: Holistic Data Repairs with Probabilistic Inference. Proc. VLDB Endow. 10, 11 (2017), 1190--1201.Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. Esther Rolf, Nikolay Malkin, Alexandros Graikos, Ana Jojic, Caleb Robinson, and Nebojsa Jojic. 2022. Resolving label uncertainty with implicit posterior models. In Proceedings of the Thirty-Eighth Conference on Uncertainty in Artificial Intelligence (Proceedings of Machine Learning Research, Vol. 180), James Cussens and Kun Zhang (Eds.). PMLR, 1707--1717. https://proceedings.mlr.press/v180/rolf22a.htmlGoogle ScholarGoogle Scholar
  45. Alvin E Roth and Ido Erev. 1995. Learning in extensive-form games: Experimental data and simple dynamic models in the intermediate term. Games and economic behavior 8, 1 (1995), 164--212.Google ScholarGoogle Scholar
  46. Burr Settles. 2009. Active Learning Literature Survey. Computer Sciences Technical Report 1648. University of Wisconsin--Madison. http://axon.cs.byu.edu/~martinez/classes/778/Papers/settles.activelearning.pdfGoogle ScholarGoogle Scholar
  47. Burr Settles. 2012. Active Learning. Morgan & Claypool Publishers.Google ScholarGoogle Scholar
  48. Pannaga Shivaswamy and Thorsten Joachims. 2015. Coactive Learning. J. Artif. Int. Res. 53, 1 (may 2015), 1--40.Google ScholarGoogle ScholarDigital LibraryDigital Library
  49. Joshua B. Tenenbaum. 1999. Bayesian Modeling of Human Concept Learning. In Advances in Neural Information Processing Systems 11, M. J. Kearns, S. A. Solla, and D. A. Cohn (Eds.). MIT Press, 59--68. http://papers.nips.cc/paper/1542-bayesian-modeling-of-human-concept-learning.pdfGoogle ScholarGoogle Scholar
  50. Saravanan Thirumuruganathan, Laure Berti-Équille, Mourad Ouzzani, Jorge-Arnulfo Quiané-Ruiz, and Nan Tang. 2017. UGuide: User-Guided Discovery of FD-Detectable Errors. In , SIGMOD. ACM, 1385--1397. https://doi.org/10.1145/3035918.3064024Google ScholarGoogle ScholarDigital LibraryDigital Library
  51. Paroma Varma and Christopher Ré. 2018. Snuba: Automating Weak Supervision to Label Training Data. Proc. VLDB Endow. 12, 3 (nov 2018), 223--236. https://doi.org/10.14778/3291264.3291268Google ScholarGoogle ScholarDigital LibraryDigital Library
  52. Mohamed Yakout, Ahmed K. Elmagarmid, Jennifer Neville, Mourad Ouzzani, and Ihab F. Ilyas. 2011. Guided Data Repair. Proc. VLDB Endow. 4, 5 (Feb. 2011), 279--289. https://doi.org/10.14778/1952376.1952378Google ScholarGoogle ScholarDigital LibraryDigital Library
  53. Songbai Yan, Kamalika Chaudhuri, and Tara Javidi. 2016. Active Learning from Imperfect Labelers. In NIPS, Daniel D. Lee, Masashi Sugiyama, Ulrike von Luxburg, Isabelle Guyon, and Roman Garnett (Eds.). 2128--2136.Google ScholarGoogle Scholar
  54. H Peyton Young. 2004. Strategic learning and its limits. OUP Oxford.Google ScholarGoogle Scholar
  55. Chicheng Zhang and Kamalika Chaudhuri. 2015. Active Learning from Weak and Strong Labelers. In NIPS, Corinna Cortes, Neil D. Lawrence, Daniel D. Lee, Masashi Sugiyama, and Roman Garnett (Eds.). 703--711.Google ScholarGoogle Scholar
  56. Ugur Çetintemel, Mitch Cherniack, Justin DeBrabant, Yanlei Diao, Kyriaki Dimitriadou, Alexander Kalinin, Olga Papaemmanouil, and Stanley B. Zdonik. 2013. Query Steering for Interactive Data Exploration. In CIDR.Google ScholarGoogle Scholar

Index Terms

  1. Exploratory Training: When Annonators Learn About Data

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in

        Full Access

        • Published in

          cover image Proceedings of the ACM on Management of Data
          Proceedings of the ACM on Management of Data  Volume 1, Issue 2
          PACMMOD
          June 2023
          2310 pages
          EISSN:2836-6573
          DOI:10.1145/3605748
          Issue’s Table of Contents

          Copyright © 2023 ACM

          Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 20 June 2023
          Published in pacmmod Volume 1, Issue 2

          Permissions

          Request permissions about this article.

          Request Permissions

          Qualifiers

          • research-article

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader