research-article

Exploratory Training: When Annonators Learn About Data

Authors:
Rajesh Shrestha

Oregon State University, Corvallis, OR, USA

Oregon State University, Corvallis, OR, USA

0009-0005-4163-4071
View Profile

,
Omeed Habibelahian

Oregon State University, Corvallis, OR, USA

Oregon State University, Corvallis, OR, USA

0009-0005-5093-0243
View Profile

,
Arash Termehchy

Oregon State University, Corvallis, OR, USA

Oregon State University, Corvallis, OR, USA

0009-0007-2213-6303
View Profile

,
Paolo Papotti

Eurecom, Biot, France

Eurecom, Biot, France

0000-0003-0651-4128
View Profile

Authors Info & Claims

Proceedings of the ACM on Management of Data Volume 1 Issue 2Article No.: 135pp 1–25https://doi.org/10.1145/3589280

Published:20 June 2023Publication History

Proceedings of the ACM on Management of Data

Abstract

Data systems often present examples and solicit labels from users to learn a target model, i.e., active learning. However, due to the complexity of the underlying data, users may not initially have a perfect understanding of the effective model and do not know the accurate labeling. For example, a user who is training a model for detecting noisy or abnormal values may not perfectly know the properties of typical and clean values in the data. Users may improve their knowledge about the data and target model as they observe examples during training. As users gradually learn about the data and model, they may revise their labeling strategies. Current systems assume that users always provide correct labeling with potentially a fixed and small chance of annotation mistakes. Nonetheless, if the trainer revises its belief during training, such mistakes become significant and non-stationarity. Hence, current systems consume incorrect labels and may learn inaccurate models. In this paper, we build theoretical underpinnings and design algorithms to develop systems that collaborate with users to learn the target model accurately and efficiently. At the core of our proposal, a game-theoretic framework models the joint learning of user and system to reach a desirable eventual stable state, where both user and system share the same belief about the target model. We extensively evaluate our system using user studies over various real-world datasets and show that our algorithms lead to accurate results with a smaller number of interactions compared to existing methods.

Supplemental Material

Exploratory Training - When Annotators Learn About Data_final.mp4

mp4

132.2 MB

Download

References

Ziawasch Abedjan, Cuneyt Gurcan Akcora, Mourad Ouzzani, Paolo Papotti, and Michael Stonebraker. 2015. Temporal Rules Discovery for Web Data Cleaning. Proc. VLDB Endow. 9, 4 (2015), 336--347. https://doi.org/10.14778/2856318.2856328Google ScholarDigital Library
Ziawasch Abedjan, Xu Chu, Dong Deng, Raul Castro Fernandez, Ihab F. Ilyas, Mourad Ouzzani, Paolo Papotti, Michael Stonebraker, and Nan Tang. 2016. Detecting Data Errors: Where are we and what needs to be done? Proc. VLDB Endow. 9, 12 (2016), 993--1004.Google ScholarDigital Library
Serge Abiteboul, Richard Hull, and Victor Vianu. 1994. Foundations of Databases: The Logical Level. Addison-Wesley.Google ScholarDigital Library
Azza Abouzied, Dominik Moritz, and Michael J. Cafarella. 2022. HILDA'22: The SIGMOD 2022 Workshop on Human-in-the-Loop Data Analytics. In Proceedings of the 2022 International Conference on Management of Data (Philadelphia, PA, USA) (SIGMOD '22). Association for Computing Machinery, New York, NY, USA, 2552--2553. https://doi.org/10.1145/3514221.3524077Google ScholarDigital Library
Charu C. Aggarwal, Xiangnan Kong, Quanquan Gu, Jiawei Han, and Philip S. Yu. 2014. Active Learning: A Survey. In Data Classification: Algorithms and Applications, Charu C. Aggarwal (Ed.). CRC Press, 571--606. http://www.crcnetbase.com/doi/abs/10.1201/b17320--23Google Scholar
Patricia C. Arocena, Boris Glavic, Giansalvatore Mecca, Renée J. Miller, Paolo Papotti, and Donatello Santoro. 2015. Messing up with BART: Error Generation for Evaluating Data-Cleaning Algorithms. Proc. VLDB Endow. 9, 2 (oct 2015), 36--47. https://doi.org/10.14778/2850578.2850579Google ScholarDigital Library
Jürgen Bernard, Marco Hutter, Matthias Zeppelzauer, Dieter Fellner, and Michael Sedlmair. 2018. Comparing Visual-Interactive Labeling with Active Learning: An Experimental Study. IEEE Transactions on Visualization and Computer Graphics 24, 1 (2018), 298--308. https://doi.org/10.1109/TVCG.2017.2744818Google ScholarCross Ref
Jürgen Bernard, Matthias Zeppelzauer, Markus Lehmann, Martin Müller, and Michael Sedlmair. 2018. Towards User- Centered Active Learning Algorithms. Computer Graphics Forum 37, 3 (2018), 121--132. https://doi.org/10.1111/cgf.13406 arXiv:https://onlinelibrary.wiley.com/doi/pdf/10.1111/cgf.13406Google ScholarCross Ref
Jürgen Bernard, Matthias Zeppelzauer, Michael Sedlmair, and Wolfgang Aigner. 2018. VIAL: A Unified Process for Visual Interactive Labeling. Vis. Comput. 34, 9 (sep 2018), 1189--1207. https://doi.org/10.1007/s00371-018--1500--3Google Scholar
Laure Berti-Équille, Hazar Harmouch, Felix Naumann, Noël Novelli, and Saravanan Thirumuruganathan. 2018. Discovery of Genuine Functional Dependencies from Relational Data with Missing Values. Proc. VLDB Endow. 11, 8 (2018), 880--892.Google ScholarDigital Library
Colin F. Camerer, Teck-Hua Ho, and Juin Kuan Chong. 2004. Behavioural Game Theory: Thinking, Learning and Teaching. In Advances in understanding strategic behaviour : game theory, experiments, and bounded rationality. Palgrave Macmillan, 120--180.Google Scholar
Loredana Caruccio, Vincenzo Deufemia, Felix Naumann, and Giuseppe Polese. 2021. Discovering Relaxed Functional Dependencies Based on Multi-Attribute Dominance. IEEE Trans. Knowl. Data Eng. 33, 9 (2021), 3212--3228.Google ScholarDigital Library
Xu Chu, Ihab F. Ilyas, and Paolo Papotti. 2013. Holistic data cleaning: Putting violations into context. In ICDE. IEEE Computer Society, 458--469.Google Scholar
Trevor Darrell, Xin Wang, Li Erran Li, Fisher Yu, Zeynep Akata, Wenwu Zhu, Pradeep Ravikumar, Shiji Zhou, Shanghang Zhang, and Kalesha Bullard. 2021. HILL'21: ICML Workshop on Human in the Loop Learning. In Proceedings of the 2021 International Conference on Machine Learning (ICML '21).Google Scholar
Kyriaki Dimitriadou, Olga Papaemmanouil, and Yanlei Diao. 2014. Explore-by-example: an automatic query steering framework for interactive data exploration. In International Conference on Management of Data, SIGMOD 2014, Snowbird, UT, USA, June 22--27, 2014, Curtis E. Dyreson, Feifei Li, and M. Tamer Özsu (Eds.). ACM, 517--528. https://doi.org/10.1145/2588555.2610523Google ScholarDigital Library
Wenfei Fan. 2008. Dependencies revisited for improving data quality. In Proceedings of the Twenty-Seventh ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, PODS 2008, June 9--11, 2008, Vancouver, BC, Canada, Maurizio Lenzerini and Domenico Lembo (Eds.). ACM, 159--170. https://doi.org/10.1145/1376916.1376940Google ScholarDigital Library
Wenfei Fan and Floris Geerts. 2012. Foundations of Data Quality Management. Morgan & Claypool Publishers. https://doi.org/10.2200/S00439ED1V01Y201207DTM030Google ScholarCross Ref
Daniel Fink. 1997. A Compendium of Conjugate Priors. Technical Report. https://www.johndcook.com/CompendiumOfConjugatePriors.pdfGoogle Scholar
Drew Fudenberg and David Levine. 1998. The Theory of Learning in Games. MIT Press.Google Scholar
Samuel J. Gershman, Eric J. Horvitz, and Joshua B. Tenenbaum. 2015. Computational rationality: A converging paradigm for intelligence in brains, minds, and machines. Science 349, 6245 (2015), 273--278. https://doi.org/10.1126/science.aac6076 arXiv:https://science.sciencemag.org/content/349/6245/273.full.pdfGoogle Scholar
Daniel Golovin, Andreas Krause, and Debajyoti Ray. 2010. Near-Optimal Bayesian Active Learning with Noisy Observations. In Proceedings of the 23rd International Conference on Neural Information Processing Systems - Volume 1 (Vancouver, British Columbia, Canada) (NIPS'10). Curran Associates Inc., Red Hook, NY, USA, 766--774.Google Scholar
Trevor Hastie, Robert Tibshirani, and Jerome Friedman. 2009. The Elements of Statistical Learning. Springer.Google Scholar
Jian He, Enzo Veltri, Donatello Santoro, Guoliang Li, Giansalvatore Mecca, Paolo Papotti, and Nan Tang. 2016. Interactive and Deterministic Data Cleaning. In SIGMOD. ACM, 893--907.Google Scholar
Alireza Heidari, Joshua McGrath, Ihab F. Ilyas, and Theodoros Rekatsinas. 2019. HoloDetect: Few-Shot Learning for Error Detection. In SIGMOD. ACM, 829--846.Google Scholar
Arvid Heise, Jorge-Arnulfo Quiané-Ruiz, Ziawasch Abedjan, Anja Jentzsch, and Felix Naumann. 2013. Scalable Discovery of Unique Column Combinations. Proc. VLDB Endow. 7, 4 (2013), 301--312.Google ScholarDigital Library
Joseph M. Hellerstein, Jeffrey Heer, and Sean Kandel. 2018. Self-Service Data Preparation: Research to Practice. IEEE Data Eng. Bull. 41, 2 (2018), 23--34.Google Scholar
Josef Hofbauer and William H. Sandholm. 2002. On the Global Convergence of Stochastic Fictitious Play. Econometrica 70, 6 (2002), 2265--2294. http://www.jstor.org/stable/3081987Google ScholarCross Ref
Benjamin Höferlin, Rudolf Netzel, Markus Höferlin, Daniel Weiskopf, and Gunther Heidemann. 2012. Inter-active learning of ad-hoc classifiers for video visual analytics. 2012 IEEE Conference on Visual Analytics Science and Technology (VAST) (2012), 23--32.Google ScholarDigital Library
Ykä Huhtala, Juha Kärkkäinen, Pasi Porkka, and Hannu Toivonen. 1999. TANE: An Efficient Algorithm for Discovering Functional and Approximate Dependencies. Comput. J. 42, 2 (1999), 100--111.Google ScholarCross Ref
Ihab F. Ilyas, Volker Markl, Peter Haas, Paul Brown, and Ashraf Aboulnaga. 2004. CORDS: Automatic Discovery of Correlations and Soft Functional Dependencies. In SIGMOD.Google ScholarDigital Library
Jyrki Kivinen and Heikki Mannila. 1992. Approximate Dependency Inference from Relations. In ICDT (Lecture Notes in Computer Science, Vol. 646), Joachim Biskup and Richard Hull (Eds.). Springer, 86--98.Google ScholarCross Ref
Sanjay Krishnan, Jiannan Wang, Eugene Wu, Michael J. Franklin, and Ken Goldberg. 2016. ActiveClean: Interactive Data Cleaning for Statistical Modeling. Proc. VLDB Endow. 9, 12 (aug 2016), 948--959. https://doi.org/10.14778/2994509.2994514Google ScholarDigital Library
Rui Li, Rui Guo, Zhenquan Xu, and Wei Feng. 2012. A prefetching model based on access popularity for geospatial data in a cluster-based caching system. International Journal of Geographical Information Science 26, 10 (2012), 1831--1844.Google ScholarDigital Library
Christopher H. Lin, Mausam, and Daniel S. Weld. 2016. Re-Active Learning: Active Learning with Relabeling. In Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence (Phoenix, Arizona) (AAAI'16). AAAI Press, 1845--1852.Google Scholar
Ester Livshits, Alireza Heidari, Ihab F. Ilyas, and Benny Kimelfeld. 2020. Approximate Denial Constraints. Proc. VLDB Endow. 13, 10 (2020), 1682--1695.Google ScholarDigital Library
Ester Livshits, Benny Kimelfeld, and Sudeepa Roy. 2020. Computing Optimal Repairs for Functional Dependencies. ACM Trans. Database Syst. 45, 1, Article 4 (2020), 46 pages. https://doi.org/10.1145/3360904Google ScholarDigital Library
Mohammad Mahdavi, Ziawasch Abedjan, Raul Castro Fernandez, Samuel Madden, Mourad Ouzzani, Michael Stonebraker, and Nan Tang. 2019. Raha: A Configuration-Free Error Detection System. In SIGMOD. ACM, 865--882.Google Scholar
Ben McCamish, Vahid Ghadakchi, Arash Termehchy, Behrouz Touri, and Liang Huang. 2018. The Data Interaction Game. In Proceedings of the 2018 International Conference on Management of Data (Houston, TX, USA) (SIGMOD '18). ACM, New York, NY, USA, 83--98. https://doi.org/10.1145/3183713.3196899Google ScholarDigital Library
Tom Mitchell. 1997. Machine Learning. McGraw-Hil.Google Scholar
Yael Niv. 2009. The Neuroscience of Reinforcement Learning. In ICML.Google Scholar
Y Niv. 2009. Reinforcement learning in the brain. The Journal of Mathematical Psychology 53, 3 (2009), 139--154.Google ScholarCross Ref
Eduardo H. M. Pena, Eduardo C. de Almeida, and Felix Naumann. 2019. Discovery of Approximate (and Exact) Denial Constraints. Proc. VLDB Endow. 13, 3 (2019), 266--278.Google ScholarDigital Library
Theodoros Rekatsinas, Xu Chu, Ihab F. Ilyas, and Christopher Ré. 2017. HoloClean: Holistic Data Repairs with Probabilistic Inference. Proc. VLDB Endow. 10, 11 (2017), 1190--1201.Google ScholarDigital Library
Esther Rolf, Nikolay Malkin, Alexandros Graikos, Ana Jojic, Caleb Robinson, and Nebojsa Jojic. 2022. Resolving label uncertainty with implicit posterior models. In Proceedings of the Thirty-Eighth Conference on Uncertainty in Artificial Intelligence (Proceedings of Machine Learning Research, Vol. 180), James Cussens and Kun Zhang (Eds.). PMLR, 1707--1717. https://proceedings.mlr.press/v180/rolf22a.htmlGoogle Scholar
Alvin E Roth and Ido Erev. 1995. Learning in extensive-form games: Experimental data and simple dynamic models in the intermediate term. Games and economic behavior 8, 1 (1995), 164--212.Google Scholar
Burr Settles. 2009. Active Learning Literature Survey. Computer Sciences Technical Report 1648. University of Wisconsin--Madison. http://axon.cs.byu.edu/~martinez/classes/778/Papers/settles.activelearning.pdfGoogle Scholar
Burr Settles. 2012. Active Learning. Morgan & Claypool Publishers.Google Scholar
Pannaga Shivaswamy and Thorsten Joachims. 2015. Coactive Learning. J. Artif. Int. Res. 53, 1 (may 2015), 1--40.Google ScholarDigital Library
Joshua B. Tenenbaum. 1999. Bayesian Modeling of Human Concept Learning. In Advances in Neural Information Processing Systems 11, M. J. Kearns, S. A. Solla, and D. A. Cohn (Eds.). MIT Press, 59--68. http://papers.nips.cc/paper/1542-bayesian-modeling-of-human-concept-learning.pdfGoogle Scholar
Saravanan Thirumuruganathan, Laure Berti-Équille, Mourad Ouzzani, Jorge-Arnulfo Quiané-Ruiz, and Nan Tang. 2017. UGuide: User-Guided Discovery of FD-Detectable Errors. In , SIGMOD. ACM, 1385--1397. https://doi.org/10.1145/3035918.3064024Google ScholarDigital Library
Paroma Varma and Christopher Ré. 2018. Snuba: Automating Weak Supervision to Label Training Data. Proc. VLDB Endow. 12, 3 (nov 2018), 223--236. https://doi.org/10.14778/3291264.3291268Google ScholarDigital Library
Mohamed Yakout, Ahmed K. Elmagarmid, Jennifer Neville, Mourad Ouzzani, and Ihab F. Ilyas. 2011. Guided Data Repair. Proc. VLDB Endow. 4, 5 (Feb. 2011), 279--289. https://doi.org/10.14778/1952376.1952378Google ScholarDigital Library
Songbai Yan, Kamalika Chaudhuri, and Tara Javidi. 2016. Active Learning from Imperfect Labelers. In NIPS, Daniel D. Lee, Masashi Sugiyama, Ulrike von Luxburg, Isabelle Guyon, and Roman Garnett (Eds.). 2128--2136.Google Scholar
H Peyton Young. 2004. Strategic learning and its limits. OUP Oxford.Google Scholar
Chicheng Zhang and Kamalika Chaudhuri. 2015. Active Learning from Weak and Strong Labelers. In NIPS, Corinna Cortes, Neil D. Lawrence, Daniel D. Lee, Masashi Sugiyama, and Roman Garnett (Eds.). 703--711.Google Scholar
Ugur Çetintemel, Mitch Cherniack, Justin DeBrabant, Yanlei Diao, Kyriaki Dimitriadou, Alexander Kalinin, Olga Papaemmanouil, and Stanley B. Zdonik. 2013. Query Steering for Interactive Data Exploration. In CIDR.Google Scholar

Index Terms

Exploratory Training: When Annonators Learn About Data
1. Human-centered computing
  1. Human computer interaction (HCI)
    1. HCI design and evaluation methods
      1. User models
2. Information systems
  1. Information systems applications
    1. Data mining
      1. Data cleaning
    2. Decision support systems
      1. Data analytics

Recommendations

Exploratory training: when trainers learn
HILDA '22: Proceedings of the Workshop on Human-In-the-Loop Data Analytics

Data systems often present examples and solicit labels from users to learn a target concept in supervised to semi-supervised learning. This selection of examples could be even done in an active fashion i.e., active learning. Current systems assume that ...
Read More
Training pool selection for semi-supervised learning
ISNN'12: Proceedings of the 9th international conference on Advances in Neural Networks - Volume Part I

Semi-supervised leaning deals with methods for automatically exploiting unlabeled samples in addition to labeled set. The data selection is an important topic in active learning. It addresses the selection the valuable unlabeled data to label, ...
Read More
Active learning and data manipulation techniques for generating training examples in meta-learning

Algorithm selection is an important task in different domains of knowledge. Meta-learning treats this task by adopting a supervised learning strategy. Training examples in meta-learning (called meta-examples) are generated from experiments performed ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
Proceedings of the ACM on Management of Data Volume 1, Issue 2
PACMMOD
June 2023
2310 pages
EISSN:2836-6573
DOI:10.1145/3605748
Editor:
Divyakant Agrawal
UC Santa Barbara, United States
Issue’s Table of Contents
Copyright © 2023 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 20 June 2023
Published in pacmmod Volume 1, Issue 2

Permissions
Request permissions about this article.
Request Permissions
Author Tags
Bayesian model
active learning
data exploration
functional dependencies
human in loop
human learning
hypothesis-testing model
Qualifiers
- research-article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 1
  Total Citations
  View Citations
- 112
  Total Downloads
- Downloads (Last 12 months)112
- Downloads (Last 6 weeks)6
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Exploratory Training: When Annonators Learn About Data

Proceedings of the ACM on Management of Data

Abstract

Supplemental Material

References

Cited By

Index Terms

Recommendations

Exploratory training: when trainers learn

Training pool selection for semi-supervised learning

Active learning and data manipulation techniques for generating training examples in meta-learning

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Author Tags

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Exploratory Training: When Annonators Learn About Data

Proceedings of the ACM on Management of Data

Abstract

Supplemental Material

References

Cited By

Index Terms

Recommendations

Exploratory training: when trainers learn

Training pool selection for semi-supervised learning

Active learning and data manipulation techniques for generating training examples in meta-learning

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Author Tags

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media