Abstract
Data systems often present examples and solicit labels from users to learn a target model, i.e., active learning. However, due to the complexity of the underlying data, users may not initially have a perfect understanding of the effective model and do not know the accurate labeling. For example, a user who is training a model for detecting noisy or abnormal values may not perfectly know the properties of typical and clean values in the data. Users may improve their knowledge about the data and target model as they observe examples during training. As users gradually learn about the data and model, they may revise their labeling strategies. Current systems assume that users always provide correct labeling with potentially a fixed and small chance of annotation mistakes. Nonetheless, if the trainer revises its belief during training, such mistakes become significant and non-stationarity. Hence, current systems consume incorrect labels and may learn inaccurate models. In this paper, we build theoretical underpinnings and design algorithms to develop systems that collaborate with users to learn the target model accurately and efficiently. At the core of our proposal, a game-theoretic framework models the joint learning of user and system to reach a desirable eventual stable state, where both user and system share the same belief about the target model. We extensively evaluate our system using user studies over various real-world datasets and show that our algorithms lead to accurate results with a smaller number of interactions compared to existing methods.
Supplemental Material
- Ziawasch Abedjan, Cuneyt Gurcan Akcora, Mourad Ouzzani, Paolo Papotti, and Michael Stonebraker. 2015. Temporal Rules Discovery for Web Data Cleaning. Proc. VLDB Endow. 9, 4 (2015), 336--347. https://doi.org/10.14778/2856318.2856328Google ScholarDigital Library
- Ziawasch Abedjan, Xu Chu, Dong Deng, Raul Castro Fernandez, Ihab F. Ilyas, Mourad Ouzzani, Paolo Papotti, Michael Stonebraker, and Nan Tang. 2016. Detecting Data Errors: Where are we and what needs to be done? Proc. VLDB Endow. 9, 12 (2016), 993--1004.Google ScholarDigital Library
- Serge Abiteboul, Richard Hull, and Victor Vianu. 1994. Foundations of Databases: The Logical Level. Addison-Wesley.Google ScholarDigital Library
- Azza Abouzied, Dominik Moritz, and Michael J. Cafarella. 2022. HILDA'22: The SIGMOD 2022 Workshop on Human-in-the-Loop Data Analytics. In Proceedings of the 2022 International Conference on Management of Data (Philadelphia, PA, USA) (SIGMOD '22). Association for Computing Machinery, New York, NY, USA, 2552--2553. https://doi.org/10.1145/3514221.3524077Google ScholarDigital Library
- Charu C. Aggarwal, Xiangnan Kong, Quanquan Gu, Jiawei Han, and Philip S. Yu. 2014. Active Learning: A Survey. In Data Classification: Algorithms and Applications, Charu C. Aggarwal (Ed.). CRC Press, 571--606. http://www.crcnetbase.com/doi/abs/10.1201/b17320--23Google Scholar
- Patricia C. Arocena, Boris Glavic, Giansalvatore Mecca, Renée J. Miller, Paolo Papotti, and Donatello Santoro. 2015. Messing up with BART: Error Generation for Evaluating Data-Cleaning Algorithms. Proc. VLDB Endow. 9, 2 (oct 2015), 36--47. https://doi.org/10.14778/2850578.2850579Google ScholarDigital Library
- Jürgen Bernard, Marco Hutter, Matthias Zeppelzauer, Dieter Fellner, and Michael Sedlmair. 2018. Comparing Visual-Interactive Labeling with Active Learning: An Experimental Study. IEEE Transactions on Visualization and Computer Graphics 24, 1 (2018), 298--308. https://doi.org/10.1109/TVCG.2017.2744818Google ScholarCross Ref
- Jürgen Bernard, Matthias Zeppelzauer, Markus Lehmann, Martin Müller, and Michael Sedlmair. 2018. Towards User- Centered Active Learning Algorithms. Computer Graphics Forum 37, 3 (2018), 121--132. https://doi.org/10.1111/cgf.13406 arXiv:https://onlinelibrary.wiley.com/doi/pdf/10.1111/cgf.13406Google ScholarCross Ref
- Jürgen Bernard, Matthias Zeppelzauer, Michael Sedlmair, and Wolfgang Aigner. 2018. VIAL: A Unified Process for Visual Interactive Labeling. Vis. Comput. 34, 9 (sep 2018), 1189--1207. https://doi.org/10.1007/s00371-018--1500--3Google Scholar
- Laure Berti-Équille, Hazar Harmouch, Felix Naumann, Noël Novelli, and Saravanan Thirumuruganathan. 2018. Discovery of Genuine Functional Dependencies from Relational Data with Missing Values. Proc. VLDB Endow. 11, 8 (2018), 880--892.Google ScholarDigital Library
- Colin F. Camerer, Teck-Hua Ho, and Juin Kuan Chong. 2004. Behavioural Game Theory: Thinking, Learning and Teaching. In Advances in understanding strategic behaviour : game theory, experiments, and bounded rationality. Palgrave Macmillan, 120--180.Google Scholar
- Loredana Caruccio, Vincenzo Deufemia, Felix Naumann, and Giuseppe Polese. 2021. Discovering Relaxed Functional Dependencies Based on Multi-Attribute Dominance. IEEE Trans. Knowl. Data Eng. 33, 9 (2021), 3212--3228.Google ScholarDigital Library
- Xu Chu, Ihab F. Ilyas, and Paolo Papotti. 2013. Holistic data cleaning: Putting violations into context. In ICDE. IEEE Computer Society, 458--469.Google Scholar
- Trevor Darrell, Xin Wang, Li Erran Li, Fisher Yu, Zeynep Akata, Wenwu Zhu, Pradeep Ravikumar, Shiji Zhou, Shanghang Zhang, and Kalesha Bullard. 2021. HILL'21: ICML Workshop on Human in the Loop Learning. In Proceedings of the 2021 International Conference on Machine Learning (ICML '21).Google Scholar
- Kyriaki Dimitriadou, Olga Papaemmanouil, and Yanlei Diao. 2014. Explore-by-example: an automatic query steering framework for interactive data exploration. In International Conference on Management of Data, SIGMOD 2014, Snowbird, UT, USA, June 22--27, 2014, Curtis E. Dyreson, Feifei Li, and M. Tamer Özsu (Eds.). ACM, 517--528. https://doi.org/10.1145/2588555.2610523Google ScholarDigital Library
- Wenfei Fan. 2008. Dependencies revisited for improving data quality. In Proceedings of the Twenty-Seventh ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, PODS 2008, June 9--11, 2008, Vancouver, BC, Canada, Maurizio Lenzerini and Domenico Lembo (Eds.). ACM, 159--170. https://doi.org/10.1145/1376916.1376940Google ScholarDigital Library
- Wenfei Fan and Floris Geerts. 2012. Foundations of Data Quality Management. Morgan & Claypool Publishers. https://doi.org/10.2200/S00439ED1V01Y201207DTM030Google ScholarCross Ref
- Daniel Fink. 1997. A Compendium of Conjugate Priors. Technical Report. https://www.johndcook.com/CompendiumOfConjugatePriors.pdfGoogle Scholar
- Drew Fudenberg and David Levine. 1998. The Theory of Learning in Games. MIT Press.Google Scholar
- Samuel J. Gershman, Eric J. Horvitz, and Joshua B. Tenenbaum. 2015. Computational rationality: A converging paradigm for intelligence in brains, minds, and machines. Science 349, 6245 (2015), 273--278. https://doi.org/10.1126/science.aac6076 arXiv:https://science.sciencemag.org/content/349/6245/273.full.pdfGoogle Scholar
- Daniel Golovin, Andreas Krause, and Debajyoti Ray. 2010. Near-Optimal Bayesian Active Learning with Noisy Observations. In Proceedings of the 23rd International Conference on Neural Information Processing Systems - Volume 1 (Vancouver, British Columbia, Canada) (NIPS'10). Curran Associates Inc., Red Hook, NY, USA, 766--774.Google Scholar
- Trevor Hastie, Robert Tibshirani, and Jerome Friedman. 2009. The Elements of Statistical Learning. Springer.Google Scholar
- Jian He, Enzo Veltri, Donatello Santoro, Guoliang Li, Giansalvatore Mecca, Paolo Papotti, and Nan Tang. 2016. Interactive and Deterministic Data Cleaning. In SIGMOD. ACM, 893--907.Google Scholar
- Alireza Heidari, Joshua McGrath, Ihab F. Ilyas, and Theodoros Rekatsinas. 2019. HoloDetect: Few-Shot Learning for Error Detection. In SIGMOD. ACM, 829--846.Google Scholar
- Arvid Heise, Jorge-Arnulfo Quiané-Ruiz, Ziawasch Abedjan, Anja Jentzsch, and Felix Naumann. 2013. Scalable Discovery of Unique Column Combinations. Proc. VLDB Endow. 7, 4 (2013), 301--312.Google ScholarDigital Library
- Joseph M. Hellerstein, Jeffrey Heer, and Sean Kandel. 2018. Self-Service Data Preparation: Research to Practice. IEEE Data Eng. Bull. 41, 2 (2018), 23--34.Google Scholar
- Josef Hofbauer and William H. Sandholm. 2002. On the Global Convergence of Stochastic Fictitious Play. Econometrica 70, 6 (2002), 2265--2294. http://www.jstor.org/stable/3081987Google ScholarCross Ref
- Benjamin Höferlin, Rudolf Netzel, Markus Höferlin, Daniel Weiskopf, and Gunther Heidemann. 2012. Inter-active learning of ad-hoc classifiers for video visual analytics. 2012 IEEE Conference on Visual Analytics Science and Technology (VAST) (2012), 23--32.Google ScholarDigital Library
- Ykä Huhtala, Juha Kärkkäinen, Pasi Porkka, and Hannu Toivonen. 1999. TANE: An Efficient Algorithm for Discovering Functional and Approximate Dependencies. Comput. J. 42, 2 (1999), 100--111.Google ScholarCross Ref
- Ihab F. Ilyas, Volker Markl, Peter Haas, Paul Brown, and Ashraf Aboulnaga. 2004. CORDS: Automatic Discovery of Correlations and Soft Functional Dependencies. In SIGMOD.Google ScholarDigital Library
- Jyrki Kivinen and Heikki Mannila. 1992. Approximate Dependency Inference from Relations. In ICDT (Lecture Notes in Computer Science, Vol. 646), Joachim Biskup and Richard Hull (Eds.). Springer, 86--98.Google ScholarCross Ref
- Sanjay Krishnan, Jiannan Wang, Eugene Wu, Michael J. Franklin, and Ken Goldberg. 2016. ActiveClean: Interactive Data Cleaning for Statistical Modeling. Proc. VLDB Endow. 9, 12 (aug 2016), 948--959. https://doi.org/10.14778/2994509.2994514Google ScholarDigital Library
- Rui Li, Rui Guo, Zhenquan Xu, and Wei Feng. 2012. A prefetching model based on access popularity for geospatial data in a cluster-based caching system. International Journal of Geographical Information Science 26, 10 (2012), 1831--1844.Google ScholarDigital Library
- Christopher H. Lin, Mausam, and Daniel S. Weld. 2016. Re-Active Learning: Active Learning with Relabeling. In Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence (Phoenix, Arizona) (AAAI'16). AAAI Press, 1845--1852.Google Scholar
- Ester Livshits, Alireza Heidari, Ihab F. Ilyas, and Benny Kimelfeld. 2020. Approximate Denial Constraints. Proc. VLDB Endow. 13, 10 (2020), 1682--1695.Google ScholarDigital Library
- Ester Livshits, Benny Kimelfeld, and Sudeepa Roy. 2020. Computing Optimal Repairs for Functional Dependencies. ACM Trans. Database Syst. 45, 1, Article 4 (2020), 46 pages. https://doi.org/10.1145/3360904Google ScholarDigital Library
- Mohammad Mahdavi, Ziawasch Abedjan, Raul Castro Fernandez, Samuel Madden, Mourad Ouzzani, Michael Stonebraker, and Nan Tang. 2019. Raha: A Configuration-Free Error Detection System. In SIGMOD. ACM, 865--882.Google Scholar
- Ben McCamish, Vahid Ghadakchi, Arash Termehchy, Behrouz Touri, and Liang Huang. 2018. The Data Interaction Game. In Proceedings of the 2018 International Conference on Management of Data (Houston, TX, USA) (SIGMOD '18). ACM, New York, NY, USA, 83--98. https://doi.org/10.1145/3183713.3196899Google ScholarDigital Library
- Tom Mitchell. 1997. Machine Learning. McGraw-Hil.Google Scholar
- Yael Niv. 2009. The Neuroscience of Reinforcement Learning. In ICML.Google Scholar
- Y Niv. 2009. Reinforcement learning in the brain. The Journal of Mathematical Psychology 53, 3 (2009), 139--154.Google ScholarCross Ref
- Eduardo H. M. Pena, Eduardo C. de Almeida, and Felix Naumann. 2019. Discovery of Approximate (and Exact) Denial Constraints. Proc. VLDB Endow. 13, 3 (2019), 266--278.Google ScholarDigital Library
- Theodoros Rekatsinas, Xu Chu, Ihab F. Ilyas, and Christopher Ré. 2017. HoloClean: Holistic Data Repairs with Probabilistic Inference. Proc. VLDB Endow. 10, 11 (2017), 1190--1201.Google ScholarDigital Library
- Esther Rolf, Nikolay Malkin, Alexandros Graikos, Ana Jojic, Caleb Robinson, and Nebojsa Jojic. 2022. Resolving label uncertainty with implicit posterior models. In Proceedings of the Thirty-Eighth Conference on Uncertainty in Artificial Intelligence (Proceedings of Machine Learning Research, Vol. 180), James Cussens and Kun Zhang (Eds.). PMLR, 1707--1717. https://proceedings.mlr.press/v180/rolf22a.htmlGoogle Scholar
- Alvin E Roth and Ido Erev. 1995. Learning in extensive-form games: Experimental data and simple dynamic models in the intermediate term. Games and economic behavior 8, 1 (1995), 164--212.Google Scholar
- Burr Settles. 2009. Active Learning Literature Survey. Computer Sciences Technical Report 1648. University of Wisconsin--Madison. http://axon.cs.byu.edu/~martinez/classes/778/Papers/settles.activelearning.pdfGoogle Scholar
- Burr Settles. 2012. Active Learning. Morgan & Claypool Publishers.Google Scholar
- Pannaga Shivaswamy and Thorsten Joachims. 2015. Coactive Learning. J. Artif. Int. Res. 53, 1 (may 2015), 1--40.Google ScholarDigital Library
- Joshua B. Tenenbaum. 1999. Bayesian Modeling of Human Concept Learning. In Advances in Neural Information Processing Systems 11, M. J. Kearns, S. A. Solla, and D. A. Cohn (Eds.). MIT Press, 59--68. http://papers.nips.cc/paper/1542-bayesian-modeling-of-human-concept-learning.pdfGoogle Scholar
- Saravanan Thirumuruganathan, Laure Berti-Équille, Mourad Ouzzani, Jorge-Arnulfo Quiané-Ruiz, and Nan Tang. 2017. UGuide: User-Guided Discovery of FD-Detectable Errors. In , SIGMOD. ACM, 1385--1397. https://doi.org/10.1145/3035918.3064024Google ScholarDigital Library
- Paroma Varma and Christopher Ré. 2018. Snuba: Automating Weak Supervision to Label Training Data. Proc. VLDB Endow. 12, 3 (nov 2018), 223--236. https://doi.org/10.14778/3291264.3291268Google ScholarDigital Library
- Mohamed Yakout, Ahmed K. Elmagarmid, Jennifer Neville, Mourad Ouzzani, and Ihab F. Ilyas. 2011. Guided Data Repair. Proc. VLDB Endow. 4, 5 (Feb. 2011), 279--289. https://doi.org/10.14778/1952376.1952378Google ScholarDigital Library
- Songbai Yan, Kamalika Chaudhuri, and Tara Javidi. 2016. Active Learning from Imperfect Labelers. In NIPS, Daniel D. Lee, Masashi Sugiyama, Ulrike von Luxburg, Isabelle Guyon, and Roman Garnett (Eds.). 2128--2136.Google Scholar
- H Peyton Young. 2004. Strategic learning and its limits. OUP Oxford.Google Scholar
- Chicheng Zhang and Kamalika Chaudhuri. 2015. Active Learning from Weak and Strong Labelers. In NIPS, Corinna Cortes, Neil D. Lawrence, Daniel D. Lee, Masashi Sugiyama, and Roman Garnett (Eds.). 703--711.Google Scholar
- Ugur Çetintemel, Mitch Cherniack, Justin DeBrabant, Yanlei Diao, Kyriaki Dimitriadou, Alexander Kalinin, Olga Papaemmanouil, and Stanley B. Zdonik. 2013. Query Steering for Interactive Data Exploration. In CIDR.Google Scholar
Index Terms
- Exploratory Training: When Annonators Learn About Data
Recommendations
Exploratory training: when trainers learn
HILDA '22: Proceedings of the Workshop on Human-In-the-Loop Data AnalyticsData systems often present examples and solicit labels from users to learn a target concept in supervised to semi-supervised learning. This selection of examples could be even done in an active fashion i.e., active learning. Current systems assume that ...
Training pool selection for semi-supervised learning
ISNN'12: Proceedings of the 9th international conference on Advances in Neural Networks - Volume Part ISemi-supervised leaning deals with methods for automatically exploiting unlabeled samples in addition to labeled set. The data selection is an important topic in active learning. It addresses the selection the valuable unlabeled data to label, ...
Active learning and data manipulation techniques for generating training examples in meta-learning
Algorithm selection is an important task in different domains of knowledge. Meta-learning treats this task by adopting a supervised learning strategy. Training examples in meta-learning (called meta-examples) are generated from experiments performed ...
Comments