skip to main content
research-article
Artifacts Available / v1.1

Explaining Dataset Changes for Semantic Data Versioning with Explain-Da-V

Published:01 February 2023Publication History
Skip Abstract Section

Abstract

In multi-user environments in which data science and analysis is collaborative, multiple versions of the same datasets are generated. While managing and storing data versions has received some attention in the research literature, the semantic nature of such changes has remained under-explored. In this work, we introduce Explain-Da-V, a framework aiming to explain changes between two given dataset versions. Explain-Da-V generates explanations that use data transformations to explain changes. We further introduce a set of measures that evaluate the validity, generalizability, and explainability of these explanations. We empirically show, using an adapted existing benchmark and a newly created benchmark, that Explain-Da-V generates better explanations than existing data transformation synthesis methods.

References

  1. 2022. Auto-pipeline benchmark. https://gitlab.com/jwjwyoung/autopipeline-benchmarks. accessed on Feb 7, 2023.Google ScholarGoogle Scholar
  2. 2022. AutoPandas Implementation. https://github.com/rbavishi/autopandas. accessed on Feb 7, 2023.Google ScholarGoogle Scholar
  3. 2022. Decision Trees. https://scikit-learn.org/stable/modules/tree.html. accessed on Feb 7, 2023.Google ScholarGoogle Scholar
  4. 2022. Explanation Example. https://github.com/shraga89/ExplainDaV/blob/main/Explanation_Example.md. accessed on Feb 7, 2023.Google ScholarGoogle Scholar
  5. 2022. Featuretools. https://www.featuretools.com/. accessed on Feb 7, 2023.Google ScholarGoogle Scholar
  6. 2022. Foofah Implementation. https://github.com/umich-dbgroup/foofah. accessed on Feb 7, 2023.Google ScholarGoogle Scholar
  7. 2022. Initial IMDB dataset. https://www.kaggle.com/datasets/harshitshankhdhar/imdb-dataset-of-top-1000-movies-and-tv-shows. accessed on Feb 7, 2023.Google ScholarGoogle Scholar
  8. 2022. Initial IRIS dataset. https://www.kaggle.com/uciml/iris. accessed on Feb 7, 2023.Google ScholarGoogle Scholar
  9. 2022. Initial NBA dataset. https://www.kaggle.com/justinas/nba-players-data. accessed on Feb 7, 2023.Google ScholarGoogle Scholar
  10. 2022. Initial TITANIC dataset. https://www.kaggle.com/competitions/titanic. accessed on Feb 7, 2023.Google ScholarGoogle Scholar
  11. 2022. Initial WINE dataset. https://www.kaggle.com/christopheiv/winemagdata130k. accessed on Feb 7, 2023.Google ScholarGoogle Scholar
  12. 2022. Lasso Regularization. https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Lasso.html. accessed on Feb 7, 2023.Google ScholarGoogle Scholar
  13. 2022. Pandas. https://pandas.pydata.org/. accessed on Feb 7, 2023.Google ScholarGoogle Scholar
  14. 2022. Repository. https://github.com/northeastern-datalab/Explain-Da-V. accessed on Feb 18, 2023.Google ScholarGoogle Scholar
  15. 2022. Rigde Regularization. https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Ridge.html. accessed on Feb 7, 2023.Google ScholarGoogle Scholar
  16. Ziawasch Abedjan, John Morcos, Ihab F. Ilyas, Mourad Ouzzani, Paolo Papotti, and Michael Stonebraker. 2016. DataXFormer: A robust transformation discovery system. In 32nd IEEE International Conference on Data Engineering, ICDE 2016, Helsinki, Finland, May 16-20, 2016. IEEE Computer Society, 1134--1145. Google ScholarGoogle ScholarCross RefCross Ref
  17. Rohan Bavishi, Caroline Lemieux, Roy Fox, Koushik Sen, and Ion Stoica. 2019. AutoPandas: neural-backed generators for program synthesis. Proc. ACM Program. Lang. 3, OOPSLA (2019), 168:1--168:27. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Ladjel Bellatreche and Robert Wrembel. 2013. Special issue on: Evolution and versioning in semantic data integration systems., 57--59 pages.Google ScholarGoogle Scholar
  19. Anant P. Bhardwaj, Souvik Bhattacherjee, Amit Chavan, Amol Deshpande, Aaron J. Elmore, Samuel Madden, and Aditya G. Parameswaran. 2015. DataHub: Collaborative Data Science & Dataset Version Management at Scale. In Seventh Biennial Conference on Innovative Data Systems Research, CIDR 2015, Asilomar, CA, USA, January 4-7, 2015, Online Proceedings. www.cidrdb.org. http://cidrdb.org/cidr2015/Papers/CIDR15_Paper18.pdfGoogle ScholarGoogle Scholar
  20. Souvik Bhattacherjee, Amit Chavan, Silu Huang, Amol Deshpande, and Aditya Parameswaran. 2015. Principles of dataset versioning: Exploring the recreation/storage tradeoff. In Proceedings of the VLDB Endowment. International Conference on Very Large Data Bases, Vol. 8. NIH Public Access, 1346.Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Tobias Bleifuß, Leon Bornemann, Theodore Johnson, Dmitri V Kalashnikov, Felix Naumann, and Divesh Srivastava. 2018. Exploring change: A new dimension of data analytics. Proceedings of the VLDB Endowment 12, 2 (2018), 85--98.Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Tobias Bleifuß, Leon Bornemann, Dmitri V Kalashnikov, Felix Naumann, and Divesh Srivastava. 2019. DBChEx: Interactive Exploration of Data and Schema Change. In CIDR.Google ScholarGoogle Scholar
  23. Alex Bogatu, Norman W. Paton, Alvaro A. A. Fernandes, and Martin Koehler. 2019. Towards Automatic Data Format Transformations: Data Wrangling at Scale. Comput. J. 62, 7 (2019), 1044--1060. Google ScholarGoogle ScholarCross RefCross Ref
  24. Leon Bornemann, Tobias Bleifuß, Dmitri Kalashnikov, Felix Naumann, and Divesh Srivastava. 2018. Data change exploration using time series clustering. Datenbank-Spektrum 18, 2 (2018), 79--87.Google ScholarGoogle ScholarCross RefCross Ref
  25. Richard J Brook and Gregory C Arnold. 2018. Applied regression analysis and experimental design. CRC Press.Google ScholarGoogle Scholar
  26. Jason Brownlee. 2022. Data preparation for machine learning.Google ScholarGoogle Scholar
  27. Douglas Burdick, Ronald Fagin, Phokion G. Kolaitis, Lucian Popa, and Wang-Chiew Tan. 2016. A Declarative Framework for Linking Entities. ACM Trans. Database Syst. 41, 3 (2016), 17:1--17:38.Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Nadia Burkart and Marco F Huber. 2021. A survey on the explainability of supervised machine learning. Journal of Artificial Intelligence Research 70 (2021), 245--317.Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Hancheng Cao, Vivian Yang, Victor Chen, Yu Jin Lee, Lydia Stone, N'godjigui Junior Diarrassouba, Mark E Whiting, and Michael S Bernstein. 2021. My team will go on: Differentiating high and low viability teams through team interaction. Proceedings of the ACM on Human-Computer Interaction 4 (2021), 1--27.Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Varun Chandola, Arindam Banerjee, and Vipin Kumar. 2009. Anomaly detection: A survey. ACM computing surveys (CSUR) 41, 3 (2009), 1--58.Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Sudarshan S Chawathe and Hector Garcia-Molina. 1997. Meaningful change detection in structured data. ACM SIGMOD Record 26, 2 (1997), 26--37.Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Sudarshan S Chawathe, Anand Rajaraman, Hector Garcia-Molina, and Jennifer Widom. 1996. Change detection in hierarchically structured information. Acm Sigmod Record 25, 2 (1996), 493--504.Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Anton Chernyavskiy, Dmitry Ilvovsky, and Preslav Nakov. 2021. Transformers: "The End of History" for Natural Language Processing?. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases. Springer, 677--693.Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Gregory Cobena, Serge Abiteboul, and Amelie Marian. 2002. Detecting changes in XML documents. In Proceedings 18th International Conference on Data Engineering. IEEE, 41--52.Google ScholarGoogle ScholarCross RefCross Ref
  35. Nicole Cruz, Jean Baratgin, Mike Oaksford, and David E Over. 2015. Bayesian reasoning with ifs and ands and ors. Frontiers in psychology 6 (2015), 192.Google ScholarGoogle Scholar
  36. Giovanni Da San Martino, Seunghak Yu, Alberto Barrón-Cedeno, Rostislav Petrov, and Preslav Nakov. 2019. Fine-grained analysis of propaganda in news article. In Proceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (EMNLP-IJCNLP). 5636--5646.Google ScholarGoogle ScholarCross RefCross Ref
  37. Canada Open Data. 2020. https://open.canada.ca/en/open-dataGoogle ScholarGoogle Scholar
  38. UK Open Data. 2020. https://data.gov.uk/Google ScholarGoogle Scholar
  39. Boer Deng. 2015. Papers with shorter titles get more citations. Nature News 26 (2015).Google ScholarGoogle Scholar
  40. Dong Deng, Wenbo Tao, Ziawasch Abedjan, Ahmed K. Elmagarmid, Ihab F. Ilyas, Guoliang Li, Samuel Madden, Mourad Ouzzani, Michael Stonebraker, and Nan Tang. 2019. Unsupervised String Transformation Learning for Entity Consolidation. In 35th IEEE International Conference on Data Engineering, ICDE 2019, Macao, China, April 8-11, 2019. IEEE, 196--207. Google ScholarGoogle ScholarCross RefCross Ref
  41. Jeffrey R Edwards. 2002. Alternatives to difference scores: Polynomial regression and response surface methodology. Advances in measurement and data analysis (2002), 350--400.Google ScholarGoogle Scholar
  42. Kareem El Gebaly, Parag Agrawal, Lukasz Golab, Flip Korn, and Divesh Srivastava. 2014. Interpretable and informative explanations of outcomes. Proceedings of the VLDB Endowment 8, 1 (2014), 61--72.Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. Ahmed K Elmagarmid, Panagiotis G Ipeirotis, and Vassilios S Verykios. 2006. Duplicate record detection: A survey. IEEE Transactions on knowledge and data engineering 19, 1 (2006), 1--16.Google ScholarGoogle ScholarCross RefCross Ref
  44. Peter A Flach and Iztok Savnik. 1999. Database dependency discovery: a machine learning approach. AI Communications 12 (3) (1999), 139 -- 160. http://content.iospress.com/articles/ai-communications/aic182 Publisher: IOS Press.Google ScholarGoogle Scholar
  45. Avigdor Gal, Haggai Roitman, and Roee Shraga. 2019. Learning to rerank schema matches. IEEE Transactions on Knowledge and Data Engineering 33, 8 (2019), 3104--3116.Google ScholarGoogle ScholarCross RefCross Ref
  46. Yihan Gao, Silu Huang, and Aditya G. Parameswaran. 2018. Navigating the Data Lake with DATAMARAN: Automatically Extracting Structure from Log Datasets. In Proceedings of the 2018 International Conference on Management of Data, SIGMOD Conference 2018, Houston, TX, USA, June 10-15, 2018, Gautam Das, Christopher M. Jermaine, and Philip A. Bernstein (Eds.). ACM, 943--958. Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. Bar Genossar, Roee Shraga, and Avigdor Gal. 2023. FlexER: Flexible Entity Resolution for Multiple Intents. In SIGMOD Conference 2023. ACM. arXivpreprintarXiv:2209.07569Google ScholarGoogle Scholar
  48. Dimitris C Gkikas, Katerina Tzafilkou, Prokopis K Theodoridis, Aristogiannis Garmpis, and Marios C Gkikas. 2022. How do text characteristics impact user engagement in social media posts: Modeling content readability, length, and hashtags number in Facebook. International Journal of Information Management Data Insights 2, 1 (2022), 100067.Google ScholarGoogle ScholarCross RefCross Ref
  49. William R. Harris and Sumit Gulwani. 2011. Spreadsheet table transformations from examples. In Proceedings of the 32nd ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI 2011, San Jose, CA, USA, June 4-8, 2011, Mary W. Hall and David A. Padua (Eds.). ACM, 317--328. Google ScholarGoogle ScholarDigital LibraryDigital Library
  50. Yeye He, Xu Chu, Kris Ganjam, Yudian Zheng, Vivek R. Narasayya, and Surajit Chaudhuri. 2018. Transform-Data-by-Example (TDE): An Extensible Search Engine for Data Transformations. Proc. VLDB Endow. 11, 10 (2018), 1165--1177. Google ScholarGoogle ScholarDigital LibraryDigital Library
  51. Yeye He, Zhongjun Jin, and Surajit Chaudhuri. 2020. Auto-Transform:Learning-to-Transform by Patterns. Proc. VLDB Endow. 13, 11 (2020), 2368--2381. http://www.vldb.org/pvldb/vol13/p2368-he.pdfGoogle ScholarGoogle ScholarDigital LibraryDigital Library
  52. Fred Hohman, Kanit Wongsuphasawat, Mary Beth Kery, and Kayur Patel. 2020. Understanding and visualizing data iteration in machine learning. In Proceedings of the 2020 CHI conference on human factors in computing systems. 1--13.Google ScholarGoogle ScholarDigital LibraryDigital Library
  53. The home of the U.S. Government's open data. 2020. https://data.gov/Google ScholarGoogle Scholar
  54. Silu Huang, Liqi Xu, Jialin Liu, Aaron J Elmore, and Aditya Parameswaran. 2017. ORPHEUSDB: Bolt-on Versioning for Relational Databases. Proceedings of the VLDB Endowment 10, 10 (2017).Google ScholarGoogle Scholar
  55. Zhongjun Jin, Michael R. Anderson, Michael J. Cafarella, and H. V. Jagadish. 2017. Foofah: Transforming Data By Example. In Proceedings of the 2017 ACM International Conference on Management of Data, SIGMOD Conference 2017, Chicago, IL, USA, May 14-19, 2017, Semih Salihoglu, Wenchao Zhou, Rada Chirkova, Jun Yang, and Dan Suciu (Eds.). ACM, 683--698. Google ScholarGoogle ScholarDigital LibraryDigital Library
  56. Zhongjun Jin, Michael J. Cafarella, H. V. Jagadish, Sean Kandel, Michael Minar, and Joseph M. Hellerstein. 2019. CLX: Towards verifiable PBE data transformation. In Advances in Database Technology - 22nd International Conference on Extending Database Technology, EDBT 2019, Lisbon, Portugal, March 26-29, 2019, Melanie Herschel, Helena Galhardas, Berthold Reinwald, Irini Fundulaki, Carsten Binnig, and Zoi Kaoudi (Eds.). OpenProceedings.org, 265--276. Google ScholarGoogle ScholarCross RefCross Ref
  57. Mary Beth Kery, Amber Horvath, and Brad Myers. 2017. Variolite: Supporting Exploratory Programming by Data Scientists. In Proceedings of the 2017 CHI Conference on Human Factors in Computing Systems. 1265--1276.Google ScholarGoogle ScholarDigital LibraryDigital Library
  58. Mary Beth Kery, Bonnie E John, Patrick O'Flaherty, Amber Horvath, and Brad A Myers. 2019. Towards effective foraging by data scientists to find past analysis choices. In Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems. 1--13.Google ScholarGoogle ScholarDigital LibraryDigital Library
  59. Mary Beth Kery, Marissa Radensky, Mahima Arya, Bonnie E John, and Brad A Myers. 2018. The story in the notebook: Exploratory data science using a literate programming tool. In Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems. 1--11.Google ScholarGoogle ScholarDigital LibraryDigital Library
  60. Aamod Khatiwada, Roee Shraga, Wolfgang Gatterbauer, and Renée J. Miller. 2022. Integrating Data Lake Tables. Proc. VLDB Endow. 16, 4 (2022), 932--945. https://www.vldb.org/pvldb/vol16/p932-khatiwada.pdfGoogle ScholarGoogle ScholarDigital LibraryDigital Library
  61. Alexandra Kim, Laks VS Lakshmanan, and Divesh Srivastava. 2020. Summarizing hierarchical multidimensional data. In 2020 IEEE 36th International Conference on Data Engineering (ICDE). IEEE, 877--888.Google ScholarGoogle ScholarCross RefCross Ref
  62. Olga Kovaleva, Alexey Romanov, Anna Rogers, and Anna Rumshisky. 2019. Revealing the Dark Secrets of BERT. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). 4365--4374.Google ScholarGoogle ScholarCross RefCross Ref
  63. Max Kuhn and Kjell Johnson. 2019. Feature engineering and selection: A practical approach for predictive models. CRC Press.Google ScholarGoogle Scholar
  64. Himabindu Lakkaraju, Stephen H Bach, and Jure Leskovec. 2016. Interpretable decision sets: A joint framework for description and prediction. In Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining. 1675--1684.Google ScholarGoogle ScholarDigital LibraryDigital Library
  65. Benjamin Letham, Cynthia Rudin, Tyler H McCormick, and David Madigan. 2015. Interpretable classifiers using rules and bayesian analysis: Building a better stroke prediction model. The Annals of Applied Statistics 9, 3 (2015), 1350--1371.Google ScholarGoogle ScholarCross RefCross Ref
  66. Yuliang Li, Jinfeng Li, Yoshihiko Suhara, AnHai Doan, and Wang-Chiew Tan. 2020. Deep entity matching with pre-trained language models. Proceedings of the VLDB Endowment 14, 1 (2020), 50--60.Google ScholarGoogle ScholarDigital LibraryDigital Library
  67. Benjamin Marlin. 2004. Collaborative filtering: A machine learning perspective. University of Toronto Toronto.Google ScholarGoogle Scholar
  68. Renée J Miller. 2018. Open data integration. Proceedings of the VLDB Endowment 11, 12 (2018), 2130--2139.Google ScholarGoogle ScholarDigital LibraryDigital Library
  69. Tom Mitchell. 1997. Decision tree learning. Machine learning 414 (1997), 52--78.Google ScholarGoogle Scholar
  70. Heiko Müller, Johann-Christoph Freytag, and Ulf Leser. 2006. Describing differences between databases. In Proceedings of the 15th ACM international conference on Information and knowledge management. 612--621.Google ScholarGoogle ScholarDigital LibraryDigital Library
  71. Menaka Narayanan, Emily Chen, Jeffrey He, Been Kim, Sam Gershman, and Finale Doshi-Velez. 2018. How do humans understand explanations from machine learning systems? an evaluation of the human-interpretability of explanation. arXiv preprint arXiv:1802.00682 (2018).Google ScholarGoogle Scholar
  72. Fatemeh Nargesian, Erkang Zhu, Renée J. Miller, Ken Q. Pu, and Patricia C. Arocena. 2019. Data Lake Management: Challenges and Opportunities. Proc. VLDB Endow. 12, 12 (aug 2019), 1986--1989. Google ScholarGoogle ScholarDigital LibraryDigital Library
  73. Andrew Nierman and HV Jagadish. 2002. Evaluating Structural Similarity in XML Documents.. In webdb, Vol. 2. Citeseer, 61--66.Google ScholarGoogle Scholar
  74. Pedro Orvalho, Miguel Terra-Neves, Miguel Ventura, Ruben Martins, and Vasco Manquinho. 2020. SQUARES: a SQL synthesizer using query reverse engineering. Proceedings of the VLDB Endowment 13, 12 (2020), 2853--2856.Google ScholarGoogle ScholarDigital LibraryDigital Library
  75. Aslihan Özmen, Mahdi Esmailoghli, and Ziawasch Abedjan. 2021. Combining Programming-by-Example with Transformation Discovery from large Databases. In Datenbanksysteme für Business, Technologie und Web (BTW 2021), 19. Fachtagung des GI-Fachbereichs „Datenbanken und Informationssysteme" (DBIS), 13.-17. September 2021, Dresden, Germany, Proceedings (LNI), Kai-Uwe Sattler, Melanie Herschel, and Wolfgang Lehner (Eds.), Vol. P-311. Gesellschaft für Informatik, Bonn, 313--324. Google ScholarGoogle ScholarCross RefCross Ref
  76. Thorsten Papenbrock, Jens Ehrlich, Jannik Marten, Tommy Neubert, Jan-Peer Rudolph, Martin Schönberg, Jakob Zwiener, and Felix Naumann. 2015. Functional Dependency Discovery: An Experimental Evaluation of Seven Algorithms. Proc. VLDB Endow. 8, 10 (2015), 1082--1093. Google ScholarGoogle ScholarDigital LibraryDigital Library
  77. Forough Poursabzi-Sangdeh, Daniel G Goldstein, Jake M Hofman, Jennifer Wortman Wortman Vaughan, and Hanna Wallach. 2021. Manipulating and measuring model interpretability. In Proceedings of the 2021 CHI conference on human factors in computing systems. 1--52.Google ScholarGoogle ScholarDigital LibraryDigital Library
  78. Erhard Rahm and Philip A Bernstein. 2001. A survey of approaches to automatic schema matching. the VLDB Journal 10, 4 (2001), 334--350.Google ScholarGoogle ScholarDigital LibraryDigital Library
  79. John F Roddick. 1995. A survey of schema versioning issues for database systems. Information and Software Technology 37, 7 (1995), 383--393.Google ScholarGoogle ScholarCross RefCross Ref
  80. Pau Rodriguez, Miguel A Bautista, Jordi Gonzalez, and Sergio Escalera. 2018. Beyond one-hot encoding: Lower dimensional target embedding. Image and Vision Computing 75 (2018), 21--31.Google ScholarGoogle ScholarCross RefCross Ref
  81. Maximilian E Schüle, Josef Schmeißer, Thomas Blum, Alfons Kemper, and Thomas Neumann. 2021. TardisDB: Extending SQL to Support Versioning. In Proceedings of the 2021 International Conference on Management of Data. 2775--2778.Google ScholarGoogle ScholarDigital LibraryDigital Library
  82. Vraj Shah, Jonathan Lacanlale, Premanand Kumar, Kevin Yang, and Arun Kumar. 2021. Towards Benchmarking Feature Type Inference for AutoML Platforms. In Proceedings of the 2021 International Conference on Management of Data. 1584--1596.Google ScholarGoogle ScholarDigital LibraryDigital Library
  83. Roee Shraga, Avigdor Gal, and Haggai Roitman. 2020. Adnev: Cross-domain schema matching using deep similarity matrix adjustment and evaluation. Proceedings of the VLDB Endowment 13, 9 (2020), 1401--1415.Google ScholarGoogle ScholarDigital LibraryDigital Library
  84. Roee Shraga and Renée J. Miller. 2023. Explaining Dataset Changes for Semantic Data Versioning with Explain-Da-V (Technical Report). https://arxiv.org/pdf/2301.13095Google ScholarGoogle Scholar
  85. Rishabh Singh. 2016. BlinkFill: Semi-supervised Programming By Example for Syntactic String Transformations. Proc. VLDB Endow. 9, 10 (2016), 816--827. Google ScholarGoogle ScholarDigital LibraryDigital Library
  86. Rishabh Singh and Sumit Gulwani. 2012. Learning Semantic String Transformations from Examples. Proc. VLDB Endow. 5, 8 (2012), 740--751. Google ScholarGoogle ScholarDigital LibraryDigital Library
  87. Richard T Snodgrass, Curtis Dyreson, Faiz Currim, Sabah Currim, and Shailesh Joshi. 2008. Validating quicksand: Temporal schema versioning in τXSchema. Data & Knowledge Engineering 65, 2 (2008), 223--242.Google ScholarGoogle ScholarDigital LibraryDigital Library
  88. Charles Sutton, Timothy Hobson, James Geddes, and Rich Caruana. 2018. Data diff: Interpretable, executable summaries of changes in distributions for data wrangling. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 2279--2288.Google ScholarGoogle ScholarDigital LibraryDigital Library
  89. Christos Thrampoulidis, Samet Oymak, and Babak Hassibi. 2015. Regularized linear regression: A precise analysis of the estimation error. In Conference on Learning Theory. PMLR, 1683--1709.Google ScholarGoogle Scholar
  90. Kai Ming Ting, Sunil Aryal, and Takashi Washio. 2018. Which Outlier Detector Should I use?. In 2018 IEEE International Conference on Data Mining (ICDM). IEEE, 8--8.Google ScholarGoogle ScholarCross RefCross Ref
  91. Quoc Trung Tran, Chee-Yong Chan, and Srinivasan Parthasarathy. 2014. Query reverse engineering. The VLDB Journal 23, 5 (2014), 721--746.Google ScholarGoogle ScholarCross RefCross Ref
  92. Panos Vassiliadis. 2009. A survey of extract-transform-load technology. International Journal of Data Warehousing and Mining (IJDWM) 5, 3 (2009), 1--27.Google ScholarGoogle ScholarCross RefCross Ref
  93. S Vijayarani, Ms J Ilamathi, Ms Nithya, et al. 2015. Preprocessing techniques for text mining-an overview. International Journal of Computer Science & Communication Networks 5, 1 (2015), 7--16.Google ScholarGoogle Scholar
  94. Xiaolan Wang and Alexandra Meliou. 2019. Explain 3D: explaining disagreements in disjoint datasets. Proceedings of the VLDB Endowment 12, 7 (2019).Google ScholarGoogle ScholarDigital LibraryDigital Library
  95. Yuan Wang, David J DeWitt, and J-Y Cai. 2003. X-Diff: An effective change detection algorithm for XML documents. In Proceedings 19th international conference on data engineering (Cat. No. 03CH37405). IEEE, 519--530.Google ScholarGoogle ScholarCross RefCross Ref
  96. Cong Yan and Yeye He. 2020. Auto-suggest: Learning-to-recommend data preparation steps using data science notebooks. In Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data. 1539--1554.Google ScholarGoogle ScholarDigital LibraryDigital Library
  97. Junwen Yang, Yeye He, and Surajit Chaudhuri. 2021. Auto-pipeline: synthesizing complex data pipelines by-target using reinforcement learning and search. Proceedings of the VLDB Endowment 14, 11 (2021), 2563--2575.Google ScholarGoogle ScholarDigital LibraryDigital Library
  98. Gunce Su Yilmaz, Tana Wattanawaroon, Liqi Xu, Abhishek Nigam, Aaron J Elmore, and Aditya Parameswaran. 2018. Datadiff: User-interpretable data transformation summaries for collaborative data analysis. In Proceedings of the 2018 International Conference on Management of Data. 1769--1772.Google ScholarGoogle ScholarDigital LibraryDigital Library
  99. Amy X Zhang, Michael Muller, and Dakuo Wang. 2020. How do data science workers collaborate? roles, workflows, and tools. Proceedings of the ACM on Human-Computer Interaction 4, CSCW1 (2020), 1--23.Google ScholarGoogle ScholarDigital LibraryDigital Library
  100. Alice Zheng and Amanda Casari. 2018. Feature engineering for machine learning: principles and techniques for data scientists. " O'Reilly Media, Inc.".Google ScholarGoogle Scholar
  101. Erkang Zhu, Dong Deng, Fatemeh Nargesian, and Renée J Miller. 2019. Josie: Overlap set similarity search for finding joinable tables in data lakes. In Proceedings of the 2019 International Conference on Management of Data. 847--864.Google ScholarGoogle ScholarDigital LibraryDigital Library
  102. Erkang Zhu, Yeye He, and Surajit Chaudhuri. 2017. Auto-Join: Joining Tables by Leveraging Transformations. Proc. VLDB Endow. 10, 10 (2017), 1034--1045. http://www.vldb.org/pvldb/vol10/p1034-he.pdfGoogle ScholarGoogle ScholarDigital LibraryDigital Library

Recommendations

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Sign in

Full Access

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader