research-article

Explaining Dataset Changes for Semantic Data Versioning with Explain-Da-V

Authors:
Roee Shraga

Northeastern University, Boston, MA, USA

Northeastern University, Boston, MA, USA
View Profile

,
Renée J. Miller

Northeastern University, Boston, MA, USA

Northeastern University, Boston, MA, USA
View Profile

Authors Info & Claims

Proceedings of the VLDB Endowment Volume 16 Issue 6pp 1587–1600https://doi.org/10.14778/3583140.3583169

Published:01 February 2023Publication History

Proceedings of the VLDB Endowment

Abstract

In multi-user environments in which data science and analysis is collaborative, multiple versions of the same datasets are generated. While managing and storing data versions has received some attention in the research literature, the semantic nature of such changes has remained under-explored. In this work, we introduce Explain-Da-V, a framework aiming to explain changes between two given dataset versions. Explain-Da-V generates explanations that use data transformations to explain changes. We further introduce a set of measures that evaluate the validity, generalizability, and explainability of these explanations. We empirically show, using an adapted existing benchmark and a newly created benchmark, that Explain-Da-V generates better explanations than existing data transformation synthesis methods.

References

2022. Auto-pipeline benchmark. https://gitlab.com/jwjwyoung/autopipeline-benchmarks. accessed on Feb 7, 2023.Google Scholar
2022. AutoPandas Implementation. https://github.com/rbavishi/autopandas. accessed on Feb 7, 2023.Google Scholar
2022. Decision Trees. https://scikit-learn.org/stable/modules/tree.html. accessed on Feb 7, 2023.Google Scholar
2022. Explanation Example. https://github.com/shraga89/ExplainDaV/blob/main/Explanation_Example.md. accessed on Feb 7, 2023.Google Scholar
2022. Featuretools. https://www.featuretools.com/. accessed on Feb 7, 2023.Google Scholar
2022. Foofah Implementation. https://github.com/umich-dbgroup/foofah. accessed on Feb 7, 2023.Google Scholar
2022. Initial IMDB dataset. https://www.kaggle.com/datasets/harshitshankhdhar/imdb-dataset-of-top-1000-movies-and-tv-shows. accessed on Feb 7, 2023.Google Scholar
2022. Initial IRIS dataset. https://www.kaggle.com/uciml/iris. accessed on Feb 7, 2023.Google Scholar
2022. Initial NBA dataset. https://www.kaggle.com/justinas/nba-players-data. accessed on Feb 7, 2023.Google Scholar
2022. Initial TITANIC dataset. https://www.kaggle.com/competitions/titanic. accessed on Feb 7, 2023.Google Scholar
2022. Initial WINE dataset. https://www.kaggle.com/christopheiv/winemagdata130k. accessed on Feb 7, 2023.Google Scholar
2022. Lasso Regularization. https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Lasso.html. accessed on Feb 7, 2023.Google Scholar
2022. Pandas. https://pandas.pydata.org/. accessed on Feb 7, 2023.Google Scholar
2022. Repository. https://github.com/northeastern-datalab/Explain-Da-V. accessed on Feb 18, 2023.Google Scholar
2022. Rigde Regularization. https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Ridge.html. accessed on Feb 7, 2023.Google Scholar
Ziawasch Abedjan, John Morcos, Ihab F. Ilyas, Mourad Ouzzani, Paolo Papotti, and Michael Stonebraker. 2016. DataXFormer: A robust transformation discovery system. In 32nd IEEE International Conference on Data Engineering, ICDE 2016, Helsinki, Finland, May 16-20, 2016. IEEE Computer Society, 1134--1145. Google ScholarCross Ref
Rohan Bavishi, Caroline Lemieux, Roy Fox, Koushik Sen, and Ion Stoica. 2019. AutoPandas: neural-backed generators for program synthesis. Proc. ACM Program. Lang. 3, OOPSLA (2019), 168:1--168:27. Google ScholarDigital Library
Ladjel Bellatreche and Robert Wrembel. 2013. Special issue on: Evolution and versioning in semantic data integration systems., 57--59 pages.Google Scholar
Anant P. Bhardwaj, Souvik Bhattacherjee, Amit Chavan, Amol Deshpande, Aaron J. Elmore, Samuel Madden, and Aditya G. Parameswaran. 2015. DataHub: Collaborative Data Science & Dataset Version Management at Scale. In Seventh Biennial Conference on Innovative Data Systems Research, CIDR 2015, Asilomar, CA, USA, January 4-7, 2015, Online Proceedings. www.cidrdb.org. http://cidrdb.org/cidr2015/Papers/CIDR15_Paper18.pdfGoogle Scholar
Souvik Bhattacherjee, Amit Chavan, Silu Huang, Amol Deshpande, and Aditya Parameswaran. 2015. Principles of dataset versioning: Exploring the recreation/storage tradeoff. In Proceedings of the VLDB Endowment. International Conference on Very Large Data Bases, Vol. 8. NIH Public Access, 1346.Google ScholarDigital Library
Tobias Bleifuß, Leon Bornemann, Theodore Johnson, Dmitri V Kalashnikov, Felix Naumann, and Divesh Srivastava. 2018. Exploring change: A new dimension of data analytics. Proceedings of the VLDB Endowment 12, 2 (2018), 85--98.Google ScholarDigital Library
Tobias Bleifuß, Leon Bornemann, Dmitri V Kalashnikov, Felix Naumann, and Divesh Srivastava. 2019. DBChEx: Interactive Exploration of Data and Schema Change. In CIDR.Google Scholar
Alex Bogatu, Norman W. Paton, Alvaro A. A. Fernandes, and Martin Koehler. 2019. Towards Automatic Data Format Transformations: Data Wrangling at Scale. Comput. J. 62, 7 (2019), 1044--1060. Google ScholarCross Ref
Leon Bornemann, Tobias Bleifuß, Dmitri Kalashnikov, Felix Naumann, and Divesh Srivastava. 2018. Data change exploration using time series clustering. Datenbank-Spektrum 18, 2 (2018), 79--87.Google ScholarCross Ref
Richard J Brook and Gregory C Arnold. 2018. Applied regression analysis and experimental design. CRC Press.Google Scholar
Jason Brownlee. 2022. Data preparation for machine learning.Google Scholar
Douglas Burdick, Ronald Fagin, Phokion G. Kolaitis, Lucian Popa, and Wang-Chiew Tan. 2016. A Declarative Framework for Linking Entities. ACM Trans. Database Syst. 41, 3 (2016), 17:1--17:38.Google ScholarDigital Library
Nadia Burkart and Marco F Huber. 2021. A survey on the explainability of supervised machine learning. Journal of Artificial Intelligence Research 70 (2021), 245--317.Google ScholarDigital Library
Hancheng Cao, Vivian Yang, Victor Chen, Yu Jin Lee, Lydia Stone, N'godjigui Junior Diarrassouba, Mark E Whiting, and Michael S Bernstein. 2021. My team will go on: Differentiating high and low viability teams through team interaction. Proceedings of the ACM on Human-Computer Interaction 4 (2021), 1--27.Google ScholarDigital Library
Varun Chandola, Arindam Banerjee, and Vipin Kumar. 2009. Anomaly detection: A survey. ACM computing surveys (CSUR) 41, 3 (2009), 1--58.Google ScholarDigital Library
Sudarshan S Chawathe and Hector Garcia-Molina. 1997. Meaningful change detection in structured data. ACM SIGMOD Record 26, 2 (1997), 26--37.Google ScholarDigital Library
Sudarshan S Chawathe, Anand Rajaraman, Hector Garcia-Molina, and Jennifer Widom. 1996. Change detection in hierarchically structured information. Acm Sigmod Record 25, 2 (1996), 493--504.Google ScholarDigital Library
Anton Chernyavskiy, Dmitry Ilvovsky, and Preslav Nakov. 2021. Transformers: "The End of History" for Natural Language Processing?. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases. Springer, 677--693.Google ScholarDigital Library
Gregory Cobena, Serge Abiteboul, and Amelie Marian. 2002. Detecting changes in XML documents. In Proceedings 18th International Conference on Data Engineering. IEEE, 41--52.Google ScholarCross Ref
Nicole Cruz, Jean Baratgin, Mike Oaksford, and David E Over. 2015. Bayesian reasoning with ifs and ands and ors. Frontiers in psychology 6 (2015), 192.Google Scholar
Giovanni Da San Martino, Seunghak Yu, Alberto Barrón-Cedeno, Rostislav Petrov, and Preslav Nakov. 2019. Fine-grained analysis of propaganda in news article. In Proceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (EMNLP-IJCNLP). 5636--5646.Google ScholarCross Ref
Canada Open Data. 2020. https://open.canada.ca/en/open-dataGoogle Scholar
UK Open Data. 2020. https://data.gov.uk/Google Scholar
Boer Deng. 2015. Papers with shorter titles get more citations. Nature News 26 (2015).Google Scholar
Dong Deng, Wenbo Tao, Ziawasch Abedjan, Ahmed K. Elmagarmid, Ihab F. Ilyas, Guoliang Li, Samuel Madden, Mourad Ouzzani, Michael Stonebraker, and Nan Tang. 2019. Unsupervised String Transformation Learning for Entity Consolidation. In 35th IEEE International Conference on Data Engineering, ICDE 2019, Macao, China, April 8-11, 2019. IEEE, 196--207. Google ScholarCross Ref
Jeffrey R Edwards. 2002. Alternatives to difference scores: Polynomial regression and response surface methodology. Advances in measurement and data analysis (2002), 350--400.Google Scholar
Kareem El Gebaly, Parag Agrawal, Lukasz Golab, Flip Korn, and Divesh Srivastava. 2014. Interpretable and informative explanations of outcomes. Proceedings of the VLDB Endowment 8, 1 (2014), 61--72.Google ScholarDigital Library
Ahmed K Elmagarmid, Panagiotis G Ipeirotis, and Vassilios S Verykios. 2006. Duplicate record detection: A survey. IEEE Transactions on knowledge and data engineering 19, 1 (2006), 1--16.Google ScholarCross Ref
Peter A Flach and Iztok Savnik. 1999. Database dependency discovery: a machine learning approach. AI Communications 12 (3) (1999), 139 -- 160. http://content.iospress.com/articles/ai-communications/aic182 Publisher: IOS Press.Google Scholar
Avigdor Gal, Haggai Roitman, and Roee Shraga. 2019. Learning to rerank schema matches. IEEE Transactions on Knowledge and Data Engineering 33, 8 (2019), 3104--3116.Google ScholarCross Ref
Yihan Gao, Silu Huang, and Aditya G. Parameswaran. 2018. Navigating the Data Lake with DATAMARAN: Automatically Extracting Structure from Log Datasets. In Proceedings of the 2018 International Conference on Management of Data, SIGMOD Conference 2018, Houston, TX, USA, June 10-15, 2018, Gautam Das, Christopher M. Jermaine, and Philip A. Bernstein (Eds.). ACM, 943--958. Google ScholarDigital Library
Bar Genossar, Roee Shraga, and Avigdor Gal. 2023. FlexER: Flexible Entity Resolution for Multiple Intents. In SIGMOD Conference 2023. ACM. arXivpreprintarXiv:2209.07569Google Scholar
Dimitris C Gkikas, Katerina Tzafilkou, Prokopis K Theodoridis, Aristogiannis Garmpis, and Marios C Gkikas. 2022. How do text characteristics impact user engagement in social media posts: Modeling content readability, length, and hashtags number in Facebook. International Journal of Information Management Data Insights 2, 1 (2022), 100067.Google ScholarCross Ref
William R. Harris and Sumit Gulwani. 2011. Spreadsheet table transformations from examples. In Proceedings of the 32nd ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI 2011, San Jose, CA, USA, June 4-8, 2011, Mary W. Hall and David A. Padua (Eds.). ACM, 317--328. Google ScholarDigital Library
Yeye He, Xu Chu, Kris Ganjam, Yudian Zheng, Vivek R. Narasayya, and Surajit Chaudhuri. 2018. Transform-Data-by-Example (TDE): An Extensible Search Engine for Data Transformations. Proc. VLDB Endow. 11, 10 (2018), 1165--1177. Google ScholarDigital Library
Yeye He, Zhongjun Jin, and Surajit Chaudhuri. 2020. Auto-Transform:Learning-to-Transform by Patterns. Proc. VLDB Endow. 13, 11 (2020), 2368--2381. http://www.vldb.org/pvldb/vol13/p2368-he.pdfGoogle ScholarDigital Library
Fred Hohman, Kanit Wongsuphasawat, Mary Beth Kery, and Kayur Patel. 2020. Understanding and visualizing data iteration in machine learning. In Proceedings of the 2020 CHI conference on human factors in computing systems. 1--13.Google ScholarDigital Library
The home of the U.S. Government's open data. 2020. https://data.gov/Google Scholar
Silu Huang, Liqi Xu, Jialin Liu, Aaron J Elmore, and Aditya Parameswaran. 2017. ORPHEUSDB: Bolt-on Versioning for Relational Databases. Proceedings of the VLDB Endowment 10, 10 (2017).Google Scholar
Zhongjun Jin, Michael R. Anderson, Michael J. Cafarella, and H. V. Jagadish. 2017. Foofah: Transforming Data By Example. In Proceedings of the 2017 ACM International Conference on Management of Data, SIGMOD Conference 2017, Chicago, IL, USA, May 14-19, 2017, Semih Salihoglu, Wenchao Zhou, Rada Chirkova, Jun Yang, and Dan Suciu (Eds.). ACM, 683--698. Google ScholarDigital Library
Zhongjun Jin, Michael J. Cafarella, H. V. Jagadish, Sean Kandel, Michael Minar, and Joseph M. Hellerstein. 2019. CLX: Towards verifiable PBE data transformation. In Advances in Database Technology - 22nd International Conference on Extending Database Technology, EDBT 2019, Lisbon, Portugal, March 26-29, 2019, Melanie Herschel, Helena Galhardas, Berthold Reinwald, Irini Fundulaki, Carsten Binnig, and Zoi Kaoudi (Eds.). OpenProceedings.org, 265--276. Google ScholarCross Ref
Mary Beth Kery, Amber Horvath, and Brad Myers. 2017. Variolite: Supporting Exploratory Programming by Data Scientists. In Proceedings of the 2017 CHI Conference on Human Factors in Computing Systems. 1265--1276.Google ScholarDigital Library
Mary Beth Kery, Bonnie E John, Patrick O'Flaherty, Amber Horvath, and Brad A Myers. 2019. Towards effective foraging by data scientists to find past analysis choices. In Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems. 1--13.Google ScholarDigital Library
Mary Beth Kery, Marissa Radensky, Mahima Arya, Bonnie E John, and Brad A Myers. 2018. The story in the notebook: Exploratory data science using a literate programming tool. In Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems. 1--11.Google ScholarDigital Library
Aamod Khatiwada, Roee Shraga, Wolfgang Gatterbauer, and Renée J. Miller. 2022. Integrating Data Lake Tables. Proc. VLDB Endow. 16, 4 (2022), 932--945. https://www.vldb.org/pvldb/vol16/p932-khatiwada.pdfGoogle ScholarDigital Library
Alexandra Kim, Laks VS Lakshmanan, and Divesh Srivastava. 2020. Summarizing hierarchical multidimensional data. In 2020 IEEE 36th International Conference on Data Engineering (ICDE). IEEE, 877--888.Google ScholarCross Ref
Olga Kovaleva, Alexey Romanov, Anna Rogers, and Anna Rumshisky. 2019. Revealing the Dark Secrets of BERT. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). 4365--4374.Google ScholarCross Ref
Max Kuhn and Kjell Johnson. 2019. Feature engineering and selection: A practical approach for predictive models. CRC Press.Google Scholar
Himabindu Lakkaraju, Stephen H Bach, and Jure Leskovec. 2016. Interpretable decision sets: A joint framework for description and prediction. In Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining. 1675--1684.Google ScholarDigital Library
Benjamin Letham, Cynthia Rudin, Tyler H McCormick, and David Madigan. 2015. Interpretable classifiers using rules and bayesian analysis: Building a better stroke prediction model. The Annals of Applied Statistics 9, 3 (2015), 1350--1371.Google ScholarCross Ref
Yuliang Li, Jinfeng Li, Yoshihiko Suhara, AnHai Doan, and Wang-Chiew Tan. 2020. Deep entity matching with pre-trained language models. Proceedings of the VLDB Endowment 14, 1 (2020), 50--60.Google ScholarDigital Library
Benjamin Marlin. 2004. Collaborative filtering: A machine learning perspective. University of Toronto Toronto.Google Scholar
Renée J Miller. 2018. Open data integration. Proceedings of the VLDB Endowment 11, 12 (2018), 2130--2139.Google ScholarDigital Library
Tom Mitchell. 1997. Decision tree learning. Machine learning 414 (1997), 52--78.Google Scholar
Heiko Müller, Johann-Christoph Freytag, and Ulf Leser. 2006. Describing differences between databases. In Proceedings of the 15th ACM international conference on Information and knowledge management. 612--621.Google ScholarDigital Library
Menaka Narayanan, Emily Chen, Jeffrey He, Been Kim, Sam Gershman, and Finale Doshi-Velez. 2018. How do humans understand explanations from machine learning systems? an evaluation of the human-interpretability of explanation. arXiv preprint arXiv:1802.00682 (2018).Google Scholar
Fatemeh Nargesian, Erkang Zhu, Renée J. Miller, Ken Q. Pu, and Patricia C. Arocena. 2019. Data Lake Management: Challenges and Opportunities. Proc. VLDB Endow. 12, 12 (aug 2019), 1986--1989. Google ScholarDigital Library
Andrew Nierman and HV Jagadish. 2002. Evaluating Structural Similarity in XML Documents.. In webdb, Vol. 2. Citeseer, 61--66.Google Scholar
Pedro Orvalho, Miguel Terra-Neves, Miguel Ventura, Ruben Martins, and Vasco Manquinho. 2020. SQUARES: a SQL synthesizer using query reverse engineering. Proceedings of the VLDB Endowment 13, 12 (2020), 2853--2856.Google ScholarDigital Library
Aslihan Özmen, Mahdi Esmailoghli, and Ziawasch Abedjan. 2021. Combining Programming-by-Example with Transformation Discovery from large Databases. In Datenbanksysteme für Business, Technologie und Web (BTW 2021), 19. Fachtagung des GI-Fachbereichs „Datenbanken und Informationssysteme" (DBIS), 13.-17. September 2021, Dresden, Germany, Proceedings (LNI), Kai-Uwe Sattler, Melanie Herschel, and Wolfgang Lehner (Eds.), Vol. P-311. Gesellschaft für Informatik, Bonn, 313--324. Google ScholarCross Ref
Thorsten Papenbrock, Jens Ehrlich, Jannik Marten, Tommy Neubert, Jan-Peer Rudolph, Martin Schönberg, Jakob Zwiener, and Felix Naumann. 2015. Functional Dependency Discovery: An Experimental Evaluation of Seven Algorithms. Proc. VLDB Endow. 8, 10 (2015), 1082--1093. Google ScholarDigital Library
Forough Poursabzi-Sangdeh, Daniel G Goldstein, Jake M Hofman, Jennifer Wortman Wortman Vaughan, and Hanna Wallach. 2021. Manipulating and measuring model interpretability. In Proceedings of the 2021 CHI conference on human factors in computing systems. 1--52.Google ScholarDigital Library
Erhard Rahm and Philip A Bernstein. 2001. A survey of approaches to automatic schema matching. the VLDB Journal 10, 4 (2001), 334--350.Google ScholarDigital Library
John F Roddick. 1995. A survey of schema versioning issues for database systems. Information and Software Technology 37, 7 (1995), 383--393.Google ScholarCross Ref
Pau Rodriguez, Miguel A Bautista, Jordi Gonzalez, and Sergio Escalera. 2018. Beyond one-hot encoding: Lower dimensional target embedding. Image and Vision Computing 75 (2018), 21--31.Google ScholarCross Ref
Maximilian E Schüle, Josef Schmeißer, Thomas Blum, Alfons Kemper, and Thomas Neumann. 2021. TardisDB: Extending SQL to Support Versioning. In Proceedings of the 2021 International Conference on Management of Data. 2775--2778.Google ScholarDigital Library
Vraj Shah, Jonathan Lacanlale, Premanand Kumar, Kevin Yang, and Arun Kumar. 2021. Towards Benchmarking Feature Type Inference for AutoML Platforms. In Proceedings of the 2021 International Conference on Management of Data. 1584--1596.Google ScholarDigital Library
Roee Shraga, Avigdor Gal, and Haggai Roitman. 2020. Adnev: Cross-domain schema matching using deep similarity matrix adjustment and evaluation. Proceedings of the VLDB Endowment 13, 9 (2020), 1401--1415.Google ScholarDigital Library
Roee Shraga and Renée J. Miller. 2023. Explaining Dataset Changes for Semantic Data Versioning with Explain-Da-V (Technical Report). https://arxiv.org/pdf/2301.13095Google Scholar
Rishabh Singh. 2016. BlinkFill: Semi-supervised Programming By Example for Syntactic String Transformations. Proc. VLDB Endow. 9, 10 (2016), 816--827. Google ScholarDigital Library
Rishabh Singh and Sumit Gulwani. 2012. Learning Semantic String Transformations from Examples. Proc. VLDB Endow. 5, 8 (2012), 740--751. Google ScholarDigital Library
Richard T Snodgrass, Curtis Dyreson, Faiz Currim, Sabah Currim, and Shailesh Joshi. 2008. Validating quicksand: Temporal schema versioning in τXSchema. Data & Knowledge Engineering 65, 2 (2008), 223--242.Google ScholarDigital Library
Charles Sutton, Timothy Hobson, James Geddes, and Rich Caruana. 2018. Data diff: Interpretable, executable summaries of changes in distributions for data wrangling. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 2279--2288.Google ScholarDigital Library
Christos Thrampoulidis, Samet Oymak, and Babak Hassibi. 2015. Regularized linear regression: A precise analysis of the estimation error. In Conference on Learning Theory. PMLR, 1683--1709.Google Scholar
Kai Ming Ting, Sunil Aryal, and Takashi Washio. 2018. Which Outlier Detector Should I use?. In 2018 IEEE International Conference on Data Mining (ICDM). IEEE, 8--8.Google ScholarCross Ref
Quoc Trung Tran, Chee-Yong Chan, and Srinivasan Parthasarathy. 2014. Query reverse engineering. The VLDB Journal 23, 5 (2014), 721--746.Google ScholarCross Ref
Panos Vassiliadis. 2009. A survey of extract-transform-load technology. International Journal of Data Warehousing and Mining (IJDWM) 5, 3 (2009), 1--27.Google ScholarCross Ref
S Vijayarani, Ms J Ilamathi, Ms Nithya, et al. 2015. Preprocessing techniques for text mining-an overview. International Journal of Computer Science & Communication Networks 5, 1 (2015), 7--16.Google Scholar
Xiaolan Wang and Alexandra Meliou. 2019. Explain 3D: explaining disagreements in disjoint datasets. Proceedings of the VLDB Endowment 12, 7 (2019).Google ScholarDigital Library
Yuan Wang, David J DeWitt, and J-Y Cai. 2003. X-Diff: An effective change detection algorithm for XML documents. In Proceedings 19th international conference on data engineering (Cat. No. 03CH37405). IEEE, 519--530.Google ScholarCross Ref
Cong Yan and Yeye He. 2020. Auto-suggest: Learning-to-recommend data preparation steps using data science notebooks. In Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data. 1539--1554.Google ScholarDigital Library
Junwen Yang, Yeye He, and Surajit Chaudhuri. 2021. Auto-pipeline: synthesizing complex data pipelines by-target using reinforcement learning and search. Proceedings of the VLDB Endowment 14, 11 (2021), 2563--2575.Google ScholarDigital Library
Gunce Su Yilmaz, Tana Wattanawaroon, Liqi Xu, Abhishek Nigam, Aaron J Elmore, and Aditya Parameswaran. 2018. Datadiff: User-interpretable data transformation summaries for collaborative data analysis. In Proceedings of the 2018 International Conference on Management of Data. 1769--1772.Google ScholarDigital Library
Amy X Zhang, Michael Muller, and Dakuo Wang. 2020. How do data science workers collaborate? roles, workflows, and tools. Proceedings of the ACM on Human-Computer Interaction 4, CSCW1 (2020), 1--23.Google ScholarDigital Library
Alice Zheng and Amanda Casari. 2018. Feature engineering for machine learning: principles and techniques for data scientists. " O'Reilly Media, Inc.".Google Scholar
Erkang Zhu, Dong Deng, Fatemeh Nargesian, and Renée J Miller. 2019. Josie: Overlap set similarity search for finding joinable tables in data lakes. In Proceedings of the 2019 International Conference on Management of Data. 847--864.Google ScholarDigital Library
Erkang Zhu, Yeye He, and Surajit Chaudhuri. 2017. Auto-Join: Joining Tables by Leveraging Transformations. Proc. VLDB Endow. 10, 10 (2017), 1034--1045. http://www.vldb.org/pvldb/vol10/p1034-he.pdfGoogle ScholarDigital Library

Recommendations

Explaining Disease: Correlations, Causes, and Mechanisms

Why do people get sick? I argue that a disease explanation is best thought of as causal network instantiation, where a causal network describes the interrelations among multiple factors, and instantiation consists of observational or hypothetical ...
Read More
Causality-based versioning

Versioning file systems provide the ability to recover from a variety of failures, including file corruption, virus and worm infestations, and user mistakes. However, using versions to recover from data-corrupting events requires a human to determine ...
Read More
Causality-based versioning
FAST '09: Proccedings of the 7th conference on File and storage technologies

Versioning file systems provide the ability to recover from a variety of failures, including file corruption, virus and worm infestations, and user mistakes. However, using versions to recover from data-corrupting events requires a human to determine ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
Proceedings of the VLDB Endowment Volume 16, Issue 6
February 2023
393 pages
ISSN:2150-8097
Editors:
Georgia Koutrika
Athena Research Center
,
Jun Yang
Duke University
Issue’s Table of Contents
Sponsors
In-Cooperation
Publisher
VLDB Endowment
Publication History
- Published: 1 February 2023
Published in pvldb Volume 16, Issue 6

Check for updates
Badges
- Artifacts Available / v1.1
Qualifiers
- research-article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 1
  Total Citations
  View Citations
- 89
  Total Downloads
- Downloads (Last 12 months)89
- Downloads (Last 6 weeks)9
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.