Abstract
In multi-user environments in which data science and analysis is collaborative, multiple versions of the same datasets are generated. While managing and storing data versions has received some attention in the research literature, the semantic nature of such changes has remained under-explored. In this work, we introduce Explain-Da-V, a framework aiming to explain changes between two given dataset versions. Explain-Da-V generates explanations that use data transformations to explain changes. We further introduce a set of measures that evaluate the validity, generalizability, and explainability of these explanations. We empirically show, using an adapted existing benchmark and a newly created benchmark, that Explain-Da-V generates better explanations than existing data transformation synthesis methods.
- 2022. Auto-pipeline benchmark. https://gitlab.com/jwjwyoung/autopipeline-benchmarks. accessed on Feb 7, 2023.Google Scholar
- 2022. AutoPandas Implementation. https://github.com/rbavishi/autopandas. accessed on Feb 7, 2023.Google Scholar
- 2022. Decision Trees. https://scikit-learn.org/stable/modules/tree.html. accessed on Feb 7, 2023.Google Scholar
- 2022. Explanation Example. https://github.com/shraga89/ExplainDaV/blob/main/Explanation_Example.md. accessed on Feb 7, 2023.Google Scholar
- 2022. Featuretools. https://www.featuretools.com/. accessed on Feb 7, 2023.Google Scholar
- 2022. Foofah Implementation. https://github.com/umich-dbgroup/foofah. accessed on Feb 7, 2023.Google Scholar
- 2022. Initial IMDB dataset. https://www.kaggle.com/datasets/harshitshankhdhar/imdb-dataset-of-top-1000-movies-and-tv-shows. accessed on Feb 7, 2023.Google Scholar
- 2022. Initial IRIS dataset. https://www.kaggle.com/uciml/iris. accessed on Feb 7, 2023.Google Scholar
- 2022. Initial NBA dataset. https://www.kaggle.com/justinas/nba-players-data. accessed on Feb 7, 2023.Google Scholar
- 2022. Initial TITANIC dataset. https://www.kaggle.com/competitions/titanic. accessed on Feb 7, 2023.Google Scholar
- 2022. Initial WINE dataset. https://www.kaggle.com/christopheiv/winemagdata130k. accessed on Feb 7, 2023.Google Scholar
- 2022. Lasso Regularization. https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Lasso.html. accessed on Feb 7, 2023.Google Scholar
- 2022. Pandas. https://pandas.pydata.org/. accessed on Feb 7, 2023.Google Scholar
- 2022. Repository. https://github.com/northeastern-datalab/Explain-Da-V. accessed on Feb 18, 2023.Google Scholar
- 2022. Rigde Regularization. https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Ridge.html. accessed on Feb 7, 2023.Google Scholar
- Ziawasch Abedjan, John Morcos, Ihab F. Ilyas, Mourad Ouzzani, Paolo Papotti, and Michael Stonebraker. 2016. DataXFormer: A robust transformation discovery system. In 32nd IEEE International Conference on Data Engineering, ICDE 2016, Helsinki, Finland, May 16-20, 2016. IEEE Computer Society, 1134--1145. Google ScholarCross Ref
- Rohan Bavishi, Caroline Lemieux, Roy Fox, Koushik Sen, and Ion Stoica. 2019. AutoPandas: neural-backed generators for program synthesis. Proc. ACM Program. Lang. 3, OOPSLA (2019), 168:1--168:27. Google ScholarDigital Library
- Ladjel Bellatreche and Robert Wrembel. 2013. Special issue on: Evolution and versioning in semantic data integration systems., 57--59 pages.Google Scholar
- Anant P. Bhardwaj, Souvik Bhattacherjee, Amit Chavan, Amol Deshpande, Aaron J. Elmore, Samuel Madden, and Aditya G. Parameswaran. 2015. DataHub: Collaborative Data Science & Dataset Version Management at Scale. In Seventh Biennial Conference on Innovative Data Systems Research, CIDR 2015, Asilomar, CA, USA, January 4-7, 2015, Online Proceedings. www.cidrdb.org. http://cidrdb.org/cidr2015/Papers/CIDR15_Paper18.pdfGoogle Scholar
- Souvik Bhattacherjee, Amit Chavan, Silu Huang, Amol Deshpande, and Aditya Parameswaran. 2015. Principles of dataset versioning: Exploring the recreation/storage tradeoff. In Proceedings of the VLDB Endowment. International Conference on Very Large Data Bases, Vol. 8. NIH Public Access, 1346.Google ScholarDigital Library
- Tobias Bleifuß, Leon Bornemann, Theodore Johnson, Dmitri V Kalashnikov, Felix Naumann, and Divesh Srivastava. 2018. Exploring change: A new dimension of data analytics. Proceedings of the VLDB Endowment 12, 2 (2018), 85--98.Google ScholarDigital Library
- Tobias Bleifuß, Leon Bornemann, Dmitri V Kalashnikov, Felix Naumann, and Divesh Srivastava. 2019. DBChEx: Interactive Exploration of Data and Schema Change. In CIDR.Google Scholar
- Alex Bogatu, Norman W. Paton, Alvaro A. A. Fernandes, and Martin Koehler. 2019. Towards Automatic Data Format Transformations: Data Wrangling at Scale. Comput. J. 62, 7 (2019), 1044--1060. Google ScholarCross Ref
- Leon Bornemann, Tobias Bleifuß, Dmitri Kalashnikov, Felix Naumann, and Divesh Srivastava. 2018. Data change exploration using time series clustering. Datenbank-Spektrum 18, 2 (2018), 79--87.Google ScholarCross Ref
- Richard J Brook and Gregory C Arnold. 2018. Applied regression analysis and experimental design. CRC Press.Google Scholar
- Jason Brownlee. 2022. Data preparation for machine learning.Google Scholar
- Douglas Burdick, Ronald Fagin, Phokion G. Kolaitis, Lucian Popa, and Wang-Chiew Tan. 2016. A Declarative Framework for Linking Entities. ACM Trans. Database Syst. 41, 3 (2016), 17:1--17:38.Google ScholarDigital Library
- Nadia Burkart and Marco F Huber. 2021. A survey on the explainability of supervised machine learning. Journal of Artificial Intelligence Research 70 (2021), 245--317.Google ScholarDigital Library
- Hancheng Cao, Vivian Yang, Victor Chen, Yu Jin Lee, Lydia Stone, N'godjigui Junior Diarrassouba, Mark E Whiting, and Michael S Bernstein. 2021. My team will go on: Differentiating high and low viability teams through team interaction. Proceedings of the ACM on Human-Computer Interaction 4 (2021), 1--27.Google ScholarDigital Library
- Varun Chandola, Arindam Banerjee, and Vipin Kumar. 2009. Anomaly detection: A survey. ACM computing surveys (CSUR) 41, 3 (2009), 1--58.Google ScholarDigital Library
- Sudarshan S Chawathe and Hector Garcia-Molina. 1997. Meaningful change detection in structured data. ACM SIGMOD Record 26, 2 (1997), 26--37.Google ScholarDigital Library
- Sudarshan S Chawathe, Anand Rajaraman, Hector Garcia-Molina, and Jennifer Widom. 1996. Change detection in hierarchically structured information. Acm Sigmod Record 25, 2 (1996), 493--504.Google ScholarDigital Library
- Anton Chernyavskiy, Dmitry Ilvovsky, and Preslav Nakov. 2021. Transformers: "The End of History" for Natural Language Processing?. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases. Springer, 677--693.Google ScholarDigital Library
- Gregory Cobena, Serge Abiteboul, and Amelie Marian. 2002. Detecting changes in XML documents. In Proceedings 18th International Conference on Data Engineering. IEEE, 41--52.Google ScholarCross Ref
- Nicole Cruz, Jean Baratgin, Mike Oaksford, and David E Over. 2015. Bayesian reasoning with ifs and ands and ors. Frontiers in psychology 6 (2015), 192.Google Scholar
- Giovanni Da San Martino, Seunghak Yu, Alberto Barrón-Cedeno, Rostislav Petrov, and Preslav Nakov. 2019. Fine-grained analysis of propaganda in news article. In Proceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (EMNLP-IJCNLP). 5636--5646.Google ScholarCross Ref
- Canada Open Data. 2020. https://open.canada.ca/en/open-dataGoogle Scholar
- UK Open Data. 2020. https://data.gov.uk/Google Scholar
- Boer Deng. 2015. Papers with shorter titles get more citations. Nature News 26 (2015).Google Scholar
- Dong Deng, Wenbo Tao, Ziawasch Abedjan, Ahmed K. Elmagarmid, Ihab F. Ilyas, Guoliang Li, Samuel Madden, Mourad Ouzzani, Michael Stonebraker, and Nan Tang. 2019. Unsupervised String Transformation Learning for Entity Consolidation. In 35th IEEE International Conference on Data Engineering, ICDE 2019, Macao, China, April 8-11, 2019. IEEE, 196--207. Google ScholarCross Ref
- Jeffrey R Edwards. 2002. Alternatives to difference scores: Polynomial regression and response surface methodology. Advances in measurement and data analysis (2002), 350--400.Google Scholar
- Kareem El Gebaly, Parag Agrawal, Lukasz Golab, Flip Korn, and Divesh Srivastava. 2014. Interpretable and informative explanations of outcomes. Proceedings of the VLDB Endowment 8, 1 (2014), 61--72.Google ScholarDigital Library
- Ahmed K Elmagarmid, Panagiotis G Ipeirotis, and Vassilios S Verykios. 2006. Duplicate record detection: A survey. IEEE Transactions on knowledge and data engineering 19, 1 (2006), 1--16.Google ScholarCross Ref
- Peter A Flach and Iztok Savnik. 1999. Database dependency discovery: a machine learning approach. AI Communications 12 (3) (1999), 139 -- 160. http://content.iospress.com/articles/ai-communications/aic182 Publisher: IOS Press.Google Scholar
- Avigdor Gal, Haggai Roitman, and Roee Shraga. 2019. Learning to rerank schema matches. IEEE Transactions on Knowledge and Data Engineering 33, 8 (2019), 3104--3116.Google ScholarCross Ref
- Yihan Gao, Silu Huang, and Aditya G. Parameswaran. 2018. Navigating the Data Lake with DATAMARAN: Automatically Extracting Structure from Log Datasets. In Proceedings of the 2018 International Conference on Management of Data, SIGMOD Conference 2018, Houston, TX, USA, June 10-15, 2018, Gautam Das, Christopher M. Jermaine, and Philip A. Bernstein (Eds.). ACM, 943--958. Google ScholarDigital Library
- Bar Genossar, Roee Shraga, and Avigdor Gal. 2023. FlexER: Flexible Entity Resolution for Multiple Intents. In SIGMOD Conference 2023. ACM. arXivpreprintarXiv:2209.07569Google Scholar
- Dimitris C Gkikas, Katerina Tzafilkou, Prokopis K Theodoridis, Aristogiannis Garmpis, and Marios C Gkikas. 2022. How do text characteristics impact user engagement in social media posts: Modeling content readability, length, and hashtags number in Facebook. International Journal of Information Management Data Insights 2, 1 (2022), 100067.Google ScholarCross Ref
- William R. Harris and Sumit Gulwani. 2011. Spreadsheet table transformations from examples. In Proceedings of the 32nd ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI 2011, San Jose, CA, USA, June 4-8, 2011, Mary W. Hall and David A. Padua (Eds.). ACM, 317--328. Google ScholarDigital Library
- Yeye He, Xu Chu, Kris Ganjam, Yudian Zheng, Vivek R. Narasayya, and Surajit Chaudhuri. 2018. Transform-Data-by-Example (TDE): An Extensible Search Engine for Data Transformations. Proc. VLDB Endow. 11, 10 (2018), 1165--1177. Google ScholarDigital Library
- Yeye He, Zhongjun Jin, and Surajit Chaudhuri. 2020. Auto-Transform:Learning-to-Transform by Patterns. Proc. VLDB Endow. 13, 11 (2020), 2368--2381. http://www.vldb.org/pvldb/vol13/p2368-he.pdfGoogle ScholarDigital Library
- Fred Hohman, Kanit Wongsuphasawat, Mary Beth Kery, and Kayur Patel. 2020. Understanding and visualizing data iteration in machine learning. In Proceedings of the 2020 CHI conference on human factors in computing systems. 1--13.Google ScholarDigital Library
- The home of the U.S. Government's open data. 2020. https://data.gov/Google Scholar
- Silu Huang, Liqi Xu, Jialin Liu, Aaron J Elmore, and Aditya Parameswaran. 2017. ORPHEUSDB: Bolt-on Versioning for Relational Databases. Proceedings of the VLDB Endowment 10, 10 (2017).Google Scholar
- Zhongjun Jin, Michael R. Anderson, Michael J. Cafarella, and H. V. Jagadish. 2017. Foofah: Transforming Data By Example. In Proceedings of the 2017 ACM International Conference on Management of Data, SIGMOD Conference 2017, Chicago, IL, USA, May 14-19, 2017, Semih Salihoglu, Wenchao Zhou, Rada Chirkova, Jun Yang, and Dan Suciu (Eds.). ACM, 683--698. Google ScholarDigital Library
- Zhongjun Jin, Michael J. Cafarella, H. V. Jagadish, Sean Kandel, Michael Minar, and Joseph M. Hellerstein. 2019. CLX: Towards verifiable PBE data transformation. In Advances in Database Technology - 22nd International Conference on Extending Database Technology, EDBT 2019, Lisbon, Portugal, March 26-29, 2019, Melanie Herschel, Helena Galhardas, Berthold Reinwald, Irini Fundulaki, Carsten Binnig, and Zoi Kaoudi (Eds.). OpenProceedings.org, 265--276. Google ScholarCross Ref
- Mary Beth Kery, Amber Horvath, and Brad Myers. 2017. Variolite: Supporting Exploratory Programming by Data Scientists. In Proceedings of the 2017 CHI Conference on Human Factors in Computing Systems. 1265--1276.Google ScholarDigital Library
- Mary Beth Kery, Bonnie E John, Patrick O'Flaherty, Amber Horvath, and Brad A Myers. 2019. Towards effective foraging by data scientists to find past analysis choices. In Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems. 1--13.Google ScholarDigital Library
- Mary Beth Kery, Marissa Radensky, Mahima Arya, Bonnie E John, and Brad A Myers. 2018. The story in the notebook: Exploratory data science using a literate programming tool. In Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems. 1--11.Google ScholarDigital Library
- Aamod Khatiwada, Roee Shraga, Wolfgang Gatterbauer, and Renée J. Miller. 2022. Integrating Data Lake Tables. Proc. VLDB Endow. 16, 4 (2022), 932--945. https://www.vldb.org/pvldb/vol16/p932-khatiwada.pdfGoogle ScholarDigital Library
- Alexandra Kim, Laks VS Lakshmanan, and Divesh Srivastava. 2020. Summarizing hierarchical multidimensional data. In 2020 IEEE 36th International Conference on Data Engineering (ICDE). IEEE, 877--888.Google ScholarCross Ref
- Olga Kovaleva, Alexey Romanov, Anna Rogers, and Anna Rumshisky. 2019. Revealing the Dark Secrets of BERT. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). 4365--4374.Google ScholarCross Ref
- Max Kuhn and Kjell Johnson. 2019. Feature engineering and selection: A practical approach for predictive models. CRC Press.Google Scholar
- Himabindu Lakkaraju, Stephen H Bach, and Jure Leskovec. 2016. Interpretable decision sets: A joint framework for description and prediction. In Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining. 1675--1684.Google ScholarDigital Library
- Benjamin Letham, Cynthia Rudin, Tyler H McCormick, and David Madigan. 2015. Interpretable classifiers using rules and bayesian analysis: Building a better stroke prediction model. The Annals of Applied Statistics 9, 3 (2015), 1350--1371.Google ScholarCross Ref
- Yuliang Li, Jinfeng Li, Yoshihiko Suhara, AnHai Doan, and Wang-Chiew Tan. 2020. Deep entity matching with pre-trained language models. Proceedings of the VLDB Endowment 14, 1 (2020), 50--60.Google ScholarDigital Library
- Benjamin Marlin. 2004. Collaborative filtering: A machine learning perspective. University of Toronto Toronto.Google Scholar
- Renée J Miller. 2018. Open data integration. Proceedings of the VLDB Endowment 11, 12 (2018), 2130--2139.Google ScholarDigital Library
- Tom Mitchell. 1997. Decision tree learning. Machine learning 414 (1997), 52--78.Google Scholar
- Heiko Müller, Johann-Christoph Freytag, and Ulf Leser. 2006. Describing differences between databases. In Proceedings of the 15th ACM international conference on Information and knowledge management. 612--621.Google ScholarDigital Library
- Menaka Narayanan, Emily Chen, Jeffrey He, Been Kim, Sam Gershman, and Finale Doshi-Velez. 2018. How do humans understand explanations from machine learning systems? an evaluation of the human-interpretability of explanation. arXiv preprint arXiv:1802.00682 (2018).Google Scholar
- Fatemeh Nargesian, Erkang Zhu, Renée J. Miller, Ken Q. Pu, and Patricia C. Arocena. 2019. Data Lake Management: Challenges and Opportunities. Proc. VLDB Endow. 12, 12 (aug 2019), 1986--1989. Google ScholarDigital Library
- Andrew Nierman and HV Jagadish. 2002. Evaluating Structural Similarity in XML Documents.. In webdb, Vol. 2. Citeseer, 61--66.Google Scholar
- Pedro Orvalho, Miguel Terra-Neves, Miguel Ventura, Ruben Martins, and Vasco Manquinho. 2020. SQUARES: a SQL synthesizer using query reverse engineering. Proceedings of the VLDB Endowment 13, 12 (2020), 2853--2856.Google ScholarDigital Library
- Aslihan Özmen, Mahdi Esmailoghli, and Ziawasch Abedjan. 2021. Combining Programming-by-Example with Transformation Discovery from large Databases. In Datenbanksysteme für Business, Technologie und Web (BTW 2021), 19. Fachtagung des GI-Fachbereichs „Datenbanken und Informationssysteme" (DBIS), 13.-17. September 2021, Dresden, Germany, Proceedings (LNI), Kai-Uwe Sattler, Melanie Herschel, and Wolfgang Lehner (Eds.), Vol. P-311. Gesellschaft für Informatik, Bonn, 313--324. Google ScholarCross Ref
- Thorsten Papenbrock, Jens Ehrlich, Jannik Marten, Tommy Neubert, Jan-Peer Rudolph, Martin Schönberg, Jakob Zwiener, and Felix Naumann. 2015. Functional Dependency Discovery: An Experimental Evaluation of Seven Algorithms. Proc. VLDB Endow. 8, 10 (2015), 1082--1093. Google ScholarDigital Library
- Forough Poursabzi-Sangdeh, Daniel G Goldstein, Jake M Hofman, Jennifer Wortman Wortman Vaughan, and Hanna Wallach. 2021. Manipulating and measuring model interpretability. In Proceedings of the 2021 CHI conference on human factors in computing systems. 1--52.Google ScholarDigital Library
- Erhard Rahm and Philip A Bernstein. 2001. A survey of approaches to automatic schema matching. the VLDB Journal 10, 4 (2001), 334--350.Google ScholarDigital Library
- John F Roddick. 1995. A survey of schema versioning issues for database systems. Information and Software Technology 37, 7 (1995), 383--393.Google ScholarCross Ref
- Pau Rodriguez, Miguel A Bautista, Jordi Gonzalez, and Sergio Escalera. 2018. Beyond one-hot encoding: Lower dimensional target embedding. Image and Vision Computing 75 (2018), 21--31.Google ScholarCross Ref
- Maximilian E Schüle, Josef Schmeißer, Thomas Blum, Alfons Kemper, and Thomas Neumann. 2021. TardisDB: Extending SQL to Support Versioning. In Proceedings of the 2021 International Conference on Management of Data. 2775--2778.Google ScholarDigital Library
- Vraj Shah, Jonathan Lacanlale, Premanand Kumar, Kevin Yang, and Arun Kumar. 2021. Towards Benchmarking Feature Type Inference for AutoML Platforms. In Proceedings of the 2021 International Conference on Management of Data. 1584--1596.Google ScholarDigital Library
- Roee Shraga, Avigdor Gal, and Haggai Roitman. 2020. Adnev: Cross-domain schema matching using deep similarity matrix adjustment and evaluation. Proceedings of the VLDB Endowment 13, 9 (2020), 1401--1415.Google ScholarDigital Library
- Roee Shraga and Renée J. Miller. 2023. Explaining Dataset Changes for Semantic Data Versioning with Explain-Da-V (Technical Report). https://arxiv.org/pdf/2301.13095Google Scholar
- Rishabh Singh. 2016. BlinkFill: Semi-supervised Programming By Example for Syntactic String Transformations. Proc. VLDB Endow. 9, 10 (2016), 816--827. Google ScholarDigital Library
- Rishabh Singh and Sumit Gulwani. 2012. Learning Semantic String Transformations from Examples. Proc. VLDB Endow. 5, 8 (2012), 740--751. Google ScholarDigital Library
- Richard T Snodgrass, Curtis Dyreson, Faiz Currim, Sabah Currim, and Shailesh Joshi. 2008. Validating quicksand: Temporal schema versioning in τXSchema. Data & Knowledge Engineering 65, 2 (2008), 223--242.Google ScholarDigital Library
- Charles Sutton, Timothy Hobson, James Geddes, and Rich Caruana. 2018. Data diff: Interpretable, executable summaries of changes in distributions for data wrangling. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 2279--2288.Google ScholarDigital Library
- Christos Thrampoulidis, Samet Oymak, and Babak Hassibi. 2015. Regularized linear regression: A precise analysis of the estimation error. In Conference on Learning Theory. PMLR, 1683--1709.Google Scholar
- Kai Ming Ting, Sunil Aryal, and Takashi Washio. 2018. Which Outlier Detector Should I use?. In 2018 IEEE International Conference on Data Mining (ICDM). IEEE, 8--8.Google ScholarCross Ref
- Quoc Trung Tran, Chee-Yong Chan, and Srinivasan Parthasarathy. 2014. Query reverse engineering. The VLDB Journal 23, 5 (2014), 721--746.Google ScholarCross Ref
- Panos Vassiliadis. 2009. A survey of extract-transform-load technology. International Journal of Data Warehousing and Mining (IJDWM) 5, 3 (2009), 1--27.Google ScholarCross Ref
- S Vijayarani, Ms J Ilamathi, Ms Nithya, et al. 2015. Preprocessing techniques for text mining-an overview. International Journal of Computer Science & Communication Networks 5, 1 (2015), 7--16.Google Scholar
- Xiaolan Wang and Alexandra Meliou. 2019. Explain 3D: explaining disagreements in disjoint datasets. Proceedings of the VLDB Endowment 12, 7 (2019).Google ScholarDigital Library
- Yuan Wang, David J DeWitt, and J-Y Cai. 2003. X-Diff: An effective change detection algorithm for XML documents. In Proceedings 19th international conference on data engineering (Cat. No. 03CH37405). IEEE, 519--530.Google ScholarCross Ref
- Cong Yan and Yeye He. 2020. Auto-suggest: Learning-to-recommend data preparation steps using data science notebooks. In Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data. 1539--1554.Google ScholarDigital Library
- Junwen Yang, Yeye He, and Surajit Chaudhuri. 2021. Auto-pipeline: synthesizing complex data pipelines by-target using reinforcement learning and search. Proceedings of the VLDB Endowment 14, 11 (2021), 2563--2575.Google ScholarDigital Library
- Gunce Su Yilmaz, Tana Wattanawaroon, Liqi Xu, Abhishek Nigam, Aaron J Elmore, and Aditya Parameswaran. 2018. Datadiff: User-interpretable data transformation summaries for collaborative data analysis. In Proceedings of the 2018 International Conference on Management of Data. 1769--1772.Google ScholarDigital Library
- Amy X Zhang, Michael Muller, and Dakuo Wang. 2020. How do data science workers collaborate? roles, workflows, and tools. Proceedings of the ACM on Human-Computer Interaction 4, CSCW1 (2020), 1--23.Google ScholarDigital Library
- Alice Zheng and Amanda Casari. 2018. Feature engineering for machine learning: principles and techniques for data scientists. " O'Reilly Media, Inc.".Google Scholar
- Erkang Zhu, Dong Deng, Fatemeh Nargesian, and Renée J Miller. 2019. Josie: Overlap set similarity search for finding joinable tables in data lakes. In Proceedings of the 2019 International Conference on Management of Data. 847--864.Google ScholarDigital Library
- Erkang Zhu, Yeye He, and Surajit Chaudhuri. 2017. Auto-Join: Joining Tables by Leveraging Transformations. Proc. VLDB Endow. 10, 10 (2017), 1034--1045. http://www.vldb.org/pvldb/vol10/p1034-he.pdfGoogle ScholarDigital Library
Recommendations
Explaining Disease: Correlations, Causes, and Mechanisms
Why do people get sick? I argue that a disease explanation is best thought of as causal network instantiation, where a causal network describes the interrelations among multiple factors, and instantiation consists of observational or hypothetical ...
Causality-based versioning
Versioning file systems provide the ability to recover from a variety of failures, including file corruption, virus and worm infestations, and user mistakes. However, using versions to recover from data-corrupting events requires a human to determine ...
Causality-based versioning
FAST '09: Proccedings of the 7th conference on File and storage technologiesVersioning file systems provide the ability to recover from a variety of failures, including file corruption, virus and worm infestations, and user mistakes. However, using versions to recover from data-corrupting events requires a human to determine ...
Comments