Abstract
In this paper, we tackle the challenging problem of Shapley value computation in data markets in a novel setting of data assemblage tasks with binary utility functions among data owners. By modeling these scenarios as cooperative simple games, we leverage pivotal probabilities to transform the computation into a problem of counting beneficiaries. Moreover, we make an insightful observation that the Shapley values can be computed using subsets of minimal syntheses within the inclusion-exclusion framework in combinatorics. Based on this insight, we develop a game decomposition approach and utilize techniques in Boolean function decomposition into disjunctive normal form. One interesting property of our method is that the time complexity depends only on the data owners participating in those minimal syntheses, rather than all the data owners. Extensive experiments with real data sets demonstrate a significant efficiency improvement for computing the Shapley values in data assemblage tasks modeled as simple games.
- Alessandro Acquisti, Curtis Taylor, and Liad Wagman. 2016. The Economics of Privacy. Journal of Economic Literature 54, 2 (June 2016), 442--92. https://doi.org/10.1257/jel.54.2.442Google ScholarCross Ref
- Anish Agarwal, Munther A. Dahleh, and Tuhin Sarkar. 2019. A Marketplace for Data: An Algorithmic Solution. In Proceedings of the 2019 ACM Conference on Economics and Computation, EC 2019, Phoenix, AZ, USA, June 24--28, 2019, Anna Karlin, Nicole Immorlica, and Ramesh Johari (Eds.). ACM, 701--726. https://doi.org/10.1145/3328526.3329589Google ScholarDigital Library
- Charu C. Aggarwal. 2016. Recommender Systems: The Textbook (1st ed.). Springer Publishing Company, Incorporated.Google ScholarCross Ref
- Charu C. Aggarwal and Philip S. Yu. 2008. Privacy-Preserving Data Mining: A Survey. In Handbook of Database Security: Applications and Trends, Michael Gertz and Sushil Jajodia (Eds.). Springer US, Boston, MA, 431--460. https://doi.org/10.1007/978-0--387--48533--1_18Google ScholarCross Ref
- William Aiello, Yuval Ishai, and Omer Reingold. 2001. Priced Oblivious Transfer: How to Sell Digital Goods. In Advances in Cryptology - EUROCRYPT 2001, International Conference on the Theory and Application of Cryptographic Techniques, Innsbruck, Austria, May 6--10, 2001, Proceeding (Lecture Notes in Computer Science, Vol. 2045). Springer, 119--135. https://doi.org/10.1007/3--540--44987--6_8Google ScholarCross Ref
- Magdalena Balazinska, Bill Howe, and Dan Suciu. 2011. Data Markets in the Cloud: An Opportunity for the Database Community. Proc. VLDB Endow. 4, 12 (2011), 1482--1485. http://www.vldb.org/pvldb/vol4/p1482-balazinska.pdfGoogle ScholarDigital Library
- Jan C. Bioch. 2002. Modular Decomposition of Boolean Functions. https://ssrn.com/abstract=370984.Google Scholar
- Jan C. Bioch. 2005. The complexity of modular decomposition of Boolean functions. Discret. Appl. Math. 149, 1--3 (2005), 1--13. https://doi.org/10.1016/j.dam.2003.12.010Google ScholarCross Ref
- Andreas Björklund, Thore Husfeldt, and Mikko Koivisto. 2009. Set Partitioning via Inclusion-Exclusion. SIAM J. Comput. 39, 2 (2009), 546--563. https://doi.org/10.1137/070683933 arXiv:https://doi.org/10.1137/070683933Google ScholarDigital Library
- Jens Bleiholder, Sascha Szott, Melanie Herschel, and Felix Naumann. 2010. Complement union for data integration. In Workshops Proceedings of the 26th International Conference on Data Engineering, ICDE 2010, March 1--6, 2010, Long Beach, California, USA. IEEE Computer Society, 183--186. https://doi.org/10.1109/ICDEW.2010.5452760Google ScholarCross Ref
- George Boole. 1854. An investigation of the laws of thought: on which are founded the mathematical theories of logic and probabilities. Vol. 2. Walton and Maberly.Google Scholar
- Frank M. Brown. 1990. Boolean reasoning - the logic of boolean equations. Kluwer.Google Scholar
- Satya R. Chakravarty, Manipushpak Mitra, and Palash Sarkar. 2014. A Course on Cooperative Game Theory. Cambridge University Press. https://doi.org/10.1017/CBO9781107415997Google ScholarCross Ref
- Georgios Chalkiadakis, Edith Elkind, and Michael Wooldridge. 2011. Computational Aspects of Cooperative Game Theory (Synthesis Lectures on Artificial Inetlligence and Machine Learning) (1st ed.). Morgan & Claypool Publishers.Google Scholar
- Lingjiao Chen, Paraschos Koutris, and Arun Kumar. 2019. Towards Model-based Pricing for Machine Learning in a Data Marketplace. In Proceedings of the 2019 International Conference on Management of Data, SIGMOD Conference 2019, Amsterdam, The Netherlands, June 30 - July 5, 2019, Peter A. Boncz, Stefan Manegold, Anastasia Ailamaki, Amol Deshpande, and Tim Kraska (Eds.). ACM, 1535--1552. https://doi.org/10.1145/3299869.3300078Google ScholarDigital Library
- Sara Cohen, Itzhak Fadida, Yaron Kanza, Benny Kimelfeld, and Yehoshua Sagiv. 2006. Full Disjunctions: Polynomial-Delay Iterators in Action. In Proceedings of the 32nd International Conference on Very Large Data Bases, Seoul, Korea, September 12--15, 2006, Umeshwar Dayal, Kyu-Young Whang, David B. Lomet, Gustavo Alonso, Guy M. Lohman, Martin L. Kersten, Sang Kyun Cha, and Young-Kuk Kim (Eds.). ACM, 739--750. http://dl.acm.org/citation.cfm?id=1164191Google Scholar
- Zicun Cong, Xuan Luo, Jian Pei, Feida Zhu, and Yong Zhang. 2022. Data pricing in machine learning pipelines. Knowl. Inf. Syst. 64, 6 (2022), 1417--1455. https://doi.org/10.1007/s10115-022-01679--4Google ScholarDigital Library
- R.D. Cook and Sanford Weisberg. 1980. Characterizations of an Empirical Influence Function for Detecting Influential Cases in Regression. Technometrics 22, 4 (1980), 495--508. https://doi.org/10.1080/00401706.1980.10486199 arXiv:https://www.tandfonline.com/doi/pdf/10.1080/00401706.1980.10486199Google ScholarCross Ref
- Yves Crama and Peter L. Hammer. 2011. Boolean Functions - Theory, Algorithms, and Applications. Encyclopedia of mathematics and its applications, Vol. 142. Cambridge University Press. http://www.cambridge.org/gb/knowledge/isbn/item6222210/'site_locale=en_GBGoogle ScholarDigital Library
- Nilesh N. Dalvi and Dan Suciu. 2007. Efficient query evaluation on probabilistic databases. VLDB J. 16, 4 (2007), 523--544. https://doi.org/10.1007/s00778-006-0004--3Google ScholarDigital Library
- David Dao, Dan Alistarh, Claudiu Musat, and Ce Zhang. 2018. DataBright: Towards a Global Exchange for Decentralized Data Ownership and Trusted Computation. CoRR abs/1802.04780 (2018). arXiv:1802.04780 http://arxiv.org/abs/1802.04780Google Scholar
- Xiaotie Deng and Christos H. Papadimitriou. 1994. On the Complexity of Cooperative Solution Concepts. Mathematics of Operations Research 19, 2 (1994), 257--266. http://www.jstor.org/stable/3690220Google ScholarDigital Library
- Daniel Deutch, Nave Frost, Benny Kimelfeld, and Mikaël Monet. 2021. Computing the Shapley Value of Facts in Query Answering. CoRR abs/2112.08874 (2021). arXiv:2112.08874 https://arxiv.org/abs/2112.08874Google Scholar
- Xin Luna Dong and Divesh Srivastava. 2015. Big Data Integration. Morgan & Claypool Publishers. https://doi.org/10.2200/S00578ED1V01Y201404DTM040Google ScholarCross Ref
- Ulrich Faigle and Walter Kern. 1992. The Shapley value for cooperative games under precedence constraints. International Journal of Game Theory 21 (1992), 249--266.Google ScholarDigital Library
- Dan S. Felsenthal and Moshé Machover. 1996. Alternative Forms of the Shapley Value and the Shapley-Shubik Index. Public Choice 87, 3/4 (1996), 315--318. http://www.jstor.org/stable/30027233Google ScholarCross Ref
- Raul Castro Fernandez, Pranav Subramaniam, and Michael J. Franklin. 2020. Data Market Platforms: Trading Data Assets to Solve Data Problems. Proc. VLDB Endow. 13, 11 (2020), 1933--1947. http://www.vldb.org/pvldb/vol13/p1933-fernandez.pdfGoogle ScholarDigital Library
- Lisa K. Fleischer and Yu-Han Lyu. 2012. Approximately Optimal Auctions for Selling Privacy When Costs Are Correlated with Data. In Proceedings of the 13th ACM Conference on Electronic Commerce (Valencia, Spain) (EC'12). Association for Computing Machinery, New York, NY, USA, 568--585. https://doi.org/10.1145/2229012.2229054Google ScholarDigital Library
- Amirata Ghorbani and James Y. Zou. 2019. Data Shapley: Equitable Valuation of Data for Machine Learning. In Proceedings of the 36th International Conference on Machine Learning, ICML 2019, 9--15 June 2019, Long Beach, California, USA (Proceedings of Machine Learning Research, Vol. 97), Kamalika Chaudhuri and Ruslan Salakhutdinov (Eds.). PMLR, 2242--2251. http://proceedings.mlr.press/v97/ghorbani19c.htmlGoogle Scholar
- Arpita Ghosh, Katrina Ligett, Aaron Roth, and Grant Schoenebeck. 2014. Buying Private Data without Verification. In Proceedings of the Fifteenth ACM Conference on Economics and Computation (Palo Alto, California, USA) (EC'14). Association for Computing Machinery, New York, NY, USA, 931--948. https://doi.org/10.1145/2600057.2602902Google ScholarDigital Library
- Donald B Gillies. 1959. Solutions to general non-zero-sum games. Contributions to the Theory of Games 4, 40 (1959), 47--85.Google Scholar
- Andrew V. Goldberg, Jason D. Hartline, and Andrew Wright. 2001. Competitive Auctions and Digital Goods. In Proceedings of the Twelfth Annual ACM-SIAM Symposium on Discrete Algorithms (Washington, D.C., USA) (SODA'01). Society for Industrial and Applied Mathematics, USA, 735--744.Google ScholarDigital Library
- Ian Goodfellow, Yoshua Bengio, and Aaron Courville. 2016. Deep learning. MIT press.Google ScholarDigital Library
- Miguel A Hernán and James M Robins. 2010. Causal inference.Google Scholar
- Nick Hynes, David Dao, David Yan, Raymond Cheng, and Dawn Song. 2018. A Demonstration of Sterling: A Privacy-Preserving Data Marketplace. Proc. VLDB Endow. 11, 12 (Aug. 2018), 2086--2089. https://doi.org/10.14778/3229863.3236266Google ScholarDigital Library
- Ruoxi Jia, David Dao, Boxin Wang, Frances Ann Hubis, Nezihe Merve Gürel, Bo Li, Ce Zhang, Costas J. Spanos, and Dawn Song. 2019. Efficient Task-Specific Data Valuation for Nearest Neighbor Algorithms. Proc. VLDB Endow. 12, 11 (2019), 1610--1623. https://doi.org/10.14778/3342263.3342637Google ScholarDigital Library
- Ruoxi Jia, David Dao, Boxin Wang, Frances Ann Hubis, Nick Hynes, Nezihe Merve Gürel, Bo Li, Ce Zhang, Dawn Song, and Costas J. Spanos. 2019. Towards Efficient Data Valuation Based on the Shapley Value. In The 22nd International Conference on Artificial Intelligence and Statistics, AISTATS 2019, 16--18 April 2019, Naha, Okinawa, Japan (Proceedings of Machine Learning Research, Vol. 89), Kamalika Chaudhuri and Masashi Sugiyama (Eds.). PMLR, 1167--1176. http://proceedings.mlr.press/v89/jia19a.htmlGoogle Scholar
- Michael I Jordan and Tom M Mitchell. 2015. Machine learning: Trends, perspectives, and prospects. Science 349, 6245 (2015), 255--260.Google Scholar
- Javen Kennedy, Pranav Subramaniam, Sainyam Galhotra, and Raul Castro Fernandez. 2022. Revisiting Online Data Markets in 2022: A Seller and Buyer Perspective. SIGMOD Rec. 51, 3 (nov 2022), 30--37. https://doi.org/10.1145/3572751.3572757Google ScholarDigital Library
- Aamod Khatiwada, Roee Shraga, Wolfgang Gatterbauer, and Renée J. Miller. 2022. Integrating Data Lake Tables. Proc. VLDB Endow. 16, 4 (2022), 932--945. https://www.vldb.org/pvldb/vol16/p932-khatiwada.pdfGoogle ScholarDigital Library
- Jon Kleinberg, Christos H Papadimitriou, and Prabhakar Raghavan. 2001. On the value of private information. In Theoretical Aspects Of Rationality And Knowledge: Proceedings of the 8 th conference on Theoretical aspects of rationality and knowledge, Vol. 8. Citeseer, 249--257.Google Scholar
- Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. 2015. Deep learning. nature 521, 7553 (2015), 436--444.Google Scholar
- Chao Li, Daniel Yang Li, Gerome Miklau, and Dan Suciu. 2015. A Theory of Pricing Private Data. ACM Trans. Database Syst. 39, 4, Article 34 (Dec. 2015), 28 pages. https://doi.org/10.1145/2691190.2691191Google ScholarDigital Library
- Ester Livshits, Leopoldo E. Bertossi, Benny Kimelfeld, and Moshe Sebag. 2020. The Shapley Value of Tuples in Query Answering. In 23rd International Conference on Database Theory, ICDT 2020, March 30-April 2, 2020, Copenhagen, Denmark (LIPIcs, Vol. 155), Carsten Lutz and Jean Christoph Jung (Eds.). Schloss Dagstuhl - Leibniz-Zentrum für Informatik, 20:1--20:19. https://doi.org/10.4230/LIPIcs.ICDT.2020.20Google ScholarCross Ref
- Xuan Luo, Jian Pei, Zicun Cong, and Cheng Xu. 2022. On Shapley Value in Data Assemblage Under Independent Utility. Proc. VLDB Endow. 15, 11 (2022), 2761--2773. https://www.vldb.org/pvldb/vol15/p2761-luo.pdfGoogle ScholarDigital Library
- Xuan Luo, Jian Pei, Cheng Xu, Wenjie Zhang, and Jianliang Xu. 2024. Fast Shapley Value Computation in Data Assemblage Tasks as Cooperative Simple Games (Technical Report). https://github.com/IDEAL-Lab/shapley-value-simple-game/blob/main/technical_report.pdfGoogle Scholar
- Sasan Maleki, Long Tran-Thanh, Greg Hines, Talal Rahwan, and Alex Rogers. 2013. Bounding the Estimation Error of Sampling-based Shapley Value Approximation With/Without Stratifying. CoRR abs/1306.4265 (2013). arXiv:1306.4265 http://arxiv.org/abs/1306.4265Google Scholar
- Irwin Mann and Lloyd S Shapley. 1960. Values of large games, IV: Evaluating the electoral college by Montecarlo techniques. Rand Corporation.Google Scholar
- Irwin Mann and Lloyd S Shapley. 1964. The a priori voting strength of the electoral college. Game theory and related approaches to social behavior (1964), 151--164.Google Scholar
- Nicholas D Matsakis and Felix S Klock II. 2014. The rust language. In ACM SIGAda Ada Letters, Vol. 34. ACM, 103--104.Google ScholarDigital Library
- Alexandra Meliou, Wolfgang Gatterbauer, Katherine F. Moore, and Dan Suciu. 2010. The Complexity of Causality and Responsibility for Query Answers and non-Answers. Proc. VLDB Endow. 4, 1 (2010), 34--45. https://doi.org/10.14778/1880172.1880176Google ScholarDigital Library
- Alexandra Meliou, Sudeepa Roy, and Dan Suciu. 2014. Causality and Explanations in Databases. Proc. VLDB Endow. 7, 13 (2014), 1715--1716. https://doi.org/10.14778/2733004.2733070Google ScholarDigital Library
- Renée J. Miller. 2018. Open Data Integration. Proc. VLDB Endow. 11, 12 (2018), 2130--2139. https://doi.org/10.14778/3229863.3240491Google ScholarDigital Library
- Xavier Molinero, Fabián Riquelme, and Maria J. Serna. 2015. Forms of representation for simple games: Sizes, conversions and equivalences. Math. Soc. Sci. 76 (2015), 87--102. https://doi.org/10.1016/j.mathsocsci.2015.04.008Google ScholarCross Ref
- Alexander Muschalle, Florian Stahl, Alexander Löser, and Gottfried Vossen. 2012. Pricing approaches for data markets. In International workshop on business intelligence for the real-time enterprise. Springer, 129--144.Google Scholar
- Mark EJ Newman. 2005. Power laws, Pareto distributions and Zipf's law. Contemporary physics 46, 5 (2005), 323--351.Google Scholar
- Kobbi Nissim, Salil Vadhan, and David Xiao. 2014. Redrawing the Boundaries on Purchasing Data from Privacy-Sensitive Individuals. In Proceedings of the 5th Conference on Innovations in Theoretical Computer Science (Princeton, New Jersey, USA) (ITCS'14). Association for Computing Machinery, New York, NY, USA, 411--422. https://doi.org/10.1145/2554797.2554835Google ScholarDigital Library
- Chaoyue Niu, Zhenzhe Zheng, Fan Wu, Shaojie Tang, Xiaofeng Gao, and Guihai Chen. 2018. Unlocking the Value of Privacy: Trading Aggregate Statistics over Private Correlated Data. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (London, United Kingdom) (KDD'18). Association for Computing Machinery, New York, NY, USA, 2031--2040. https://doi.org/10.1145/3219819.3220013Google ScholarDigital Library
- K. Pantelis and L. Aija. 2013. Understanding the value of (big) data. In 2013 IEEE International Conference on Big Data. 38--42.Google Scholar
- Judea Pearl. 2009. Causal inference in statistics: An overview. Statistics surveys 3 (2009), 96--146.Google Scholar
- Judea Pearl. 2010. Causal inference. Causality: objectives and assessment (2010), 39--58.Google Scholar
- J. Pei. 2021. A Survey on Data Pricing: from Economics to Data Science. IEEE Transactions on Knowledge & Data Engineering 01 (dec 2021), 1--1. https://doi.org/10.1109/TKDE.2020.3045927Google ScholarDigital Library
- Foster Provost and Tom Fawcett. 2013. Data science and its relationship to big data and data-driven decision making. Big data 1, 1 (2013), 51--59.Google Scholar
- Anand Rajaraman and Jeffrey D. Ullman. 1996. Integrating Information by Outerjoins and Full Disjunctions. In Proceedings of the Fifteenth ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems, June 3--5, 1996, Montreal, Canada, Richard Hull (Ed.). ACM Press, 238--248. https://doi.org/10.1145/237661.237717Google ScholarDigital Library
- Paul Resnick and Hal R Varian. 1997. Recommender systems. Commun. ACM 40, 3 (1997), 56--58.Google ScholarDigital Library
- Babak Salimi, Leopoldo E. Bertossi, Dan Suciu, and Guy Van den Broeck. 2016. Quantifying Causal Effects on Query Answering in Databases. In 8th USENIX Workshop on the Theory and Practice of Provenance, TaPP 2016, Washington, D.C., USA, June 8--9, 2016, Sarah Cohen Boulakia (Ed.). USENIX Association. https://www.usenix.org/conference/tapp16/workshop-program/presentation/salimiGoogle Scholar
- Fabian Schomm, Florian Stahl, and Gottfried Vossen. 2013. Marketplaces for data: an initial survey. ACM SIGMOD Record 42, 1 (2013), 15--26.Google ScholarDigital Library
- Pierre Senellart, Louis Jachiet, Silviu Maniu, and Yann Ramusat. 2018. ProvSQL: Provenance and Probability Management in PostgreSQL. Proc. VLDB Endow. 11, 12 (2018), 2034--2037. https://doi.org/10.14778/3229863.3236253Google ScholarDigital Library
- Claude E. Shannon. 1949. The synthesis of two-terminal switching circuits. Bell Syst. Tech. J. 28, 1 (1949), 59--98. https://doi.org/10.1002/j.1538--7305.1949.tb03624.xGoogle ScholarCross Ref
- LS Shapley. 1967. On committees. In New Methods of Thought and Procedure: Contributions to the Symposium on Methodologies. Springer, 246--270.Google Scholar
- Lloyd S. Shapley. 1952. A Value for n-Person Games. Technical Report P-295. RAND Corporation, Santa Monica, CA.Google Scholar
- Florian Tramèr, Fan Zhang, Ari Juels, Michael K. Reiter, and Thomas Ristenpart. 2016. Stealing Machine Learning Models via Prediction APIs. In Proceedings of the 25th USENIX Conference on Security Symposium (Austin, TX, USA) (SEC'16). USENIX Association, USA, 601--618.Google ScholarDigital Library
Index Terms
- Fast Shapley Value Computation in Data Assemblage Tasks as Cooperative Simple Games
Recommendations
Efficient Sampling Approaches to Shapley Value Approximation
PACMMODShapley value provides a unique way to fairly assess each player's contribution in a coalition and has enjoyed many applications. However, the exact computation of Shapley value is #P-hard due to the combinatoric nature of Shapley value. Many existing ...
The Shapley value, the Proper Shapley value, and sharing rules for cooperative ventures
AbstractIn this note, we discuss two solutions for cooperative transferable utility games, namely the Shapley value and the Proper Shapley value. We characterize positive Proper Shapley values by affine invariance and by an axiom that requires ...
The Shapley value for cooperative games under precedence constraints
AbstractCooperative games are considered where only those coalitions of players are feasible that respect a given precedence structure on the set of players. Strengthening the classical symmetry axiom, we obtain three axioms that give rise to a unique ...
Comments