ABSTRACT
As a decentralized training approach, federated learning enables multiple organizations to jointly train a model without exposing their private data. This work investigates vertical federated learning (VFL) to address scenarios where collaborating organizations have the same set of users but with different features, and only one party holds the labels. While VFL shows good performance, practitioners often face uncertainty when preparing non-transparent, internal/external features and samples for the VFL training phase. Moreover, to balance the prediction accuracy and the resource consumption of model inference, practitioners require to know which subset of prediction instances is genuinely needed to invoke the VFL model for inference. To this end, we co-design the VFL modeling process by proposing an interactive real-time visualization system, VFLens, to help practitioners with feature engineering, sample selection, and inference. A usage scenario, a quantitative experiment, and expert feedback suggest that VFLens helps practitioners boost VFL efficiency at a lower cost with sufficient confidence.
Supplemental Material
- André Altmann, Laura Toloşi, Oliver Sander, and Thomas Lengauer. 2010. Permutation importance: a corrected feature importance measure. Bioinformatics 26, 10 (2010), 1340–1347.Google ScholarDigital Library
- Yindalon Aphinyanaphongs, Lawrence D Fu, Zhiguo Li, Eric R Peskin, Efstratios Efstathiadis, Constantin F Aliferis, and Alexander Statnikov. 2014. A comprehensive empirical comparison of modern supervised classification and feature selection methods for text categorization. Journal of the Association for Information Science and Technology 65, 10(2014), 1964–1987.Google ScholarDigital Library
- David Arthur and Sergei Vassilvitskii. 2006. k-means++: The advantages of careful seeding. Technical Report. Stanford.Google Scholar
- Avrim L Blum and Pat Langley. 1997. Selection of relevant features and examples in machine learning. Artificial intelligence 97, 1-2 (1997), 245–271.Google Scholar
- H. Brendan McMahan, Eider Moore, Daniel Ramage, Seth Hampson, and Blaise Agüera y Arcas. 2017. Communication-efficient learning of deep networks from decentralized data. In Proceedings of the 20th International Conference on Artificial Intelligence and Statistics, AISTATS 2017. arxiv:1602.05629https://arxiv.org/pdf/1602.05629.pdfGoogle Scholar
- Girish Chandrashekar and Ferat Sahin. 2014. A survey on feature selection methods. Computers & Electrical Engineering 40, 1 (2014), 16–28.Google ScholarDigital Library
- Angelos Chatzimparmpas, Rafael M Martins, Kostiantyn Kucher, and Andreas Kerren. 2021. FeatureEnVi: Visual Analytics for Feature Engineering Using Stepwise Selection and Semi-Automatic Extraction Approaches. arXiv preprint arXiv:2103.14539(2021).Google Scholar
- Hao Chen, Zhicong Huang, Kim Laine, and Peter Rindal. 2018. Labeled PSI from fully homomorphic encryption with malicious security. In Proceedings of the 2018 ACM SIGSAC Conference on Computer and Communications Security. 1223–1237.Google ScholarDigital Library
- Hao Chen, Kim Laine, and Peter Rindal. 2017. Fast private set intersection from homomorphic encryption. In Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security. 1243–1255.Google ScholarDigital Library
- Tianqi Chen and Carlos Guestrin. 2016. Xgboost: A scalable tree boosting system. In Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining. 785–794.Google ScholarDigital Library
- Kewei Cheng, Tao Fan, Yilun Jin, Yang Liu, Tianjian Chen, Dimitrios Papadopoulos, and Qiang Yang. 2021. Secureboost: A lossless federated learning framework. IEEE Intelligent Systems(2021).Google ScholarDigital Library
- Tao Fan. 2018. FATE-Board_FATE’s Visualization Toolkit., 11 pages. https://github.com/FederatedAI/FATE-BoardGoogle Scholar
- Fedai. 2019. Computer vision Platform powered by Federated Learning. https://www.fedai.org/cases/computer-vision-platform-powered-by-federated-learning/Google Scholar
- George Forman 2003. An extensive empirical study of feature selection metrics for text classification.J. Mach. Learn. Res. 3, Mar (2003), 1289–1305.Google Scholar
- Isabelle Guyon and André Elisseeff. 2003. An introduction to variable and feature selection. Journal of machine learning research 3, Mar (2003), 1157–1182.Google ScholarDigital Library
- Alon Halevy, Peter Norvig, and Fernando Pereira. 2009. The unreasonable effectiveness of data. IEEE Intelligent Systems 24, 2 (2009), 8–12.Google ScholarDigital Library
- Andrew Hard, Kanishka Rao, Rajiv Mathews, Swaroop Ramaswamy, Françoise Beaufays, Sean Augenstein, Hubert Eichner, Chloé Kiddon, and Daniel Ramage. 2018. Federated Learning for Mobile Keyboard Prediction. (2018). arxiv:1811.03604http://arxiv.org/abs/1811.03604Google Scholar
- Stephen Hardy, Wilko Henecka, Hamish Ivey-Law, Richard Nock, Giorgio Patrini, Guillaume Smith, and Brian Thorne. 2017. Private federated learning on vertically partitioned data via entity resolution and additively homomorphic encryption. arXiv preprint arXiv:1711.10677(2017).Google Scholar
- Tin Kam Ho. 1995. Random decision forests. In Proceedings of 3rd international conference on document analysis and recognition, Vol. 1. IEEE, 278–282.Google Scholar
- Fred Hohman, Kanit Wongsuphasawat, Mary Beth Kery, and Kayur Patel. 2020. Understanding and visualizing data iteration in machine learning. In Proceedings of the 2020 CHI Conference on Human Factors in Computing Systems. 1–13.Google ScholarDigital Library
- Yan Huang, David Evans, and Jonathan Katz. 2012. Private set intersection: Are garbled circuits better than custom protocols?. In NDSS.Google Scholar
- Qinghe Jing, Weiyan Wang, Junxue Zhang, Han Tian, and Kai Chen. 2019. Quantifying the performance of federated transfer learning. arXiv preprint arXiv:1912.12795(2019).Google Scholar
- Alan Jović, Karla Brkić, and Nikola Bogunović. 2015. A review of feature selection methods with applications. In 2015 38th international convention on information and communication technology, electronics and microelectronics (MIPRO). Ieee, 1200–1205.Google Scholar
- Peter Kairouz, H Brendan McMahan, Brendan Avent, Aurélien Bellet, Mehdi Bennis, Arjun Nitin Bhagoji, Kallista Bonawitz, Zachary Charles, Graham Cormode, Rachel Cummings, 2019. Advances and open problems in federated learning. arXiv preprint arXiv:1912.04977(2019).Google Scholar
- Ron Kohavi and George H John. 1997. Wrappers for feature subset selection. Artificial intelligence 97, 1-2 (1997), 273–324.Google Scholar
- Anran Li, Lan Zhang, Juntao Tan, Yaxuan Qin, Junhao Wang, and Xiang-Yang Li. 2021. Sample-level Data Selection for Federated Learning. In IEEE INFOCOM 2021-IEEE Conference on Computer Communications. IEEE, 1–10.Google Scholar
- Anran Li, Lan Zhang, Juntao Tan, Yaxuan Qin, Junhao Wang, and Xiang-Yang Li. 2021. Sample-level Data Selection for Federated Learning. In IEEE INFOCOM 2021 - IEEE Conference on Computer Communications. 1–10. https://doi.org/10.1109/INFOCOM42981.2021.9488723Google ScholarDigital Library
- Qinbin Li, Bingsheng He, and Dawn Song. 2021. Model-Contrastive Federated Learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 10713–10722.Google ScholarCross Ref
- Quan Li, Kristanto Sean Njotoprawiro, Hammad Haleem, Qiaoan Chen, Chris Yi, and Xiaojuan Ma. 2018. Embeddingvis: A visual analytics approach to comparative network embedding inspection. In 2018 IEEE Conference on Visual Analytics Science and Technology (VAST). IEEE, 48–59.Google ScholarCross Ref
- Quan Li, Xiguang Wei, Huanbin Lin, Yang Liu, Tianjian Chen, and Xiaojuan Ma. 2021. Inspecting the Running Process of Horizontal Federated Learning via Visual Analytics. IEEE Transactions on Visualization and Computer Graphics (2021), 1–1. https://doi.org/10.1109/TVCG.2021.3074010Google ScholarDigital Library
- Qinbin Li, Zeyi Wen, and Bingsheng He. 2020. Practical federated gradient boosting decision trees. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34. 4642–4649.Google ScholarCross Ref
- Tian Li, Anit Kumar Sahu, Manzil Zaheer, Maziar Sanjabi, Ameet Talwalkar, and Virginia Smith. 2018. Federated optimization in heterogeneous networks. arXiv preprint arXiv:1812.06127(2018).Google Scholar
- Zhang Li and Guo Jun. 2006. A method for the selection of training samples based on boundary samples. Journal of Beijing University of Posts and Telecommunications 29, 4(2006), 77.Google Scholar
- Ninareh Mehrabi, Fred Morstatter, Nripsuta Saxena, Kristina Lerman, and Aram Galstyan. 2021. A survey on bias and fairness in machine learning. ACM Computing Surveys (CSUR) 54, 6 (2021), 1–35.Google ScholarDigital Library
- Mike. 2018. Federated learning: distributed machine learning with data locality and privacy. https://blog.fastforwardlabs.com/2018/11/14/federated-learning.htmlGoogle Scholar
- Mehryar Mohri, Gary Sivek, and Ananda Theertha Suresh. 2019. Agnostic federated learning. In International Conference on Machine Learning. PMLR, 4615–4625.Google Scholar
- Ronald Rivest and S Dusse. 1992. The MD5 message-digest algorithm.Google Scholar
- Marcin Rojek. 2018. Devices learning from each other ? See it live this September at AI Summit in San Francisco !, 7–11 pages.Google Scholar
- David Roschewitz, Mary-Anne Hartley, Luca Corinzia, and Martin Jaggi. 2021. IFedAvg: Interpretable Data-Interoperability for Federated Learning. arXiv preprint arXiv:2107.06580(2021).Google Scholar
- Nithya Sambasivan, Shivani Kapania, Hannah Highfill, Diana Akrong, Praveen Paritosh, and Lora M Aroyo. 2021. “Everyone wants to do the model work, not the data work”: Data Cascades in High-Stakes AI. In proceedings of the 2021 CHI Conference on Human Factors in Computing Systems. 1–15.Google ScholarDigital Library
- Erwan Scornet. 2020. Trees, forests, and impurity-based variable importance. arXiv preprint arXiv:2001.04295(2020).Google Scholar
- Jinwook Seo and Ben Shneiderman. 2005. A rank-by-feature framework for interactive exploration of multidimensional data. Information visualization 4, 2 (2005), 96–113.Google Scholar
- MA Syakur, BK Khotimah, EMS Rochman, and Budi Dwi Satoto. 2018. Integration k-means clustering method and elbow method for identification of the best customer profile cluster. In IOP conference series: materials science and engineering, Vol. 336. IOP Publishing, 012017.Google ScholarCross Ref
- Praneeth Vepakomma, Otkrist Gupta, Tristan Swedish, and Ramesh Raskar. 2018. Split learning for health: Distributed deep learning without sharing raw patient data. arxiv:1812.00564 [cs.LG]Google Scholar
- Guan Wang. 2019. Interpret federated learning with shapley values. arXiv preprint arXiv:1905.04519(2019).Google Scholar
- Hongyi Wang, Mikhail Yurochkin, Yuekai Sun, Dimitris Papailiopoulos, and Yasaman Khazaeni. 2020. Federated learning with matched averaging. arXiv preprint arXiv:2002.06440(2020).Google Scholar
- Xiguang Wei, Quan Li, Yang Liu, Han Yu, Tianjian Chen, and Qiang Yang. 2019. Multi-Agent Visualization for Explaining Federated Learning. In Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence, {IJCAI-19}. International Joint Conferences on Artificial Intelligence Organization, 6572–6574. https://doi.org/10.24963/ijcai.2019/960Google ScholarCross Ref
- Ting Wu, Lei Chen, Pan Hui, Chen Jason Zhang, and Weikai Li. 2015. Hear the whole story: Towards the diversity of opinion in crowdsourcing markets. Proceedings of the VLDB Endowment 8, 5 (2015), 485–496.Google ScholarDigital Library
- Kai Yang, Tao Fan, Tianjian Chen, Yuanming Shi, and Qiang Yang. 2019. A quasi-newton method based vertical federated learning framework for logistic regression. arXiv preprint arXiv:1912.00513(2019).Google Scholar
- Liu Yang, Ben Tan, Vincent W Zheng, Kai Chen, and Qiang Yang. 2020. Federated recommendation systems. In Federated Learning. Springer, 225–239.Google Scholar
- Qiang Yang, Yang Liu, Tianjian Chen, and Yongxin Tong. 2019. Federated machine learning: Concept and applications. ACM Transactions on Intelligent Systems and Technology 10, 2(2019), 1–19. https://doi.org/10.1145/3298981 arxiv:1902.04885Google ScholarDigital Library
- Qiang Yang, Yang Liu, Yong Cheng, Yan Kang, Tianjian Chen, and Han Yu. 2019. Federated learning. Synthesis Lectures on Artificial Intelligence and Machine Learning 13, 3(2019), 1–207.Google ScholarCross Ref
- Ka-Ping Yee, Kirsten Swearingen, Kevin Li, and Marti Hearst. 2003. Faceted metadata for image search and browsing. In Proceedings of the SIGCHI conference on Human factors in computing systems. 401–408.Google ScholarDigital Library
- I-Cheng Yeh and Che-hui Lien. 2009. The comparisons of data mining techniques for the predictive accuracy of probability of default of credit card clients. Expert Systems with Applications 36, 2 (2009), 2473–2480.Google ScholarDigital Library
- Ching-Hung Yuen and Kwok-Wo Wong. 2011. A chaos-based joint image compression and encryption scheme using DCT and SHA-1. Applied Soft Computing 11, 8 (2011), 5092–5098.Google ScholarDigital Library
- Mikhail Yurochkin, Mayank Agarwal, Soumya Ghosh, Kristjan Greenewald, Nghia Hoang, and Yasaman Khazaeni. 2019. Bayesian nonparametric federated learning of neural networks. In International Conference on Machine Learning. PMLR, 7252–7261.Google Scholar
- Chengliang Zhang, Suyi Li, Junzhe Xia, Wei Wang, Feng Yan, and Yang Liu. 2020. {BatchCrypt}: Efficient Homomorphic Encryption for {Cross-Silo} Federated Learning. In 2020 USENIX Annual Technical Conference (USENIX ATC 20). 493–506.Google Scholar
- Lingchen Zhao, Lihao Ni, Shengshan Hu, Yaniiao Chen, Pan Zhou, Fu Xiao, and Libing Wu. 2018. Inprivate digging: Enabling tree-based distributed data mining with differential privacy. In IEEE INFOCOM 2018-IEEE Conference on Computer Communications. IEEE, 2087–2095.Google ScholarDigital Library
- Yue Zhao, Meng Li, Liangzhen Lai, Naveen Suda, Damon Civin, and Vikas Chandra. 2018. Federated learning with non-iid data. arXiv preprint arXiv:1806.00582(2018).Google Scholar
- Fanglan Zheng, Kun Li, Jiang Tian, Xiaojia Xiang, 2020. A vertical federated learning method for interpretable scorecard and its application in credit scoring. arXiv preprint arXiv:2009.06218(2020).Google Scholar
Index Terms
- VFLens: Co-design the Modeling Process for Efficient Vertical Federated Learning via Visualization
Recommendations
A Static Bi-dimensional Sample Selection for Federated Learning with Label Noise
Database Systems for Advanced ApplicationsAbstractIn real-world Federated learning(FL), client training data may contain label noise, which can harm the generalization performance of the global model. Most existing noisy label learning methods rely on sample selection strategies that treat small-...
Secure vertical federated learning based on feature disentanglement
Highlights- A new vertical federated learning framework (SVFL) preventing label owner from label inference attack.
AbstractFederated learning (FL) faces many security threats. Although multiple robust FL frameworks have been proposed to defend against these malicious attacks in horizontal federated learning (HFL), security issues in vertical federated ...
Multi-label learning with label-specific feature reduction
We propose two multi-label learning approaches with LIFT reduction.The idea of fuzzy rough set attribute reduction is adopted in our approaches.Sample selection improves the efficiency in feature dimension reduction. In multi-label learning, since ...
Comments