skip to main content
10.1145/3375627.3375830acmconferencesArticle/Chapter ViewAbstractPublication PagesaiesConference Proceedingsconference-collections
research-article
Open Access

Fooling LIME and SHAP: Adversarial Attacks on Post hoc Explanation Methods

Published:07 February 2020Publication History

ABSTRACT

As machine learning black boxes are increasingly being deployed in domains such as healthcare and criminal justice, there is growing emphasis on building tools and techniques for explaining these black boxes in an interpretable manner. Such explanations are being leveraged by domain experts to diagnose systematic errors and underlying biases of black boxes. In this paper, we demonstrate that post hoc explanations techniques that rely on input perturbations, such as LIME and SHAP, are not reliable. Specifically, we propose a novel scaffolding technique that effectively hides the biases of any given classifier by allowing an adversarial entity to craft an arbitrary desired explanation. Our approach can be used to scaffold any biased classifier in such a way that its predictions on the input data distribution still remain biased, but the post hoc explanations of the scaffolded classifier look innocuous. Using extensive evaluation with multiple real world datasets (including COMPAS), we demonstrate how extremely biased (racist) classifiers crafted by our framework can easily fool popular explanation techniques such as LIME and SHAP into generating innocuous explanations which do not reflect the underlying biases.

References

  1. Ulrich Aivodji, Hiromi Arai, Olivier Fortineau, Sébastien Gambs, Satoshi Hara, and Alain Tapp. 2019. Fairwashing: the risk of rationalization. In International Conference on Machine Learning . 161--170.Google ScholarGoogle Scholar
  2. Julia Angwin, Jeff Larson, Surya Mattu, and Lauren Kirchner. 2016. Machine bias. ProPublica (2016).Google ScholarGoogle Scholar
  3. Arthur Asuncion and David Newman. 2007. UCI machine learning repository, 2007.Google ScholarGoogle Scholar
  4. C Blake, E Koegh, and CJ Mertz. 1999. Repository of Machine Learning. University of California at Irvine (1999).Google ScholarGoogle Scholar
  5. Ann-Kathrin Dombrowski, Maximilian Alber, Christopher J Anders, Marcel Ackermann, Klaus-Robert Müller, and Pan Kessel. 2019. Explanations can be manipulated and geometry is to blame. arXiv preprint arXiv:1906.07983 (2019).Google ScholarGoogle Scholar
  6. Finale Doshi-Velez and Been Kim. 2017. Towards a rigorous science of interpretable machine learning. arXiv preprint arXiv:1702.08608 (2017).Google ScholarGoogle Scholar
  7. Radwa Elshawi, Mouaz H Al-Mallah, and Sherif Sakr. 2019. On the interpretability of machine learning-based model for predicting hypertension. BMC medical informatics and decision making , Vol. 19, 1 (2019), 146.Google ScholarGoogle Scholar
  8. Amirata Ghorbani, Abubakar Abid, and James Zou. 2019. Interpretation of neural networks is fragile. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33. 3681--3688.Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Juyeon Heo, Sunghwan Joo, and Taesup Moon. 2019. Fooling Neural Network Interpretations via Adversarial Model Manipulation. In Advances in Neural Information Processing Systems 32. 2921--2932.Google ScholarGoogle Scholar
  10. Mark Ibrahim, Melissa Louie, Ceena Modarres, and John Paisley. 2019. Global Explanations of Neural Networks: Mapping the Landscape of Predictions. In Proceedings of the 2019 AAAI/ACM Conference on AI, Ethics, and Society (AIES '19). 279--287.Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Been Kim, Martin Wattenberg, Justin Gilmer, Carrie Cai, James Wexler, Fernanda Viegas, and Rory S ayres. 2018. Interpretability Beyond Feature Attribution: Quantitative Testing with Concept Activation Vectors (TCAV). In Proceedings of the 35th International Conference on Machine Learning (Proceedings of Machine Learning Research), Jennifer Dy and Andreas Krause (Eds.), Vol. 80. PMLR, Stockholmsmässan, Stockholm Sweden, 2668--2677.Google ScholarGoogle Scholar
  12. Joshua A Kroll, Solon Barocas, Edward W Felten, Joel R Reidenberg, David G Robinson, and Harlan Yu. 2016. Accountable algorithms. U. Pa. L. Rev. , Vol. 165 (2016), 633.Google ScholarGoogle Scholar
  13. Jeff Larson, Surya Mattu, Lauren Kirchner, and Julia Angwin. 2016. How we analyzed the COMPAS recidivism algorithm. ProPublica (5 2016) , Vol. 9 (2016).Google ScholarGoogle Scholar
  14. Zachary C. Lipton. 2018. The Mythos of Model Interpretability. Queue , Vol. 16, 3, Article 30 (June 2018), bibinfonumpages27 pages.Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Scott M Lundberg and Su-In Lee. 2017. A Unified Approach to Interpreting Model Predictions. In Neural Information Processing Systems (NIPS), I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (Eds.). Curran Associates, Inc., 4765--4774.Google ScholarGoogle Scholar
  16. Brent Mittelstadt, Chris Russell, and Sandra Wachter. 2019. Explaining explanations in AI. In Proceedings of the conference on fairness, accountability, and transparency. ACM, 279--288.Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. M Redmond. 2011. Communities and crime unnormalized data set. UCI Machine Learning Repository. (2011).Google ScholarGoogle Scholar
  18. Michael Redmond and Alok Baveja. 2002. A data-driven software tool for enabling cooperative information sharing among police departments. European Journal of Operational Research , Vol. 141, 3 (2002), 660--678.Google ScholarGoogle ScholarCross RefCross Ref
  19. General Data Protection Regulation. 2016. Regulation (EU) 2016/679 of the European Parliament and of the Council of 27 April 2016 on the protection of natural persons with regard to the processing of personal data and on the free movement of such data, and repealing Directive 95/46. Official Journal of the European Union (OJ) , Vol. 59, 1--88 (2016), 294.Google ScholarGoogle Scholar
  20. Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. 2016. "Why Should I Trust You?": Explaining the Predictions of Any Classifier. In Knowledge Discovery and Data Mining (KDD) .Google ScholarGoogle Scholar
  21. Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. 2018. Anchors: High-precision model-agnostic explanations. In Thirty-Second AAAI Conference on Artificial Intelligence .Google ScholarGoogle Scholar
  22. Cynthia Rudin. 2019. Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead. Nature Machine Intelligence , Vol. 1, 5 (2019), 206.Google ScholarGoogle ScholarCross RefCross Ref
  23. Andrew D Selbst and Solon Barocas. 2018. The intuitive appeal of explainable machines. Fordham L. Rev. , Vol. 87 (2018), 1085.Google ScholarGoogle Scholar
  24. Sarah Tan, Rich Caruana, Giles Hooker, and Yin Lou. 2018. Distill-and-compare: auditing black-box models using transparent model distillation. In Proceedings of the 2018 AAAI/ACM Conference on AI, Ethics, and Society. ACM, 303--310.Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Leanne S Whitmore, Anthe George, and Corey M Hudson. 2016. Mapping chemical performance on molecular structures using locally interpretable explanations. arXiv preprint arXiv:1611.07443 (2016).Google ScholarGoogle Scholar

Index Terms

  1. Fooling LIME and SHAP: Adversarial Attacks on Post hoc Explanation Methods

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in
        • Published in

          cover image ACM Conferences
          AIES '20: Proceedings of the AAAI/ACM Conference on AI, Ethics, and Society
          February 2020
          439 pages
          ISBN:9781450371100
          DOI:10.1145/3375627

          Copyright © 2020 ACM

          Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 7 February 2020

          Permissions

          Request permissions about this article.

          Request Permissions

          Check for updates

          Qualifiers

          • research-article

          Acceptance Rates

          Overall Acceptance Rate61of162submissions,38%

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader