Fooling LIME and SHAP: Adversarial Attacks on Post hoc Explanation Methods

Authors:
Dylan Slack

University of California, Irvine, Irvine, CA, USA

University of California, Irvine, Irvine, CA, USA
View Profile

,
Sophie Hilgard

Harvard University, Cambridge, MA, USA

Harvard University, Cambridge, MA, USA
View Profile

,
Emily Jia

Harvard University, Cambridge, MA, USA

Harvard University, Cambridge, MA, USA
View Profile

,
Sameer Singh

University of California, Irvine, Irvine, CA, USA

University of California, Irvine, Irvine, CA, USA
View Profile

,
Himabindu Lakkaraju

Harvard University, Cambridge, MA, USA

Harvard University, Cambridge, MA, USA
View Profile

AIES '20: Proceedings of the AAAI/ACM Conference on AI, Ethics, and SocietyFebruary 2020Pages 180–186https://doi.org/10.1145/3375627.3375830

Published:07 February 2020Publication History

AIES '20: Proceedings of the AAAI/ACM Conference on AI, Ethics, and Society

Pages 180–186

ABSTRACT

As machine learning black boxes are increasingly being deployed in domains such as healthcare and criminal justice, there is growing emphasis on building tools and techniques for explaining these black boxes in an interpretable manner. Such explanations are being leveraged by domain experts to diagnose systematic errors and underlying biases of black boxes. In this paper, we demonstrate that post hoc explanations techniques that rely on input perturbations, such as LIME and SHAP, are not reliable. Specifically, we propose a novel scaffolding technique that effectively hides the biases of any given classifier by allowing an adversarial entity to craft an arbitrary desired explanation. Our approach can be used to scaffold any biased classifier in such a way that its predictions on the input data distribution still remain biased, but the post hoc explanations of the scaffolded classifier look innocuous. Using extensive evaluation with multiple real world datasets (including COMPAS), we demonstrate how extremely biased (racist) classifiers crafted by our framework can easily fool popular explanation techniques such as LIME and SHAP into generating innocuous explanations which do not reflect the underlying biases.

References

Ulrich Aivodji, Hiromi Arai, Olivier Fortineau, Sébastien Gambs, Satoshi Hara, and Alain Tapp. 2019. Fairwashing: the risk of rationalization. In International Conference on Machine Learning . 161--170.Google Scholar
Julia Angwin, Jeff Larson, Surya Mattu, and Lauren Kirchner. 2016. Machine bias. ProPublica (2016).Google Scholar
Arthur Asuncion and David Newman. 2007. UCI machine learning repository, 2007.Google Scholar
C Blake, E Koegh, and CJ Mertz. 1999. Repository of Machine Learning. University of California at Irvine (1999).Google Scholar
Ann-Kathrin Dombrowski, Maximilian Alber, Christopher J Anders, Marcel Ackermann, Klaus-Robert Müller, and Pan Kessel. 2019. Explanations can be manipulated and geometry is to blame. arXiv preprint arXiv:1906.07983 (2019).Google Scholar
Finale Doshi-Velez and Been Kim. 2017. Towards a rigorous science of interpretable machine learning. arXiv preprint arXiv:1702.08608 (2017).Google Scholar
Radwa Elshawi, Mouaz H Al-Mallah, and Sherif Sakr. 2019. On the interpretability of machine learning-based model for predicting hypertension. BMC medical informatics and decision making , Vol. 19, 1 (2019), 146.Google Scholar
Amirata Ghorbani, Abubakar Abid, and James Zou. 2019. Interpretation of neural networks is fragile. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33. 3681--3688.Google ScholarDigital Library
Juyeon Heo, Sunghwan Joo, and Taesup Moon. 2019. Fooling Neural Network Interpretations via Adversarial Model Manipulation. In Advances in Neural Information Processing Systems 32. 2921--2932.Google Scholar
Mark Ibrahim, Melissa Louie, Ceena Modarres, and John Paisley. 2019. Global Explanations of Neural Networks: Mapping the Landscape of Predictions. In Proceedings of the 2019 AAAI/ACM Conference on AI, Ethics, and Society (AIES '19). 279--287.Google ScholarDigital Library
Been Kim, Martin Wattenberg, Justin Gilmer, Carrie Cai, James Wexler, Fernanda Viegas, and Rory S ayres. 2018. Interpretability Beyond Feature Attribution: Quantitative Testing with Concept Activation Vectors (TCAV). In Proceedings of the 35th International Conference on Machine Learning (Proceedings of Machine Learning Research), Jennifer Dy and Andreas Krause (Eds.), Vol. 80. PMLR, Stockholmsmässan, Stockholm Sweden, 2668--2677.Google Scholar
Joshua A Kroll, Solon Barocas, Edward W Felten, Joel R Reidenberg, David G Robinson, and Harlan Yu. 2016. Accountable algorithms. U. Pa. L. Rev. , Vol. 165 (2016), 633.Google Scholar
Jeff Larson, Surya Mattu, Lauren Kirchner, and Julia Angwin. 2016. How we analyzed the COMPAS recidivism algorithm. ProPublica (5 2016) , Vol. 9 (2016).Google Scholar
Zachary C. Lipton. 2018. The Mythos of Model Interpretability. Queue , Vol. 16, 3, Article 30 (June 2018), bibinfonumpages27 pages.Google ScholarDigital Library
Scott M Lundberg and Su-In Lee. 2017. A Unified Approach to Interpreting Model Predictions. In Neural Information Processing Systems (NIPS), I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (Eds.). Curran Associates, Inc., 4765--4774.Google Scholar
Brent Mittelstadt, Chris Russell, and Sandra Wachter. 2019. Explaining explanations in AI. In Proceedings of the conference on fairness, accountability, and transparency. ACM, 279--288.Google ScholarDigital Library
M Redmond. 2011. Communities and crime unnormalized data set. UCI Machine Learning Repository. (2011).Google Scholar
Michael Redmond and Alok Baveja. 2002. A data-driven software tool for enabling cooperative information sharing among police departments. European Journal of Operational Research , Vol. 141, 3 (2002), 660--678.Google ScholarCross Ref
General Data Protection Regulation. 2016. Regulation (EU) 2016/679 of the European Parliament and of the Council of 27 April 2016 on the protection of natural persons with regard to the processing of personal data and on the free movement of such data, and repealing Directive 95/46. Official Journal of the European Union (OJ) , Vol. 59, 1--88 (2016), 294.Google Scholar
Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. 2016. "Why Should I Trust You?": Explaining the Predictions of Any Classifier. In Knowledge Discovery and Data Mining (KDD) .Google Scholar
Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. 2018. Anchors: High-precision model-agnostic explanations. In Thirty-Second AAAI Conference on Artificial Intelligence .Google Scholar
Cynthia Rudin. 2019. Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead. Nature Machine Intelligence , Vol. 1, 5 (2019), 206.Google ScholarCross Ref
Andrew D Selbst and Solon Barocas. 2018. The intuitive appeal of explainable machines. Fordham L. Rev. , Vol. 87 (2018), 1085.Google Scholar
Sarah Tan, Rich Caruana, Giles Hooker, and Yin Lou. 2018. Distill-and-compare: auditing black-box models using transparent model distillation. In Proceedings of the 2018 AAAI/ACM Conference on AI, Ethics, and Society. ACM, 303--310.Google ScholarDigital Library
Leanne S Whitmore, Anthe George, and Corey M Hudson. 2016. Mapping chemical performance on molecular structures using locally interpretable explanations. arXiv preprint arXiv:1611.07443 (2016).Google Scholar

Index Terms

Fooling LIME and SHAP: Adversarial Attacks on Post hoc Explanation Methods
1. Computing methodologies
  1. Machine learning
    1. Learning paradigms
      1. Supervised learning
        Supervised learning by classification
2. Human-centered computing
  1. Human computer interaction (HCI)
    1. HCI design and evaluation methods
      1. User studies
    2. Interactive systems and tools

Recommendations

"How do I fool you?": Manipulating User Trust via Misleading Black Box Explanations
AIES '20: Proceedings of the AAAI/ACM Conference on AI, Ethics, and Society

As machine learning black boxes are increasingly being deployed in critical domains such as healthcare and criminal justice, there has been a growing emphasis on developing techniques for explaining these black boxes in a human interpretable manner. ...
Read More
FATALRead - Fooling visual speech recognition models: Put words on Lips
Abstract
Visual speech recognition is essential in understanding speech in several real-world applications such as surveillance systems and aiding differently-abled. It proliferates the research in the realm of visual speech recognition, also known as ...
Read More
The Path to Defence: A Roadmap to Characterising Data Poisoning Attacks on Victim Models
Data Poisoning Attacks (DPA) represent a sophisticated technique aimed at distorting the training data of machine learning models, thereby manipulating their behavior. This process is not only technically intricate but also frequently dependent on the ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
AIES '20: Proceedings of the AAAI/ACM Conference on AI, Ethics, and Society
February 2020
439 pages
ISBN:9781450371100
DOI:10.1145/3375627
General Chairs:
Annette Markham
Aarhus University | Loyola University
,
Julia Powles
University of Western Australia
,
Toby Walsh
TU Berlin | University of New South Wales | Data61
,
Anne L. Washington
New York University
Copyright © 2020 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 7 February 2020
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
adversarial attacks
bias detection
black box explanations
model interpretability
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate61of162submissions,38%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 268
  Total Citations
  View Citations
- 9,875
  Total Downloads
- Downloads (Last 12 months)3,104
- Downloads (Last 6 weeks)391
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Fooling LIME and SHAP: Adversarial Attacks on Post hoc Explanation Methods

AIES '20: Proceedings of the AAAI/ACM Conference on AI, Ethics, and Society

ABSTRACT

References

Cited By

Index Terms

Recommendations

"How do I fool you?": Manipulating User Trust via Misleading Black Box Explanations

FATALRead - Fooling visual speech recognition models: Put words on Lips

The Path to Defence: A Roadmap to Characterising Data Poisoning Attacks on Victim Models