Making Corgis Important for Honeycomb Classification: Adversarial Attacks on Concept-based Explainability Tools

Brown, Davis; Kvinge, Henry

Computer Science > Machine Learning

arXiv:2110.07120 (cs)

[Submitted on 14 Oct 2021 (v1), last revised 26 Jul 2022 (this version, v2)]

Title:Making Corgis Important for Honeycomb Classification: Adversarial Attacks on Concept-based Explainability Tools

Authors:Davis Brown, Henry Kvinge

View PDF

Abstract:Methods for model explainability have become increasingly critical for testing the fairness and soundness of deep learning. Concept-based interpretability techniques, which use a small set of human-interpretable concept exemplars in order to measure the influence of a concept on a model's internal representation of input, are an important thread in this line of research. In this work we show that these explainability methods can suffer the same vulnerability to adversarial attacks as the models they are meant to analyze. We demonstrate this phenomenon on two well-known concept-based interpretability methods: TCAV and faceted feature visualization. We show that by carefully perturbing the examples of the concept that is being investigated, we can radically change the output of the interpretability method. The attacks that we propose can either induce positive interpretations (polka dots are an important concept for a model when classifying zebras) or negative interpretations (stripes are not an important factor in identifying images of a zebra). Our work highlights the fact that in safety-critical applications, there is need for security around not only the machine learning pipeline but also the model interpretation process.

Comments:	AdvML Frontiers 2022 @ ICML 2022 workshop
Subjects:	Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:2110.07120 [cs.LG]
	(or arXiv:2110.07120v2 [cs.LG] for this version)
	https://doi.org/10.48550/arXiv.2110.07120

Submission history

From: Davis Brown [view email]
[v1] Thu, 14 Oct 2021 02:12:33 UTC (13,876 KB)
[v2] Tue, 26 Jul 2022 13:23:56 UTC (14,354 KB)

Computer Science > Machine Learning

Title:Making Corgis Important for Honeycomb Classification: Adversarial Attacks on Concept-based Explainability Tools

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Machine Learning

Title:Making Corgis Important for Honeycomb Classification: Adversarial Attacks on Concept-based Explainability Tools

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators