Searching for explanations of black-box classifiers in the space of queries

Tracking #: 3194-4408

Authors: 
Jason Liartis
Edmund Dervakos
Orfeas Menis-Mastromichalakis
Alexandros Chortaras
Giorgos Stamou

Responsible editor: 
Guest Editors Ontologies in XAI

Submission type: 
Full Paper
Abstract: 
Deep learning models have achieved impressive performance in various tasks, but they are usually opaque with regards to their inner complex operation, obfuscating the reasons for which they make decisions. This opacity raises ethical and legal concerns regarding the real-life use of such models, especially in critical domains such as in medicine, and has led to the emergence of the eXplainable Artificial Intelligence (XAI) field of research, which aims to make the operation of opaque AI systems more comprehensible to humans. The problem of explaining a black-box classifier is often approached by feeding it data and observing its behaviour. In this work, we feed the classifier with data that are part of a knowledge graph, and describe the behaviour with rules that are expressed in the terminology of the knowledge graph, that is understandable by humans. We first theoretically investigate the problem to provide guarantees for the extracted rules and then we investigate the relation of "explanation rules for a specific class" with "semantic queries collecting from the knowledge graph the instances classified by the black-box classifier to this specific class". Thus we approach the problem of extracting explanation rules as a semantic query reverse engineering problem. We develop algorithms for solving this inverse problem as a heuristic search in the space of semantic queries and we evaluate the proposed algorithms on four simulated use-cases and discuss the results.
Full PDF Version: 
Tags: 
Reviewed

Decision/Status: 
Reject (Two Strikes)

Solicited Reviews:
Click to Expand/Collapse
Review #1
By Rafael Penaloza submitted on 07/Sep/2022
Suggestion:
Major Revision
Review Comment:

This is a revised version of a submission that I had previously reviewed. After reading the new version and the authors' comments, I am still not convinced by the approach. To be more precise, I believe that the authors move the issues behind black-box explainability to a different level, but without really addressing them.
An important element of the submitted framework is a symbolic labeling (by experts) of sub-symbolic data, which is supposed to allow understanding the reasons for a decision. These labels are not used by the black-box classifier, but only by the rule generator. As mentioned in my original review, the authors make a huge leap of faith assuming that the feature represented in the labelling has anything to do with the features that the classifier is using.
Using the authors' same COVID example, it could for instance happen that all samples of "sore throat" are short, and the classifier uses the length of the audio as an important feature to decide to seek medical advice. Then, the approach would produce a rule saying "sore throat" implies "seek medical advice", although this is not what the classifier is doing. Of course this is an artificial and stupid example, but should pass the point accross.
The issue is greater if there are several combination of features, as the reverse engineering process tries to find the relevant labels. The size of the samples that fall into each combination gets smaller and smaller, and hence it is easier to derive that they create a rule.
The point is that there is no formal relationship between the classifier and the labels, and hence no guarantee (not even in approximation) that these labels explain in any way the answer.
Removing this benefit, the approach falls into a setting of symbolic explainability for which there are still challenges, but the approach does not add much.
The authors need to do a better work explaining why the symbolic/sub-symbolic connection should work, and provide some guarantees for this approach to be useful in the context of XAI. Otherwise, it has the same drawbacks as any black-box explanation approach.

Review #2
Anonymous submitted on 30/Dec/2022
Suggestion:
Major Revision
Review Comment:

I thank the authors for the detailed response.

The author's response clarifies several doubts I previously had about the paper. In particular, it is now clear to me what "understandable" and "intuitive" stand for. I still see as an issue the possible mismatch between what a black-box (an accurate one) really bases its decisions on and the annotations used to generate the rules. It can happen that they are completely unrelated. However, despite of this, I believe that the proposed framework represents a useful tool and constitutes a valuable contribution to the XAI research area.

Regarding the experiments, most of my previous criticisms were based on a misunderstanding on whether the experimental results were obtained for single queries or arbitrary unions of conjunctive queries. This has now been clarified. More importantly, the newly added experiment really makes a difference, and I find it more adequate than the already existing ones.

Overall, this new submission addresses the majority of my previous comments satisfactorily. Given the fact that two of my main concerns have been positively clarified (as mentioned before), I am now inclined to propose the paper to be accepted. However, before that, there is still an issue that should be properly justified. Namely, the relation between the proposed algorithms and the ones that already exist in the literature to solve the "query reverse engineering problem" in the ontology mediated query setting (OMQS). Some questions/comments in this regard follow:

- Algorithm KGRules-H together with the use of QLCS as merging operation will always find a query or a union of conjunctive queries (if it exists) that has as certain answers exactly the set of "positive instances". Since this would be basically a solution to the query reverse engineering problem, there are several questions that naturally arise:

a) What are the differences between this algorithm and the ones that have been proposed to solve the query reverse engineering problem in the OMQS?

b) In case they are different, is KGRules-H+QLCS more/less efficient? Why to use the proposed algorithm instead of any of the already existing ones?

- Similar questions are valid for the approximate variants of your algorithms (Greedy variant, threshold variant), i.e., why not to adapt existing exact algorithms in a similar way?

Having this comparison would help to clarify whether the new algorithms are really needed, or one could just take existing ones and integrate them into the proposed framework. I could not find any such comparison or related information in the new submission.

Additional comments, typos:

- The arguments given in page 17 to justify the existence of the QLCS seem correct to me. It is much better now than in the previous submission. Two additional comments:

i) It seems to me that the occurrence of TOP(x) in every query is not relevant to show the existence of the QLCS. In page 17, it is claimed that because TOP(x) occurs in every query, then the connected component of the graph product G1XG2 containing is not empty. I think it will be non-empty anyway, since the definition of query graph tells us that answer variables are nodes of the query graph. Hence, it does not seem accurate to argue that "... the inclusion of the top atom guarantees the existence of the QLCS..".

ii) I agree that q1,q2 <=_S q' implies q <=_S q' (p. 17, l.34). But this implication is not an immediate consequence. It requires to do some operations with homomorphisms. It would be good to either add a more detailed explanation or to just write a formal proof.

- p.3, l.25: *terminolocial*

- p.3, l.51: ...variable name distinct from u_i... I think it should be z_i instead of u_i.

- p.16, l.51: should [63] occur before the full-stop?

- p.23, l.30: be be


Comments

Reviewer 1
We thank the reviewer for their valuable comments. We took into consideration all the reviewers’ comments and made major revisions to our manuscript. Below we explain in detail how we addressed the reviewer’s comments.

Comment 1.1
“My main concern is with the motivation of the paper, and how it applies to the developed formalism. The authors argue that semantic labelings should be used instead of instance features to allow for understandable explanations, given that the names have a meaning to users reading the explanation. In the original example, the authors propose that the input of a diagnosis is a recording of a cough, which for training has been semantically labeled by an expert (as "dry cough", "loose cough", etc). These labels characterise the recordings, and are used for the explanation. I see two obvious problems with this (which are actually connected):

1. the interest of explaining black-box classifiers is not in understanding the train or test instances, but rather in understanding a new instance given. For those, the semantic annotations are not available---otherwise, we would not need the black-box classifier to begin with”

What the reviewer mentions as “the interest of explaining a black-box classifier” (“understanding a new instance given”) is the aim of local explanations, while in this work we introduce a method to formalize and produce global explanations, which aim at interpreting a model’s behaviour in general (in this work via the explanation dataset), and not on a specific instance.
In order to address this comment in our revised manuscript, we expanded the introduction (related work and motivation) to make clear the distinction between local and global explanations, and the positioning of our work in this context.

Comment 1.2
“2. there is no guarantee of any kind that the black-box classifier is using the "semantics" given by the annotations. Indeed, it could be that for the classifier other factors such as the length of the recording, the frequency of cough, or its strength are more important. We would then be explaining through irrelevant labels (and hence, mislead the users).”

In order to address this comment, we clarified in our manuscript (in the motivating example, and after definition 2) that even though we do not have a guarantee that the classifier is using the “semantics” of the explanation dataset, the explanations are still useful indicators of the behaviour of the classifier. Importantly, by appropriately choosing an explanation dataset, and the corresponding vocabulary and knowledge, an end-user has control over what information the explanations take under consideration. For instance, a system developer would use a different explanation dataset than a medical professional.

Comment 1.3
“Another issue is with the classifier, and the rules extracted. As the authors observe, it is difficult to find correct rules for a classification, and hence one would like to find some with exceptions. However, it is usually these exceptions that are interesting to explain (the difficult cases to understand, not the simple instances). For these, there is no notion of explanation, except about being exceptions to a rule. Very important details, such as bias, could be hidden in those exceptions that cannot be understood. It would be good if the authors could discuss this issue.”

In order to address this comment, we have extended the discussion regarding the exceptions in section 5.5. We plan to further study the exceptions in a systematic way in future work as we mention in the conclusion of the revised paper.

Comment 1.4
“Another less problematic, but still relevant issue regards presentation. I have recently observed this trend where at first sight, papers seem more general than they actually are, and important restrictions are hidden in some sentences in the middle of the paper. Although this should formally not be an issue (a paper should be read in detail), in reality can cause confusions to superficial readings. In this case, for example, the authors first define the general notion of a conjunctive query, then (in line 47 or page 3) say that "query" stands for "conjunctive query", and a few lines later (line 2 of page 4) say that "query" stands for "instance query". Someone who misses that last passage can easily think that it can handle conjunctive queries in general.”

We have eliminated the term “instance query” to avoid confusion. Instead, we have introduced the term sav query which are queries without individual names in their body and a single answer variable. We now use this term throughout the paper to make our assumptions clear.

Comment 1.5
“Another case is at the end of page 4: after defining rules in general, it is assumed that all rules are "connected". This assumption is mentioned there, and nowhere else, but used throughout the paper.”

Given that rules in the paper are explanation rules, we have added in definition 2 that an explanation rule is by definition connected so that the assumption made in the background section is no longer needed.

Reviewer 2
We thank the reviewer for their valuable comments. We took into consideration all the reviewers’ comments and made major revisions to our manuscript. Below we explain in detail how we addressed the reviewer’s comments.

Comment 2.1
“No external knowledge graphs are used; a knowledge base is built based on the exemplary datasets, using 'features' extracted from it; it looks (at least based on the examples) that a knowledge bases are built and are fully related to datasets, right?”

This is correct for the case of the Mushroom experiment, where the exact features of the classifier are used as semantic descriptions, and the MNIST experiment, where semantic descriptions are automatically extracted using the features (pixels). In the CLEVR-Hans experiment, the semantic descriptions are not extracted from the features but provided “externally” by the creators of the dataset. However, even in this case, there is no TBox, so the reviewer’s comment stands. In the revised version of the paper, we include another experiment (section 5.3) in which we also use external knowledge (WordNet, which is independent of the dataset and the features) as a TBox.

Comment 2.2
“The examples are a bit disappointing; they should provide some 'exact' results so a reader can derive her own opinion regarding the method's benefits (a bit more below). There is also the impression of over-complication in some places. And, the paper is long and not the easiest to follow. Understandably, you need to provide a number of concepts and notations – but it is a bit overwhelming – the idea behind your approach is somehow difficult to 'extract' (sorry, it was my impression). So, any simplification of the paper would provide the reader with an easier task to understand it and appreciate it. Still, I like the paper and many aspects of it very much and would like to see its simplified and more 'goal-oriented' version.”

To address this comment we added explanatory notes after complex terms and notions in order to simplify them. See answers to the next comments for further details.

Comment 2.3
“you use the term ‘semantic query’ – any particular reason for that?”

The term semantic query is used in the literature for queries over knowledge bases. An explanatory note has been added to the paper.

Comment 2.4
“what do you mean by “certain answers” – are these definite ones? Do you even need to mention that? Could some answer be uncertain?”

The formal definition of certain answers is provided in Section 2. For clarification, the definition of a (not necessarily certain) answer has been added. The entire paper has been checked to make sure that certain answers are always called by their full name.

Comment 2.5
“Page 4, line 14: le should be ExE ->2“

The domain of \ell_E is the set of edges (pairs of nodes), so the definition is correct. However, to be consistent with that and avoid possible confusion, the two (pair) arguments of \ell_E have been replaced throughout the paper by a single argument including within angle brackets the two pair components.

Comment 2.6
“Page 4, line 24: you use the term ‘substitution’ – a few words of explanation, please”

A formal definition of substitution has been added in the paragraph about conjunctive queries.

Comment 2.7
“Page 4, line 49: you have ‘D(x)’ – what D stands for? I assume it is different from C(x); does it mean class/category of a classification problem (sorry, I could not find any explanation in the text).”

It has been added that D is an element of CN

Comment 2.8
“Page 6, Fig 1: it is a nice figure but could include a bit more details that explain the approach, or you could include a figure representing a ‘flow’ of the process (just a suggestion).”

Another figure has been added to section 4, to represent the ‘flow’ of the process.

Comment 2.9
“Page 7, line 9: is Exemplar a subset of EN? (the first part of the formula); in general, it would be nice to see a bit more explanation of this important formula.”

Exemplar is defined in definition 1. A technical clarification of this formula has been added after the definition.

Comment 2.10
“Page 7, Example 1 (and following up): I really like the idea of ‘running example’ that illustrates your proposed approach, yet, not everything is very clear – for example, s1, …, s8 as a part of IN – are these symptoms? Also, it would be nice to mention how the explanation rules (top of page 8) were created (refer to Section 4, its particular parts).”

Yes, s1,s2,...s8 are symptoms. Because of the simplicity of the dataset the rules were created by hand in order to make the framework easier to understand.

Comment 2.11
“Page 8, lines 18-20: it seems to me that it is the essence of your approach – to make the paper more understandable for a reader, it would be nice to define what is a query reverse engineering problem in this particular context.”

In section 3 (Framework), we have introduced a short definition of the Query by Example framework which is commonly used in other works on query reverse engineering, closely related to ours.

Comment 2.12
“Page 10, line 29 (and page 11, line 48): TBox eliminated? Could you please provide a short explanation?”

In the introduction of section 4 (Computation of Explanations), a short explanation of TBox elimination has been added to the text.

Comment 2.13
“Page 12, line 46 (Section 4.2) is it easier to determine ‘dissimilarity’ than ‘similarity’? any comment here?”

Throughout the text, we have replaced the phrase “least dissimilar” with “most similar”. In the algorithms however we have kept the term “dissimilarity” as it was what is computed.

Comment 2.14
“Section 5: Experiments
As much as I enjoy three different datasets used as case studies, I am disappointed you have not provided examples of generated rules/explanations. It would be very interesting to see how we (readers) can ‘evaluate’ these explanations. Among the three datasets, the first two seem (at least for me) quite suitable for showing how classifiers work. For example, I could suggest putting some samples of classified (correctly/or many not) data points and explaining why this happened. The third dataset – MNIST – is very popular, as you indicated, but somehow less attractive for the topic of your paper.”

For CLEVR-Hans and MNIST we have provided multiple examples of generated rules/explanations (Tables 4,5, Figure 10, and section 5.3.3). Additionally, in the revised manuscript we conduct an experiment in a new dataset that includes external knowledge used as a TBox (see comment 2.1).

Reviewer 3
We thank the reviewer for their valuable comments. We took into consideration all the reviewers’ comments and made major revisions to our manuscript. Below we explain in detail how we addressed the reviewer’s comments.

Comment 3.1
“Given that the underlying reverse engineering problem have already been investigated in the literature, the properties of being *intuitive* and *understandable* seem to be a distinctive and relevant aspect of the paper's contribution. However, I could not find a formal or convincing argument that supports this in the paper. All this seems to be discussed in the paper at a very informal level. In particular [53] and [46] are cited in these discussions. On the one hand, this submission appears to be the extended version of [53] which is a Workshop paper. For this reason, I believe that any discussion in this regard appearing in [53] should actually be provided here. On the other hand, it is not clear what kind of manuscript [46] is. Has it been peer-reviewed? The bibliography entry gives no clue about it and what I could find in DBLP looks like a shorter version of this submission.”

We have clarified throughout the manuscript that the terms intuitive and understandable refer to the terminology used in the explanations, which we argue can potentially impact the intuitiveness and understandability of the explanations themselves. We have extended the discussion of these notions in sections 1 (Introduction) and 3 (Framework), including discussion appearing in [53].
In the revised manuscript, we have included the main aspects (including proofs) of [46], so that they can be reviewed upon this paper.

Comment 3.2
“The experiments are not designed to analyse this aspect of the provided solution, and there is no formalisation of what *intuitive* and *understandable* means. For example, one possibility could be to define reasonable properties, either of a syntactic or a semantic nature, that *intuitive* and *understandable* queries are required to satisfy. Once this is done, it provides a formal criteria that can be used to evaluate how good the queries produced by the provided algorithms are.”

In the revised version of the paper, we have clarified that our claims about understandability and intuitiveness refer to the vocabulary used in the explanation dataset. We plan to systematically measure understandability and intuitiveness in the future, as we mention in section 6 (Conclusions and Future Work).

Comment 3.3
“Experiment 1 yields 100% fidelity for the two datasets of greatest size. It seems unlikely that this can be the case, unless the Mushroom dataset is not really a good choice for benchmarking.”

It's a first experiment that allows us to see if the explanation method can find the ground truth rule that we know exists (as mentioned by the provider of the dataset), and also this allows us to compare our work with other methods that are designed for tabular data, since other methods cannot achieve 100% it's an indication that they are not complete.

Comment 3.4
“Note that only 123 mutually different items are required for Merging to yield the "empty query" (or a query with only TOP(x)) in your algorithms. This means that there would be no query capturing the exact behaviour of the classifier. Notice that 123 is really small w.r.t. 1000 and 4000, which are the sizes of the biggest datasets.”

123 is the size of CN. Even though it is possible for 123 mutually different items to yield the empty query, it would be an adversarial example. The actual Mushroom Dataset can be described by First-Order Logic rules (as mentioned by the creators of the dataset) which is why we are able to find rules even for such big datasets.

Comment 3.5
“The average length of a query is around 14, which is only a bit above the half of the number of categorical features considered in the experiment. Further, an MSQ representing an item is a conjunction of concept names and it can never have two concept names representing different values of the same categorical feature. Hence, it looks like having 100% fidelity implies that half the categorical features in the Mushroom dataset are irrelevant (already for the classifier) to separate positive and negative instances.”

As mentioned in the text, in this experiment we take a set of correct rules as explanations, and the union of their certain answers, as indicated also by the Nr. of rules measurement. Thus the average length of queries being less than the number of categorical features does not indicate that features are irrelevant, since different features might be present in different rules of the set. We have added further clarification at the end of section 3 (Framework) that it might often be beneficial to examine sets of rules, and also that in the Mushroom experiment we take a union of correct rules in order to compare with the other methods which also return unions of rules.

Comment 3.6
“The knowledge base used in Experiment #2 is extremely simple. I do not really see how this illustrate one of the main claims of the paper. Namely, that really complex ontological knowledge is take into account to provide explanations. In this experiment, this knowledge can be easily integrated into the data. No real ontology-based logical entailment is needed.”

In the revised version we have included another experiment that uses external knowledge (WordNet) as a TBox. However, even in the case of simple knowledge such as the CLEVR-Hans experiment (Experiment #2), other rule-based explanation methods that are designed for tabular data do not seem to be able to provide explanations, even when the knowledge is “easily integrated into the data”, so the experiment shows that even with “extremely simple” knowledge our approach has merits.

Comment 3.7
“1. The presentation of technical notions defining the syntax and semantics of Description Logics is very poor. This should be formally introduced, since the use of background logical knowledge is supposed to make the results in this paper distinctive from other existing approaches. Some aspects in this regard are:

- There is no example illustrating the syntax and semantics of a DL. In particular, nothing is said about how the interpretation of concept and role names extend to complex formulas. There is no formal definition of what a TBox and an ABox are, nor a formal definition for the satisfaction of axioms in a TBox/ABox.

- The second sentence in line 43 (page 3) is not accurate. The fact that most DL dialects are fragments of FOL has also to do with the the available set of constructors and their semantics.”

The formal presentation of Description Logics has been extended. The TBox and ABox are now explicitly defined, and two tables have been added: one for the syntax and semantics of some common DL concept and role constructors (the SHROIQ dialect), and one for the TBox and ABox axioms and their satisfaction semantics.

With respect to FOL translatability, the paper states that most DL dialects can be seen as fragments of first-order logic. The statement of the definition of a DL dialect has been improved so as to be clear that a DL dialect is defined by the constructors it provides (their syntax and semantics) and hence this also determines FOL translatability.

Comment 3.8
“2. Considered queries and some related notions.
- The notion of substitution should be properly defined.”

A proper definition of substitution has been added in the background section.

Comment 3.9
“While it follows from the definition of QLCS that it is unique (modulo syntactical equivalent), this is the case if it actually exists. Whether it always exists or not is, however, not really shown in the paper. Note that, assuming that all queries contain TOP(x) as an atom only ensures that a common subsumer exists, but this need not be the minimal. A formal proof of this should be provided.”

We have moved this claim to section 4.3.1. The method we present for constructing a QLCS should make apparent that the inclusion of the top atom guarantees the existence of the QLCS.

Comment 3.10
“The set of nodes V of an ABox graph should not be IN, but the individual names occurring in the ABox. The set IN is the set of all available constants. This does not mean that all of them must be used in every ABox. This should also be made clear in other parts of the paper.”

The definition of the ABox graph has been adjusted so as to include only the individual names occurring in the ABox as nodes.

Comment 3.11
The notation has been improved, and an explanatory note has been added to clarify that the union and nominal constructors (one-of) are involved in this expression.

Comment 3.12
“Theorem 1 seems to be correct. However, the proof is missing and deferred to [46] (see previous comments about [46]).”

We have included the proof from [46] in the revised manuscript.

Comment 3.13
“Nothing is said about the cases where it is not possible to exactly capture the behaviour of the classifier. Can you provide examples and/or theoretical characterisations for this? This can help to understand the limitations of the framework, beyond the insights that could have been provided in the experiments. Perhaps, the existing work on the reverse query engineering problem can shed some light into this?”

We have extended the discussion of related work on the query reverse engineering problem in the introduction section. For cases where it is not possible to exactly capture the behaviour of the classifier, we have exceptions as defined in our framework, which we have extended the discussion of exceptions in section 5.5.

Comment 3.14
“More details should be provided about why the algorithm computes a set of queries instead of just one. More precisely, what is the intention of giving a set of queries to the user? How are they meant to be used? In particular, the intermediate queries obtained in the computation.”

The last query computed by the algorithm is only optimal when using the QLCS and there is also a correct query that captures the entire pos-set. In all other cases, there is no single best query which is why we compute multiple ones. We have expanded the text in section 4 to analyse how the queries would be presented to a user.

Comment 3.15
“The GreedyCommonConjunct (Algorithm 4) and the threshold approximation are introduced as alternatives of using just QLCS, to avoid certain problems. However, there is no comparison between all these variants in the paper. Natural questions that arise from it are:

- Is the "exact query" property lost by using GreedyCommonConjunct instead of QLCS?”

“What would be the difference between queries produced by using GreedyCommonConjunct and those generated by using QLCS, in terms of the quality measures defined in the paper?”

We have expanded the text to clearly describe the advantages and limitations of each method for computing queries.

Comment 3.16
“How much using the threshold approximation affects the quality of the queries?”

We have clarified in section 4 that the effects of the threshold on the queries produced can vary significantly for different inputs and therefore different explanation datasets, thus the threshold should be adjusted experimentally.

Typos
The typos indicated by the reviewer have been fixed.