Explainable Zero-shot Learning via Attentive Graph Convolutional Network and Knowledge Graphs

Tracking #: 2318-3531

Yuxia Geng
Jiaoyan Chen
Zhiquan Ye
Wei Zhang
Huajun Chen

Responsible editor: 
Dagmar Gromann

Submission type: 
Full Paper
Zero-shot learning (ZSL) which aims to deal with new classes that have never appeared in the training data (i.e., unseen classes) has attracted massive research interests recently. Transferring of deep features learned from training classes (i.e., seen classes) are often used, but most current methods are black-box models without any explanations, especially to people without artificial intelligence expertise. In this paper, we focus on explainable ZSL, and present a knowledge graph (KG) based framework that can explain the feature transferring in ZSL in a human understandable manner. The framework has two modules: an attentive ZSL learner and an explanation generator. The former utilizes an Attentive Graph Convolutional Network (AGCN) to match inter-class relationship with the transferability of deep features (i.e., map class knowledge from WordNet into classifier) and learn unseen classifiers so as to predict the samples of unseen classes, with impressive (important) seen classifiers detected, while the latter generates human-understandable explanations of the transferability with class knowledge that are enriched by external KGs, including a domain-specific Attribute Graph and DBpedia. We evaluate our method on two benchmarks for animal recognition. Augmented by class knowledge from KGs, our framework makes high quality explanations for ZSL transferability, and at the same time improves the recognition accuracy.
Full PDF Version: 

Major Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
Anonymous submitted on 24/Nov/2019
Minor Revision
Review Comment:

The authors propose a knowledge graph (KG)-based framework to explain feature transferability in Zero Shot Learning (ZSL). Specifically they adopt WordNet and an Attentive Graph Convolutional
Neural Network (AGCN) to model interclass
relationship for ZSL, known as
an Attentive ZSL Learner (AZSL). A matching
between the inter-class relationship and the transferability
of CNN features from seen classes to unseen
classes is learned. An explanation generator is then used to extract class knowledge from the domain-specific attribute graph and and general external
KGs (e.g., DBpedia) to provide common sense evidence for
ZSL explanation. Experiments on two image classification benchmarks show that the explanation achieves high quality according to several metrics, as well as human assessment.


--the authors cover an interesting and important problem, and I think that the community could benefit from papers like these that are more focused on multimodal applications.
--Explainability is clearly an important area in AI, and combining ZSL with explainable AI is a novel proposition
--Related work is fairly complete and well-written
--I am convinced by the experimental results of the authors and believe they illustrate the promise of the method.


--the writing could use improvement. There are minor typos even in the introduction, such as 'leaning' instead of learning in line 11-12 and 'evidences' in line 46. I suggest a good proofreading. Otherwise I am happy with the quality of the article.

Review #2
By Dagmar Gromann submitted on 02/Feb/2020
Minor Revision
Review Comment:

This article proposes an explainable zero-shot learning method that utilizes an Attentive Graph Convolutional Network to transfer inter-class relationships in the form of deep features of seen classes to unseen classes and an explanation generator that draws from existing knowledge graphs to fill slots of pre-defined natural language templates in order to explain common features of seen and unseen classes. The approach is validated by utilizing several baseline ZSL models on two standard datasets and human evaluators to judge the quality of produced explanations.

Overall evaluation:
The proposed approach is interesting, especially the aspect of explainability, and the evaluation is thorough as is the use of baseline approaches and standard datasets. Both the zero-shot learning approach as well as the explanation generation method provide (marginally) outperforming resp. interesting novel results on the human annotation task. In addition, the number of different methods brought together to this central purpose is quite impressive. However, two aspects that require attention are the claim for novelty for the ASZL (see details below) and the quality of the language of the paper. Even though language should not be a reason for judging a paper, words are the medium of transferring meaning and if they are used semantically incorrect, it gets very difficult to understand the paper (see terminological issues for some examples). The general level of spelling and typos requires a thorough check by a first language speaker of English (some examples provided under minor comments) and the terminology used requires a thorough check by a domain expert with a very high command of English.

Other than that there are some improvements to be made on the description of the approach to facilitate understanding (see comments below) and some missing details to be added (see below), all of which, however, do not require major re-implementations or changes to the approach per se. Thus, I suggest a very thorough revision of the paper itself taking the provided comments and questions into consideration.

Novelty of ASZL:
From the proposed model, the CNN model is a pre-trained very common CNN model and the GCN model seems to be very similar to the one described in [5]. In the related work there is a half-sentence on the difference, which, however, does not clarify this point for me. [5] uses a GCN and the hierarchical relations in WordNet just like the proposed paper. What is the difference exactly? Is it the attention layer? Does this alone qualify as a novel approach? Or is there something else that is novel that I am missing here?

Method description:
The description of the transfer process, at least to me, was very hard to parse and only half-clear after reading the section several times. Please clarify this process (in particular the training of unseen classifiers, how those are related to seen classifiers and esp. Section 4.2.2.) and make it very clear and explicit were existing approaches are re-used and which parts are truly novel in the proposed approach.

Missing details to facilitate reproducibility:
- availability of specifically generated Attribute Graph (publicly available?)
- version of DBpedia dump utilized
- tools utilized for NLP - NER, POS tagging as well as matching
- availability of code upon publication to foster reproducibility

Central terminological issues:
- transferability: it is the ability to transfer something; "positive feature transferability" and similar constructs are very strange; did you mean positive feature transfer?
- surrounding classes: could you explain what you mean by that exactly?
- "abstract text" is not the same as "text of an abstract"
- named noun => should be named entity esp. when referring to locations
- entities that are "friendly" for classes? (p. 16)

Questions to the authors:
- Which large corpus was used for training the embeddings that are utilized to initialize the GCN?
- How does the approach handle the situation when there is no "impressive seen classifier" available? In other words, what if there is no direct class in the "surrounding" of a given unseen class?
- p. 9 "We remove it from the match set" => How? Manually? How is this approach transferable to other domains? Would you always have to check all entity matches manually and remove the ones that are not correct?
- In the triple patterns, what is the experience in cases where r_1 and r_2 are different relations? Are the results still reliable? What is the experience with results of pattern 5 with the transitional entity t?
- Are all classifiers for the two datasets trained separately? Given the substantial overlap between the two datasets, this is important. Has it been ensured that none of the classes in the test set appear in the training set?
- Where in the paper is the dramatic improvement of ZSL claimed in the text quantitatively visible for ImageNet? And what does the +/- 0.008 for GCNZ for instance and in the whole line of DGP in Table 5 compare to? What is the baseline here?
- What is the first language of all the human annotators/raters in the experiment? Since the templates are generated in English and the intelligibility of explanations in English is judged, this might be of interest.
- Could you maybe add some more details on how this approach could be applied to KG construction or NLP?

Minor comments in order of appearance (page.line):
1.38: image recognition task => tasks
1.42: that human can => that a human can
2.8: text corpus => text corpora
2.10: prefer to more complex => omit "to"
2.12: learning and prediction for unseen => learning and predicting unseeen
2.14: trust on => trust in
2.16ff: it's => very uncommon to use contractions in science => (several times in the paper) it is; (also no can't)
2.44 domain-specific Attribute Graph => a domain-specific Attribute Graph and also always in the paper "the or a Attribute Graph" (with the article)
2.46: there is no "evidences" => the plural of evidence is evidence
3.5: the future work => future work
general comment: please revise your use of articles throughout the paper and let a first language speaker of English check the paper
3.25: corpuses => corpora
3.31: really understand => understands
3.33: works devoting to => devoted (this problem of using "devoting" instead of "devoted" is recurring in the paper)
4.11: features learned deep neural networks => features learned by deep neural networks
4.25: from from => omit one
4.37: an unique => a unique
5.31: in the format of triple => in the format of triples
5.37: hierarchy relationship => hierarchical relationship
6.18: make illustration => illustrate (several times in the paper)
6.38: This deep network contain => contains
6.32: for each nodes => node
6.42: of first layer => of the first layer
7.33: different influence => a?
7.29: base => basis
7.35: for scoring so that predicting label for it => meaning?
8.27: a transaction is defined as an attribute is both owned by a set of classes => meaning?
8.22: line and sentence starts with an isolated ","
8 (in general): classsets => class sets
8.40: instance is a noun. Did you mean instantiate?
9.33: ambiguation => ambiguity
9.27: based on heuristically triple pattern => meaning?
10.19: "totally" is semantically different from "in total"
11.50: Awa => AwA (consistent spelling; should this not be AWA?)
12.12: 3,969
12.45: denoting as => denoted as
13.21: can dramatically improves => improve
13.19: learned in Section => described in Section
13.39: IMSC vs. IMCS??
14.19: satisfy => understand
16.37: later => latter
17 (general: refers the case => refers to the case

Review #3
By Michael Cochez submitted on 09/Feb/2020
Major Revision
Review Comment:

When a classification algorithm has to classify an example of a class it has never seen before, this is called zero-shot learning. This paper proposes using knowledge graphs as a background knowledge for such classification system. The idea is that this background knowledge can be used to transfer information known about classes of seen examples, to the class of the unseen example.

Overall, I do like the idea of the paper. It does address an important problem and is suitable for the special issue it was submitted to. However, in its current state I do recommend a major revision. The main reason for this is that some important details are either missing, or I was otherwise not able to understand them from the text.
Also, reading the manuscript several questions popped up, which are not explained in the text, and of which I think the article would benefit from a discussion. I will provide a list of issues and questions below.
Note that I have not worked with zero shot learning prior to reviewing this paper. So, there might certainly be aspects that I do misunderstand, which are obvious for the authors of this work. However, as I would perhaps be a more typical reader of papers like this, I urge the authors to clarify the below issues, nevertheless.
Finally, the paper would benefit from a very thorough language review. Small errors in grammar and formulation do affect the fluency of the paper.

Main issues

You do create your own Attribute Graph. It is unclear why that is needed, is DBPedia not sufficient? Are you unable to use it for some reason? How would your system work without that extra graph?
Also, would it be possible to get an insight in how much the coverage of the attributes affects the performance of the overall system?

You define the zero-shot learning problem very clearly. However, it should be noted that in any real system, it is not only important how it reacts on unseen classes, but also on the seen ones.
My point is here that if you were only to test on unseen classes, then you give the system an extra prior and it will never confuse with classes that have been seen before. So, your test cases should be a mix of examples from seen and unseen classes. Then, when reporting performance, these two have to be reported separately.

Also from your definition, it seems like a reasonable idea to also use a held out set to train specific parameters of your system. Did you do this?

I do not completely understand the architecture from the current description. The main point I do not get is the connection from your GCN-/graph attention part to the classifier. What is unclear is: what is the exact output of the module. Is it the classifier, as in parameters for a classifier, or a classification?
Then, I would understand this such that the system actually creates one binary? classifier for each of the nodes (classes) in the GCN. Then, how do you precisely select the classifier to use? Do you interpret weights as confidence levels?

In section 4.3.1, your rationale for using association rule mining is that "the searching space is often large for finding common attribute set"(sic). While this could be true in general, does this really apply to your case? Including some statistics on your particular datasets would be useful. Also, it is surprising you did not use the DBPedia ontology for this, as it does describe the domain and range of properties, which could have helped in this task.

In your datasets, the seen and unseen classes are split such that there is only one hop (in wordnet) from a seen to an unseen class. This does mean that you can be pretty certain that there is a good coverage for a one hop away class. How does this affect your system? Would it still work if the needed information is two, or maybe five hops away?
Related, It would be interesting to analyze the distance between the IMSC and the predicted class. I expect this to be one, most of the time (this is already hinted in table 7).

DBedia and your own graph with attributes is only used for 'after the fact' explanation. It seems to me that it would be even better to include these graphs into the actual classification model, which now only gets access to a much more limited taxonomy. To give an example, a graph might have the information that a dromedary has only one hump, while a camel has two. This would help a lot in transferring knowledge and and would be impossible to find from the wordnet tree.

Since the code for this submission does not seem available, it appears impossible to reproduce the results. Since the setup is pretty complex, the authors should provide a setup for easy reproduction.

Minor issues

In the abstract, you write "Transferring of deep features learned from training classes (i.e., seen classes) are often used, but most current methods are black-box models without any explanations, especially to people without artificial intelligence expertise". I am not sure what you mean here.
1. Other systems do provide explanations, but not accessible by non-specialists.
2. None of the systems provide explanations, and this mostly affects non-specialists.
Actually, you are not really stressing why these explanations are important.

p2l16 I think you argument about "but also disables the human-machine interaction which is important in machine learning model developing, configuration and debugging." Is rather weak.

p2l32 "Moreover, its method is ad-hoc, only working for predefined class attributes". This seems a reather weak argument. If I predefine all attributes I can find in DBPedia and some other sources, I can just use this, right?
Moreover, if the attributes are embedded into a latent space, one could claim that only one attribute is really needed (ignoring the issue of explainability.)

p2l43b "*extensive* experiments are conducted to evaluate the generated explanation and the ZSL learner, using *two* .. benchmarks". Extensive seems to contradict with just two benchmarks here.

You often refer to "common sense knowledge" without defining that clearly. If we look at some of the knowledge bases which claim to contain common sense knowledge, then the sources used in the current work come nowhere near.

While DBPedia spotlight has had its time, it is not really state-of-the-art any longer. Actually in that whole section 4.3.2 it is very unclear which parts are done automatic and which manual (if any at all).

In section 4.3.3, you generate text on only 10 random attributes if you have more than 10. This seems like an easy spot for improvement. It seems to make more sense to pick attributes that are clearly discriminative in comparison with other classes.

p12l12 You claim a graph of size 3969. How is that made from only 950 classes?