Review Comment:
Overall, the authors have addressed most of the comments from the previous iteration of the paper. However, from my perspective there still is a major issue preventing acceptance of the paper.
The issue is that the evaluation is only done on unseen classes. I had raised that issue in my previous review. The authors answered:
“ It is known that there are usually two testing settings in ZSL. One is standard ZSL, which predicts the testing samples of unseen classes on unseen classes. The other is generalized ZSL, where the testing samples of seen and unseen classes are classified with candidate labels from both seen and unseen classes. In this paper, for investigating the feature transferability from seen classes to unseen classes, we focus on the standard ZSL setting to evaluate the prediction ability of unseen classifiers and generate explanations for these unseen classes. It is also worth considering how to deal with explainable ZSL in generalized ZSL setting in real-world applications. Maybe we can adopt a two-phase framework -- a coarse-grained phase to judge if a testing sample comes from seen classes or unseen classes, and a fine-grained phase to make final predictions, where traditional classifiers (e.g., softmax classifiers) are used to predict its label with candidates from seen class set if the sample is from seen classes predicted by coarse-grained phase, and ZSL classifiers are used to predict its label with candidates from unseen class set if it belongs to unseen classes. We can make further attempts for this in the future”
My view is that this is not something to just look at in the future. It is perfectly justified, and even essential, to have an experiment where you want to show transferability. However, I see also a strict need to evaluate with the known classes in place. As far as I currently understand your work, there is also no need to do a two-stage process. Just treat the known classes in the same fashion as your unknown ones. This task will of course be harder, and that is exactly the point. I am expecting the results to be much worse as what you currently obtain. This, however, would be still an interesting outcome, because it would show that 1) you can transfer learn, but 2) when having both known and unknown classes things do not work as well. Besides, it would be very existing if you can provide deeper insight in how the class concussions occur most often. Either, the confusion is more or less uniform (not so interesting case) or the confusion happens most often between seen and unseen classes, which would give us further insights.
I do have some more minor issues below, but I see having this experiment as a major missing piece in this paper. I was considering a major revision to make sure this issue was amended, but this would lead to an immediate reject. Hence, I decided to go for a minor revision, and ask the authors to perform such an experiment for the next version of the paper.
A second issue that needs still more attention is describing how the features flow between the models exactly. I am not getting the whole picture, still. It might have something to do with the phrasing. For example, I do not get the sentence “With learned feature vectors of classes, we use the CNN classifiers of classes as the training supervision to map inter-class relationship into deep CNN features so that predicting a visual classifier for each class node”. Is it correct that the features coming out of your AGCN are never really put into the CNN, but only used in the end to compute a dot product which is then interpreted as the score?
The same confusion might be solved if I understand what $f_i$ in formula 4 exactly is. Is it the outcome of a pre-trained CNN? If so, why do you call it “the classifier of seen class i”?
Minor issues
In equation 3, I am surprised to see that \hat{v}_i is computed using attention on the neighbors, but not using the state of the node v_i itself at all. Is that intentional? Why?
As it was now mentioned, it caught my attention that you have an extremely large state in the nodes (2048). What is the reason for that choice?
You write “our model is a regression model rather than a classification model, which usually works better.” Which of the two works better? For which case?
There are a couple of issues which you gave more attention in your cover letter as in the paper. Perhaps you can also expand your explanation in the paper further.
|