Review Comment:
The paper proposes an architecture for scene graph generation from pictures and evaluates the effectiveness of the method on downstream tasks such as scene graph-based image captioning and image generation models.
The approach proposed by the authors decomposes the complex task into three main phases
Scene graph extraction, scene graph enrichment, and downstream task (caption/image generation) The author proposes an architecture that is obtained by combining several components used to solve the sub-task.
The author claims that they propose a “neural-symbolic visual understanding and reasoning framework”. This description does not reflect the contribution presented in the rest of the paper for two main reasons. Neuro-symbolic architectures are intended to have a tight integration of the neural part and the symbolic part, where both parts affect one another. The architecture presented in the paper looks more like a pipeline where the symbolic part, i.e.. the enrichment with the information available in a (set of) knowledge graphs. The obtained results are taken in input from the system that performs the specific downstream tasks. That cannot be considered a valid contribution to Neuro-Symbolic architecture as the addition of knowledge is done in a separate task and only the result is provided in input to a neural network. In NeSy architectures, the knowledge is injected into the model itself either in the form of loss or in the form of some special construct.
In the related work, the authors completely neglect to report and discuss the literature on neuro-symbolic architectures which have been used for image processing. See for instance
Hitzler, Pascal, et al. "Neuro-symbolic approaches in artificial intelligence." National Science Review 9.6 (2022): nwac035.
Hassan, Muhammad, et al. "Neuro-Symbolic Learning: Principles and Applications in Ophthalmology." arXiv preprint arXiv:2208.00374 (2022).
For a survey of the approaches, or specific papers on semantic image interpretation:
VAN KRIEKEN, Emile; ACAR, Erman; VAN HARMELEN, Frank. Analyzing differentiable fuzzy logic operators. Artificial Intelligence, 2022, 302: 103602.
BUFFELLI, Davide; TSAMOURA, Efthymia. Scalable Regularization of Scene Graph Generation Models using Symbolic Theories. arXiv preprint arXiv:2209.02749, 2022.
….
For Scene graph generation they use Faster-RCNN for object detection and labeling of object types. This model is extended with an additional model based on LSTM that is used to predict the relationship between pairs of objects. An evaluation of such an architecture is provided in figura 4. The evaluation of the SGG performance is usually divided into two parts, namely object type detection and relations detection since the second is a much more difficult taks than the first. Furthermore, for object type classification the authors use state-of-the-art architectures, while for relation extraction they propose a specific method, which should be evaluated.
In addition, state-of-the-art performances. Assuming that the reported evaluation concerns only the relationship between objects it looks like the baseline is less than the state of the art which is 26.1 33.5 38.4 (RK20, RK50, RK100) described in
Structured Sparse R-CNN for Direct Scene Graph Generation Yao Teng, Limin Wang; Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022, pp. 19437-19446
The relationship between object added by the post-processing that injects background knowledge, however, outperforms these metrics, so it looks like there is indeed a contribution of the semantics to the task. However, there is something that I don’t completely understand, which I believe should be clarified. To evaluate the enriched graph do you also enrich the graphs of the test set? Otherwise, how can you train the system end-to-end? Do you also enrich the graphs in the test set? If this is the case then the comparison with the standard method is not fair, since you change also the test set.
About the enriching scene graph. From the examples shown in the paper, it seems that you add certain background relationships and not others. For instance, in the first line of figure 5, why do you add the superclass (ISA) relationship on the node of type person (isa Human) and not the superclass of surfboard? Do you have some method to limit the number of additional nodes and relationships to be added to an SG?
Concerning the downstream tasks, the authors only provide examples of the positive effects of enrichment and do not sufficiently explain and discuss the problems that might arise from such a method. This would be necessary to evaluate if such a methodology is applicable in a given context.
The scientific contribution of the paper is limited. So this should be considered an experimental paper. However the positive experimental results have been presented in a very similar conference paper:
KHAN, Muhammad Jaleed; BRESLIN, John G.; CURRY, Edward. Expressive Scene Graph Generation Using Commonsense Knowledge Infusion for Visual Understanding and Reasoning. In: European Semantic Web Conference. Springer, Cham, 2022. p. 93-112.
The only novelty seems to be the experimental evaluation of the downstream task of caption generation. Since I believe that is not sufficient for a top journal paper I suggest rejecting the paper due to its limitations on the scientific and experimental contribution.
|