NeuSyRE: Neuro-Symbolic Visual Understanding and Reasoning Framework based on Scene Graph Enrichment

Tracking #: 3419-4633

Muhammad Jaleed Khan
John Breslin
Edward Curry

Responsible editor: 
Guest Editors NeSy 2022

Submission type: 
Full Paper
Neuro-symbolic hybrid approaches are inevitable for seamless high-level understanding and reasoning about visual scenes. Scene Graph Generation (SGG) is a symbolic image representation approach based on deep neural networks (DNN) that involves predicting objects, their attributes, and pairwise visual relationships in images to create scene graphs, which are utilized in downstream visual reasoning. The crowdsourced training datasets used in SGG are highly imbalanced, which results in biased SGG results. The vast number of possible triplets makes it challenging to collect sufficient training samples for every visual concept or relationship. To address these challenges, we propose augmenting the typical data-driven SGG approach with common sense knowledge to enhance the expressiveness and autonomy of visual understanding and reasoning. We present a neuro-symbolic visual understanding and reasoning framework that employs a DNN-based pipeline for object detection and multi-modal pairwise relationship prediction for scene graph generation and leverages common sense knowledge in heterogenous knowledge graphs to enrich scene graphs for improved downstream reasoning. A comprehensive evaluation is performed on multiple standard datasets, including Visual Genome and Microsoft COCO, in which the proposed approach outperformed the state-of-the-art SGG methods in terms of relationship recall scores, i.e. Recall@K and mean Recall@K, as well as the state-of-the-art scene graph-based image captioning methods in terms of SPICE and CIDEr scores with comparable BLEU, ROGUE and METEOR scores. As a result of enrichment, the qualitative results showed improved expressiveness of scene graphs, resulting in more intuitive and meaningful caption generation and more realistic image generation using scene graphs. Our results validate the effectiveness of enriching scene graphs with common sense knowledge using heterogeneous knowledge graphs. This work provides a baseline for future research in knowledge-enhanced visual understanding and reasoning. The source code is available at
Full PDF Version: 

Minor Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
Anonymous submitted on 25/Mar/2023
Review Comment:

Thanks for the authors' efforts on the paper revision. After checking the author response and the revised paper, I consider the authors have addressed most of my concerns.

Review #2
By Ivan Donadello submitted on 12/May/2023
Minor Revision
Review Comment:

The paper proposes a new loosely-coupled Neuro-Symbolic framework for scene graph generation. In particular, scene graphs generated with state-of-the-art (neural) methods are enriched with semantic (symbolic) information taken from Semantic Web resources in the form of embeddings. Such enrichments involve new nodes and arcs with respect to the original scene graphs. The semantically enriched scene graphs are of a better quality and improve the performance of the caption generation task on the COCO dataset.

The paper is original, well written, easy to understand and can be useful to both the NeSy and the Computer Vision communities as the proposed method improves the performance of scene graph generation on both the Visual Genome and the COCO dataset and the the caption generation on the COCO dataset. The provided GitHub link contains the source code that partially allows the reproduction of the experiments.

My concerns regard different parts of the work:

1. The proposed method is a loosely-coupled Neuro-Symbolic framework as it adds background knowledge after the scene graph generation with Neural Networks (NNs) and this new knowledge does not affect the parameters of such NNs as in standard NeSy works. This should be stressed more in the paper and in both the title and the abstract. As is, the title and the abstract suggest a strong NeSy method but this is not the case.
2. The paper would be greatly improved by discussing also the NeSy works with logic-based (ontologies) priors. Here, the state-of-the-art limits to discuss knowledge graphs as priors in the form of embeddings. This is fine, but can lead to errors in the scene graph generation. Indeed, at page 9, the second scene graph from above states . This wrong relationship can be avoided by using hard constraints (e.g., coming from ontologies) as prior knowledge. Other errors in the scene graphs, due to this knowledge graph priors as embeddings, are at page 17: , , . Some works on hard constraints as priors to discuss can be:
1. Donadello, I., & Serafini, L. (2019, July). Compensating supervision incompleteness with prior knowledge in semantic image interpretation. In 2019 International Joint Conference on Neural Networks (IJCNN) (pp. 1-8). IEEE.
2. Díaz-Rodríguez, N., Lamas, A., Sanchez, J., Franchi, G., Donadello, I., Tabik, S., ... & Herrera, F. (2022). EXplainable Neural-Symbolic Learning (X-NeSyL) methodology to fuse deep learning representations with expert knowledge graphs: The MonuMAI cultural heritage use case. Information Fusion, 79, 58-83.
3. Donadello, I., & Serafini, L. (2015). Mixing Low-Level and Semantic Features for Image Interpretation: A Framework and a Simple Case Study. In Computer Vision-ECCV 2014 Workshops: Zurich, Switzerland, September 6-7 and 12, 2014, Proceedings, Part II 13 (pp. 283-298). Springer International Publishing.
3. The GitHub repository could be improved by listing the needed steps necessary to run the experiments. Indeed, as is, the code does not allow to fully reproduce the experiments as the README contains links at Python notebooks without any instructions on how to run them and the exact sequence of methods to call. Moreover, the source code contains relative paths to the directories in the operating system of the first author. So it is not possible to run the code as is.
4. The enriched scene graphs mix scene information, such as, and background knowledge, such as . I am wondering whether this background knowledge will be ignored during the caption generation as it would generate unnecessary text for the image understanding and would mislead the computation of the metrics for caption generation.
5. I would remove section 4.2.6. as it is quite small and the evaluation is manually done on a sample of just four images. In this form, this is not a significant contribution of this paper.

Review #3
Anonymous submitted on 15/Jun/2023
Review Comment:

This is the second time that I review the paper. Unfortunately, the paper does not contain significant changes with respect to the first version. In the following I summarize the reactions of my comments:

CONCERN: My first concern was that the proposed architecture is a pipeline of tasks and it cannot be considered a contribution to a “new NeSy architecture”
REACTION: the author add a comment at the end of the introduction stating that claims the interdependence consists in the fact that the accuracy of the neural component affects the quality of the performance of the symbolic component.
COMMENT: This is just error propagation. Integration would mean that the symbolic component affects the neural networks and would, at least in principle, be able to correct some the errors done by the neural network.

CONCERN: Missing related works
REACTION: the author add a few sentences commenting the papers that I was suggesting for the related work.
COMMENT: this is ok.

CONCERNS: Performance on relation detection
REACTION: the author add a bar plot that shows the result
COMMENT: this is ok

COMMENT: comparing performance with state-of-the-art
REACTION: The authors add Table 2
COMMENT: Why do you decide to report the lower number from [38]? In any case, the proposed SGG cannot be said that outperforms the state of the art, but it is rather in line with the sota.

At the end of the paper, the authors added some discussion about the limitations of the approach and future improvements.

My previous evaluation was the following: “The scientific contribution of the paper is limited. So this should be considered an experimental paper. However, the positive experimental results have been presented in a very similar conference paper:
KHAN, Muhammad Jaleed; BRESLIN, John G.; CURRY, Edward. Expressive Scene Graph Generation Using Commonsense Knowledge Infusion for Visual Understanding and Reasoning. In: European Semantic Web Conference. Springer, Cham, 2022. p. 93-112.
The only novelty seems to be the experimental evaluation of the downstream task of caption generation. Since I believe that is not sufficient for a top journal paper I suggest rejecting the paper due to its limitations on the scientific and experimental contribution.”

The core scientific contribution of the paper has not been touched therefore I keep my previous evaluation that essentially the paper has a limited contribution.