Review Comment:
The paper proposes a new loosely-coupled Neuro-Symbolic framework for scene graph generation. In particular, scene graphs generated with state-of-the-art (neural) methods are enriched with semantic (symbolic) information taken from Semantic Web resources in the form of embeddings. Such enrichments involve new nodes and arcs with respect to the original scene graphs. The semantically enriched scene graphs are of a better quality and improve the performance of the caption generation task on the COCO dataset.
The paper is original, well written, easy to understand and can be useful to both the NeSy and the Computer Vision communities as the proposed method improves the performance of scene graph generation on both the Visual Genome and the COCO dataset and the the caption generation on the COCO dataset. The provided GitHub link contains the source code that partially allows the reproduction of the experiments.
My concerns regard different parts of the work:
1. The proposed method is a loosely-coupled Neuro-Symbolic framework as it adds background knowledge after the scene graph generation with Neural Networks (NNs) and this new knowledge does not affect the parameters of such NNs as in standard NeSy works. This should be stressed more in the paper and in both the title and the abstract. As is, the title and the abstract suggest a strong NeSy method but this is not the case.
2. The paper would be greatly improved by discussing also the NeSy works with logic-based (ontologies) priors. Here, the state-of-the-art limits to discuss knowledge graphs as priors in the form of embeddings. This is fine, but can lead to errors in the scene graph generation. Indeed, at page 9, the second scene graph from above states . This wrong relationship can be avoided by using hard constraints (e.g., coming from ontologies) as prior knowledge. Other errors in the scene graphs, due to this knowledge graph priors as embeddings, are at page 17: , , . Some works on hard constraints as priors to discuss can be:
1. Donadello, I., & Serafini, L. (2019, July). Compensating supervision incompleteness with prior knowledge in semantic image interpretation. In 2019 International Joint Conference on Neural Networks (IJCNN) (pp. 1-8). IEEE.
2. Díaz-Rodríguez, N., Lamas, A., Sanchez, J., Franchi, G., Donadello, I., Tabik, S., ... & Herrera, F. (2022). EXplainable Neural-Symbolic Learning (X-NeSyL) methodology to fuse deep learning representations with expert knowledge graphs: The MonuMAI cultural heritage use case. Information Fusion, 79, 58-83.
3. Donadello, I., & Serafini, L. (2015). Mixing Low-Level and Semantic Features for Image Interpretation: A Framework and a Simple Case Study. In Computer Vision-ECCV 2014 Workshops: Zurich, Switzerland, September 6-7 and 12, 2014, Proceedings, Part II 13 (pp. 283-298). Springer International Publishing.
3. The GitHub repository could be improved by listing the needed steps necessary to run the experiments. Indeed, as is, the code does not allow to fully reproduce the experiments as the README contains links at Python notebooks without any instructions on how to run them and the exact sequence of methods to call. Moreover, the source code contains relative paths to the directories in the operating system of the first author. So it is not possible to run the code as is.
4. The enriched scene graphs mix scene information, such as, and background knowledge, such as . I am wondering whether this background knowledge will be ignored during the caption generation as it would generate unnecessary text for the image understanding and would mislead the computation of the metrics for caption generation.
5. I would remove section 4.2.6. as it is quite small and the evaluation is manually done on a sample of just four images. In this form, this is not a significant contribution of this paper.