NeuSIRE: Neural-Symbolic Image Representation and Enrichment for Visual Understanding and Reasoning

Tracking #: 3247-4461

This paper is currently under review
Muhammad Jaleed Khan
John Breslin
Edward Curry1

Responsible editor: 
Guest Editors NeSy 2022

Submission type: 
Full Paper
The adoption of neural-symbolic hybrid approaches in visual intelligence is essential to progress toward seamless high-level understanding and reasoning about visual scenes. To this end, Scene Graph Generation (SGG) is a promising and challenging task, which involves the prediction of objects, their attributes and pairwise visual relationships in a visual scene to create a structured, symbolic scene representation, known as a scene graph, which is utilized in downstream visual reasoning to perform a desired task, such as image captioning, visual question answering, image retrieval, multimedia event processing or image synthesis. The crowdsourced training datasets used for this purpose are highly imbalanced and it is nearly impossible to collect and collate training samples for every visual concept or visual relationship due to a huge number of possible combinations of objects and relationship predicates. Leveraging commonsense knowledge is a natural solution to augment the data-driven approaches with external knowledge to enhance the expressiveness and autonomy of visual understanding and reasoning frameworks. In this paper, we proposed a neural-symbolic visual understanding and reasoning framework based on commonsense knowledge enrichment. Deep neural network-based techniques are used for object detection and multi-modal pairwise relationship prediction to generate a scene graph of an image, which is followed by rule-based algorithms to refine and enrich the scene graph using commonsense knowledge. The commonsense knowledge is extracted from a heterogeneous knowledge graph in the form of related facts and background information about the scene graph elements. The enriched scene graphs are then leveraged in downstream visual reasoning pipelines. We performed comprehensive evaluation of the proposed framework using the common datasets and standard evaluation metrics. As a result of commonsense knowledge enrichment, the relationship recall scores R@100 and mR@100 increased from 36.5 and 11.7 to 39.1 and 12.6 respectively on the Visual Genome (VG) dataset, and similar results were observed for the COCO dataset. The proposed framework outperformed the state-of-the-art methods in terms of R@K and mR@K on the standard split of VG. We incorporated scene graph-based image captioning and image generation models as downstream tasks of SGG with knowledge enrichment. With the use of enriched scene graphs, SPICE and CIDEr scores obtained by the image captioning model increased from 20.7 and 115.3 to 23.8 and 131.4 respectively, and the proposed approach outperformed the state-of-the-art scene graph-based image captioning techniques in terms of SPICE and CIDEr scores and achieved comparable performance in terms of BLEU, ROGUE and METEOR scores. The qualitative results of image generation showed that the enriched scene graphs result in more realistic images in which the semantic concepts in the input scene graph can be more clearly observed. The encouraging results validate the effectiveness of knowledge enrichment in scene graphs using heterogeneous knowledge graphs. The source code is available at
Full PDF Version: 
Under Review