On the use of semantic technologies for video analysis

Tracking #: 1789-3002

Pierluigi Ritrovato
Luca Greco
Mario Vento

Responsible editor: 
Guilin Qi

Submission type: 
Survey Article
The rapid proliferation of video recording devices has led to a huge explosion of contents, determining an ever increasing interest towards the development of methods and tools for automatic video analysis and interpretation. Through the years, the availability of contextual knowledge has proven to improve video analysis algorithms' performances in several ways, although the formal representation of semantic content in a shareable and fusion oriented manner is still an open problem. In this context, an interesting answer has come from Semantic technologies, that opened a new interesting perspective for the so called Knowledge Based Computer Vision (KBCV), adding new functionality, improving accuracy, and facilitating data exchange between video analysis systems in an open extensible manner. In this work, we propose a survey of the papers from the last fifteen years, back when first applications of semantic technologies to video analysis have appeared. The papers have been analyzed under different perspectives leading to the definition of a taxonomy of the different approaches and the semantic web technology stack adoption. As a result, some insights about current trends and future challenges are provided too.
Full PDF Version: 

Major Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
By Wankou Yang submitted on 05/Feb/2018
Major Revision
Review Comment:

This paper proposes a survey of many relevant papers concerning the application of Semantic Web technologies to video analysis. In the paper, the authors provide a taxonomy of the SW technology adoption for video analysis and the surveyed papers have been analyzed according to this taxonomy. This paper have the following problem:
The authors provide a taxonomy of the SW technology adoption: Low-level video analysis methods, Mid-level video analysis methods and High-level video analysis methods. It seems that the authors made this taxonomy just according to different applications of the SW technology, but the essential differences from technologies or other aspects between the three taxonomies seem unclear. The authors may give a deep analysis of the SW technologies, i.e., which makes the SW technology can be employed to solve complex problems and is the taxonomy really reasonable? Besides, in table 1, the differences between different taxonomies should be further discussed. Actually, I found that the three taxonomies involve the same technologies.

Review #2
By Lingling Zhang submitted on 05/Feb/2018
Major Revision
Review Comment:

This paper proposes a survey of many relevant papers concerning the application of Semantic Web technologies to video analysis. It is so meaningful for the further work on this field. However, the paper has not be accepted in the current revision because of the following problems:

(1) The authors choose 75 works published from 1991 to 2016, which are abundant and involve so many perspectives about Semantic Web technologies to video analysis. However, whether there are some new works during 2017 to 2018? I think the authors should also refer to some new works.

(2) In section 2, the authors introduce some related work on Semantic Web. This part is a little chaotic. Whether the part “From OWL rule support” is closely related to your survey?

(3) In section 4, although the survey is relatively comprehensive, the descriptions are tedious (for example the introduction of the work by Pantoja et al. [57]). It bring the great difficulty to the readers.

(4) In section 4, Why not the authors present the mid and high level studies in separate sections?

(5) Many paragraphs are too long to read, please separate and make them more readable and more logical.

(6) There are many typos in this paper, here they are:
(Please check the similar errors in the whole paper carefully)

Page 1:
so called => so-called
time consuming => time-consuming
stimated => estimated
“suspicious events”, =>“suspicious events,”

Page 2:
object oriented => object-oriented
However => However,
in literature => in the literature
knowledge based => knowledge-based
The selection process lead => The selection process leads
Resource Description Framework that use => Resource Description Framework that uses

sub properties => sub-properties
previous version => the previous version

“materialization". => different “ characters, please also change it to “materialization.”

the features attributes => the features of attributes

with other state of art approach => with other state of art approaches
a observed area => an observed area
contain specification => contain a specification
getting out of car => getting out of the/a car

100% accuracy . => 100% accuracy. (extra space)
rom the the i-LIDS => rom the i-LIDS (extra the)

the bounding box that enclose => the bounding box that encloses

One of the first ontology-based method => One of the first ontology-based methods

The experimentation results => The experimental results

using a 85 hours => using an 85 hours

composed by => composed of
12 different kind of events like => 12 different kinds of events like
into temporal, spatial or logic combination => into a temporal, spatial or logic combination
via web browser => via a web browser

for identify 9 different => for identifying 9 different
at enhance medical students => at enhancing medical students

where they adapt => where they adopt

Review #3
By Guilin Qi submitted on 05/Apr/2018
Major Revision
Review Comment:

This paper presents a survey on the knowledge-based computer vision, especially on the problem of video analysis with prior knowledge. By given a detailed introduction of the video analysis using the technologies of semantic web, the authors intend to provide the researchers of computer vision community some useful information to utilize the prior knowledge in the form of semantic web. The paper starts with presenting a quick overview of the ontology languages of semantic web and work of symbolic reasoning in the field of semantic web. Subsequent sections discuss many relevant work of the applications of semantic web and the challenges in the field of video analysis.

The paper provides a survey on the using of ontologies for video analysis. But the organization of the paper is not very well, the analysis of the relevant works is lack of comparison and so on. In the following, I will provide some indications of how to revise the paper.

#1 In the first section, the authors classify the video analysis into three levels, i.e., low-level, mid-level and high-level video analysis, by the criterion: the depth of the analysis of the video, but section 4 gives four classifications of different works. So, which level should “Video retrieval applications” and “Multimedia visual content annotation” be classified into? This is an inconsistence of the organization of the paper.

#2 This paper just focuses on the application of SW in video analysis. It is interesting to look back toward the beginning and see which of the original ideas have blossomed in computer vision filed rather than confining to video analysis.

#3 The authors claim that “this is the first work focusing on semantic web technologies applied to video analysis problems”. However, there has been some related works which have presented a survey on this field-Surveillance analysis and multimedia retrieval [1,2]. So, the authors should revise their claim of the paper.

#4 In section 2, the paper gives an overview of semantic web. It would be helpful to the readers if the authors can provide some references of relevant papers on applying the technologies of SW to computer vision.

#5 This paper should give a summarization about the usage of reasoning or ontology in video analysis, such as a flowchart which provides a highly visual and easily understood way of representing the system's flow of logic. For each method introduced, the authors should discuss more about how ontology and reasoning are used in computer vision, it would be good if some examples are given here.

#6 A question to the authors: what is the main difference between the image retrieval and video retrieval? If the authors consider that the video retrieval is a conceptual extension of image retrieval into the video domain, then I think a lot of knowledge-based works of traditional image fields should be considered in this survey.

#7 For some of the reviewed papers, it was not clear whether the proposed method would perform better than the alternative ones. That is to say, different classification of relevant papers should give a detailed comparison to show the motivation, the technology, the results and so on. A classification of usage-types, aims, and purposes would be very helpful here.

#8 A fair comparison of methods under similar circumstances with the traditional methods, e.g., [7] has been virtually absent. Give comparison with these traditional works will help to verify the effectiveness of the knowledge-based methods.

#9 Some references in the references section are incomplete. Some works have not been mentioned in this paper, e.g, [3-9]

Overall, I think that the authors should revise the paper to provide a comprehensive overview on knowledge-based video analysis.

1. .P. Kannan, P. Shanthi Bala, and G. Aghila. A comparative study of multimedia retrieval using ontology for semantic web. In Advances in Engineering, Science and Management (ICAESM), 2012 International Conference on, pages 400–405, March 2012.
2. Chris Poppe, Gaëtan Martens, Pieterjan De Potter, and Rik Van De Walle. Semantic web technologies for video surveillance metadata. Multimedia Tools Appl., 56(3):439–467, February 2012
3. G.C. Stein, J. Rittscher, A. Hoogs, "Enabling video annotation using a semantic database extended with visual knowledge", Multimedia and Expo 2003. ICME '03. Proceedings. 2003 International Conference on, vol. 1, pp. I-161, 2003.
4. Cees G. M. Snoek, Bouke Huurnink, Laura Hollink, Maarten de Rijke, Guus Schreiber, Marcel Worring, "Adding Semantics to Detectors for Video Retrieval", Multimedia IEEE Transactions on, vol. 9, pp. 975-986, 2007, ISSN 1520-9210.
5. A. Hoogs, R. Collins, "Object Boundary Detection in Images using a Semantic Ontology", Computer Vision and Pattern Recognition Workshop 2006. CVPRW '06. Conference on, pp. 111-111, 2006.
6. Wang D, Song D. Video Captioning with Semantic Information from the Knowledge Base[C]//Big Knowledge (ICBK), 2017 IEEE International Conference on. IEEE, 2017: 224-229.
7. Rohrbach, Marcus. "Attributes as semantic units between natural language and visual recognition." Visual Attributes. Springer, Cham, 2017. 301-330.
8.  Xiao-Yong Wei, Chong-Wah Ngo, Yu-Gang Jiang, "Selection of Concept Detectors for Video Search by Ontology-Enriched Semantic Spaces", Multimedia IEEE Transactions on, vol. 10, pp. 1085-1096, 2008, ISSN 1520-9210.
9. Yu Y, Ko H, Choi J, et al. End-to-end concept word detection for video captioning, retrieval, and question answering[C]//Computer Vision and Pattern Recognition (CVPR), 2017 IEEE Conference on. IEEE, 2017: 3261-3269.