Video Representation and Suspicious Event Detection using Semantic Technologies

Tracking #: 2342-3555

Ashish Singh Patel
Giovanni Merlino
Dario Bruneo
Antonio Puliafito
Om Prakash Vyas
Muneendra Ojha

Responsible editor: 
Armin Haller

Submission type: 
Full Paper
Due to the widespread deployment of Surveillance Systems and IoT applications, the amount of surveillance data is massively on the rise. Storing and analyzing video surveillance data is a significant challenge, requiring video interpretation and event detection along with related context. Low-level features from multimedia content are extracted and represented in symbolic form. These features include shape, texture, and color information of the multimedia content. In this work, a methodology is proposed, which extracts the salient features and properties using machine learning techniques typical of the surveillance domain, and represents the information using a domain ontology tailored explicitly for the detection of certain activities. An ontology is developed to include concepts and properties which may be applicable in the domain of surveillance and its applications. Extracted features are represented as Linked Data using an ontology. The proposed approach is validated with actual implementation and is thus evaluated by recognizing suspicious activity in an open parking space. The suspicious activity detection is formalized through inference rules and SPARQL queries. Eventually, Semantic Web Technology has proven to be a remarkable toolchain to interpret videos, thus opening novel possibilities for video scene representation, and detection of complex events, without any human involvement. As per the best of our knowledge about the literature of this domain, we claim that there is no existing method that can represent frame-level information of a video in structured representation and perform event detection, reducing storage and enhancing semantically-aided retrieval of video data. A video dataset of six different, and unusual, suspicious activities has also been built, which can be useful to solve problems related to activity recognition in other smart parking scenarios.
Full PDF Version: 

Major Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
By Leslie Sikos submitted on 17/Nov/2019
Major Revision
Review Comment:

The paper presents an interesting effort for utilizing ontology-based video scene descriptions and rule-based formalisms for reasoning over video scenes to detect suspicious activities, with a use case describing a car park.

The namespace of the proposed ontology is missing and the actual description logic used for the formal grounding of the ontology is not stated. This would be particularly important to understand the mathematical constructors utilized in the formalism, which, along with the computational properties of the proposed SWRL rules, would indicate overall reasoning complexity.

Without the ontology file, the proposed ontology cannot be evaluated at all, and cannot be implemented to verify the claims of the paper. It is not clear why sameObject is defined instead of using the standard owl:sameAs as part of the descriptions, which should be justified.

While spatiotemporal object relations are declared (such as for overlapping objects), alignment with the Allen relations and de facto standard vocabularies that define these and related relations, such as OWL-Time and the SWRL Temporal Ontology, are not explained (they are not even mentioned). Related KOSes, such as the Video Structure Ontology and the Video Ontology, are not considered.

For comparing two identical (moving or stationary) objects, the same shape and size are set as prerequisites in the paper, even though there are feature descriptors that are scale- and/or rotation-invariant, such as the Histogram of Oriented Gradients (HOG), Rotation-Invariant Histogram of Oriented Gradients (RIHOG), and Ordinal Pyramid Coding (OPC), which would allow these to be different.

Isn’t “developing an ontology that represents an object along with its position in every frame” overly resource-intensive? How does the proposed approach perform in (near) real-time applications? Is the frame-by-frame semantic description actually the most optimal for car park videos?

In line 6 of figure 4, only a symbolic URL is used. Why isn’t the actual URL provided?

“Text from the video is extracted and then converted to Resource Description Framework (RDF) using semantic web technologies and NLP.” - What about semantic enrichment? How are namespaces defined for the terms used in the description?

There are several wording issues in the paper. For example, “to reach the human-level perception for various scenarios” is an exaggeration. “LSCOM, SROIQ which are not completely based on description logic.” is a stranded sentence that does not make sense. Some further issues include the following:

“and linked with domain knowledge to acquire human-level perception” → “and linked with domain knowledge to acquire interpretation capabilities with software agents”

“Complex events, which are rare in nature, are hard to train” → “Those complex events that are rare are hard to train”

“and require massive computational capabilities” → “and massive computational capability requirements”

“can be used to reason spatial and temporal reasoning” → “can be used to perform spatial and temporal reasoning”

There are grammatical errors and typos throughout the manuscript, which should be corrected, for example, the following:

In Table 2, “May be” should be “Potentially”

“in a video scenes” → “in video scenes”

“in generation of” → “in the generation of”

“a description logic based knowledge representations, can be used for” → “a description logic-based knowledge representation, which can be used for”

Review #2
Anonymous submitted on 09/Dec/2019
Minor Revision
Review Comment:

This article presents and evaluates a method for semantically representing suspicious events in videos using a newly developed ontology, linked data and semantic rules. The paper is generally interesting and easy to follow and the idea of using rules makes it possible to extend the approach to new activities relatively easily when domain knowledge is available. The paper as some issues in relation to the size and homogeneity of the data used and the evaluation method as some flaws. In particular, there is no comparison of the proposed approach with simple baselines and it seems that the evaluation is partially performed on the data that was used for creating the rules. Additionally, even though the related work literature cover well semantic web approaches, it seems to ignore existing image analysis work (even though the author seems to be aware of such models as they are relying on the YOLO object recognition model) The paper has also multiple grammar issues and requires some proofreading, particularly in the early paragraphs.

I believe that some small additional work is required. In particular: 1) The literature review needs to be extended to non-semantic work; 2) The dataset should be more varied (or the authors should make their paper title/content/limitations clearer to reflect the domain), and; 3) The model evaluation should be corrected/improved.

Detailed comments:
- Semantic extraction of high-level concepts and objects from videos and pictures already exists in particular since the release of ImageNet [1].
+/- Although RDF representation of videos may be smaller than storing images, storing images is still necessary when events need to be verified. Therefore, I am not convinced about the storage argument in practice.
+/- The dataset appears to be very limited making the results specific to the use case...
- Information is missing about the dataset. Are all the videos from the same parking lot? How does it generalise?
+ The related work in the Semantic web area is well described...
- ...However, there is no information about existing work in object recognition and image processing that is able to recognize and associate object over time (e.g., [2]) even though the authors use YOLO.
+/- Section 3 should be introduced more clearly, it starts with definitions without any introduction.
- The parameters of the rules 'are obtained using several independent experiments', what are these experiments? How these measures values generalise?
+/- How are the activities selected (i.e., why particular types of activities were selected by the authors?)? Is there a reference for 'suspicious' acclivities in parking lots?
+/-The efficiency analysis result is partially flawed as it shows that the extraction/identification of objects in frames require GPUs.
- The accuracy evaluation uses some of the data from which the rules were created. This is a major issue particularly for the settings that were decided for creating the rules.
- There is no comparison with baselines (How does the approach compare with simpler/non-semantic approaches?).

[1] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg, and Li Fei-Fei. 2015. ImageNet Large Scale Visual Recognition Challenge. Int. J. Comput. Vision 115, 3 (December 2015)
[2] S. Bouindour, R. Hu and H. Snoussi, "Enhanced Convolutional Neural Network for Abnormal Event Detection in Video Streams," 2019 IEEE Second International Conference on Artificial Intelligence and Knowledge Engineering (AIKE), Sardinia, Italy, 2019

Review #3
Anonymous submitted on 24/Dec/2019
Major Revision
Review Comment:

Main technical contribution is to design and develop a methodology to support generate effective video content signature and support suspicious event detection.
Although basic idea is technically sound and enjoys good novelty, research motivatioin and technical solution
are not presented in a comprehensive fashion. It would be great to provide details about,

1) what is importance of the research problem explored by the study and what kind of good impacts the study can make in real world?
2) Comparing to existing approaches, what are the main con/pro?

On the top of these problems, a few issues are needed to be addressed properly before proceeding to next round review:

#Issue 1: In general, journal publication should give very detailed introduction of the related studies published in recent years. However,
the section 2 of the article is not comprehensive enough to provide a good coverage. Most troublesome is the significant gap in the coverage of the state-of-the-art.
A few important literature for online learning is not cited,

Generating video descriptions with latent topic guidance, IEEE T-MM 2019
Which information sources are more effective and reliable in video search, ACM SIGIR 2016
Concept-based video retrieval, Foundations and Trends in Information Retrieval, 2009
Modality mixture projections for semantic video event detection, IEEE T-CSVT, 2009
.... Many many....

#Issue 2: Basic design principle and many details of the methodology proposed is not available. For exammple, What is key advantage or disadvantage of proposed methodology?
Further, good to see more details about algorithm and parameter tuning for rule based learning.

#Issue 3: Experimental study is not clear or comprehensive enough to gain full assessment on research quality and impacts. 1) It is not unclear about what is (are) main task(s) for
empirical study and why using current evaluation metric? Further, a detail introduction on basic procedure of
experimental study and related principle is also very important.

In sum, major weakness of the paper in its current form is that its concerns and contributions seem mostly empirical and conceptual
in nature given that there have been extensive work in the related research domain. Many portions are not clear or need further amendments and revision. Authors are strongly suggested to take all suggestion given above into account
during the next round revision and I wish them good luck in this next phase of work.