Timeline Binder: A Caption-based Approach for Creating Semantic Video Timeline

Tracking #: 2377-3591

Muhammad Faran Majeed
Khalid Saeed
Alam Zeb
Muhammad Faisal Abrar

Responsible editor: 
Harald Sack

Submission type: 
Full Paper
Video is an important type of multimedia which has myriad characteristics such as richer content and little prior structure. Major applications such as surveillance, medicine, education, entertainment and sports heavily use videos. Internet protocol video is dominating the World’s Internet traffic. Large number of videos are made and uploaded on daily basis. Looking for topics of one’s interest in these videos is a cumbersome task. Mostly, people are only interested in small portions of videos as opposed to the whole video. It is better to look for small but related segments of interest. Our proposed system Timeline Binder is an attempt to tackle this problem. For Timeline Binder, we have created our own dataset of online videos with subtitles. Based on those subtitles files, a concept vector is obtained. If a concept vector has a match for user’s query tag then timed-segments for all related videos to that matched concept in the concept vector are combined into a single timeline thus making all related segments into a single video. In this research, we present performance measurements between Timeline Binder and other available applications such as Google2SRT and NR8. The results show that Timeline Binder performs better than Google2SRT and NR8.
Full PDF Version: 


Solicited Reviews:
Click to Expand/Collapse
Review #1
By Lyndon Nixon submitted on 10/Dec/2019
Review Comment:

This manuscript presents a Web based application for automatically generating a linear timeline of (YouTube) video fragments which share a referenced concept. As the authors correctly note, given the scale of online video assets, such tools can indeed be very valuable to allow users to quickly find parts of a larger video that are relevant to their search or gather an overview of different videos which share reference to the same search term. The main problem with this paper is that the approach is too simplistic for consideration in a journal, as well as that in this case the presented work has zero relevance to the Semantic Web. The authors need more awareness of the state of the art in this area (semantic multimedia fragments are part of the work of http://www.eurecom.fr/ds/media-hyperlinking for example, with the demo HyperTED now active again for fragment-based search across TED Talks videos).

Comments for the authors:
- fix typos in your manuscript before submission to any conference or journal. e.g. "real world" instead of "real word"
- the "user query tags" are unclear, can they do a free text search or is there a vocabulary of "tags" loaded into the Web application and they can only select from the tags that are present?
- "based on the Semantic Web" is patently untrue - you mention earlier XML as an Semantic Web technology which it is not (it is simply a possible serialisation format for Semantic Web data just like JSON); you do not use RDF(S)/OWL or any other knowledge representation or ontology (even Linked Data).
- the Related Work section does not present academic or research publications in the field (work goes back at least to 2009 with Hausenblas, M. (2009). Interlinking multimedia: how to apply linked data principles to multimedia fragments. There is Schandl, B., Haslhofer, B., Bürger, T., Langegger, A., & Halb, W. (2012). Linked Data and multimedia: the state of affairs. Multimedia Tools and Applications, 59(2), 523-556., as well as Nixon, L., & Troncy, R. (2014, May). Survey of semantic media annotation tools for the web: towards new media applications with linked media. In European Semantic Web Conference (pp. 100-114).)
- the Concept Vector is unclear, it sounds like rather (concept, video fragment reference) pairs which is not a vector. The whole approach to representing the content of the video fragments is too simplistic for a journal, you yourself acknowledge you would need at least to have NLP here to provide a better association of natural language terms to fragments. e.g. stemming / normalisation. For a journal, one would expect more advanced techniques to be in place (e.g. word embeddings, BERT).
- the list of explanations of technologies in Section 3 is unnecessary. One can assume a technical journal reader knows what is JavaScript etc. and references/footnotes can be used to point the interested reader to further information.
- the evaluation is too limited. Two metrics are considered, each in comparison to one other tool. However, considering the purpose of a Timeline Binder, an evaluation should focus on user satisfaction with the generated timeline, or speed of finding a relevant video fragment through search compared to using standard YouTube, for example.
- creating a timeline of video fragments need additional considerations not mentioned in this paper, such as determining the best start/end time for a fragment (using the timing given in a SRT file will typically lead to very short fragments of a few seconds which are too short for a viewer to get the context of the video, while expanding the selection will need some sort of 'break' determination such as using paragraph detection in the transcript text or audio pauses in the video)

If the authors are interested in this area of research, I encourage them to look more at the past work in the area (MediaMixer project is a good starting point, with the HyperTED demo as well as VideoLecturesMashhup using video fragment's semantic annotation to support better topic-centered video browsing and viewing) and extend their current approach with semantics.

Review #2
By Henning AgtRickauer submitted on 07/Jan/2020
Review Comment:

The article presents an approach to video segment retrieval based on subtitle concept / keyword matching and combines the results in a single video. The system (called Timeline Binder) is based on online videos (e.g. Youtube), the extraction of existing subtitles (using existing tools), keyword extraction using dictionaries and a keyword search. A web interface was developed to display datasets (YouTube URLs) and search results using a video player. System performance was measured by determining video access times.

First of all, the article was difficult to understand because of the many language mistakes. It requires a significant improvement in the English language and proofreading by a native speaker. Second, the article hardly fits the audience of the Semantic Web Journal, as the system does not use any Semantic Web technology at all (e.g. no Web Annotations for subtitles or for video tagging, no RDF for data storage or linking of concepts).

The structure of the paper is not obvious. A longer part of Section 1 (Introdcution) describes very general web concepts that readers of the SWj should be familiar with. The introduction does not contain a problem statement, which is instead described in Section 2 (Related Work) after reviewing several subtitle extraction / search tools. The comparison of the tools does not use any criteria, only pros and cons. Section 3 (Methodology) mainly describes the system design, which is also presented in Section 4 (Results and Discussion) together with algorithms. Only a small part of this section is devoted to evaluation.

I think the originality of the paper is low because the whole approach focuses entirely on subtitles and a combination of fairly simple tools: videos must already have subtitles, the retrieval of subtitles is done with Google2SRT, the keyword extraction is based only on stopword removal and recognizing compound words, the search is based on a simple keyword match (it is said that a concept vector is created, but it is difficult to understand what it looks like and appears to be just a data structure for keywords and time segments), resulting segments of different videos are just concatenated (is this really a desired search result - video segments without context and without relevant order?).

It is impossible to assess the significance of the results. There is a Github repository with the source code, but descriptions are completely missing, no hints to the installation, deployment, or use of the system. An online demo is not available. The authors write that they want to address the video search within exabytes of data, but the system was tested on a set of 20 Youtube URLs (how did they select them?). The approach assumes the existence of subtitles, which is mainly not the case with arbitrary online videos, and therefore does not scale. The evaluation is meaningless because it only measures the time it takes to download subtitles and to create a video timeline both being heavily dependent on the network connection. At least it should be examined how well the retrieved results match the intent of the user. Most system descriptions concentrate on the creation of datasets, but the actual retrieval/search part is only described superficially. The authors should also look at video search and retrieval systems that use other technologies as subtitles.

In summary, I recommend rejecting the paper because of the difficulty in understanding the approach, the missing relevance to the Semantic Web Community, the lack of technical depth, and the inability to reproduce the results.

Review #3
Anonymous submitted on 07/Jun/2020
Review Comment:

(1) originality
The paper is unbalanced. For instance, it provides very generous statements about grows of video traffic (Section 1.1, 2.1 and 5). Only on page 6 we find the potential contribution of the research. However the Section on Related work is very generic and only seems in part relevant to the work. The contribution (Section 2.1) is to generate a single video based on various semantically similar video’s. The related work, however, only focuses on browser extensions for searching through video captions.

Also, there seems to be a disconnect between the scope of the problem (“poorly-filmed, long running and unedited content” on the web) and the solution. Namely (i) the dataset consists of a selection of professional and highly edited content (ii) the need to manually add ULRs of videos to “a list” (4.1.1.). The authors should have critically reflected on this matter. Also, the paper fails to mention the potential user group and connecting user stories, that makes the application domain rather superficial.

(2) significance of the results

Use of semantic web technologies is mentioned in Section 1.1, albeit superficially. And it is seemingly not considered as part of the solution to the problem statement.

Details of the technical solution are lacking. For instance, in 4.1.3 “Also, for each video a list of “start” and “end” time is obtained according to the timed segments corresponding to the concept.” => this begs the question how the system determines the length of a section. Also: how can this be evaluated? The same goes for the evaluation, arguably the most critical part of the paper. It says “Figure 9 compares our developed system with NR8 in terms of timeline access time from already defined dataset.” but it doesn’t mention *how* the evaluation took place. Also, it is not clear if the usability of the front-end was evaluated.

(3) quality of writing.
Several repetitions, some texts under Related work are copied directly from the webpages of the respective software solutions. Statements like "it's obvious" (3.2.2.) need to be substantiated.