Identifying Topics from Micropost Collections using Linked Open Data

Tracking #: 1961-3174

Authors: 
Ahmet Yildirim
Suzan Uskudarli

Responsible editor: 
Krzysztof Janowicz

Submission type: 
Full Paper
Abstract: 
The extensive use of social media for sharing and obtaining information has resulted in the development of topic detection models to facilitate the comprehension of the overwhelming amount of short and distributed posts. Probabilistic topic models, such as Latent Dirichlet Allocation, and matrix factorization based approaches such as Latent Semantic Analysis and Non-negative Matrix Factorization represent topics as sets of terms that are useful for many automated processes. However, the determination of what a topic is about is left as a further task. Alternatively, techniques that produce summaries are human comprehensible, but less suitable for automated processing. This work proposes an approach that utilizes Linked Open Data (LOD) resources to extract semantically represented topics from collections of microposts. The proposed approach utilizes entity linking to identify the elements of topics from microposts. The elements are related through co-occurrence graphs, which are processed to yield topics. The topics are represented using an ontology that is introduced for this purpose. A prototype of the approach is used to identify topics from 11 datasets consisting of more than one million posts collected from Twitter during various events, such as the 2016 US election debates and the death of Carrie Fisher. The characteristics of the approach with more than 5 thousand generated topics are described in detail. A human evaluation of topics from 30 randomly selected intervals resulted in a precision of 81.0% and F1 score of 93.3%. Furthermore, they are compared with topics generated from the same datasets with two different topic identification approaches. The potentials of semantic topics in revealing information, that is not otherwise easily observable, is demonstrated with semantic queries of various complexities.
Full PDF Version: 
Tags: 
Reviewed

Decision/Status: 
Reject (Two Strikes)

Solicited Reviews:
Click to Expand/Collapse
Review #1
Anonymous submitted on 05/Oct/2018
Suggestion:
Major Revision
Review Comment:

The paper presents a methodology (S-BOUN-TI) for detecting topics from a collection of microposts by using Liked Open Data. This approach tags the microposts with relevant DBpedia concepts (using the TagMe tagger), builds their co-occurrence network, and runs a community detection algorithm to obtain a set of clusters that should represent the pertinent topics. The paper also introduces the Topico Ontology for representing topics in RDF.

I reviewed this paper some months ago and suggested a major revision. Given that the paper is presented as a new submission (and that I did not received an answer to the previous review), I wrote a self-contained review that repeat some of the points that are still valid from the previous review.

The paper is well written and clear. The topic is significant and of potential impact.

The approach is not particularly novel, but it appears to be sound. The Topico ontology is interesting, but it seems to be used only as a way to represent the output, rather than for supporting the method.

The topics produced by the methods appear to lack granularity. For example, many of the topics produced for the Trump-Clinton debate simply contains one of two additional concepts in addition to “Trump” and “Clinton”. I would appreciate some more examples of the kind of topics extracted and their granularity.

In my previous review I suggested a number of improvements. The author addressed some of them, and I think that the current version of the paper is more solid and clear. Unfortunately, the evaluation, which was the main reason for which I suggested the major revision, does not contain additional information that would change my original judgment.

As I mentioned few months ago, the evaluation does not seem to prove that the proposed method performs comparably or better than the state of the art and seems to focus on evaluating the extraction of DBpedia concepts rather than the clusters. Indeed, since the concepts are linked to the posts with the state of the art TagMe tagger, the novel contribution to be evaluated would be creation of the clusters of concepts.

In particular, the paper compares the proposed approach (S-BOUN-TI) with two other methods: BOUN-TI and LDA. However, the outputs of BOUN-TI and LDA are quite different from the ones of S-BOUN-TI and the comparison does not yield particularly good evidence of the superiority of S-BOUN-TI. Indeed, the paper mentions that "The scores of S-BOUN-TI are some-what lower. The nature of the topics as well as the annotation criteria are important to keep in mind while interpreting these results. [..]". And regarding LDA, it adds: "The topic representations of S-BOUN-TI and LDA are very different. [...] In order to get a rough idea regarding how similar topics produced by these methods, we compared their elements (in lower case form) using exact and substring matching."

While I agree that is not straightforward to set up a valid evaluation I can see some options that would allow to obtain a more significant comparisons that the ones reported in the paper. A first possibility would be using a metric for evaluating clusters, such as the Rand index, and compare different methods versus a manually created gold standard. A second option would be to conduct a users’ study with the aim of proving that your topic representation is perceived as more useful and understandable than current alternative approaches.

In short, my understanding is that the experiments reported in the paper fail to prove that S-BOUN-TI performs equally or better than state of the art methods or that it produces a more comprehensible representation of the topics. I believe this may actually be the case, but it should be proved with a formal evaluation.

The paper has merits and I would encourage the authors to do some more work on the evaluation (producing new empirical evidence) and re-submit.

Minor remarks
Please, introduce all acronyms, even if they are fairly well known (e.g., LDA, NMF, LSA)

Review #2
By Carlos Badenes-Olmedo submitted on 16/Oct/2018
Suggestion:
Accept
Review Comment:

The review of this new version of the article would have been much more agile if the notes with the authors' comments to the first review had been available.

Some points were identified as critical in the first review: (1) why a new ontology is required to represent topics, and (2) how to design the evaluations to compare this approach with algorithms based on probabilistic topic models.

(1): A review of existing ontologies handling events is shown. The creation of the Topico ontology is justified by considering that a topic does not always have a temporal information, so it cannot be considered as an event. In this line it would be advisable to take a look at the Submissions:ParticipantRole design pattern [1] and the Nature Core Ontology [2].

(2): The evaluation focuses on the quality and utility of the generated topics. It measures (a) relevance by means of manual anotators, (b) usefulness by means of the use of support functions and (c) similarity to other approaches.
a) A larger number of evaluators ( preferably not involved in the work) is recommended to give more strength to the inter-annotator agreement ratio, and confidence in the results obtained.
b) The comparison based on the use of supporting functions is not appropiate. All algorithms, including the one presented in this article, use them in their construction. Some do it before discovering the topics (e.g. S-BOUN-TI), and others would need to do it later (e.g. WLB).
c) Perhaps a comparison with BTM [3] would have been more suitable than LDA for short texts.

There are areas to be improved, mainly in the evaluations, but I believe that in general terms the work is suitable for publication.

[1] - http://ontologydesignpatterns.org
[2] - http://ns.nature.com/terms
[3] - Cheng, Xueqi, Xiaohui Yan, Yanyan Lan and Jiafeng Guo. “BTM: Topic Modeling over Short Texts.” IEEE Transactions on Knowledge and Data Engineering 26 (2014): 2928-2941.

Review #3
By Gengchen Mai submitted on 27/Nov/2018
Suggestion:
Major Revision
Review Comment:

This paper proposed an algorithm S-BOUN-T1, a topic extraction algorithm. I appreciate all the efforts the authors have made to improve the quality of this paper. But here is my major concern:

In general, this is an algorithm paper. But the proposed topic extraction method, S-BOUN-T1, needs 7 parameters to extract the topics and publish as Linked Data. In this revision, the authors added a couple of paragraphs in the appendix to discuss the effect of different parameters. But the discussion is just on a qualitative level. It is not enough. As an algorithm paper, it is very important to try different parameter combinations and quantitatively evaluate the corresponding results to show how each parameter affect the performance of the model. This is the most important part for an algorithm paper which should not be put into the appendixes.

I also understand why the authors did not do this in the first place. S-BOUN-T1 has 7 parameters which mean a huge number of parameter combinations. Quantitatively evaluating each model with different parameter combinations mean a lot of work. But the aim of an algorithm paper is proposing a method which is general and can be applied to other datasets. If the authors do not provide a general guidance for future users on how to select the parameters, it is difficult for them to use this algorithm.

I suggest the authors put more efforts to make their algorithm more solid rather than to discuss the future applications of this method.