Semantic Abstraction for Generalization of Tweet Classification: An Evaluation on Incident-Related Tweets

Tracking #: 1023-2234

Authors: 
Axel Schulz
Christian Guckelsberger
Frederik Janssen

Responsible editor: 
Guest Editors Smart Cities 2014

Submission type: 
Full Paper
Abstract: 
Social media is a rich source of up-to-date information about events such as incidents. The sheer amount of available information makes machine learning approaches a necessity to process this information further. This learning problem is often concerned with regionally restricted datasets such as data from only one city. Because social media data such as tweets varies considerably across different cities, the training of efficient models requires labeling data from each city of interest, which is costly and time consuming. To avoid such an expensive labeling procedure, a generalizable model can be trained on data from one city and then applied to data from different cities. In this paper, we present Semantic Abstraction to improve the generalization of tweet classification. In particular, we derive features from Linked Open Data and include location and temporal mentions. A comprehensive evaluation on twenty datasets from ten different cities shows that Semantic Abstraction is indeed a valuable means for improving generalization. We show that this not only holds for a two-class problem where incident-related tweets are separated from non-related ones but also for a four-class problem where three different incident types and a not-incident related class are distinguished. To get a thorough understanding of the generalization problem itself, we closely examined rule-based models from our evaluation. We conclude that on the one hand, the quality of the model strongly depends on the class distribution. On the other hand, the rules learned on cities with an equal class distribution are in most cases much more intuitive than those induced from skewed distributions. We also found that most of the learned rules rely on the novel semantically abstracted features.
Full PDF Version: 
Tags: 
Reviewed

Decision/Status: 
Accept

Solicited Reviews:
Click to Expand/Collapse
Review #1
By Frank Ostermann submitted on 02/Mar/2015
Suggestion:
Accept
Review Comment:

After comparing the new with the old version, it seems that the authors have substantially revised the paper and addressed all crucial issues. My only remaining remark concerns figures 1 & 2: These are not histograms but stacked bar charts.

Review #2
By Alejandro Llaves submitted on 10/Mar/2015
Suggestion:
Accept
Review Comment:

The authors have addressed all the comments and suggestions from my previous review. Therefore, I think that the paper is ready to be accepted.

Review #3
By Fabrizio Orlandi submitted on 15/Jun/2015
Suggestion:
Accept
Review Comment:

*Originality:*

This paper is an extension of the work presented at the 5th workshop on Semantics for Smart Cities. Moreover, this is a revised version of the article following a first round of reviews.
The paper describes an interesting contribution and it is definitely relevant for the special issue.
The authors introduce an approach, called Semantic Abstraction, that makes use of Linked Data features for a generalised classification of social media text (tweets).
Linked Data based features such as DBpedia categories and entity types (together with other more traditional features extracted from text), are evaluated on Twitter data collected from ten different large cities. The proposed scenario is focused on classifying incidents from tweets and the proposed approach aims at: (i) improving the classification of datasets derived from only one city; (ii) enabling training and testing to be performed on datasets from different cities.

---
*Significance of the Results:*

According to the results of the experiments, the approach is promising and relevant, also considering its comparison with the state of the art. The evaluation is clear, sound and comprehensive. The datasets of the experiments conducted are provided by the authors for reproducibility.

The authors have taken most of the comments of the reviewers into consideration and improved the paper considerably. I would recommend to accept this paper for publication.

My main suggestion would be to include a short discussion on how to use different Linked Data sources (not only DBpedia but also domain specific ones) and additional Linked Data based features (not only categories and types), as suggested by one of the reviewers. The authors mention this only as future work but this is an interesting challenge and the authors' opinion on this would add value to the paper.

In this work tweets are represented as a set of words, as unigrams and bigrams, what about trigrams, etc.? Please add just a short justification about this.

---
*Quality of the writing:*

The paper is well written, clear and properly structured. It has been improved according to the reviewers' feedback.
A few minor remarks:

* Section 1:

In: "Semantically low ("accident" and "car collision") or more abstract level ("I-90" ....)."
it seems to be the opposite.

"The first involve training and testing our models on data" -> "The first involves training and testing our models on data"

* Section 2:

"Listing 1: Extracted DBpedia..." (not DBPedia)

** Section 2.2:
The words "proper" and "common" could be emphasized using italic font.

* Section 3:
** Section 3.2:
"explanation marks" -> "exclamation marks"

* Table 4 & 5: add "F-Measure" in the caption of the tables or in the tables themselves.

** Section 4.3:
"we investigated two crucial questions regarding the properties the different datasets." -> "we investigated two crucial questions regarding the properties of the different datasets."

* Table 8: Align the caption as in Table 9.

* References: refs. [25] and [26] are duplicates.