Semantic Abstraction for Generalization of Tweet Classification

Tracking #: 926-2137

Authors: 
Axel Schulz
Christian Guckelsberger
Frederik Janssen

Responsible editor: 
Guest Editors Smart Cities 2014

Submission type: 
Full Paper
Abstract: 
Social media is a rich source of up-to-date information about events such as incidents. The sheer amount of available information makes machine learning approaches a necessity to process this information further. This learning problem is often concerned with regionally restricted datasets such as data from only one city. Because social media data such as tweets varies considerably across different cities, the training of efficient models requires labeling data from each city of interest, which is costly and time consuming. To avoid such an expensive labeling procedure, a generalizable model can be trained on data from one city and then applied to data from different cities. In this paper, we present Semantic Abstraction to improve generalization of tweet classification. In particular, we derive features from Linked Open Data and include location and temporal mentions. A comprehensive evaluation on twenty datasets from ten different cities shows that Semantic Abstraction is indeed a valuable means for improving generalization. We show that this not only holds for a two-class problem where incident-related tweets are separated from non-related ones but also for a four-class problem where three different incident types and a neutral class are distinguished. To get a thorough understanding of the generalization problem itself, we closely examined rule-based models from our evaluation. We conclude that on the one hand, the quality of the model strongly depends on the class distribution. On the other hand, the rules learned on cities with an equal class distribution are in most cases much more intuitive than those induced from skewed distributions. We also found that most of the learned rules rely on the novel semantically abstracted features.
Full PDF Version: 
Tags: 
Reviewed

Decision/Status: 
Minor revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
By John Breslin submitted on 11/Jan/2015
Suggestion:
Minor Revision
Review Comment:

Originality:

This paper describes how to use Linked Data features to generalize the analysis of tweets in one city using trained data from that city or others. It expands on work presented in the 5th Workshop on Semantics for Smart Cities. The approach presented abstracts features using NER and location/temporal mentions, enriching these with LOD for more generalisable features (e.g. from road names like I80 to a road class like highway). The authors present the two-class and four-class variations based on different incident classifications.

The scenario is entirely focused on detecting and classifying incidents from tweets, therefore I believe that the title is too general and should be changed to include incidents and cities in it somehow. If the authors truly want general tweet classification, another large section would have to be written which would describe more about the requirements or aspects of features that would make them more generalisable to other types of tweets. But this would probably involve applying the methods to a new domain, and I think a better fix is to adjust the title to be more specific.

They should describe why the three incident types listed were chosen (volume, more commonly-known, more disruptive (shooting vs. traffic jam), etc.). Also, the authors do not describe what other LOD features could have been used (emergency services, local resources, etc.) for these incidents.

Significance of the Results:

The results show improvement over the baseline using selected feature sets, which should prove useful to other researchers and readers. It would be useful if the author’s could share their datasets and workflows for reproducibility.

One hypothesis that LOD features will in general improve results was not shown to be entirely true, and that more refining of these features are required in future work. It would therefore be interesting to hear the authors' opinions on how the use of other LOD datasets, e.g. GeoNames, could have helped, or if some more formal incident-specific datasets (government or company data) could/should be used, or what kind of generalisable features are best suited for cross-dataset classification.

Quality of the writing:

The paper is fairly well written. It suffers from a number of poor definitions that should be addressed in a revision. Suggested fixes and other things that could do with clarification now follow, by section.

Abstract
--------

Improve generalization -> improve the generalization
of tweet classification -> of tweet classifications

Introduction
------------

time-consuming -> time consuming

Give a quick example of what you mean in terms of city-specific info in the first sentence of paragraph 2.

Explain the term tokens (it may be obvious but worth spelling out what you mean in relation to words, phrases, etc.).

I-90 should probably be I90 in the tweet to match the paragraph text and Figure

in form of Twitter -> in the form of Twitter

present out -> present our

These first involve -> The first involves

our conclusion -> our conclusions

Section 2
---------

requires to -> requires us to

The definitions of entities and mentions are confusing. Can something else be done here (e.g. a table of term, definition, example)? I think one issue is that some examples are given and others aren’t. So seeing an NE for an entity confuses with NEs. What about conceptual abstraction instead of entity?

defined as named entity -> defined as a named entity

helps coping -> helps us with coping

road blocked -> Road blocked (to match earlier text, or else change both to road)

type of named entities -> type of named entity

“URIs for these entities are often missing” … not clear why this is so?

“In our approach, both common and proper…” … state if using NER or DBpedia

tweet shown in Figure 1 -> you mean Listing 1 I think

“unable to detect some of the rather informal temporal expressions in the unstructured texts” … this seems to imply that the large docs it was designed for were (semi-)structured - is this so?

Section 3
---------

The authors have a nice selection of cities. It would be interesting to see if the crime rates / accident rates line up to the data observed!

In Table 1, explain the discrepancy between 1404 No Classes for 2-Class and 390 No Classes for 4.

“tweets were manually labelled” … how many exactly?

“resulted in twenty datasets … split in ten” … you may need to explain these numbers to make it easier for the reader; more later.

as frequent as other -> as frequently as other

tend report -> tend to report

Maybe give some examples of No Class tweets - these are ones that mention an incident keyword but don’t actually refer to an incident, right? Give an example for the reader.

For Table 1 and Figure 1/3, is it necessary to have the same data displayed on both or is either sufficient?

Could Figure 2 show dotted groupings around the baseline and semantic abstraction elements?

DBPedia -> DBpedia

Suggest to add heading to separate our third approach. Change “Semantic Abstraction using Location Mentions and Temporal Expressions” to “Semantic Abstraction using Location Mentions” and add “Semantic Abstraction using Temporal Expressions” just before ‘Third’. That will make it easier for reader when she/he sees “our three SA approaches”.

Visually looking at Table 2, it seems London as the lowest overlapping tokens. Any ideas why? Also seems that there is no higher overlap between US cities…

The definition of features needs to be clearer in 3.3. I am seeing a feature as a Type or Class, yet the text says “for both Types and Categories a large number of features is representative” - it sounds like features are aspects of Types or Classes rather than +TYPE and +CLASS being types of feature. Maybe check the rway this is written.

The last paragraph or two in Section 3 are quite vague. Please re-read and consider making clearer.

For example, is it surprising regarding the discriminative categories since these are the main ones you’d associate with the three types chosen anyway?

When you say “some of the representative features are shared”, what exactly is ‘some’? How many?

“This could be an indicator that these” … what are these? Classes? Types?

Section 4
---------

Table 3, why was N=18 chosen, especially with text referring to items not in Table? Explain the high amount of web related classes.

interest is how the learned -> interest is what the learned

As for SVMs -> Since for SVMs

RandomForest <-> Random Forest

I wasn’t really convinced by the selection of the five algorithms. I think more may be needed to justify. How do these compare to all available in the software?

for each classifiers -> for each classifier

that semantic abstraction -> that Semantic Abstraction

allows to -> allows one to

as non-parametric -> as a non-parametric

Friedman, Nemenyi -> Friedman and Nemenyi

End of Table 4 caption (two classes) is missing

Explain where 500 raw samples comes from, again in simple calculations for the casual reader.

Reference Table 4 in 4.2 paragraph 3?

Figure 4 in the appendix and Table 4 -> Figure 5 in the appendix and Table 7 … please check all cross references as obviously there are issues here.

Again, do you really need both tables and all the Box-Whisker figures? Think about it at least…

The paragraph “Although we dominantly…” is unclear, consider rewording.

neglectable -> negligible

+ALL feature group … +ALL italics

“Surprisingly” … Why, explain?

The text says that SA doesn’t increase performance. LOD does and is a type of SA, so this is confusing - distinguish between aggregate and individual. You may want to explain this and also keep it clear from the already-referred to aggregation across the five algorithm types.

45000 raw- and … explain where numbers come from, how calculated

“differ significantly from the baseline” -> this could be good or bad, but I think you mean good because of lower ranking

“focus on the semantic features +LOC, +TIME, and the^H^H^H +LOD” explain why these ones

You will need to define “Majority Class” as it is kind of thrown in here. Also explain how the values are calculated from the class counts in Figure 1.

“ranked at the bottom” -> “ranked at or near the bottom”

“skewed class distribution” … what is connection to Majority Class?

“in general trucks are not involved more often in incidents” … I assume this is true, but if you make a statement like this you may need to cite proof from an objective source (insurance data, police data?)
equation 5. -> equation 5).

(Equation 6 -> (Equation 6)

cin-ditionally -> conditionally

“of the word “unit”” -> explain, e.g. is it related to distances, speeds, etc. (km)

Wildland Fire doesn’t sound too exotic but Former Member States does - explain the difference

It could be, could show -> sounds vague, “can”?

Section 5
---------

No comments. The authors outline related approaches used to help classify social media using semantic abstractions. These are nicely described and related to work in the paper, describing the main differences.

Section 6
---------

“social media text classification” +for city incidents

Review #2
By Frank Ostermann submitted on 13/Jan/2015
Suggestion:
Minor Revision
Review Comment:

This manuscript was submitted as 'full paper' and should be reviewed along the usual dimensions for research contributions which include (1) originality, (2) significance of the results, and (3) quality of writing.

This paper presents original research that is highly relevant to the journal and the speciall issue. It does so in a concise and well-structured manner. The language might benefit from a native speaker's input, but is clearly good enough for publication. My comments mainly concern some methodological issues and the significance of the results. If these are addressed, I recommend the paper for publication.
1. The authors assume that Tweets from different cities contain specific tokens, which reduce the performance of a classifier trained on data from one city when used to classify data from another. Hence the need for semantic abstraction. This is a plausible argument, but it would benefit from a more detailed discussion on what type of tokens would likely be different, i.e. which would most benefit from semantic abstraction, in order to evaluate early on whether the method to used for the semantic abstraction is actually feasible.
2. On page 2, left column, last paragraph: The first statement is what you would like to show. The actual results are less clear-cut.
3. On page 2, right column: A toponym is not the mentioning of a place, it is simply the name of a place. Further, I would argue that homonymy in natural language and ambiguity in place names are conceptually different.
4. On page 3, righ column: The distinction between "proper" and "common" location mentions seems misleading. I think what you mean are "specific" and "generic".
5. On page 4, right column: "Ground truth" is a vey specific term that IMHO does not apply for your Tweet data set. Further, the notion of "huge regional distance" is vague and misleading here. I think you mean simply "geographic distance". However, what about (dis)similar characteristics for the cities involved? Is a 15 km radius appropriate for all of them? Finally, what about the possible impact of the time periods on your study? I am thinking of major events that involve generic activities but specific locations and result in high Tweet volumes.
6. On page 4/5: In order to be able to judge your analysis, you need to provide more information on the selection process for manual classification and the keywords used. Related to that, the last paragraph in section 3.1 confuses me – so what good is the use of keywords?
7. Table 1: Related to the comment above, is this the complete training set, i.e. were all of these Tweets manually labeled? What is your opinion on the remarkable difference in size, especially for large cities such as Chicago and San Francisco?
8. On page 5, right column: You converted from ASCII to UTF? Please clarify.
9. On page 6, left column: Are there any effects you expect from using the semantic abstraction approaches on original Tweets, as opposed to the baseline which is used on the preprocessed ones? Further, you claim to count how often the same feature occurred in one Tweet. Given the brevity of a Tweet, this does not seem to make much sense. Can you elaborate?
10. On page 7, left column: What about overlaps between unique tokens on the basis of actual geographic regions? E.g. NYC/Boston vs. London/Dublin vs. San Francisco/Seattle? Further, the the point of the argument made in the LOD features paragraph is lost on me.
11. Table 3: Are they incident-related or not? Caption seems to contradict column here. Why use Top-N and not Random-N?
12. On page 9, left column: Why not use the best performance, instead of the average performance? You want to explore the potential of semantic abstraction, so why purposefully reduce (smoothen) the performance? Also, by aggregating (averaging) the performance over the classifiers, you cannot actually achieve goal b), because for that you would need to examine the individual classifier results.
13. Table 4: You mean model outputs, not samples?
14. On page 12, left column: When you report on the results, can you give examples of the actual impact or power of the respective approach, e.g. provide examples where it could (or has) led to improvement in the classification?
15. On page 12, right column: I am not an expert on machine learning literature, but I wonder whether your findings regarding the relationship between classification performance and class distribution in the training set haven't been addressed already, e.g. work by Gary Weiss? I find that hard to believe.
16. In section 5, may I suggest to mention work by Spinsanti & Ostermann from 2012 and 2013. It uses machine learning techniques for the analysis of social media with a specific focus on geography.
17. In general: Please highlight the significance of your results. At the moment, this is not clear, the impact seems to be rather small to be honest. Even if that is so and the results remain behind your original expectations, you can still draw more useful conclusions from them.

Review #3
By Alejandro Llaves submitted on 19/Jan/2015
Suggestion:
Minor Revision
Review Comment:

This is an extended version of a workshop paper submitted to Semantics for Smarter Cities, workshop held at ISWC 2014. The authors propose a method called Semantic Abstraction to enhance tweet classification by using Linked Open Data (LOD) and spatio-temporal indicators.

ORIGINALITY
The concept of Semantic Abstraction is helpful to establish relationships among tokens that are unique in datasets belonging to a specific region. The authors show in the paper that, although their method can be used to classify datasets derived from one city, it is more valuable when the classification training and testing is performed with datasets from multiple cities. The authors also claim that, unlike previous works (reference missing!), the features extracted from tweets are treated as numeric items in the evaluation and counted for each tweet. Finally, I miss a more elaborated comparison of Semantic Abstraction to other machine learning methods in the Related Work section.

SIGNIFICANCE OF THE RESULTS
This work is a step forward to build more general tweet classifiers. The method of Semantic Abstraction uses DBpedia categories, that are supposed to be part of a shared conceptualization, to classify concepts from different regions (I am wondering if the same could be applied to non English-speaking regions having enough datasets provided). The authors present a thorough evaluation that surely will set a solid basis for future work in this direction.

QUALITY OF WRITING
The paper is well written and easy to read. I collected some typos and minor remarks below, but in general terms the quality of writing is adequate for a journal publication.

STRUCTURE
1. Introduction
2. Named Entity and Temporal Expression Recognition on Unstructured Texts
2.1 Named Entity Recognition and Replacement using Linked Open Data
2.2 Location Mention Extraction and Replacement
2.3 Temporal Expression Recognition and Replacement
3. Generation and Statistics of the Data
3.1 Data Collection
3.2 Preprocessing and Feature Generation (Preprocessing, Baseline approach, Semantic Abstraction using LOD)
3.3 Analysis of Datasets (Tokens, LOD features)
4. Evaluation
4.1 Method
4.2 Experiment 1: Same City (Discussion)
4.3 Experiment 2: Different Cities (Individual classification performance, Quality of training set)
5. Related Work
6. Conclusion and Future Work
References
Appendix

OTHER REMARKS
- Section 3
- The title is ambiguous: Generation of what? Data or statistics? It could be rephrased as "Generation of Features and Statistics of the Data".
- Section 3.1:
- "We collected all available Tweets..." -> "tweets" does not start with capital letter in the previous mentions.
- "Another might be that people tend TO report..." -> Add "to"
- "However, this reflects the typical situation..." -> Remove "However"
- Section 3.2: "In contrast to previous works..." -> Cite which previous works.
- Table 3: the caption says "TOP-N incident-related (IR) types and categories...", but includes also Not IR. It could be rephrased as "Top-N types and categories..."
- Section 4
- Section 4.2: "Similarly, Figure 4 in the appendix..." -> There is no Figure 4 in the appendix.
- Section 4.3
- "the first six rules all contain abstracted location mentions in their body in CINJUNCTION with..." -> replace for "conjunction".
- There is a broken line when you put the example of the Early_American_Industrial_Centers (class "fire").
- References
- [34] The year of publication is missing.
- Appendix: page 18 is empty.
- Overall: Some references/links to tools or methods are missing: Friedman's test, Nemenyi's test, Tukey's test, Spearman Rank Correlation. OpenCalais API, Part-of-Speech, Sem4Tags tagger, etc.