A Supervised Machine Learning Approach for Events Extraction out of Arabic Tweets

Tracking #: 1545-2757

Mohammad ALSmadi
Omar Qawasmeh

Responsible editor: 
Guest Editors ML4KBG 2016

Submission type: 
Full Paper
Tweets hold rich amount of information about daily events, however they are noisy text, personalized and challenging to be understood by machines. Therefore, this research proposes a state-of-the-art supervised machine learning approach for extracting events out of Arabic tweets. The proposed approach focuses on four research tasks: Task 1: Event Trigger Extraction, Task 2: Event Time Expression Extraction, Task 3: Event Type Identification, and Task 4: Temporal Resolution for ontology population. Extracted event arguments using these tasks are used to populate an event ontology designed for this purpose. This ontology is used to feed a visualization tool (e.g. Calendar) representing live extracted events. The proposed approach was evaluated on a dataset of 2k Arabic tweets and evaluation results were promising. The approach performance was compared to an unsupervised rule-based approach from previous work using the same dataset. Results show that the proposed approach outperforms the unsupervised rule-based one in tasks T1: event trigger extraction (F-1= 92.6 vs. F-1= 78.7) and T2: event time expression extraction (F-1= 92.8 vs. F-1= 88.35), whereas is acting relatively worse in T3: event type identification (Accuracy= 80.1 vs. Accuracy= 95.9).
Full PDF Version: 


Solicited Reviews:
Click to Expand/Collapse
Review #1
Anonymous submitted on 16/Feb/2017
Major Revision
Review Comment:

The paper presents a supervised machine learning approach for extracting events from Tweets expressed in Arabic language.
The goal of the presented approach is to tackle the following three challenges: (i) the extraction of events' triggers; (ii) the extraction of time expressions; (iii) the extraction of event type; and, (iv) the resolution of temporal expressions for populating an event ontology.

The proposed idea is interesting, the paper is well written and the first part is easy to follow.
However, there are some lacks that significantly affects the overall judgment.

First: the Reviewer did not understand the connection of the Calendar application with the remaining part of the paper.
This part seems to be totally disjointed from the rest of the paper.
Indeed, no information about the effectiveness of Task 4 is reported.
Always concerning Task 4, no details about the developed ontology have been provided:
- Which are the main concepts? Object properties? etc. A graph may be help the comprehension.
- Which methodology has been followed for developing the ontology?
These details must be provided.
Further details about the matching algorithm should be provided too.

Second: the evaluation surprised the Reviewer.
The accuracy obtained on Task 3 is quite low by considering that this Task should be easier w.r.t., for example, Task 1.
The effectiveness of the algorithm for detecting event type should be improved before to consider this paper for the publication.
Moreover, more details about it have to be provided.

Third: further details about the learning algorithm should be provided.
At first impression, the paper seems to be a black boxes pipeline of existing components.
Which is the real contribution of the authors? It should be specified.

Minor issues.
- Section 4.1: the list of temporal keywords should be linked.
- Section 1:
----- "events our f Arabic" -> "events out of Arabic"
----- "findings form" -> "findings from"
- Section 2:
----- "addressing this filed" -> "addressing this field"
- Section 3:
----- "two main task" -> "two main tasks"
----- "Abuleil propose" -> "Abuleil proposes" (or proposed)
----- "the authors produced" -> "The authors produced"
- Section 4.2:
----- "extract extract" -> "extract"
- Section 7:
----- "Event are populated" -> "Events are populated"

Review #2
Anonymous submitted on 21/Mar/2017
Review Comment:

The paper investigates the use of supervised machine learning approaches to extract events information from Arabic tweets. The authors collected and annotated 2000 arabic tweets, and used three machine learning algorithms to train and test a set of classifiers. The evaluation is based on the machine learning test results, as well as on a comparison to a lexicon-based approach previously developed by the authors. The extracted events are then fed into an ontology which is meant to feed into a visualisation tool (eg calendar).

In general, a good topic, and challenging given the difficulty in processing arabic text, and in extracting events in general. Unfortunately, in my humble opinion, the paper has several major weaknesses, which I will list below.

The part of the paper which is concerned with populating an ontology is very weak and somewhat detached from the events extraction work. Ontology population work was not evaluated, and the population process itself appears to be very simplistic (mainly based on processing temporal information). There is no novelty in this part of the work given that ontology population work was not covered in the paper nor compared against. Also, feeding this information into a visualisation tool (Calendar is given as an example) was mentioned in the abstract and intro but not covered in the paper at all.

Comparing the supervised model to a lexicon based model is acceptable, but insufficient. Naturally, one would expect the supervised model to perform better than unsupervised approaches. It would have been better to also compare against one or more of the many supervised approaches mentioned in the related work section.

Many supervised approaches are mentioned in the related work section, which appear to use similar approaches and features. At the end of that section, authors highlight how their work differs from other arabic-focused work, but it would be necessary to also highlight differences to non-arabic based works, since the methods are very similar. It is unclear whether the difference is mainly in the type of features used, or in the event information being extracted, or somewhere else. This could be highlighted better.

Event type is taken to be Instant or Interval. Why those two? this is more about the type of the time of the event, than of the event itself. In which case, Event time and Event type are both related to time. Any why knowing whether the event is instant or interval is useful? The 6 event information items listed in the Introduction (a, b, ... f) need a better justification or pinning on related literature. At the moment they look rather unstructured (several are based on time, Event location and Event target are both based on location).

Good coverage of related work is given. It would have been much better to use one or more of the datasets listed in table 1 in the evaluation in this paper, to provide a direct and strong comparison to the literature.

Section 4.1 describes the data collection process. It seems that the twitter search was done using date-related words (days of week, month named, etc). This generates heavy bias in the collected data towards those time notions. Paper does not explain how event time and event type (instant/interval) were judged/annotated. If they used days of week etc as notions, then I suspect that this search method had a strong impact on the results, which was not discussed.

The manual annotation task was not well described. It is unknown what the inter-annotator agreement score was, or how many annotations were produced. It seems that the annotators (the authors of this paper) annotated against the 7 tags listed in section 4.1, But we don't know how many annotations were produced for each of these tags, and how many of them were unique. For example, we don't know how many different Event Triggers were annotated.

The MADAMIRA and ANED tools were used to extract and disambiguate semantics. Unclear how well these tools worked, and how many unique semantics were extracted, and how many tweets produced these semantics.

The evaluation analysis does not help to understand the impact of each feature on the results. Author rerun the analysis several times by dropping one feature and keeping all the others each time. Unfortunately this does not help to understand the impact of each individual feature. In other words, we don't know from the results which is the most important feature. If anything, the results seem to suggest that some of these feature have almost no impact on the results.

In table 5, how's the weighted average calculated? what were the weights? and why? It seems that the weights were chosen to strongly favour one side over another to improve the results, with no justification.

Tables 6 and 7 need much further explanation. Why no F for T3 in Table 6? and why no R and P for these tables? which are important to know to complete the comparison to the supervised results.

Table 9 suggests that dropping any of the features have little effect on the results, and that accuracy remains very high (in the 90s%) regardless of which feature is dropped (even when the uni-gram feature is dropped). This is a very unexpected. It is normally the case that at least one of the features (eg uni-gram) would be the dominant one. How could this be explained? Better to run the analysis with each feature individually to see how they perform.

Other points:
In introduction, 6 event information items are mentioned (a, b, ... f). But 7 are listed in section 4.1.

Intro gives the impression that all these 6 are being extracted in this paper. Later we learn that only the first 3 are! This is misleading.

Motivation should highlight how much Twitter is being used in Arabic, and why extracting events is useful. The two application examples given (Intelligent web, and Q&A) are very vague.

Paper talks about extracting "event arguments". However, no argumentation is being extracted per so, but only event information fragments.

In Related Work, authors should have also covered semi-supervised approaches briefly since they are related to this work, which builds a supervised model and compares it to an unsupervised one.

Stating the accuracy results of related work is ok, but these numbers are not meaningful given that their experiments were all differents (applied to different datasets), and hence they are incomparable.

Why not compare against the lexicon by [5]? which is also for arabic.

Typos and such like:
1. "Arabic language is the one of the fastest .." delete "the"
2. "in order to solve" too strong.
3. "our f Arabic"
4. very unclear sentence "For instance, on the text level [5, 6, 7] and on the sentence level
[8, 1] can be listed"
5. "ACE corpus 3, Four"
6. "withour POS"
7. "events. the authors"
8. placing figure 1 in related work section is very odd
9. "used to extract extract the features"

Review #3
By Mohamed Sherif submitted on 03/May/2017
Major Revision
Review Comment:

The paper introduces a supervised machine learning approach for extracting events out of Arabic tweets. Based on a previous work proposed by the same authors [1], the authors categorized an event arguments into: event trigger, time, type, location, product and target.
In this paper, the authors tried to solve the main three tasks: (T1) event trigger extraction, (T2) event time expression extraction and (T3)event type identification.

After introduction, the authors begin the paper by motivating their work and stating various challenges when to extract event from twitter text. Thereafter, a wide view on related work is given afterwards with a focus on supervised machine learning approaches for event extraction and Arabic language.
In Section 4, the authors describe the proposed approach. First, the authors present the features extraction phase based on the MADAMIRA tool. Next, the classification of the extracted features were used to train three main classifiers: Naïve Bayes, Support Vector Machine (SVM), and a decision tree based on the WEKA 2.4 implementation.
In Section 5, the authors evaluate the proposed approach against a 2K of manually annotated tweets. Also, the author compare the proposed system with their previous work [1]. The authors then discussed the results in section 6. In Section 7, the authors introduce the temporal resolution process using the regular expression matching and Lexicon-based matching. Finally, the paper is concluded and some future extensions are presented in Section 8.

The paper is written in good English and in my opinion the intensive evaluation and discussion section are the main strength of this paper.

The main weakness of this paper is originality of the used techniques. In particular, the authors did not come with any new technique rather apply three of the already existing supervised machine learning techniques. Also, they use already existing tools for preprocessing and feature extraction. Nevertheless, the important of the paper comes from
providing an empirical analysis of the presented three supervised machine learning techniques in the domain of event extraction from Arabic tweets.
Providing an overall system capable of combining all aspect of event extraction in Arabic tweets.

Therefore, I suggest that the authors either re-submit the paper to the system track where such an engineering paper is more suitable, or add at least one novel algorithm to solve any of the aforementioned tasks.

Other comments:
- Provide a link to the developed system. For example, the Github of the project.
- Provide a link to the 2K manually annotated tweets.
- Give more details about how did you initialize each of the three supervised machine learning algorithm / tools.
- Page 2: what is the difference between even even location and event target?
- Page 4: “They computed four main features sets (Temporal features, Social features, Topical features, and Twitter-centric features) for their dataset and used them to train machine learning approaches to cluster tweets at any point of time.” Too long sentence better split into 2 sentences
- Add F-Measure as a column to Table 1 as it already exists for all systems in the text.
- There is a mismatch between the first paragraph of Section 4 and Figure 1. For example, the temporal resolution step (from the text) is not in Figure 1, also the feeding step (from Figure 1) is not included in the paragraph.
- In Section 4.1, “authors using the framework of events arguments discussed earlier” where?
- In Section 4.1, add an annotation example.
- Tables 6 and 7 are not clear to me, also why there is no F-Measure in the last row of Table 6?
- Table 8: Why there is no accuracy for task 1 and 2, also why there is no F-Measure for task 3?
- Page 13: I believe that the reference [30] is misplaced, and it should be [1]

- Page 2: “our f” → “out of”
- Page 5: “withour” → “without”
- Page 9: “extract extract” → “extract”
- Figure 2: “if { DataExpression matches RegularExpressionFormat newData…” → “if DataExpression matches RegularExpressionFormat { newData…”