RDFising Events in the World: Raising News to the LOD Cloud

Tracking #: 737-1947

Authors: 
Marieke van Erp
Marco Rospocher
Piek Vossen
Itziar Aldabe
Aitor Soroa

Responsible editor: 
Guest Editors EKAW 2014 Schlobach Janowicz

Submission type: 
Conference Style
Abstract: 
News articles report on events happening in the world. Up to now, information about these events has been difficult to capture in a structured form such as RDF due to the complexity of language. However, with natural language processing technology coming of age, it is starting to become feasible to tap into this wealth of information. In this paper, we present our pipeline for extracting event information from large sets of news articles in English and Spanish in order to (1) create structured data of who did what when and where as well as opinions and speculations, and (2) link the extracted content to entities of the LOD cloud. Whilst our information extraction pipeline may not be perfect, the redundancy of the data smoothes out recall, and the linking to ontologies and LOD sources enable filtering the data based on cleaner background knowledge. By tracking the provenance of the extracted information, so that user can always refer back to the original article, our pipeline produces rich datasets that contains both unstructured (raw text) and structured content (RDF) in an interlinked manner. We demonstrate this by presenting two datasets that we have produced, highlighting their structure, volume and complexity. This illustrates how our platform can process daily streams of news and publish the reported events as a structured resource for researchers to use without having to process data or set up state-of-the-art NLP technology themselves. The pipelines and produced datasets are open and available to the public.
Full PDF Version: 
Tags: 
Reviewed

Decision/Status: 
[EKAW] reject

Solicited Reviews:
Click to Expand/Collapse
Review #1
Anonymous submitted on 25/Aug/2014
Suggestion:
[EKAW] reject
Review Comment:

Overall evaluation
Select your choice from the options below and write its number below.

== 3 strong accept
== 2 accept
== 1 weak accept
== 0 borderline paper
* -1 weak reject
== -2 reject
== -3 strong reject

Reviewer's confidence
Select your choice from the options below and write its number below.

* 5 (expert)
== 4 (high)
== 3 (medium)
== 2 (low)
== 1 (none)

Interest to the Knowledge Engineering and Knowledge Management Community
Select your choice from the options below and write its number below.

== 5 excellent
* 4 good
== 3 fair
== 2 poor
== 1 very poor

Novelty
Select your choice from the options below and write its number below.

== 5 excellent
== 4 good
* 3 fair
== 2 poor
== 1 very poor

Technical quality
Select your choice from the options below and write its number below.
== 5 excellent
* 4 good
== 3 fair
== 2 poor
== 1 very poor

Evaluation
Select your choice from the options below and write its number below.
== 5 excellent
== 4 good
* 3 fair
== 2 poor
== 1 not present

Clarity and presentation
Select your choice from the options below and write its number below.
== 5 excellent
* 4 good
== 3 fair
== 2 poor
== 1 very poor

Review

The paper presents a system for extraction of events from unstructured text sources that was developed in the NewsReader project. The system relies on a standard NLP pipeline relying on tokenization, POS tagging, named entity recognition and disambiguation as well as time and date recognisers and (semantic) role labeler.
Two datasets that have been processed by the system are described in the paper (car domain related data obtained from LexisNexis and news data obtained from Wikinews). The results of event extraction of these datasets has been made freely available as an RDF dataset.
An interesting aspect of the presented approach is that it is multilingual, processing both English and Spanish, and that it is able to perform intra-document and cross-document event linking.

Overall, the research is interesting and the resulting datasets will be surely useful for others.

The main problem with the paper is that it lacks an empirical evaluation. It is not clear how accurate the system is in identifying events nor how good it does at event coreference.

Further, the actual method for extracting events is not described. The informed reader assumes that semantically role labeled data is mapped to events, but how this is done is left open. For instance, given the current description, one could not reimplement the system, which in my view is not acceptable for a research paper. An example of how semantic role labeled data is mapped to events and a short description would have actually sufficed here. The alogrithm for coreference detection/resolution is also explained only very vaguely. Thus, important details that are necessary for understanding and assessing the approach have been ommitted.

Review #2
Anonymous submitted on 26/Aug/2014
Suggestion:
[EKAW] conference only accept
Review Comment:

Overall evaluation
Select your choice from the options below and write its number below.

== 0 Borderline

Reviewer's confidence
Select your choice from the options below and write its number below.

== 4 (high)

Interest to the Knowledge Engineering and Knowledge Management Community
Select your choice from the options below and write its number below.

== 5 excellent

Novelty
Select your choice from the options below and write its number below.

== 4 good

Technical quality
Select your choice from the options below and write its number below.
== 5 excellent

Evaluation
Select your choice from the options below and write its number below.
== 4 good

Clarity and presentation
Select your choice from the options below and write its number below.
== 5 excellent

Review

The paper is well written and shows an good knowledge of the field. The system it presents makes use of previously published results of the authors (LREC 2014), but also utilizes open source software and publicly available resources.
The paper can be used as a walkthrough on how to extract knowledge from news and represent it as RDF triples. A good point is that it provides reference to all the necessary standards and lists all the NLP components required for processing news text. However, although it employs a rich pipeline of NLP components and extracts entity and relation information, it still does not explain how this information is related to events.

The findings from the analysis of the automotive industry news are interesting, but this analysis is far from being an evaluation. What is necessary here is a comparison (in time complexity and performance), with state of the art systems that triplify entity information from text.

Another major weakness of the approach is the complexity of some NLP tasks and the times needed to run them. Some times are very big and do not allow the system to be used in a real case. It would be interesting if the NLP pipeline could be more flexible, so that users can omit a step (e.g. opinion mining) in order to accelerate the overall processing. This will also allow users to process multilingual content, omitting the modules that are not available in each language.

Some more suggestions:
Figure 1 is difficult to read. I suggest to reduce the size of input and output boxes and increase the size of the NLP component. It would be useful to have the input text that generates the output presented in pages 6 and 7, or else it is difficult to understand how the output has been generated.

In page 7, in the paragraph after the example output, it should say "Semantic Role Labeller" instead of "Semantic Role Labeling".
An example with real input and output would me more useful instead of the first paragraph of page 8. A visualization will help too.
There is no link to the KnowledgeStore storage system. Isn't it accessible to
the public?
The names in Figure 3 must be with larger font.

Review #3
Anonymous submitted on 01/Sep/2014
Suggestion:
[EKAW] conference only accept
Review Comment:

Overall evaluation
Select your choice from the options below and write its number below.

== 3 strong accept
== 2 accept
== 1 weak accept
== 0 borderline paper
== -1 weak reject
== -2 reject
== -3 strong reject

0

Reviewer's confidence
Select your choice from the options below and write its number below.

== 5 (expert)
== 4 (high)
== 3 (medium)
== 2 (low)
== 1 (none)

5

Interest to the Knowledge Engineering and Knowledge Management Community
Select your choice from the options below and write its number below.

== 5 excellent
== 4 good
== 3 fair
== 2 poor
== 1 very poor

4

Novelty
Select your choice from the options below and write its number below.

== 5 excellent
== 4 good
== 3 fair
== 2 poor
== 1 very poor

2

Technical quality
Select your choice from the options below and write its number below.
== 5 excellent
== 4 good
== 3 fair
== 2 poor
== 1 very poor

3

Evaluation
Select your choice from the options below and write its number below.
== 5 excellent
== 4 good
== 3 fair
== 2 poor
== 1 not present

2

Clarity and presentation
Select your choice from the options below and write its number below.
== 5 excellent
== 4 good
== 3 fair
== 2 poor
== 1 very poor

3

Review

This paper presents ongoing research of a system for event extraction from news articles.
The vision and results are at an intermediate stage in the course of the related EU project NewsReader. Useful components appear to be under development, and examples of how to use them for news stakeholders are provided.

My impression, despite the generally good approach taken, is that the work is still incomplete, both in its foundational aspect, and in its evaluation. For this reason I deem it barely appropriate for a conference like EKAW, but definitely not yet ready for SWJ.

For the first point (foundations), I have two issues:

a) the authors do not address significantly the state of the art, missing basic work on event extraction techniques in NLP, as well as in the SW (see below for a couple of references from within the DERIVE workshop series), and specially missing specific work on EE from news articles (see below for some examples).

Some important references in EE from news:
http://piskorski.waw.pl/papers/p749.pdf (from JRC)

http://www.rn.inf.tu-dresden.de/uploads/Studentische_Arbeiten/Diplomarbe... (from Dresden)

Some relevant DERIVE work:
http://ceur-ws.org/Vol-779/derive2011_submission_1.pdf

http://ceur-ws.org/Vol-1123/paper3.pdf

b) the theoretical approach is not deep enough in terms of linking knowledge extracted, and knowledge represented. In particular:
- the GAF and NAF annotation frameworks are more vocabularies for linking text and data: why not using existing ones? E.g. Lemon/Ontolex, NIF, Earmark, etc.?
- the pipeline is reasonable enough, and I like a lot that the modules are available in github; however, no details are given in the paper about the nature of the modules: how are they chosen? do they perform better than existing ones? how are they integrated, besides the generic STORM implementation? I am also puzzled by the time taken by some of the components to process one document on average! How nig is an average document? If it is a typical news article, the time taken by the NER (12.9s) and the SRL (39.8s) is huge! How scalable can such a system be on large, realtime news processing?
- finally, the demo at http://ixa2.si.ehu.es/nrdemo/demo.php is a nice showcase of the capabilities from the modules, but wrt the state of art, I have the impression that the system does not much more that annotating text spans with results of NLP. For example, in existing approaches such as LODifier and FRED (implemented and available at http://wit.istc.cnr.it/stlab-tools/fred), the annotations are organized in a connected RDF graph where each useful element extracted by NLP is given a semantics along the practices of LOD and the SW. In the case of NewsReader this seems completely missing. Not a big problem per se, but probably a paper for EKAW or a SW journal should be more aware of the issue.

For the second point (evaluation), the authors propose two studies on an automotive corpus and on wikinews. They are nice, but in what sense are they an evaluation? For sure a lot of triples have been produced, but how to assess that those triples constitute a good resource? The papers contains a couple of examples in showing links between persons or localization of facts told in the news, but these are only episodes rather than assessments. In practice, you might want to create a sample of what you expect from extraction and linking, and evaluate how the components perform wrt that.