Relation Extraction from the Web using Distant Supervision

Tracking #: 742-1952

Authors: 
Isabelle Augenstein

Responsible editor: 
Guest Editors EKAW 2014 Schlobach Janowicz

Submission type: 
Conference Style
Abstract: 
Extracting information from Web pages requires the ability to work at Web scale in terms of the number of documents, the number of domains and domain complexity. Recent approaches have used existing knowledge bases to learn to extract information with promising results. In this paper we propose the use of distant supervision for relation extraction from the Web. Distant supervision is a method which uses background information from the Linking Open Data cloud to automatically label sentences with relations to create training data for relation classifiers. Although the method is promising, existing approaches are still not suitable for Web extraction as they suffer from three main issues: data sparsity, noise and lexical ambiguity. Our approach reduces the impact of data sparsity by making entity recognition tools more robust across domains, as well as extracting relations across sentence boundaries. We reduce the noise caused by lexical ambiguity by employing statistical methods to strategically select training data. Our experiments show that using a more robust entity recognition approach and expanding the scope of relation extraction results in about 8 times the number of extractions, and that strategically selecting training data can result in an error reduction of about 30%.
Full PDF Version: 
Tags: 
Reviewed

Decision/Status: 
[EKAW] combined track accept

Solicited Reviews:
Click to Expand/Collapse
Review #1
Anonymous submitted on 23/Aug/2014
Suggestion:
[EKAW] combined track accept
Review Comment:

Overall evaluation
Select your choice from the options below and write its number below.

== 3 strong accept
== 2 accept
== 1 weak accept
== 0 borderline paper
== -1 weak reject
== -2 reject
== -3 strong reject
2

Reviewer's confidence
Select your choice from the options below and write its number below.

== 5 (expert)
== 4 (high)
== 3 (medium)
== 2 (low)
== 1 (none)
4

Interest to the Knowledge Engineering and Knowledge Management Community
Select your choice from the options below and write its number below.

== 5 excellent
== 4 good
== 3 fair
== 2 poor
== 1 very poor
5

Novelty
Select your choice from the options below and write its number below.

== 5 excellent
== 4 good
== 3 fair
== 2 poor
== 1 very poor
4

Technical quality
Select your choice from the options below and write its number below.
== 5 excellent
== 4 good
== 3 fair
== 2 poor
== 1 very poor
5

Evaluation
Select your choice from the options below and write its number below.
== 5 excellent
== 4 good
== 3 fair
== 2 poor
== 1 not present
5

Clarity and presentation
Select your choice from the options below and write its number below.
== 5 excellent
== 4 good
== 3 fair
== 2 poor
== 1 very poor
4

Review
Please provide your textual review here.

The paper presents an improved approach to relation extracting using distant supervision, i.e. using existing knowledge bases to automatically label entities in a text to create training data for learning a classifier. The paper points out deficiencies of current approaches and explains how and why the presented approach is intended to be better. The evaluation of the presented approach indeed shows significant improvement over an existing state-of-the-art system.

The paper is well written, the presented approach is well motivated and explained, and the evaluation of the approach is described and discussed in detail. The presented work is sufficiently related to existing approaches from the literature.

A few minor remarks:

Sec.3.1: "Ambiguity within an entity": It will increase clarity if the example given in the first paragraph is mapped to the generic discussion in the second paragraph, e.g. (Beatles, album, LetItBe) vs. (Beatles, track, LetItBe). Of course, the reader can figure this out on his own, but it is nicer if it is already in the text. Moreover, the statement that "Let it be" is a seed for the class "Musical Artist" causes some confusion first, because "Let it be" is not a seed for the class "Musical Artist" but a seed for a property of it ("album", resp. "track").

There is a typo in the formula for defining A_0 in the second paragraph above Sec.3.2.

The search pattern mentioned in Sec.4.1 quotes the subject entity but should also quote the class name and relation name.

The relaxed setting (Sec.3.2) is a very pragmatic but still interesting approach. It increases the number of extractions but, of course, reduces precision. I am wondering if the relaxation is unnecessarily far reaching. Wouldn't it be better to do the described relaxation only for sentences where the subject is a pronoun or a more general concept (e.g. "author") than the subject in the first sentence (e.g. "E.A. Poe") of a paragraph? In Sec.6 it said that improving the relaxation via better coreference detection is subject to future work, but it would still be nice if the paper included a more detailed discussion of the possibilities for relaxing the original supervision paradigm and what their advantages and difficulties are.

Review #2
Anonymous submitted on 24/Aug/2014
Suggestion:
[EKAW] conference only accept
Review Comment:

Overall evaluation
Select your choice from the options below and write its number below.

== 3 strong accept
== 2 accept
== 1 weak accept
== 0 borderline paper
== -1 weak reject
== -2 reject
== -3 strong reject
2

Reviewer's confidence
Select your choice from the options below and write its number below.

== 5 (expert)
== 4 (high)
== 3 (medium)
== 2 (low)
== 1 (none)
3

Interest to the Knowledge Engineering and Knowledge Management Community
Select your choice from the options below and write its number below.

== 5 excellent
== 4 good
== 3 fair
== 2 poor
== 1 very poor
5

Novelty
Select your choice from the options below and write its number below.

== 5 excellent
== 4 good
== 3 fair
== 2 poor
== 1 very poor
2

Technical quality
Select your choice from the options below and write its number below.
== 5 excellent
== 4 good
== 3 fair
== 2 poor
== 1 very poor
3

Evaluation
Select your choice from the options below and write its number below.
== 5 excellent
== 4 good
== 3 fair
== 2 poor
== 1 not present
3

Clarity and presentation
Select your choice from the options below and write its number below.
== 5 excellent
== 4 good
== 3 fair
== 2 poor
== 1 very poor

Review
Please provide your textual review here.

The paper proposes several improvements to relation extraction from the Web
using LOD-based training data annotation (seed instances). The paper
demonstrates good understanding of the problem and engineering quality of the
realized solution. However, contribution of the presented work to the current
state-of-the-art is minor and individual effects of particular steps (such as
co-reference resolution and paragraph-length contexts) are not clearly
evaluated. In some cases, the link between limitations of existing systems
discussed in the Introduction and the proposed solution is not clear, e.g.,
how the system better deals with "unrecognized entities"?

Although the paper motivates the work by web-scale extraction, the set of
classes and relations used in the evaluation is rather limited and
homogeneous. It is not clear to what extent the proposed approach influences
results in cases of different nature, especially when the potential pool of
seed instances is small. The effect of the proposed improvements is also not
presented w.r.t. scalability of real web-scale extraction, e.g., co-reference
resolution by means of the Stanford CoreNLP tools is known to be slow and thus
unusable for very large data.

Related work could mention success of web-scale IE such as IBM Watson.

Review #3
Anonymous submitted on 02/Sep/2014
Suggestion:
[EKAW] conference only accept
Review Comment:

Overall evaluation
Select your choice from the options below and write its number below.

== 3 strong accept
== 2 accept
== 1 weak accept
== 0 borderline paper
== -1 weak reject
== -2 reject
== -3 strong reject

2

Reviewer's confidence
Select your choice from the options below and write its number below.

== 5 (expert)
== 4 (high)
== 3 (medium)
== 2 (low)
== 1 (none)

3

Interest to the Knowledge Engineering and Knowledge Management Community
Select your choice from the options below and write its number below.

== 5 excellent
== 4 good
== 3 fair
== 2 poor
== 1 very poor

4

Novelty
Select your choice from the options below and write its number below.

== 5 excellent
== 4 good
== 3 fair
== 2 poor
== 1 very poor

4

Technical quality
Select your choice from the options below and write its number below.
== 5 excellent
== 4 good
== 3 fair
== 2 poor
== 1 very poor

4

Evaluation
Select your choice from the options below and write its number below.
== 5 excellent
== 4 good
== 3 fair
== 2 poor
== 1 not present

4

Clarity and presentation
Select your choice from the options below and write its number below.
== 5 excellent
== 4 good
== 3 fair
== 2 poor
== 1 very poor
4

Review

The paper describes an approach that uses distant supervision for identifying relations between entities or entities and the class they belong to in text. The authors propose an approach that improves related work by applying simple statistical models to filter unreliable data, leave ambiguous data out of their seeds and using more negative training data.

The paper is very well written and the approach and results provide a decent contribution to the field.

There are a few parts that could be improved:

In `related work': the difference between Open IE and distant learning is not as big as the authors suggest: Open IE, for instance, uses distant learning as a first step. They also make use of machine learning techniques, so the claim that they are 'rule based' is incorrect. If the authors mean to address a specific part of the approach, this should be specified.

The work by Wu and Weld (among others [1]) on Kylin should be cited.

In `Evaluation': 'relative recall' is not quite as well established as 'recall': please explain the concept in one sentence.
How do the results of the reimplementation of Mintz et al.'s system compare to the original?
The difference in recall and positive hits and the (probable) cause by the use of the additional NE class could be explained better. On the one hand, there are more hits, on the other hand less recall. At a first glance, it seems odd that a step that can add more candidate data would reduce recall. The issue at hand is, of course, that the systems work (1) with different sets and (2) with different features, resulting into different sets. The fact that relative recall (and not recall) is reported is extremely relevant here, because additionally identified entities will not be found by approaches not using the additional NE identifier and can thus not help improve recall. If my description above is correct, I managed to figure it out, but it should be spelled out.

Minor comment:

Compliments for the careful writing, with no noticeable errors and typos.

In section 4.2, footnote 1 should immediately follow the comma which should immediately follow `API'.

[1] Wu, Fei, and Daniel S. Weld. "Autonomously semantifying wikipedia." Proceedings of the sixteenth ACM conference on Conference on information and knowledge management. ACM, 2007.