Distantly Supervised Web Relation Extraction for Knowledge Base Population

Tracking #: 885-2095

Isabelle Augenstein
Diana Maynard
Fabio Ciravegna

Responsible editor: 
Guest Editors EKAW 2014 Schlobach Janowicz

Submission type: 
Full Paper
Extracting information from Web pages for populating knowledge bases requires methods which are suitable across domains, do not require manual effort to adapt to new domains, are able to deal with noise and integrate information extracted from different Web pages. Recent approaches have used existing knowledge bases to learn to extract information with promising results. In this paper we propose the use of distant supervision for relation extraction from the Web. Distant supervision is an unsupervised method which uses background information from the Linking Open Data cloud to automatically label sentences with relations to create training data for relation classifiers. Although the method is promising, existing approaches are still not suitable for Web extraction as they suffer from three main issues: data sparsity, noise and lexical ambiguity. Our approach reduces the impact of data sparsity by making entity recognition tools more robust across domains and extracting relations across sentence boundaries using unsupervised co-reference resolution methods. We reduce the noise caused by lexical ambiguity by employing statistical methods to strategically select training data. To combine information extracted from multiple sources for populating knowledge bases we present and evaluate several information integration strategies and show that those benefit immensely from additional relation mentions extracted using co-reference resolution, increasing precision by 8%. We further show that strategically selecting training data can increase precision by a further 3%.
Full PDF Version: 

Minor revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
By Antske Fokkens submitted on 20/Jan/2015
Minor Revision
Review Comment:

The paper is based on previous work and that shows in the overall well-written text and clear presentation of the work. The methods are sound and results are in line with current state of the art (though they difficult to compare due to differences in data and evaluation sets).

I recommend this sound work for publication, though some minor issues should be improved before it appears as a journal article.

I'd like to see a clearer explanation related to the domain independence of the approach. Even though the method itself is domain independent, it does require a knowledge base of a significant size from the domain. The domain specific effort is thus still present, but on the knowledge base side rather than the text analysis side. This does not influence the overall impact of the work, but it is important to be clear about this. Addressing the starting size a knowledge base should have in order for the approach to be successful would make interesting future work.

Some minor comments to improve the presentation of the work:


1) Populating knowledge bases does not require methods that are suitable across domains, domain specific methods work as well, despite their obvious inconveniences. I'd suggest to reformulate this sentence.
2) It is a little confusing to read what the authors will do and then again back to where related work lacks behind: I recommend swapping the sentences: 'In this paper.....' and 'Although....'

Section 1:

1) The part that introduces the challenges addressed in this paper could be formulated a bit clear. In particular, the sentence 'Although promising, ...': there is no need to both mention 'ignored issues' and 'limitations' and since the issues that are listed afterwards are not completely ignored by all, I'd recommend reformulating that part. Maybe state that your work improves on existing approaches addressing four challenges which are illustrated to the example 'Let It Be'?
2) The contributions listed at the end of the paper do not all improve the state of the art (the last two do not). I suggest to call them 'contributions of this research/paper' rather than improvements of the state of the art.

Section 2:

1) typo: Riedel et al. [25] argues -> Riedel et al. [25] argue

Section 3:

1) State at the beginning of the section that you will present the approaches you use and that they will be indicated in bold font (it is easy to guess that the bold abbreviations are method names, but the reader shouldn't have to guess).
2) reformulation: otherwise we want to discard it -> otherwise we discard it
3) typo: one of which does not use attempt resolve -> one of which does not attempt to resolve
4) typo: there is a space between the full stop and footnote 1

Section 4:

1) typos: there are spaces before several footnotes
2) typos: 'Features marked with (*) are only used in the normal setting, but for NoSub setting(Section 3.2)
-> Features marked with (*) are only used in the normal setting, but not for the NoSub setting (Section 3.2)

Section 5:

1) when reading this, one wonders about Mintz et al.'s results. They cannot be compared directly and they only present slightly comparable results in a graph, so I understand you don't compare your results, but maybe briefly explain this in a footnote.
2) typo: textbfinformation -> \textbf{information} (but why is this in bold font?)

Section 6:

1) typo: 'To populate knowledge bases, we test different information integration strategies, which differ in performance by5We'

The first sentence is not finished...

Review #2
By Ulrich Reimer submitted on 22/Jan/2015
Minor Revision
Review Comment:

The paper is an extended version of an already accepted paper to EKAW 2014. The paper is well written, presents new and relevant results and clearly extends the existing state-of-the-art.
Therefore, the paper should be accepted for publication in the Semantic Web Journal.

There are several minor improvements to be made:

page 2, column 2: The sentecne starting in line 6 ("In the example above, ...") states that the second sentence of the example does not contain two named entitities but a pronoun. This does not fit with the example. In fact, the statement holds true for the relative clause of the first sentence.

page 4, column 2: The definition given at the bottom of the Unam paragraph which reads: | { r | l \in L_0 ... } |
The variable r does not occur in the logical expression - something is wrong.

page 8, column 1, line 2: "book" is mentioned as a noun phrase candidate, which contradicts the previous statement that "for non-greedy matching all subsequences starting with the first word" are considered.

page 10, column 1 (also Table 3): It should be explained what the baseline model is. I first thought it is the reimplemented Mintz model, but Mintz is mentioned in Table 3 as a separate model. So what is the baseline model?

Some of the verbalizations of algorithms are hard to understand. The authors should especially try to reformulate the following passages to make them better understandable:
page 5, bottom third of column 2 (talking about co-occurrence counts, co-reference counts, etc.); page 10, column 1: the explanation of how the top line for recall is computed

The first paragraph of Sec.5 should describe the experimental set-up in a bit more detail - the current description is very dense and can easily be misunderstood.

Serveral typos (please check!), e.g.: last sentence of Sec.3.2: "or" --> "of"; first line below Table 6; middle of column 2, page 11: "[34] also it for" --> "[34] also use it for"