KnowMore - Knowledge Base Augmentation with Structured Web Markup

Tracking #: 1710-2922

Ran Yu
Ujwal Gadiraju
Besnik Fetahu
Oliver Lehmberg
Dominique Ritze
Stefan Dietze

Responsible editor: 
Guest Editors ML4KBG 2016

Submission type: 
Full Paper
Knowledge bases are in widespread use for aiding tasks such as information extraction and information retrieval, where Web search is a prominent example. However, knowledge bases are inherently incomplete, particularly with respect to tail entities and properties. On the other hand, embedded entity markup based on Microdata, RDFa, and Microformats have become prevalent on the Web and constitute an unprecedented source of data with significant potential to aid the task of knowledge base augmentation (KBA). RDF statements extracted from markup are fundamentally different from traditional knowledge graphs: entity descriptions are flat, facts are highly redundant and of varied quality, and, explicit links are missing despite a vast amount of coreferences. Therefore, data fusion is required in order to facilitate the use of markup data for KBA. We present a novel data fusion approach which addresses these issues through a combination of entity matching and fusion techniques geared towards the specific challenges associated with Web markup. To ensure precise and non-redundant results, we follow a supervised learning approach based on a set of features considering aspects such as quality and relevance of entities, facts and their sources. We perform a thorough evaluation on a subset of the Web Data Commons dataset and show significant potential for augmenting existing knowledge bases. A comparison with existing data fusion baselines demonstrates superior performance of our approach when applied to Web markup data.
Full PDF Version: 

Minor Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
Anonymous submitted on 05/Oct/2017
Minor Revision
Review Comment:

First of all, I would like to thank the authors for the changes they made in response to my review. I appreciate in particular the clarified contribution and the added Section 9.1 on the potential of the proposed method.

Regarding the review criteria of originality, significance of the results and quality of writing, I now believe that the originality of the proposed end-to-end pipeline is good, and that the results are significant, in so far as they have the potential to considerably enrich existing knowledge bases.

Major concerns however remain however about the quality of writing:
1. Unless stated, it appears the paper has not been properly proofread for language and presentation issues. This is not just a surface issue, as the language currently used makes understanding difficult in several places, and especially surprising given that the paper has 6 authors.
2. While I see improvements, I still find the problem definition (Section 3.2) convoluted
3. The evaluation in several parts only spells out numbers, but does not provide the reader with insights or lessons learned

I think the abovementioned concerns require a major effort and was at the edge between suggesting a major or a minor revision. I chose a minor revision because I think the concerns can be adressed without requiring changes to the methodology or new experiments, but suggest to really take a major effort at improving the writing.


1. Language and Presentation
- Intro: "to aid knowledge base augmentation", "to enable the exploitation of markup data", "for supporting KBA tasks" - Why these reservations? Are you not *doing* KB augmentation, not *exploiting* markup data?
- Intro: "the KBA problem" - is the term "problem" needed here? I am not aware of an established KBA problem (like "the travelling salesman problem"), so if needed please introduce the term, though I'd rather simply drop "problem" and refer to it by the established process name: "KBA".
- Figure 1 has an unreadable fontsize
- 3.3: "the KBA problem defined above, consists of two steps" -> remove comma
- 3.3: "The first step aims KnowMore_{match} at" - fix word order
- 3.3: "with respect to KB." - add article
- Table 1: "Notation" -> "Term"?
- Table 1: Some terms are named well "F_class, F_ded", others are illegible (F, F'). Why not give sensible names to all?
- 4.3: Similarity for dates: What is "a conflict"? Insatisfiability or one disagreement? As it stands it is not clear whether values other than 0 and 1 could occur at all.
- 5.1: Features: The author reply mentions that the text appears after Table 4 as further explanation, but in the present version, the text appears first. Also as it stands, the typesetting of the features is quite unfortunate, it appears on a first glance as if they were part of references [1, 3]. How about using "t_1, ..., t_3" instead?
- "we follow the intuition that, if" - remove comma
- "the example facts #2 and #3, would be valid" - remove comma
- "S im(f_i,f_{KB})" - fix math typesetting
- "this does neither .... while improving ..." - fix language
- "for each type book and movie from Wikipedia" - fix language
- "preliminary analysis of the completeness" - relict of the old version
- "available ^{12}" - fix spacing
- Enumeration in 6.2.2: There seems to be a problem with types, the label of the first item seems to be a step in the pipeline, while the others sound like metrics. Also, what about aligning the labels with the names of the steps as defined in Table 1?
- 6.2.2: "R - the percentage" -> "Recall R - the percentage"
- 6.3: "S V M" - fix typesetting
- 7.1: "While the step is a precondition, we provide evaluation results" - "As it is an important precondition"?
- 7.4: "entities, existing" - remove comma

2. Problem Definition
- The addition of the example is nice, but it appears still too late. Please put an example of an entity description immediately when you define it.
- Use of symbol "q" for entities is confusing. The symbol bears no similarity with the term "entity" and in the field is often used for queries. Why not use "e"?
- What is "E"?
- What is the "type" of Definition 1? "Problem definition"? "Task"?
- As it is, Def 1 can be satisfied by the empty set. You likely want to not just "a set" but "the maximal set"?
- The problem definition is possibly overly complicated by the fact that there are actually two problems hidden in it, as also described in the approach section: 1) Find the entity descriptions that refer to the same entity, and 2) Find novel facts. Maybe the problem definition would benefit from being split in two, especially as you are adressing the problems independently, and appear not to use joint inference?

3. Missing Discussion/Lessons learned
- What do we learn from Naive Bayes outperforming the other methods? In response to my previous comment on this you added more experiments showing THAT it outperforms the others, my point was however rather to understand WHY that happens and what we can learn from this for similar scenarios.
- Section 4.2: Still hard to understand due to the language. What really is happening here, on which basis are entities returned by the Lucene queries? How conservative/credulous is the String similarity inside? Understanding this is crucial to understand how many false negatives this step may introduce, so I'd consider a discussion as imperative.
- 5.1: "We have experimented with several different approachs" - Why? And why the ones you chose?
- 7.1: Text only spells out the numbers in Table 7. Please rather provide 1-2 sentences explanation as to why the main results are as they are (i.e., why is the best configuration best, or why are the baselines worse?)
- 7.2: The same: "The baseline fails to recall a large amount of correct facts" - Why? What does your technique do better? What can future researchers learn from your method?
- 7.4: Coverage gain: I think the metric is reasonable, but as it is nonstandard please explain briefly before using it why you consider it useful/interesting for the present problem.

- Is Naive Bayes really a state-of-the-art classifier?

Review #2
Anonymous submitted on 16/Oct/2017
Review Comment:

This manuscript was submitted as 'full paper' and should be reviewed along the usual dimensions for research contributions which include (1) originality, (2) significance of the results, and (3) quality of writing.

In their revised version, I consider that the authors have significantly improved the presentation of their contributions. In particular, they have taken into account my comments on their first version in a satisfactory manner.

Review #3
By Aleksander Smywinski-Pohl submitted on 23/Oct/2017
Minor Revision
Review Comment:

Comparing to the previous review, most of my suggestions have been introduced in the text. Yet there are two things that could be improved without much further work:
1. The definitions of the similarity metrics - the authors once again refer to the source code, which is a bad practice I believe. I don't expect many definitions, yet they are crucial for reproducing the results.
2. Usage of percentage points rather than percents. If the precision jumped from 60% to 80% it improved by 33% not 20%! However it improved by 20 percentage points.

Minor issues:
p. 5 "A fact is correct, i.e. consistent with the real world regarding query entity q" - query entity or just entity?
p. 6 "For instance, considering the query “Brideshead Revisited” (of type Book), as part of the the blocking step" - doubled "the"
p. 6 "Applying these heuristics *improves* the performance of the subsequent step by providing a wider and *improved* pool of candidates." - doubled word, the second occurrence does not provide any explanation in that context.
p. 8 "A fact f is considered to be novel with respect to the KBA task, if it fulfills the conditions: i) *not duplicate* with other facts selected from our source markup" - does not duplicate
p. 9 "corpus M, ii) *not duplicate* with any facts existing in the KB. Each of these two conditions corresponds to a diversification step." - does not duplicate
p. 10 "– Entity matching. Precision P - the percentage of entity descriptions ei ∈ E that were correctly matched to eq, *R* - the percentage of ei ∈ E0 that were correctly matched to KB, and the F1 score." - missing word "recall" before "R".