Review Comment:
Summary:
This is a paper about inferring semantic information about tables found on Wikipedia. The semantic information is given in the form of a so-called "semantic description", i.e., a directed graph (a tree actually) where each node is an Wikidata class, an entity, a literal, or a table column, and each edge is an ontology property representing a relation between two nodes. The approach followed in this paper has two main steps:
Step 1) Build an initial graph of all possible relationships between table cells and table context (a data graph), and then group/consolidate relationships if they are of the same ontology property (to construct the candidate graph).
Step 2) Identify the correct relationships from the candidate graph, using Probabilistic Soft Logic (PSL) reasoning. For that step, a set of inference PSL rules are defined, to identify and discard mismatches between the current evidence (e.g., rows in a table indicating the existence of a relationship between two columns) and ontological constraints (e.g., domain or range restrictions). Each rule is assigned a weight and at the end of this reasoning process, a Steiner tree algorithm decides which edges (already processed to be above a certain confidence threshold) to retain.
The evaluation is performed using two datasets:
- A new, publicly available dataset, called 250WT, created by the authors from 250 Wikipedia tables.
- A dataset used in the SemTab 2020 Challenge with 22,000 tables and target annotations to Wikidata.
The evaluation involves two tasks, which are variations of the SemTab challenge tasks CTA (annotating a table column with a KG class) and CPA (annotating a set of table columns with an n-ary relationship). The baseline methods are three top-performing methods in the SemTab 2020 challenge, including the 2020 winner, MTab.
The results show that GRAMS, i.e., the method proposed in this paper, is outperforming the baselines in the new dataset 250WT, while it achieves competitive results in the SemTab dataset.
My impression from the paper is positive, as it introduces a novel way of performing a very important task and it is, in general, well-written. I also liked that the authors created a new dataset, which they made publicly available along with the source code, and that they offer insights about things that worked and things that did not work that well (e.g., the contribution of some PSL rules are insignificant or even negative, but this is well explained and justified).
Overall, I believe that the paper is contributing to the state-of-the-art and, with some improvements, would be a good fit for this journal.
Originality:
The idea of using PSL and Steiner trees for this problem is a novel and very interesting approach.
Significance of the results:
The experimental results show the strengths and weaknesses of the approach, and are of big interest to the community.
Nonetheless, some parts of the evaluation are too vague and informal to be considered and should be re-written and/or performed again, in a clear setting (wrt. evaluation methodology and measures). See the detailed comments that follow.
Quality of writing:
The paper is well-written and structured, but there are certain things that should be improved.
Strong points:
- Interesting idea and approach
- Source code and data are available on github
- Competitive results to SOTA methods
- Feature Analysis is really insightful and useful (Section 4.6)
Weak points:
- Lack of formality makes some parts confusing/vague
- Not clear what the context is exactly and how it is extracted
- The evaluation needs clarifications and perhaps new tests
- Presentation is good overall, but should be improved in some parts
More details on the weak points and suggestions for improving them:
- The lack of a mathematical problem formulation is one of the main weak points of this paper. The introduction in Section 3 is an informal description of the problem, leaving some things open to different interpretations. E.g., what is the surrounding context of T? What exactly is the definition of semantic descriptions? There is an textual, informal definition in the intro stating that it is a directed graph, but again, how is this graph formally defined? I would advise the authors to add a "Preliminaries" (sub-)section to include this kind of information.
A similar issue is met in the Algorithms, where many parts are informally / textually described instead of actually using pseudocode. Examples include lines 4, 6, 22, 26, 27 in Algorithm 3
- The authors mention the exploitation of the context quite a few times, but it is not clear what exactly is the context (a formal definition), how it contributes to the results. Perhaps Section 4.5 was meant to serve this purpose, but it does not, as described next.
- Section 4.5 is supposed to be evaluating the role of the context. How this is done, why and how is the evaluation carried out are not at all clear. The authors define the table subject as "the subject of a table is what the table is primarily about". What does "primarily" mean? What is the table subject in terms of an ontology? The authors state that "it is often expressed in the semantic description via ontology classes [...] or relationships with other literals [...]". Is it always an ontology class / relationship with literals, or could it be something else, too? Also, the case of using relationships with other literals to describe a table (even the example given), is not straightforward at all. I would suggest re-writing this section from scratch; it should not remain as it is currently.
Moreover, how was the ground truth of 250WT created for that task? How was GRAMS evaluated? The authors mention that the baselines were excluded "because they do not predict the table subjects". That is true, but with some effort, the authors could have run the CTA task on those systems (from SemTab Challenge) to the "subject column", which some of the systems detect. Even with one system that would be very helpful, even if it is clear that this is not an entirely fair comparison. Finally, in 1 table out of the 18 in the dataset for which GRAMS detected a subject, the authors mention that it was not exactly the same as in the ground truth, but semantically similar. Could this methodology/measure be generalized beyond specific cases and systems (GRAMS), e.g., as in the evaluation measures employed in the SemTab challenge that address generalization with some penalty?
- Another problem related to the evaluation, is the fact that the authors ignore rules with probability below 0.5, but without a proper experimental justification. Perhaps this threshold is not the best, or the threshold could be learned and not fixed.
- Finally, a there is a rather ad-hoc claim that "In many cases, a single path is sufficient to describe the relationship between two nodes. Including extra paths, which tends to be incorrect, can decrease the precision of our prediction." How is this claim supported?
- Presentation-related issues, roughly in order of importance:
-- I would like to see a small (at least) architecture section, where things like using "a local key-value database to store Wikidata" are made clear much earlier than page 14. Perhaps between Sections 2 and 3, or after the "preliminaries" (if the authors introduce that). E.g., I was wondering how the authors "fuzzy match" literals (page 5, line 6), which becomes kind of clearer only when they say that they use a local key-value store. Until then, I was assuming that they use a Wikidata service. It is still not clear though, which fuzzy match algorithm/measure they use in particular (BM25?), and which key-value store (an elasticsearch index?). Those should be clarified in the text, even if someone could check the code to find those answers.
-- I am a bit confused about the weighting scheme of PSL rules. The authors mention in Section 3.2 that "If a rule in PSL does not have weight, it will be considered as a hard constraint". Then, one page later, they mention that "rules that act as constraints should have very high weights (rule 11 has w = 100)". I have two questions about those statements:
1) Which one is it? Weight = 100, or no weight, for hard constraints?
2) Why only rule 11 is considered as a hard constraint and not also rules 8, 9 and 10?
-- page 3, lines 23-25: "Following this process, we may discover in the second row that Thomas Prinzhorn was a second president, and he left the office in 2002, while there may be no suggestions from data in the first row as Wilhelm Brauneder does not link to any Qnode."
It's not at all straightforward to understand how following the same process leads to this conclusion.
(also I guess that "Wilhelm" is the same as "Willi" mentioned in the table, and that "Gruene" in Figure 1 is the same as "Gruenen", mentioned in Figure 2a, but please confirm/clarify)
-- DBpedia does not use infoboxes only, as it is stated in the first paragraph. It also uses other mining features like regular expressions from Wikipedia articles. See an overview of such extractors in Table 1 of http://dx.doi.org/10.3233/SW-140134
-- References 23-28 are never mentioned in the text. They are all relevant, and should be also mentioned/compared in the related works.
-- Page 5, line 20: I think it's line 11, not 13 in the parenthesis, but I could be wrong.
-- For consistency, I would put "Output" after "Input" in Algorithm 1.
-- Perhaps a separator between rules 5 and 6 would help to visually distinguish the two groups of rules.
Typos:
-- Page 9, line 45: "may contains" --> "contain"
-- Page 9, line 49: "Lines 8 to 10 finds" --> "find"
|