A Graph-based Approach for Inferring Semantic Descriptions of Wikipedia Tables

Tracking #: 3195-4409

Authors: 
Binh Vu
Craig A. Knoblock

Responsible editor: 
Agnieszka Lawrynowicz

Submission type: 
Full Paper
Abstract: 
Millions of high-quality tables are available in Wikipedia. These tables cover many domains and contain useful information. To make use of these tables for data discovery or data integration, we need precise descriptions of the concepts and relationships in the data, known as semantic descriptions. However, creating semantic descriptions is a complex process requiring considerable manual effort and can be error-prone. This paper presents a novel probabilistic approach for automatically building semantic descriptions of Wikipedia tables. Our approach leverages hyperlinks in Wikipedia tables and existing knowledge in Wikidata to construct a graph of possible relationships in a table and its context, and then it uses collective inference to distinguish genuine and spurious relationships to form the final semantic description. In contrast to existing methods, our solution can handle tables that require complex semantic descriptions of n-ary relations (e.g., the population of a country in a particular year) or implicit contextual values to describe the data accurately. In our empirical evaluation, our approach outperforms state-of-the-art systems on a large set of Wikipedia tables by as much as 12.6% and 4.8% average F1 scores on relationship and concept prediction tasks, respectively.
Full PDF Version: 
Tags: 
Reviewed

Decision/Status: 
Major Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
Anonymous submitted on 04/Aug/2022
Suggestion:
Accept
Review Comment:

The authors introduce a novel graph-based approach for generating semantic descriptions, i.e., types and relationships between table attributes, for tabular data (Wikipedia tables) with a knowledge graph (Wikidata). The paper addresses a novel task on semantic modeling for n-ary tables while there have been few studies on this problem. The approach first constructs a data graph of all possible relationships between table cells and then uses the data graph to construct a candidate graph of column relationships. They propose a probabilistic model (PSL) to predict incorrect relationships in the candidate graph.

This paper improves their previous published paper [1] as follows:
- add more examples and explanations to the method.
- improve the probabilistic model by adding more rules in PSL.
- conduct insightful analysis, compared with more baselines in the experiments. There are differences in the reported results in Table 3 with the author published papers; however, this time, the experimental results were aggregated from 10 different runs. Overall, their results are consistently higher than other baselines in the 250WT dataset (n-ary tables) while achieving comparable performance in SemTab2020 (synthesis dataset).
- The authors provided data resources and runnable scripts to replicate their experiments in "Long-term stable URL for resources."
Overall, the paper is well written; the method is solid and clearly explained.
Therefore, I recommend an acceptance for this paper.

Minor questions and comments:
- Evaluation metrics: Section Page 11, line 33, I could not find the n3, n1, n4, or n2 in Fig.5.
- Table 3: What is the maxhop in the experiments in table 3?
- Some references do not contain enough information. Google scholar sometimes provides missed citation formation.
For example:
D. Ritze and C. Bizer, Matching web tables to dbpedia-a feature utility study, context 42(41) (2017), 19–31.

[1] Binh Vu et al. A Graph-Based Approach for Inferring Semantic Descriptions of Wikipedia Tables, ISWC 2021

Review #2
By Vasilis Efthymiou submitted on 16/Aug/2022
Suggestion:
Major Revision
Review Comment:

Summary:
This is a paper about inferring semantic information about tables found on Wikipedia. The semantic information is given in the form of a so-called "semantic description", i.e., a directed graph (a tree actually) where each node is an Wikidata class, an entity, a literal, or a table column, and each edge is an ontology property representing a relation between two nodes. The approach followed in this paper has two main steps:
Step 1) Build an initial graph of all possible relationships between table cells and table context (a data graph), and then group/consolidate relationships if they are of the same ontology property (to construct the candidate graph).
Step 2) Identify the correct relationships from the candidate graph, using Probabilistic Soft Logic (PSL) reasoning. For that step, a set of inference PSL rules are defined, to identify and discard mismatches between the current evidence (e.g., rows in a table indicating the existence of a relationship between two columns) and ontological constraints (e.g., domain or range restrictions). Each rule is assigned a weight and at the end of this reasoning process, a Steiner tree algorithm decides which edges (already processed to be above a certain confidence threshold) to retain.

The evaluation is performed using two datasets:
- A new, publicly available dataset, called 250WT, created by the authors from 250 Wikipedia tables.
- A dataset used in the SemTab 2020 Challenge with 22,000 tables and target annotations to Wikidata.

The evaluation involves two tasks, which are variations of the SemTab challenge tasks CTA (annotating a table column with a KG class) and CPA (annotating a set of table columns with an n-ary relationship). The baseline methods are three top-performing methods in the SemTab 2020 challenge, including the 2020 winner, MTab.

The results show that GRAMS, i.e., the method proposed in this paper, is outperforming the baselines in the new dataset 250WT, while it achieves competitive results in the SemTab dataset.

My impression from the paper is positive, as it introduces a novel way of performing a very important task and it is, in general, well-written. I also liked that the authors created a new dataset, which they made publicly available along with the source code, and that they offer insights about things that worked and things that did not work that well (e.g., the contribution of some PSL rules are insignificant or even negative, but this is well explained and justified).
Overall, I believe that the paper is contributing to the state-of-the-art and, with some improvements, would be a good fit for this journal.

Originality:
The idea of using PSL and Steiner trees for this problem is a novel and very interesting approach.

Significance of the results:
The experimental results show the strengths and weaknesses of the approach, and are of big interest to the community.
Nonetheless, some parts of the evaluation are too vague and informal to be considered and should be re-written and/or performed again, in a clear setting (wrt. evaluation methodology and measures). See the detailed comments that follow.

Quality of writing:
The paper is well-written and structured, but there are certain things that should be improved.

Strong points:
- Interesting idea and approach
- Source code and data are available on github
- Competitive results to SOTA methods
- Feature Analysis is really insightful and useful (Section 4.6)

Weak points:
- Lack of formality makes some parts confusing/vague
- Not clear what the context is exactly and how it is extracted
- The evaluation needs clarifications and perhaps new tests
- Presentation is good overall, but should be improved in some parts

More details on the weak points and suggestions for improving them:

- The lack of a mathematical problem formulation is one of the main weak points of this paper. The introduction in Section 3 is an informal description of the problem, leaving some things open to different interpretations. E.g., what is the surrounding context of T? What exactly is the definition of semantic descriptions? There is an textual, informal definition in the intro stating that it is a directed graph, but again, how is this graph formally defined? I would advise the authors to add a "Preliminaries" (sub-)section to include this kind of information.

A similar issue is met in the Algorithms, where many parts are informally / textually described instead of actually using pseudocode. Examples include lines 4, 6, 22, 26, 27 in Algorithm 3

- The authors mention the exploitation of the context quite a few times, but it is not clear what exactly is the context (a formal definition), how it contributes to the results. Perhaps Section 4.5 was meant to serve this purpose, but it does not, as described next.

- Section 4.5 is supposed to be evaluating the role of the context. How this is done, why and how is the evaluation carried out are not at all clear. The authors define the table subject as "the subject of a table is what the table is primarily about". What does "primarily" mean? What is the table subject in terms of an ontology? The authors state that "it is often expressed in the semantic description via ontology classes [...] or relationships with other literals [...]". Is it always an ontology class / relationship with literals, or could it be something else, too? Also, the case of using relationships with other literals to describe a table (even the example given), is not straightforward at all. I would suggest re-writing this section from scratch; it should not remain as it is currently.

Moreover, how was the ground truth of 250WT created for that task? How was GRAMS evaluated? The authors mention that the baselines were excluded "because they do not predict the table subjects". That is true, but with some effort, the authors could have run the CTA task on those systems (from SemTab Challenge) to the "subject column", which some of the systems detect. Even with one system that would be very helpful, even if it is clear that this is not an entirely fair comparison. Finally, in 1 table out of the 18 in the dataset for which GRAMS detected a subject, the authors mention that it was not exactly the same as in the ground truth, but semantically similar. Could this methodology/measure be generalized beyond specific cases and systems (GRAMS), e.g., as in the evaluation measures employed in the SemTab challenge that address generalization with some penalty?

- Another problem related to the evaluation, is the fact that the authors ignore rules with probability below 0.5, but without a proper experimental justification. Perhaps this threshold is not the best, or the threshold could be learned and not fixed.

- Finally, a there is a rather ad-hoc claim that "In many cases, a single path is sufficient to describe the relationship between two nodes. Including extra paths, which tends to be incorrect, can decrease the precision of our prediction." How is this claim supported?

- Presentation-related issues, roughly in order of importance:

-- I would like to see a small (at least) architecture section, where things like using "a local key-value database to store Wikidata" are made clear much earlier than page 14. Perhaps between Sections 2 and 3, or after the "preliminaries" (if the authors introduce that). E.g., I was wondering how the authors "fuzzy match" literals (page 5, line 6), which becomes kind of clearer only when they say that they use a local key-value store. Until then, I was assuming that they use a Wikidata service. It is still not clear though, which fuzzy match algorithm/measure they use in particular (BM25?), and which key-value store (an elasticsearch index?). Those should be clarified in the text, even if someone could check the code to find those answers.

-- I am a bit confused about the weighting scheme of PSL rules. The authors mention in Section 3.2 that "If a rule in PSL does not have weight, it will be considered as a hard constraint". Then, one page later, they mention that "rules that act as constraints should have very high weights (rule 11 has w = 100)". I have two questions about those statements:
1) Which one is it? Weight = 100, or no weight, for hard constraints?
2) Why only rule 11 is considered as a hard constraint and not also rules 8, 9 and 10?

-- page 3, lines 23-25: "Following this process, we may discover in the second row that Thomas Prinzhorn was a second president, and he left the office in 2002, while there may be no suggestions from data in the first row as Wilhelm Brauneder does not link to any Qnode."
It's not at all straightforward to understand how following the same process leads to this conclusion.
(also I guess that "Wilhelm" is the same as "Willi" mentioned in the table, and that "Gruene" in Figure 1 is the same as "Gruenen", mentioned in Figure 2a, but please confirm/clarify)

-- DBpedia does not use infoboxes only, as it is stated in the first paragraph. It also uses other mining features like regular expressions from Wikipedia articles. See an overview of such extractors in Table 1 of http://dx.doi.org/10.3233/SW-140134

-- References 23-28 are never mentioned in the text. They are all relevant, and should be also mentioned/compared in the related works.

-- Page 5, line 20: I think it's line 11, not 13 in the parenthesis, but I could be wrong.

-- For consistency, I would put "Output" after "Input" in Algorithm 1.

-- Perhaps a separator between rules 5 and 6 would help to visually distinguish the two groups of rules.

Typos:
-- Page 9, line 45: "may contains" --> "contain"
-- Page 9, line 49: "Lines 8 to 10 finds" --> "find"

Review #3
Anonymous submitted on 07/Oct/2022
Suggestion:
Reject
Review Comment:

General comments:
- The paper is an extended version of a previously published manuscript, namely [1]. The authors mentioned this in the cover letter, but this fact was not clearly stated in the paper, let alone any clarification of what is new in the current version.
- Some authors listed in [1] have been dropped in the current paper. Is there a clear reason for this?
- About 70% of the content of the current paper is an exact copy of the content of the published version, and thus my review covers only the new contributions.

The paper presents an approach to build semantic descriptions of a Wikipedia table automatically. The possible relationships between table columns and its context values are represented as a candidate graph. Then, a Probabilistic Soft Logic (PSL) is used to detect and remove incorrect relationships. The topic of the paper is relevant and has several use cases, such as knowledge graphs curations and question answering.

The new contributions compared to the previous version are limited to:
Figures & illustrations:
- Extending Section 3.2.2 by adding Figure 4 and a pseudocode (listed as Algorithm 3) that describe the Steiner tree algorithm.
- Extending Figure 2 and adding Figure 3, 6.

Experiments:
- Adding a third baseline to the experiments, namely MTab.
- Sections 4.5 and 4.6: two new experiments, namely table subject detection and feature analysis and post-processing evaluation.

First, the added figures and illustrations definitely help to understand the approach, but they are still not a major contribution.
Second, regarding the experiments:
- I wonder why the authors did not compare to MTab in the original publication as it is the top winner of the SemTab2020 and instead compared to BBW, which was among the top three but not the winner! So, as a matter of fact, we knew it was the best of SemTab2020.
- The reported numbers in Table 3 seem to differ for the baselines from the original paper. Any description?
- If the PSL probabilistic model was improved, why we do not have a comparison to show the significance of the improvement.
- The insights reported in Sections 4.5 about table subject detection were based only on 76 tables and showed a success rate of 23%.
- Section 4.6 provide a nice analysis and helps to understand the behavior of the suggested approach.

To me, the new contributions did not go significantly beyond the previous published content, and thus I recommend rejecting the paper.

[1] https://link.springer.com/chapter/10.1007/978-3-030-88361-4_18