Tab2KG: Semantic Table Interpretation with Lightweight Semantic Profiles

Tracking #: 2891-4105

Simon Gottschalk
Elena Demidova

Responsible editor: 
Guest Editors DeepL4KGs 2021

Submission type: 
Full Paper
Tabular data plays an essential role in many data analytics and machine learning tasks. Typically, tabular data does not possess any machine-readable semantics. In this context, semantic table interpretation is crucial for making data analytics workflows more robust and explainable. This article proposes Tab2KG - a novel method that targets at the interpretation of tables with previously unseen data and automatically infers their semantics to transform them into semantic data graphs. We introduce original lightweight semantic profiles that enrich a domain ontology's concepts and relations and represent domain and table characteristics. We propose a one-shot learning approach that relies on these profiles to map a tabular dataset containing previously unseen instances to a domain ontology. In contrast to the existing semantic table interpretation approaches, Tab2KG relies on the semantic profiles only and does not require any instance lookup. This property makes Tab2KG particularly suitable in the data analytics context, in which data tables typically contain new instances. Our experimental evaluation on several real-world datasets from different application domains demonstrates that Tab2KG outperforms state-of-the-art semantic table interpretation baselines.
Full PDF Version: 

Minor Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
By Benno Kruit submitted on 10/Nov/2021
Minor Revision
Review Comment:

Thank you for your extensive reply to the points made in the first review, and changes made to the manuscript. My questions have been sufficiently answered, and many of my concerns have further been discussed in the text and tables of the paper.
In particular, the assumptions underpinning your approach have been made more explicit in the paper. The result of this is that it is more clear that this approach is not a general-purpose solution for table interpretation, but rather an interesting method for specific scenarios, namely when additional data arrives for domains that are already well-described in an application-specific ontology. The assumption that the profile is representative is very strong, but still interesting to explore. The experimental design is now better motivated.

The main change to the paper involves the distinction between pairwise and set-based scenarios. This reflects different assumptions on the amount of data that is available for building domain profiles, and leads to interesting distinctions: the performance on the set-based scenarios is typically worse than on in the pairwise case. However, it is only described very abstractly what this would mean in practice. Is one of the two scenarios more realistic? In which real-world cases are pairs of tables not available and would one have to rely on the worse performance of the set-based case?

Smaller points:
- In the hyper parameter grid search and ablation tables, a best performance of 0.95 is reported, but it is unclear which scenario this is, and why it is so much higher than the evaluation results in subsequent tables. This should be better explained.
- In table 1, you make a distinction between different topical domains for ST and SE, but report numbers for all domains together in subsequent tables. This should be more clear.

Review #2
Anonymous submitted on 12/Nov/2021
Review Comment:

This second version of the paper is more robust. The authors have improved and clarified the experiments, added a deep ablation study, and numerous insights. The paper is well written and the problem statement and running examples sections describe immediately the goal of the system. I still think that some implementation details (e.g., representations of semantic profiles in RDF or the mapping in RML) could be avoided as they don't add much value to the paper. I found the section describing the limits of this approach very clear and this is important in order to continue improving these techniques for semantic table understanding. This contribution might represent an interesting piece of a more complete semantic table interpretation system that combines table lookup and semantic profiles matching. I hope the authors will continue in this direction. I suggest the paper be accepted.

Review #3
Anonymous submitted on 22/Nov/2021
Major Revision
Review Comment:

reviewer #4

This work proposes a solution for creating Knowledge Graphs from tables based on the data profiling techniques. In particular, the data profiles regard the domain profiles and the table profiles which are provided as vectors of features and represented into semantic data. Domain profiles are patterns of ontology relations (only datatype relations in this work) and their statistical characteristics such as value distributions of the data in a sample of the domain KG. Tables profiles comprise the columns of a table and the statistical characteristics associated with each column. The table interpretation approach named Tab2KG, considers the mapping of table columns to the ontology relations and transforms the table into the data graph.

#Originality and contribution
I think that the work is original and is very interesting. I have a big concern about the lightweight domain KG. In the evaluation Section, the authors mention "DBpedia as a crossdomain knowledge graph" and this seems to be a contradiction of what you stated in the introduction with respect to the state-of-the art approaches "In the context of DAW, the input data typically represents new instances (e.g., sensor observations, current road traffic events, . . . ), and substantial overlap between the tabular data values and entities within existing knowledge graphs cannot be expected". What if a domain KG is not available? What does mean a sample? How do we measure that the data in the domain KG are representative?

#presentation of the work
I think that reading the introduction I got a slightly different understanding of what is then explained in the other sections. Remove redundant information and keep it short. We already have an explanation in section 1 about what this work is doing- then we have a second explanation in section 2 on the running example- then we have a detailed and formal description in the problem statement- then we have section 4 with the details on profiles. I would suggest keeping a concise description in the introduction and maybe a section of the problem statement and the running example together -> In this way we reduce the number of sections as well.

#other commments
*Definition 6: data type profiles -> which are the statistical characteristics i.e., the features associated with the literal relations? From the definition and the examples in the paper, this is not clear. I was expecting to see some numbers

*Section 5.4. I can understand that the mapping is normalized in the range [0,1] but I don't understand how this function measures the similarity "Given a column profile and a data type relation profile, the mapping function returns a similarity score in the range". Can you provide a formula on how this is effectively measured?

*The knowledge graphs set was split into a training set (90%) and a test set (10%). Is this used for all the other datasets? What happens if we keep 80% for training and 20% for testing?

*missing verb: In the case of a data table profile, these attributes the columns.
*check the correctness of the verb "assign" + "to" or "with"- seas:rank 2 (check spaces)
* check spaces in triples e.g., rdf:type