Knowledge-Graph-Based Semantic Labeling: Balancing Coverage and Specificity

Tracking #: 2237-3450

Authors: 
Ahmad Alobaid
Oscar Corcho

Responsible editor: 
Freddy Lecue

Submission type: 
Full Paper
Abstract: 
Many data are published on the Web using tabular data formats (e.g., spreadsheets). This is especially the case for the data made available in open data portals, especially by public institutions. One of the main challenges for their effective (re)use is their generalized lack of semantics: column names are not usually standardized, their meaning and their content are not always clear, etc. Recently, knowledge graphs have started to be widely adopted by some data and service providers as a mean to publish large amounts of structured data. They use graph-based formats (e.g., RDF, graph databases) and often make references to lightweight ontologies. There is a common understanding that the reuse of such tabular data may be improved by annotating them with the types used by the data available in knowledge graphs. In this paper, we present a novel approach to automatically type tabular data columns with ontology classes referred to by existing knowledge graphs, for those columns whose cells represent resources (and not just property values). In contrast with existing proposals in the state-of-the-art, our approach does not require the use of external linguistic resources or annotated data sources for training, nor the building of a model of the knowledge graph beforehand. In this work, we show that semantic annotation of entity columns can achieve good results compared to the state-of-the-art using the knowledge graph as a training set without any context information, external resources or human in the loop.
Full PDF Version: 
Tags: 
Reviewed

Decision/Status: 
Reject

Solicited Reviews:
Click to Expand/Collapse
Review #1
By inma hernandez submitted on 27/Mar/2020
Suggestion:
Reject
Review Comment:

This manuscript proposes an external source-based approach for semantic labelling of web tables, in which named entities are looked up on DBPedia in order to extract possible class names. Then, the most suitable class for each entity is selected applying a proposed formula that is based on the concepts of specificity and coverage.

The idea, though not entirely novel, is interesting and worth exploring, and the manuscript is well written in general. However, the present article describes an ongoing work that still needs much research effort to be completed, and an improved experimental evaluation.

Throughout the article, the authors describe several loose ends that need to be addressed before being able to successfully apply this proposal to real-world scenarios. Some examples are:
- The automatic identification of the entity column: a simple heuristic is proposed, which is based on the results by Caffarella et. al. However, the presence in the first column is the only the second heaviest feature; the strongest feature should be explored, or at least, a justification must given as to why it is not used in this proposal.
- The improvement in the searching strategy: it is clear from the results that the naïve exact search approach is not powerful enough.
- The knowledge graph must have a SPARQL endpoint that uses RDF, and that is not the case for many well-known KGs.
- The selection of an optimal value for alpha.
- What happens if the score of the candidate classes are too low? is there some kind of threshold to determine when it is not possible to find a suitable class? The authors mention that "the other top types in the list are picked if the first one is rejected by the user". Does that mean that this is a semi-supervised approach?
- The proposal does not work well when the entity name is split between several columns, which makes it necessary to manually pre-process the datasets.

Regarding the proposal:

- One of the main strengths of this proposal is its fully-automated nature, i.e., the absence of human-in-the-loop. However, the solution provided by the authors for many of the former loose ends is precisely to ask for human intervention (which is a clear contradiction).
- Also, the authors compare (theoretically) their proposal with semi-automated techniques. It is not clear to me if they are referring to semi-supervised learning techniques that involve human interaction, or if they include supervised learning techniques in this category. In my opinion, the difference between supervised learning techniques that learn a model from an annotated dataset and the current proposal is that the annotated dataset is, in this case, DBPedia. Therefore, I believe that the authors should have included other supervised techniques in their experimental evaluation.
- The use of DBPedia as the information source limits this approach to named entities, and that is only one of its limitations. The authors should include a subsection with a discussion of the limitations of this technique.
- The authors focus their work on the semantization of the entity column, disregarding the others. It has been proved that the introduction of entity properties features in the classification process leads to an improvement in the classification performance (TAPON). Therefore, that is something that I believe should be explored in this work.
- In my opinion, this proposal would be much more interesting with an integration perspective, i.e., if the information from several KGs was used to determine the suitable class for each entity.

Regarding the experimental evaluation:

- I think that comparing this technique with just a single baseline is not enough, and other proposals should be included in the experimentation, such as [3]. Also, I would suggest using exclusively well-known or reference datasets that have been used by other proposals, so that a comparison can be made. The Olympics dataset is a custom-made dataset not used before, which makes it less interesting. Finally, since the authors mention that working with tables in which the entity is not in the first column is not a problem, the experiment should have included one of such datasets, to support their claim.
- The authors should report on the execution times as well as on the function computation time, to illustrate what is the impact of issuing the DBPedia queries for each entity in the dataset.
- I do not think that having perfect results regarding precision and recall in the Olympics dataset is a strong point of the experimentation. It is not common to have that kind of results unless the technique is somehow overfit to the dataset.
- In the experiments with T2Dv1 and T2Dv2, the differences between TADA and T2K are quite small; therefore, in order to draw some conclusions, suitable statistical tests should be performed (and reported).
- The authors should clarify how the values in the first row of Table 2 were obtained, since T2K is only tested against T2Dv2, according to the provided reference.
- The incompleteness of Knowledge Graphs is mentioned as one of the reasons for lower performance measures. This is not a trivial matter, real world knowledge graphs are known to be incomplete and noisy [2], which makes me wonder if this proposal is suitable for other knowledge graphs besides DBPedia, as they mention.
- The authors say that less common entities can be more easily misclassified, and "This could be settled using other kinds of insights like properties in the tables. This also can be limited if the properties in the table do not exist in the knowledge graph". It has been proven that using the entity properties in an iterative fashion for semantic labelling does actually provide improved classification results, even if these properties are not named entities as well, and therefore are not included in DBPedia [1]. The authors should explore this as well.

Other minor remarks:
- The authors use footnotes profusely (e.g., page 2). Note that footnotes make the reader go back and forth and can make them lose their reading flow, which is why they are discouraged in many journals. I would recommend using them exclusively to provide links to implementations and tools, or information that is only loosely related to the text, but not essential clarifications. As an example, references 37 and 38 could perfectly be footnotes instead of citations.
- Figure 1 is well intended but not quite as efective as it could be, since it lacks some level of detail. I would recommend using it to describe an actual example of the application of the formula. For example, we see the sample hierarchy of classes "Person", "Athlete", "Basketball player", (which has the higher score), but it would be interesting to see what happens when several distinct class hierarchies could apply to a given popular name (e.g., John Smith). It would be useful to illustrate the values of formula 1 as well. Also, I don't think it is accurate to say that Figure 2 "provides an alternate view" of the approach. For me, Figure 2 is a high-level description of the workflow, and I would recommend presenting it earlier, before current Figure 1, which instantiates this workflow in a specific example.

[1] Ayala, D., Hernández, I., Ruiz, D., & Toro, M. (2019). TAPON: A two-phase machine learning approach for semantic labelling. Knowledge-Based Systems, 163, 931-943.
[2] Paulheim, H. (2017). Knowledge graph refinement: A survey of approaches and evaluation methods. Semantic web, 8(3), 489-508.
[3] Zhang, Z. (2017). Effective and efficient semantic table interpretation using tableminer+. Semantic Web, 8(6), 921-957.

Review #2
Anonymous submitted on 25/May/2020
Suggestion:
Reject
Review Comment:

## originality:

the paper addresses the problem of assigning semantic types to columns in tables. The key contribution is an unsupervised algorithm to assign types from any KG as long as it offers a SPARQL endpoint to a) retrieve entities based on labels, b) find ref:types of entities, and c) find super-classes, d) count the number if instances that belong to any class. These are all reasonable assumptions.

The paper makes several assumptions about the tables:
1- The column that contains the main entities of a table is the first column
2- The main entity of a table is defined in a single column (e.g., no first name and last name columns)
3- Candidates from the KG can be obtained by exact match on labels of entities in the KG
4- Columns contain a single entity class
5- The tables are vertical, with the first row containing the headers (or no headers are present)

The assumptions simplify the problem significantly, so much that the original motivation of the paper (to semantically type large corpora of tables) cannot be achieved. While this is a problem, it is not a significant issue for the research problem, except for assumption 3, as it is the case that in most tables the cell values do not match the entities in the KG exactly, and some kind of approximate matching is necessary. This changes the problem siginificantly because approximate matching generates a sifnificantly larger number of candidates, vastly increasing ambiguity and problem difficulty.

The proposed algorithm is very simple as it is a counting algorithm that balances coverage and specificity. Given a type T, coverage is the number of cells that have candidates that are instances of T, and specificity is a measure of how specific the type is in the KG. The paper offers several measures of specificity. The scoring function is a linear combination of coverage and specificity, weighted using a parameter called alpha.

The authors offer no method to pick alpha, and in their experiments they mention that different values of alpha are appropriate for different tables. The evaluation has methodolofical problems as the corpus should have been divided into a development corpus to estimate the best alpha and a test corpus with a fixed alpha.

The related work section is incomplete as the ISWC 2019 SemTab included several systems that are not mentioned in this paper.
https://www.aicrowd.com/challenges/iswc-2019-column-type-annotation-cta-...
In addition, round 1 of the ISWC challenge included one of the datasets used in this paper (T2Dv2), and two of the systems that participated in the challenge obtained an F score of 1.0 (higher than the score in the proposed paper)

The paper should address some of the subtle aspects of the problem as they are related to the relatively simple metric used to select a semantic type:
1- Imperfections in the KG: many tables do contain a well defined semantic type, but often, the KG is missing the most specific semantic types for all instances, or may contain incorrect semantic types. The paper should address sensitivity to this problem.
2- Many tables contain well defined semantic types, but the KG model is conceptually different. For example, a table can contain both musical groups and solo artists (e.g., winners of awards), but the KG contains groups and people as different classes. In this case there is not a single low level class to charcterize the column. How can this type of problem be addressed.

## significance of the results

The results in this paper are not significant because the evaluation only considers a very simple baseline. The ISWC challenge included many different approches and it is necessary to compare with them.

There is no discussion on how to set the alpha hyperparameter and of sensitivity to alpha across the different datasets.

A discussion section should provide insight into the types of tables where the algorithm performs poorly. A hint is given regarding prevalent classes, but this analysis should be systematic.

There is not systematic discussion of execution time, broken down by the different subtasks

## quality of writing

The writing is OK, although in places a bit informal, e.g., "took around 7 hours"


Comments

When you say

"most of such data are still being published at most with 2 or 3 stars,that is, using
spreadsheets (CSVs or Excel files)"

I guess you would be more precise

"most of such data are still being published at most with 2 or 3 stars,that is, using
spreadsheets (EXCEL OR CSVs files)"

if you want others to think that Excel=2* and CSV=3*
-- and please remember that only old specifications of Excel are actually non-open...