One-shot HDT-based Semantic Labeling of Entity Columns in Tabular Data

Tracking #: 2474-3688

Authors: 
Ahmad Alobaid
Oscar Corcho
Wouter Beek

Responsible editor: 
Guest Editors Web of Data 2020

Submission type: 
Full Paper
Abstract: 
A lot of data are shared across organisations and on the Web in the form of tables (e.g., CSV). One way to facilitate the exploitation of such data and allow understanding their content is by applying semantic labeling techniques, which assign ontology classes to their tables (or parts of them), and properties to their columns. As a result of the semantic labeling process, such data can be then exposed as virtual or materialised RDF (e.g., by using mappings), and hence queried with SPARQL. We propose a one-shot semantic labeling approach to learn the classes to which the resources represented in a tabular data source belong, as well as properties of entity columns. In comparison to some of our previous approaches, this approach exploits the fact that the knowledge base used as an input source is only available in the RDF HDT binary format. We evaluate our approach with the T2Dv2 dataset. The results show that our approach achieves competitive results in comparison with state-of-the-art approaches without the need for using a full-fledged query language (e.g., SPARQL) or profiling of knowledge bases.
Full PDF Version: 
Tags: 
Reviewed

Decision/Status: 
Reject

Solicited Reviews:
Click to Expand/Collapse
Review #1
By Miel Vander Sande submitted on 04/Jun/2020
Suggestion:
Major Revision
Review Comment:

This paper introduces a technique for automatically adding semantic labels to tabular data (e.g., CSV files) using the searchable compressed HDT format.
The approach does subject-column (i.e., the column that represents the main entity of a row) semantic labeling using class scoring according to coverage and specificity of the candidate classes; and
property-column (i.e., columns that represent entities connected to the main entity through properties) semantic labeling with three prediction methods: restrictive (only existing relation between subject and property entities are considered), permissive (uses the classes and properties on the schema level), and
heuristic (a combination of both where the restrictive outcome has a higher confidence).

Unfortunately, the paper has many major flaws, and I'm not sure if they can all be fixed with a major revision. Some seem to be inherent to the approach.
I summarise them as follows:

- the approach is quite naive, using exact string matching (which becomes painfully clear when "using title case" is proposed as a way to improve the results) and recursive queries. Recent work using (un)supervised ML techniques seem much more appropriate for this task, but are completely disregarded. Was training a model for semantic labeling from HDT not more interesting?
- there is no clear methodology that sets goals, research questions, hypotheses, or a validation of those. Therefore, it's easy to succeed and impossible to fail.
- in the introduction, the approach is oversold as a solution for tabular data such as spreadsheets with any HDT file, but only provides (flawed) evidence for labeling tabular data in HTML pages (Web tables) with DBpedia. The paper should be reworked to omit such false claims.
- the main contribution: the use of HDT, is irrelevant to the approach (could be any TP-based index) and is basically a self inflicted limitation. It doesn't seem to have any impact on the speed or quality of the system, which it is also impossible to tell from the evaluation. SPARQL can also be used over HDT, but this is not considered, not even for the sake of comparison.
- the reporting of related work is incomplete and the comparison against it is poor. How does it for example compare to the steiner tree approach in Karma? Why is there no comparison against ML-based approaches?
- the math that supports the scoring is under-explained and not very well motivated. How did you construct these metrics and what did you base them on?

Some more details comments per page and per line:

p2 left

l15: subject & property columns are a very limited scope regarding the variety of csvs out there.
l27: HDT is just another index, how is this a contribution? both contributions are in fact the same. They could be reworked
l35: “The majority of the entities in tabular data exist
in the HDT file.” Typically you’d have very specific or very general purpose KBs. What is the entity ratio between the CSV and the HDT?
l38: with “natural language” do you mean the language or the style of the language?

p2 right
l17: The part on "Searching in HDT" is verbose and redundant.
l27: needs rephrasing
l29: this sentence is missing some commas
l33 T2K was not introduced before

p3 left

l46: The second column also contain entities
name. -> "The second column also contains the names of entities"

p3 right

l43: dbr:Unite_States -> dbr:United_States

p4 left
l3: how do you determine what is a subject column and what is a property column
l26: using exact matches won't give your much value in the real-world. Even if the language is the same, the entries will be messy. Why was the free-text search R-index that comes with HDT not explored? The same holds for the disambiguation. Even if the subject column returns results, the remainder of the row might not.
The whole approach starts off quite naive
l45: Listing 2 is unclear. Is entity-uri a variable?

p4 right
l28 Listing 4 and the paragraph below it introduce a huge dependency on the presence of RDFS information. This is not an issue per se, but does limit the generalisation of the approach. A list of assumptions should be stated clearly in the beginning of the paper. With respect to RDFS, it would have been interesting to see whether how it could be replaced or complemented by the SHACL/shex schema.
l44: how is the "hash of classes" constructed?

p5 left
l8: the title 4.3 is a bit weird and redundant> I suggest removing it and making the content part of 4.4
l35: What is the rationale behind this formula? Why makes you believe this will work properly? These are things that should be well explained and motivated in a scientific paper.

l44: Does the approach assume that classes in a cell always have a shared root? The coverage score certainly hints in that direction, but I can't really tell from the text.

p5 right
l2: The caption of Table 1 refers to an unknown chapter.
l23: what previous work?
l25-27: I don't really get what is explained here. Wouldn't rating exact matches of cell values to any attribute as high quality severely increase the chance of wrongly disambiguating an entity?

p6 left
l22: "measuring the number of instances..." is not really what is shown in (5), which is a ratio.
l36: Eq6 is not properly explained, let alone the rationale behind it. This is a recursive function? The dot at the end means you're doing a permutation?
l50: Does this mean one or more classes are applied to all entities in the entire column? What if the subjects have mixed types?

p8 left
l33: The examples and dependence on ancestors suggests that this approach is tailored to DBpedia. The experimental results based on T2Dv2 won't be able to show otherwise because T2Dv2 only gives you insights for DBpedia. Furthermore, T2Dv2 only contains Web tables, which are much cleaner and smaller that CSVs that go around. Hence, this should be made explicit in the title and introduction.

p8 right
l18: Why do you filter out properties that do not represent entities? You can't possibly correctly measure the recall of your approach this way.

p9 left
l6: no need to repeat the well know measures precision, recall and f1-score (which is by the way pretty confusing since you introduce an f-score of your own).
l40: what are the characteristics of the tested tables? How many rows? How many columns?
l42: again, don't ignore parts of the gold standard like tables that have no entities is the HDT. Knowing that your approach does (not) produce false positives is also valuable.

p9 right
l30: without knowing the T2K Match runtimes, there is know way to interpret these timing results. Is this good, bad, the same? 2 hours and 46 minutes seems long, which defeats the purpose of HDT: speed.

p10 left
l32: how was the T2K Match experiment set up? Did they also alter the gold standard before the experiment? Was the comparison fair? No way of knowing with what is reported.

p11 left
l1-11: the authors describe related work but do not position it to their own work. Why and how are these approaches different? What problems do they solve that you don't?
l15: From the related work, it seems like T2K Match is the only possible work to compare against. You can use SPARQL on HDT by using a framework like Jena or Comunica. Why was this not part of the evaluation? In fact, after reading the evaluation, it seems like HDT is completely irrelevant to your approach. Any TP-based index would do.

p12 left
l4-5: "competitive results in comparison to state-of-the-art approaches": comparing to a single system is not enough to claim that. There are many more works outthere (e.g., the ML ones mentioned in related work, or for example the system Karma) that are also part of SOTA.

Review #2
Anonymous submitted on 15/Jun/2020
Suggestion:
Major Revision
Review Comment:

REVIEW FOR
ONE-SHOT HDT-BASED SEMANTIC LABELING OF ENTITY COLUMNS IN TABULAR DATA

The paper discusses an approach for *semantic labeling* over tabular data such as CSV, TSV, and XLSX.
In a way, the flat nature of tables is transformed into a linked data representation.
Particularly, the paper tackles the problem of aligning table columns to classes and properties of knowledge bases (KBs).
The main contribution of the paper as claimed by the authors is the usage of HDT in assisting the semantic labeling approach.
The approach is evaluated over the T2Dv2 dataset, consisting of semantic correspondences between 770+ web tables and DBpedia.

I organize my feedback based on where issues occur. Please find it below.

ABSTRACT
- The abstract can be improved by adding some motivation as to why HDT is worth to be exploited in semantic labeling.
- ".. our approach achieves competitive results .." -> This could be further elaborated as to what aspects are achieved (e.g., wrt to runtime, accuracy, etc).

INTRODUCTION
- The introduction may emphasize the novelty and usefulness aspects in using HDT for semantic labeling.
- The term "entity column" could be better introduced and exemplified here in Introduction as it appears in the title anyway.
- It would be interesting to see a diagram of a general architecture of semantic labeling and how HDT may fit in in that architecture (+ how HDT could enhance semantic labeling).
- ".. we observed several drawbacks in them: the reliance on external sources of knowledge .." -> Isn't using HDT more or less the same thing (that is, relying on an external source)?
- ".. the bottle neck in these systems was in querying the
SPARQL endpoints.. " -> How severe is the bottleneck?
- ".. it is limited in the kind of supported queries, which do
not cover the full range of SPARQL expressivity .." -> I would be interested to know: Which SPARQL constructs are necessary for semantic labeling task and which are optional but good to have? Then, what constructs are missing when using HDT?
- The introduction could use more illustrative examples wrt. the problem being tackled.
- Pg. 2: The assumptions of the approach, how reasonable are they? What if one of them is not met, any fallback plan? This is crucial to assess the robustness of the proposed approach.

HDT
- Perhaps, it could be clarified further how HDT works, not just what HDT is.
- More elaborate discussion as to what are the pros and cons of using HDT in general (and in its potential use for semantic labeling) could be added. Examples: How long does it take to set up a KB using the HDT format for the first time? How long is the HDT index creation? How does one maintain KB updates in HDT? How mature are HDT tools?
- "Searching in HDT supports a subset of SPARQL triple .." -> In assessing how this could affect the performance of semantic labeling, one should have a precise definition as to what semantic labeling is, how is the general architecture, and what (querying feature) is needed for the semantic labeling to work well.
- ".. as a requirement for T2K to run on their machines .." -> Could introduce T2K first.
- In general, this HDT section could explain HDT in more detail. Also, hints as to how HDT could be a game-changer in semantic labeling could be given when explaining how HDT works.

EXAMPLE
- DBpedia could be briefly introduced first.
- Fig. 1 could be improved. At the moment, the separation between the HDT file and scientist table is not so clear and there is too much empty space. Also, why is the ordering (HDT file first then followed by table), while the explanation mentions the table first?
- ".. we assume that both Richard Feynman and Bertrand Russell have the
class dbo:Scientist as the rdf:type .." -> Why aren't these triples shown in the HDT file, there does not seem to be a space-issue?
- The explanation in Sec. 3 is somewhat limited. It's still unclear what is the purpose of the example, what process is going on here, what is the role of the HDT, and how HDT could help the mapping/labeling process (and how different is that compared to, say, using full SPARQL).

SUBJECT-COLUMN SEMANTIC LABELING
- ".. we present our one-shot semantic labeling .." -> Could provide more detail as to what distinguishes one-shot vs multiple shots in semantic labeling task?
- The approach relies on entity linking for each cell in the subject column, which itself actually is not a trivial task in general (due to name variations, long tail problems, etc). Any clarification on this?
- "Two cases are considered for utilizing the
in-table context. First .." -> What is the motivation and intuition for the consideration?
- Notation issue: For each listing, there is the ? as a placeholder for a variable. Perhaps, the notation could use a standard notation for variable (say, as given https://www.w3.org/TR/sparql11-query/#rVARNAME)?
- In general, one might expect the explanation for each step using the (running) example given in Sec. Example. This could greatly improve the readability.
- Wrt. Sec. 4.2: It seems that the completeness assumption is quite strong: needs both entity URIs to exist and typing information to exist. How to handle long tail entities which do not exist (both for entity URIs and typing information)?
- Sec. 4.3 could be further motivated.
- Sec. 4.4 is the most crucial here. I would expect to have a better motivation for the balancing between coverage and specificity. Also, how does one decide the value of \alpha?
- "Notation for scoring functions in Chapter ??" -> What chapter?
- "Another aspect that we did not use in the previous
work .." -> Which work?
- The explanation of dp could be further clarified.
- "The higher the number of candidate entities or the number of types assigned to an entity, the lower the score of Ic is for the type t." -> I am not sure if I could understand this. Regarding the number of types vs the type t, the issue here is that, the authors try to compute the number of types for a given, known, type t? That needs to be clarified.
- Any equation in the section could be improved by providing an example.
- There seems to be a recursion in Eq. 3 wrt. L_c. What is the base case?
- Sec. 4.4.2: ".. the specificity of a class/type t (how narrow a class is) by measuring the number of instances of a type t and the number of instances of its parent pr(t)" -> In computing this, do you also consider the subClassOf inference wrt. the underlying ontology? As it might be the case that the subClassOf semantics is not enforced by KG (that is, it can be that in the KG (that is stored), there is an entity of a type A, but not of type B despite A being a subclass of B). Or am I missing something?
- Regarding this section, and the next section (Property-Column Semantic Labeling), I would expect more emphasis on the novelty aspect.

PROPERTY-COLUMN SEMANTIC LABELING
- Again, the use of a (running) example could enhance the readability.
- "To compensate that, multiple simple patterns are applied
sequentially .." -> This should be elaborated further.
- The permissive technique could benefit from sampling for potential speed-up (i.e., does not have to really consider all entities in the class).
- As the algorithms are given (Alg. 1 and 2), could they be contrasted better? One could also rely on an example.
- I was wondering if the column names could help in determining the right property? Or your approach concentrates in using the statistics of the relations between entities?
- Regarding this section and the previous section, how is the theoretical runtime analysis for the algorithms?

EVALUATION
- "In this section, we evaluate the performance of our
semantic labeling approach to label subject columns
and property columns. We measure the performance
using precision, recall, and F1 score." -> Could add an explanation/motivation regarding how this affects the end-to-end performance of semantic labeling.
- "We performed our experiments on the T2Dv2 dataset .." -> Perhaps, provide more background on the dataset. What are the general characteristics of these 237 tables? Any assumptions? Any peculiarities? Any example? Any explanation regarding what entities are stored in the table, are they long tail?
- The version of DBPedia is from 2016. In general, I was wondering what would happen if there are tables with 'newer' entities?
- Wrt. semantic labeling of subject columns, I was wondering how your approach could handle aliases? The current approach seems to be very rigid.
- "We gathered all annotated properties in the gold standard of the 234 (there are 3 files that have missing classes). The application generates all properties (except for the subject .." -> What application?
- "Next, the application take these filtered properties and
discard properties which are not found in the HDT for
the given class .." -> How much is the missing rate?
- Eq. 8, 9, and 10 are shown too prominent despite them being common equations.
- ".. system. For the disambiguity penalty, we use dp=2 .." -> Why two? How often do disambiguities occur?
- ".. experiment for applying semantic labeling to subject columns with exact-match took around 118 seconds" -> Is this for all or individual tables? Could be clarified also: average matching time per table.
- "We also compare the performance of our approach with T2K Match .." -> More context is required regarding what T2K match generaly does (differences to the proposed approach).
- "The restrictive technique took around 5 seconds, while the
experiment with permissive technique took 9974 seconds
( 2 hours and 46 minutes) .." -> Again, for all tables? Any insight how long is it on average to handle 1 table or 1 cell?
- It seems that more extensive experiments could be added. One might add runtime and accuracy comparison to several non-HDT approaches, and also add more datasets (e.g., https://www.cs.ox.ac.uk/isg/challenges/sem-tab/).

RELATED WORK
- The discussion on labeling techniques could give more insights wrt. pros and cons in using ML techniques vs. non-ML techniques.
- The second paragraph of Sec. 7.3 is too long and could be made more well-written.
- "in comparison to the previous work [11], we present the key differences below:" -> What are the benefits in imposing the differences? The improvements wrt the key differences could be validated using experiments.

OVERALL
Overall, I believe that the paper may shed some light on the practical aspects of semantic labeling over tables. I would suggest the authors to substantially improve the paper based on the above feedback to realize that, in particular regarding the aspects of motivation, depth, novelty, and writing structure.

Typos:
- Pg. 1 and other parts: Data is or data are? As long as the usage is consistent, I think either is fine.
- Pg. 3: dbr:Unite_States -> dbr:United_States
- There are some others. Generally, the paper could be checked again for the typos.