Review Comment:
REVIEW FOR
ONE-SHOT HDT-BASED SEMANTIC LABELING OF ENTITY COLUMNS IN TABULAR DATA
The paper discusses an approach for *semantic labeling* over tabular data such as CSV, TSV, and XLSX.
In a way, the flat nature of tables is transformed into a linked data representation.
Particularly, the paper tackles the problem of aligning table columns to classes and properties of knowledge bases (KBs).
The main contribution of the paper as claimed by the authors is the usage of HDT in assisting the semantic labeling approach.
The approach is evaluated over the T2Dv2 dataset, consisting of semantic correspondences between 770+ web tables and DBpedia.
I organize my feedback based on where issues occur. Please find it below.
ABSTRACT
- The abstract can be improved by adding some motivation as to why HDT is worth to be exploited in semantic labeling.
- ".. our approach achieves competitive results .." -> This could be further elaborated as to what aspects are achieved (e.g., wrt to runtime, accuracy, etc).
INTRODUCTION
- The introduction may emphasize the novelty and usefulness aspects in using HDT for semantic labeling.
- The term "entity column" could be better introduced and exemplified here in Introduction as it appears in the title anyway.
- It would be interesting to see a diagram of a general architecture of semantic labeling and how HDT may fit in in that architecture (+ how HDT could enhance semantic labeling).
- ".. we observed several drawbacks in them: the reliance on external sources of knowledge .." -> Isn't using HDT more or less the same thing (that is, relying on an external source)?
- ".. the bottle neck in these systems was in querying the
SPARQL endpoints.. " -> How severe is the bottleneck?
- ".. it is limited in the kind of supported queries, which do
not cover the full range of SPARQL expressivity .." -> I would be interested to know: Which SPARQL constructs are necessary for semantic labeling task and which are optional but good to have? Then, what constructs are missing when using HDT?
- The introduction could use more illustrative examples wrt. the problem being tackled.
- Pg. 2: The assumptions of the approach, how reasonable are they? What if one of them is not met, any fallback plan? This is crucial to assess the robustness of the proposed approach.
HDT
- Perhaps, it could be clarified further how HDT works, not just what HDT is.
- More elaborate discussion as to what are the pros and cons of using HDT in general (and in its potential use for semantic labeling) could be added. Examples: How long does it take to set up a KB using the HDT format for the first time? How long is the HDT index creation? How does one maintain KB updates in HDT? How mature are HDT tools?
- "Searching in HDT supports a subset of SPARQL triple .." -> In assessing how this could affect the performance of semantic labeling, one should have a precise definition as to what semantic labeling is, how is the general architecture, and what (querying feature) is needed for the semantic labeling to work well.
- ".. as a requirement for T2K to run on their machines .." -> Could introduce T2K first.
- In general, this HDT section could explain HDT in more detail. Also, hints as to how HDT could be a game-changer in semantic labeling could be given when explaining how HDT works.
EXAMPLE
- DBpedia could be briefly introduced first.
- Fig. 1 could be improved. At the moment, the separation between the HDT file and scientist table is not so clear and there is too much empty space. Also, why is the ordering (HDT file first then followed by table), while the explanation mentions the table first?
- ".. we assume that both Richard Feynman and Bertrand Russell have the
class dbo:Scientist as the rdf:type .." -> Why aren't these triples shown in the HDT file, there does not seem to be a space-issue?
- The explanation in Sec. 3 is somewhat limited. It's still unclear what is the purpose of the example, what process is going on here, what is the role of the HDT, and how HDT could help the mapping/labeling process (and how different is that compared to, say, using full SPARQL).
SUBJECT-COLUMN SEMANTIC LABELING
- ".. we present our one-shot semantic labeling .." -> Could provide more detail as to what distinguishes one-shot vs multiple shots in semantic labeling task?
- The approach relies on entity linking for each cell in the subject column, which itself actually is not a trivial task in general (due to name variations, long tail problems, etc). Any clarification on this?
- "Two cases are considered for utilizing the
in-table context. First .." -> What is the motivation and intuition for the consideration?
- Notation issue: For each listing, there is the ? as a placeholder for a variable. Perhaps, the notation could use a standard notation for variable (say, as given https://www.w3.org/TR/sparql11-query/#rVARNAME)?
- In general, one might expect the explanation for each step using the (running) example given in Sec. Example. This could greatly improve the readability.
- Wrt. Sec. 4.2: It seems that the completeness assumption is quite strong: needs both entity URIs to exist and typing information to exist. How to handle long tail entities which do not exist (both for entity URIs and typing information)?
- Sec. 4.3 could be further motivated.
- Sec. 4.4 is the most crucial here. I would expect to have a better motivation for the balancing between coverage and specificity. Also, how does one decide the value of \alpha?
- "Notation for scoring functions in Chapter ??" -> What chapter?
- "Another aspect that we did not use in the previous
work .." -> Which work?
- The explanation of dp could be further clarified.
- "The higher the number of candidate entities or the number of types assigned to an entity, the lower the score of Ic is for the type t." -> I am not sure if I could understand this. Regarding the number of types vs the type t, the issue here is that, the authors try to compute the number of types for a given, known, type t? That needs to be clarified.
- Any equation in the section could be improved by providing an example.
- There seems to be a recursion in Eq. 3 wrt. L_c. What is the base case?
- Sec. 4.4.2: ".. the specificity of a class/type t (how narrow a class is) by measuring the number of instances of a type t and the number of instances of its parent pr(t)" -> In computing this, do you also consider the subClassOf inference wrt. the underlying ontology? As it might be the case that the subClassOf semantics is not enforced by KG (that is, it can be that in the KG (that is stored), there is an entity of a type A, but not of type B despite A being a subclass of B). Or am I missing something?
- Regarding this section, and the next section (Property-Column Semantic Labeling), I would expect more emphasis on the novelty aspect.
PROPERTY-COLUMN SEMANTIC LABELING
- Again, the use of a (running) example could enhance the readability.
- "To compensate that, multiple simple patterns are applied
sequentially .." -> This should be elaborated further.
- The permissive technique could benefit from sampling for potential speed-up (i.e., does not have to really consider all entities in the class).
- As the algorithms are given (Alg. 1 and 2), could they be contrasted better? One could also rely on an example.
- I was wondering if the column names could help in determining the right property? Or your approach concentrates in using the statistics of the relations between entities?
- Regarding this section and the previous section, how is the theoretical runtime analysis for the algorithms?
EVALUATION
- "In this section, we evaluate the performance of our
semantic labeling approach to label subject columns
and property columns. We measure the performance
using precision, recall, and F1 score." -> Could add an explanation/motivation regarding how this affects the end-to-end performance of semantic labeling.
- "We performed our experiments on the T2Dv2 dataset .." -> Perhaps, provide more background on the dataset. What are the general characteristics of these 237 tables? Any assumptions? Any peculiarities? Any example? Any explanation regarding what entities are stored in the table, are they long tail?
- The version of DBPedia is from 2016. In general, I was wondering what would happen if there are tables with 'newer' entities?
- Wrt. semantic labeling of subject columns, I was wondering how your approach could handle aliases? The current approach seems to be very rigid.
- "We gathered all annotated properties in the gold standard of the 234 (there are 3 files that have missing classes). The application generates all properties (except for the subject .." -> What application?
- "Next, the application take these filtered properties and
discard properties which are not found in the HDT for
the given class .." -> How much is the missing rate?
- Eq. 8, 9, and 10 are shown too prominent despite them being common equations.
- ".. system. For the disambiguity penalty, we use dp=2 .." -> Why two? How often do disambiguities occur?
- ".. experiment for applying semantic labeling to subject columns with exact-match took around 118 seconds" -> Is this for all or individual tables? Could be clarified also: average matching time per table.
- "We also compare the performance of our approach with T2K Match .." -> More context is required regarding what T2K match generaly does (differences to the proposed approach).
- "The restrictive technique took around 5 seconds, while the
experiment with permissive technique took 9974 seconds
( 2 hours and 46 minutes) .." -> Again, for all tables? Any insight how long is it on average to handle 1 table or 1 cell?
- It seems that more extensive experiments could be added. One might add runtime and accuracy comparison to several non-HDT approaches, and also add more datasets (e.g., https://www.cs.ox.ac.uk/isg/challenges/sem-tab/).
RELATED WORK
- The discussion on labeling techniques could give more insights wrt. pros and cons in using ML techniques vs. non-ML techniques.
- The second paragraph of Sec. 7.3 is too long and could be made more well-written.
- "in comparison to the previous work [11], we present the key differences below:" -> What are the benefits in imposing the differences? The improvements wrt the key differences could be validated using experiments.
OVERALL
Overall, I believe that the paper may shed some light on the practical aspects of semantic labeling over tables. I would suggest the authors to substantially improve the paper based on the above feedback to realize that, in particular regarding the aspects of motivation, depth, novelty, and writing structure.
Typos:
- Pg. 1 and other parts: Data is or data are? As long as the usage is consistent, I think either is fine.
- Pg. 3: dbr:Unite_States -> dbr:United_States
- There are some others. Generally, the paper could be checked again for the typos.
|