Review Comment:
This manuscript proposes an external source-based approach for semantic labelling of web tables, in which named entities are looked up on DBPedia in order to extract possible class names. Then, the most suitable class for each entity is selected applying a proposed formula that is based on the concepts of specificity and coverage.
The idea, though not entirely novel, is interesting and worth exploring, and the manuscript is well written in general. However, the present article describes an ongoing work that still needs much research effort to be completed, and an improved experimental evaluation.
Throughout the article, the authors describe several loose ends that need to be addressed before being able to successfully apply this proposal to real-world scenarios. Some examples are:
- The automatic identification of the entity column: a simple heuristic is proposed, which is based on the results by Caffarella et. al. However, the presence in the first column is the only the second heaviest feature; the strongest feature should be explored, or at least, a justification must given as to why it is not used in this proposal.
- The improvement in the searching strategy: it is clear from the results that the naïve exact search approach is not powerful enough.
- The knowledge graph must have a SPARQL endpoint that uses RDF, and that is not the case for many well-known KGs.
- The selection of an optimal value for alpha.
- What happens if the score of the candidate classes are too low? is there some kind of threshold to determine when it is not possible to find a suitable class? The authors mention that "the other top types in the list are picked if the first one is rejected by the user". Does that mean that this is a semi-supervised approach?
- The proposal does not work well when the entity name is split between several columns, which makes it necessary to manually pre-process the datasets.
Regarding the proposal:
- One of the main strengths of this proposal is its fully-automated nature, i.e., the absence of human-in-the-loop. However, the solution provided by the authors for many of the former loose ends is precisely to ask for human intervention (which is a clear contradiction).
- Also, the authors compare (theoretically) their proposal with semi-automated techniques. It is not clear to me if they are referring to semi-supervised learning techniques that involve human interaction, or if they include supervised learning techniques in this category. In my opinion, the difference between supervised learning techniques that learn a model from an annotated dataset and the current proposal is that the annotated dataset is, in this case, DBPedia. Therefore, I believe that the authors should have included other supervised techniques in their experimental evaluation.
- The use of DBPedia as the information source limits this approach to named entities, and that is only one of its limitations. The authors should include a subsection with a discussion of the limitations of this technique.
- The authors focus their work on the semantization of the entity column, disregarding the others. It has been proved that the introduction of entity properties features in the classification process leads to an improvement in the classification performance (TAPON). Therefore, that is something that I believe should be explored in this work.
- In my opinion, this proposal would be much more interesting with an integration perspective, i.e., if the information from several KGs was used to determine the suitable class for each entity.
Regarding the experimental evaluation:
- I think that comparing this technique with just a single baseline is not enough, and other proposals should be included in the experimentation, such as [3]. Also, I would suggest using exclusively well-known or reference datasets that have been used by other proposals, so that a comparison can be made. The Olympics dataset is a custom-made dataset not used before, which makes it less interesting. Finally, since the authors mention that working with tables in which the entity is not in the first column is not a problem, the experiment should have included one of such datasets, to support their claim.
- The authors should report on the execution times as well as on the function computation time, to illustrate what is the impact of issuing the DBPedia queries for each entity in the dataset.
- I do not think that having perfect results regarding precision and recall in the Olympics dataset is a strong point of the experimentation. It is not common to have that kind of results unless the technique is somehow overfit to the dataset.
- In the experiments with T2Dv1 and T2Dv2, the differences between TADA and T2K are quite small; therefore, in order to draw some conclusions, suitable statistical tests should be performed (and reported).
- The authors should clarify how the values in the first row of Table 2 were obtained, since T2K is only tested against T2Dv2, according to the provided reference.
- The incompleteness of Knowledge Graphs is mentioned as one of the reasons for lower performance measures. This is not a trivial matter, real world knowledge graphs are known to be incomplete and noisy [2], which makes me wonder if this proposal is suitable for other knowledge graphs besides DBPedia, as they mention.
- The authors say that less common entities can be more easily misclassified, and "This could be settled using other kinds of insights like properties in the tables. This also can be limited if the properties in the table do not exist in the knowledge graph". It has been proven that using the entity properties in an iterative fashion for semantic labelling does actually provide improved classification results, even if these properties are not named entities as well, and therefore are not included in DBPedia [1]. The authors should explore this as well.
Other minor remarks:
- The authors use footnotes profusely (e.g., page 2). Note that footnotes make the reader go back and forth and can make them lose their reading flow, which is why they are discouraged in many journals. I would recommend using them exclusively to provide links to implementations and tools, or information that is only loosely related to the text, but not essential clarifications. As an example, references 37 and 38 could perfectly be footnotes instead of citations.
- Figure 1 is well intended but not quite as efective as it could be, since it lacks some level of detail. I would recommend using it to describe an actual example of the application of the formula. For example, we see the sample hierarchy of classes "Person", "Athlete", "Basketball player", (which has the higher score), but it would be interesting to see what happens when several distinct class hierarchies could apply to a given popular name (e.g., John Smith). It would be useful to illustrate the values of formula 1 as well. Also, I don't think it is accurate to say that Figure 2 "provides an alternate view" of the approach. For me, Figure 2 is a high-level description of the workflow, and I would recommend presenting it earlier, before current Figure 1, which instantiates this workflow in a specific example.
[1] Ayala, D., Hernández, I., Ruiz, D., & Toro, M. (2019). TAPON: A two-phase machine learning approach for semantic labelling. Knowledge-Based Systems, 163, 931-943.
[2] Paulheim, H. (2017). Knowledge graph refinement: A survey of approaches and evaluation methods. Semantic web, 8(3), 489-508.
[3] Zhang, Z. (2017). Effective and efficient semantic table interpretation using tableminer+. Semantic Web, 8(6), 921-957.
|
Comments
tiny improvement
When you say
"most of such data are still being published at most with 2 or 3 stars,that is, using
spreadsheets (CSVs or Excel files)"
I guess you would be more precise
"most of such data are still being published at most with 2 or 3 stars,that is, using
spreadsheets (EXCEL OR CSVs files)"
if you want others to think that Excel=2* and CSV=3*
-- and please remember that only old specifications of Excel are actually non-open...