Review Comment:
1. Introduction
MTab4DBpedia is a Semantic Annotation approach for tabular data (Semantic Table Interpretation or STI). MTab4DBpedia covers all tasks of STI, i.e., Cell-Entity Annotation (CEA), Column-Type Annotation (CTA), Column Relation-Property Annotation (CPA).
The described system obtains the best performance for the three matching tasks of SemTab 2019 challenge. The technique combines joint probability signals from different table elements and majority voting to deal with data noisiness, schema heterogeneity, and ambiguity.
The system is inspired by the graphical probability model-based approach described by Limaye and the signal propagation in the T2K system of Ritze.
2. Definitions and assumptions
2.1 Problem definitions
The objectives of the STI and the related tasks are well defined.
2.2 Assumptions
The assumptions are well described and reasonable if we consider the objectives of the SemTab 2019.
However, these are too restrictive for the use of the technique outside the challenge. This issue will be clarified later.
3 MTab4DBpedia approach
3.1 Framework
The MTab4DBpedia approach is composed by 7-steps pipeline. Step 1 is to pre-process a table data S by decoding textual data, predicting languages, data type, entity type prediction, and entity lookup. Step 2 is to estimate entity candidates for each cell. Step 3 is to assess type candidates for columns. Step 4 is to evaluate the relationship between two columns. Step 5 is to re-estimate entity candidates with confidence aggregation from step 2, step 3, and step 4. Step 6 and Step 7 are to re-estimate type and relation candidates with results from Step 5.
3.2 Step 1: Pre-processing
– Text Decoding: considering the nature of the input data, we find the choice reasonable.
– Language Prediction: authors should clarify how language identification affects the lookup task to justify this step (like on page 11, right column, line 44).
– Data Type Prediction: what are the 13 data types used?
– Entity Type Prediction: also in this case, the authors should clarify which are the 18 types of entities. (typo page 4, left column, line 39: remove the space before the footnote number.)
– Entity Lookup: in this step, a threshold is defined to limit the lookup results. We ask the authors how to justify the threshold set at 100.
3.3. Step 2: Entity Candidate Estimation
The approach obtains a set of candidate entities from four different services (i.e., s DBpedia lookup, DBpedia Endpoint, Wikidata lookup, and Wikipedia lookup). It is a sensible choice, but this implies a strong correlation of the approach with these services. It would be desirable that the authors hypothesize the use of their own lookup service.
3.4. Step 3: Type Candidate Estimation
(typo page 5, left column, line 37: remove the space before the footnote number.)
3.4.1. Numerical Column
The EmbNum+ approach is interesting. However, it is necessary to expand the description of this approach using real examples, possibly extracted from the dataset of SemTab 2019. Also in this step, the authors should clarify how the threshold alpha have been defined. It is suggested to do this empirically. (typo page 5, left column, line 46: remove the space before the footnote number.)
3.4.2 Entity Column
(typo page 5, right column, from line 42: correct punctuation in the bulleted list.)
The Section is clear and well described. Again it is necessary to justify the threshold beta.
3.5. Step 4: Relation Candidate Estimation
3.5.1. Entity - Entity columns
The Section is clear and well described.
3.5.2. Entity - Non-Entity columns
The Section is clear and well described, but could you clarify the wording "w5, w6 are learnable parameters".
3.6. Step 5: Entity candidate Re-Estimation
It is necessary to clarify the meaning of the parameters w7,w8,w9,w10.
3.7. Step 6, 7: Re-Estimate Types and Relations
The description of the process is acceptable, but we suggest to include a concrete example to guide the reader in understanding.
4. Evaluation
4.1. Benchmark Datasets
(typo page 7, right column, from line 39: i.e.)
4.4. Experimental Results
A comparative analysis with the other approaches of the challenge would be useful.
4.5.1. CEA: Entity Matching
Authors should also provide their version of the CEA GT (EDCEA_GT).
Can be an excellent resource.
4.5.3. CPA: Relation Matching
Authors should also provide their version of the CPA GT (DECPA_GT). Can be an excellent resource.
5. Related Work
5.1. SemTab 2019 systems
We suggest better identifying the strengths of the other approaches presented during the challenge.
5.2. Other Tabular Data Annotation Tasks
Semantic table interpretation is a long-studied "problem". The first works date back to 2007. Therefore, it would be useful to extend this Section to contextualize the proposed approach concerning the other works in the state of the art. See the general comment.
6. Conclusion
6.1. Limitations
As indicated by the authors, since MTab4DBpedia is built on top of lookup services, the upper bound of accuracy strongly relies on the lookup results. We think this is a significant limitation of the proposed approach. Therefore, authors should specify how they can adapt their approach in case it is not possible to use external services (e.g. building a local index service).
GENERAL COMMENTS
The paper clearly describes the proposed approach. The approach is characterized by a fair degree of innovation and has been validated in an international challenge, which certifies its quality.
However, the discussion remains too tied to what it was necessary to achieve within the competition, thus losing some generality. Indeed the challenge addresses the main issues related to the Semantic Table Interpretation task, but it requires some assumptions that differ from real scenarios.
In particular, an approach should be able to identify the subject column and the header.
Another weak element of MTab4DBpedia is the close link with external services, as specified previously. The authors may also insert the changes made to MTab4 for participation in SemTab 2020.
There are several formulas, perhaps even too many; a small example would undoubtedly help, to give a logical sense to the formulas.
We suggest that the authors insert the repository link containing the implementation of the proposed approach.
|