Review Comment:
Overall, I like the idea that you categorize different types of numeric columns. I think it makes absolutely sense as a pre-processing step and classification for your semantic labeling. However, I am not completely convinced by the way you describe the types of numbers and your experiments: the definitions and heuristics are not always clear to me, and the testing and evaluation is not convincing, e.g., three of your defined types of numbers do not have any matching columns in your experiment. Therefore, I suggest a major revision of the paper; please find more arguments in my detailed review below.
abstract:
- “150 Million tabular datasets can be found on the Google Crawl of the Web”: can you provide a reference for this? Where did you get this number from?
intro:
- “facilitating the population of these datasets” -> I’m not sure if this is really needed/the goal at platforms that you mention, such as data.gov, kaggle, etc.. Also, it’s not the goal of your work.
- “wants to use it for his knowledge” -> “her” or "his or her"
- such as, … -> no comma
- “Recently, the problem of assigning semantic labels to numerical columns in a dataset started to get traction as many of the previous works focused mainly on textual columns [...]” -> This sentence makes not much sense.. first numeric columns get traction, then no attention…
- “Multiple approaches tried to solve this problem” -> references are missing for the respective approaches that you describe.
- “different types of numerical column” -> columns
- “following the hypothesis: that” -> either remove the “:” or “that”
background:
- “Nominal where numbers are just like text, so they don’t have any significant meaning beside differentiating them from other numbers (or texts) -> reformulate.. What do you mean with “just like text”? Text clearly has a meaning.. Do you understand “text” as “textual labels” here?
- “Ordinal kind is when the order matters” -> again reformulate
- “but with the numbers having the order” -> Which order? *an* order?
- “It has all the above properties.” -> Which above properties? Do you refer to Interval numbers?
- “It is to measure the how many occurrences or number of objects” -> reformulate
state-of-the-art:
Regarding your state-of-the-art section I was missing that, on the one hand, the task of completing knowledge bases using web tables (e.g. [a]), the task of table retrieval (e.g. [c]), and the task of table type classification [b] is related to (or precondition for) your approach; and on the other hand, that there is some additional recent work [b] worth included.
In general, you have to rework your list of references: Some references are incomplete ([9], [23], [31]: source is missing; [32] URL is missing, [2] unicode issues). Some references are just links to Wikipedia which you use as examples in 4.1 ([25-29]); you could add these links as footnotes, or somehow better describe it (maybe as an example table or figure?).. but I would not include it in the references.
problem statement:
- “values of specific type” -> a specific type
section 4:
- “that we use in out work” -> our
- It looks a bit weird that you describe the types of numbers in section 2 and 4 again.. maybe you can find a better way to structure this. E.g., in 2 the discussion as it is, and in 4 a more structured (formal) definition of the types.
- “Below we explain how do we detect” -> reformulate
- What are hierarchical nominal numbers? Can you give an example? You do not define or introduce this type. Also, I don’t understand the intuition behind the heuristic.
- You use numbers in sports as an example of sequential nominal numbers. Isn’t that rather an example of categorical? E.g., in football number 1 and 10 could be also considered a category, and given a dataset of multiple players/teams, we would have many duplicates of these numbers. Also, there might be no number 2, 3, 4 etc.. so they are not necessarily sequential.
- “We put ratio and interval together because just looking at a zero” -> reformulate.. What do you mean with “real” zero? Again the definition of the ratio and interval type is not fully clear to me, and therefore the heuristic not
- “treat them similarly” -> similar
- “Often have the difference between [...]” -> reformulate
- “counts tends to” -> tend
In general, the heuristics in section 4 sometimes look ad hoc and tailored, and often the algorithm is not clear to me. I would like to see clear definitions of the different types and better descriptions of the algorithms in section 4.. Maybe you can re-structure and align this.
section 5:
- 5.1: How many properties do you extract?
- 5.2: Did you evaluate the number type for the properties? Why didn’t you assign them manually? You could have used such a manual assignment (even just for a sample set of properties) for testing/evaluating your heuristics in section 4.
- 5.2.3: “can be a difficult or even expensive” -> reformulate
- “distinguish simple counts from the rest as it has” -> as they have
- “Counts usually vary and using” -> “vary and use”, reformulate the whole sentence
section 6:
- I would suggest to go into more detail about the labeling approach (from your EKAW paper), to make the article self-contained.
section 7:
- 7.3: Given the number of sequential, hierarchical, categorical, and ordinal columns I wonder how well your heuristics work. Since it is (only) 124 columns you could have manually checked if there is really no hierarchical and categorical column.
- In 7.3 you report “areaOfCatchment” as an error: This is not due to your labeling algorithm, but due to your numeric types of the properties. As I mentioned before, I think these types of the properties should be rather given (i.e. manually assigned), or somehow tested. Otherwise you cannot rely on your knowledge graph.
conclusion:
- “We show a typology taking into account the task of semantic labeling.” -> reformulate
- “more tests with upcoming benchmarks” -> what do you mean here? Do you want to create new/better benchmarks?
The quality of writing should be improved: lots of spelling mistakes and untypical phrasing. I tried to collect them, but I did not list all. Please do a thorough proofreading.
Running examples, figures, etc. would improve the quality and readability of the paper.
[a] Y. Oulabi and C. Bizer, “Extending cross-domain knowledge bases with long tail entities using web table data,” in Advances in Database Technology - 22nd International Conference on Extending Database Technology, EDBT 2019, Lisbon, Portugal, March 26-29, 2019, pp. 385–396, 2019.
[b] Z. Zhang, “Effective and efficient semantic table interpretation using tableminer+,”Semantic Web, vol. 8, no. 6, pp. 921–957, 2017.
[c] S. Zhang and K. Balog, “Ad hoc table retrieval using semantic similarity,” in Proceedings of the 2018 World Wide Web Conference on World Wide Web, WWW 2018, Lyon, France, April 23-27, 2018, pp. 1553–1562, 2018.
[d] K. Nishida, K. Sadamitsu, R. Higashinaka, and Y. Matsuo, “Understanding the semantic structures of tables with a hybrid deep neural network architecture,” in Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence, February 4-9, 2017, San Francisco, California, USA., pp. 168–174, 2017.
|