Typology-based Semantic Labeling of Numeric Tabular Data

Tracking #: 2172-3385

Ahmad Alobaid
Emilia Kacprzak
Oscar Corcho

Responsible editor: 
Guest Editors EKAW 2018

Submission type: 
Full Paper
More than 150 Million tabular datasets can be found on the Google Crawl of the Web. Semantic labeling of these datasets may help in their understanding and exploration. However, many challenges need to be addressed to do this automatically. With numbers, it can be even harder due to the possible difference in measurement accuracy, rounding errors, and even the frequency of their appearance (if treated as literals). Multiple approaches have been proposed in the literature to tackle the problem of semantic labeling of numeric values in existing tabular datasets, but they also suffer from several shortcomings: closely coupled with entity-linking, rely on table context, need to profile the knowledge graph and the prerequisite of manual training of the model. Above all, they all treat different kinds of numeric values evenly. In this paper, we tackle these problems and validate our hypothesis: whether treating different kinds of numeric columns differently yields a better solution.
Full PDF Version: 

Major Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
By Joana Malaverri submitted on 01/May/2019
Major Revision
Review Comment:

(1) Originality,
This work is an extension of the paper "Fuzzy Semantic Labeling of Semi-Structured Numerical Data Sources" which describes an approach to label the numerical columns of tabular data based on the application of the fuzzy c-means technique. In this extension, the authors introduce a typology of numeric values, based on other work, as a way to improve the labeling of the numerical columns. The work is original in relation to previous work and the state of the art: in general the status-quo focused mainly on textual columns. In particular, the authors present a model for constructing features based on different numerical types. These features will serve as input for the classification and labeling of each numeric column. Despite this contribution, it is unclear whether the authors were able to improve the labeling of the numerical columns compared to their previous work. If the authors were succeed in their goal, it should be made explicit for what types they achieved this improvement. I consider that a better organization and discussion of Section 7 (Evaluation) can help to clarify this gap. Besides, it is not clear what part of the typology of numerical values is original in this work. It is necessary to clarify this point.

(2) Significance of the results:
The main problem for me is the authors do not make clear whether their results are good enough. They start their work by posing a hypothesis: treating types of numeric columns differently yields better results than treating them uniformly. However, it is not possible to understand if the authors have been able to prove their hypothesis.
What is the improvement of the solution proposed in this work regarding to the previous work? It would be interesting to see the comparison of the new results against those obtained previously.

(3) Quality of writing.
I consider this a promising work that can be useful for future approaches; however, the description of the different sections needs to be improved, in particular sections 4, 5, 6 and 7. In addition to reviewing the organization of each section, it is necessary to check the wording of the text. The text is hard to be read, lacks clarity and organization. Following I list some points that can be considered by the authors in order to improve the text:

Types of Numbers: It is necessary to discuss about the numeric types the authors cited. For example, what are the advantages and/or disadvantages of the different numeric types described in this section. What criteria the authors use to select some of the types?

*Section 4: Typology of Numerical Columns:*
- What are the numeric sub-types the authors refer to? Are the same used for the high level types: nominal, ordinal, interval, and ratio?
- In addition, the authors should mention what types they are using from each state-of-the-art work and if there are some proposal related to numeric type (or sub-types) made by them.

*Section 4.1. Nominal:*
- the classification the authors describe for this category: 1) sequential; 2) hierarchical; 3) categorical; 4) random, is it based on other work? If so, it would be important to cite the reference. If it is original, authors should mention this classification as part of the contribution of the paper.
- Hierarchical: the authors could give an example to improve the understanding of the detection of the hierarchical values. The same for Categorical type. Some examples would help to understand the different numeric types of each category being analyzed.
- The paragraph: “A special case for having only one unique value is not considered categorical. We simply ignore that collection as extra knowledge would be needed to understand the meaning of this number”: it refers to a single value in the column or a digit in the value? Could you provide some examples?
- Random: could you give an example that contains Random types?
- Ordinal: the same as Random.

* The main difficulty in this section is to understand how the authors can detect and classify a type considering the different numerical values ​​that a column can contain.

- Table 1 is without reference and lacks of an appropriate description of the content.

*Section 7.3 Results and discussion:*
- At least for precision and recall, the authors should provide the formulas they are using to compute them.
- The authors present different tables (e.g., 6 and 7) containing the different scores obtained by the their approach. However, it would be interesting to show a table with the numeric columns used as input and the labels that were produced as output for each column.
- What is the improvement the author obtained with this approach regarding to the previous work?

Review #2
Anonymous submitted on 13/Jun/2019
Major Revision
Review Comment:

Overall, I like the idea that you categorize different types of numeric columns. I think it makes absolutely sense as a pre-processing step and classification for your semantic labeling. However, I am not completely convinced by the way you describe the types of numbers and your experiments: the definitions and heuristics are not always clear to me, and the testing and evaluation is not convincing, e.g., three of your defined types of numbers do not have any matching columns in your experiment. Therefore, I suggest a major revision of the paper; please find more arguments in my detailed review below.

- “150 Million tabular datasets can be found on the Google Crawl of the Web”: can you provide a reference for this? Where did you get this number from?

- “facilitating the population of these datasets” -> I’m not sure if this is really needed/the goal at platforms that you mention, such as data.gov, kaggle, etc.. Also, it’s not the goal of your work.
- “wants to use it for his knowledge” -> “her” or "his or her"
- such as, … -> no comma
- “Recently, the problem of assigning semantic labels to numerical columns in a dataset started to get traction as many of the previous works focused mainly on textual columns [...]” -> This sentence makes not much sense.. first numeric columns get traction, then no attention…
- “Multiple approaches tried to solve this problem” -> references are missing for the respective approaches that you describe.
- “different types of numerical column” -> columns
- “following the hypothesis: that” -> either remove the “:” or “that”

- “Nominal where numbers are just like text, so they don’t have any significant meaning beside differentiating them from other numbers (or texts) -> reformulate.. What do you mean with “just like text”? Text clearly has a meaning.. Do you understand “text” as “textual labels” here?
- “Ordinal kind is when the order matters” -> again reformulate
- “but with the numbers having the order” -> Which order? *an* order?
- “It has all the above properties.” -> Which above properties? Do you refer to Interval numbers?
- “It is to measure the how many occurrences or number of objects” -> reformulate

Regarding your state-of-the-art section I was missing that, on the one hand, the task of completing knowledge bases using web tables (e.g. [a]), the task of table retrieval (e.g. [c]), and the task of table type classification [b] is related to (or precondition for) your approach; and on the other hand, that there is some additional recent work [b] worth included.

In general, you have to rework your list of references: Some references are incomplete ([9], [23], [31]: source is missing; [32] URL is missing, [2] unicode issues). Some references are just links to Wikipedia which you use as examples in 4.1 ([25-29]); you could add these links as footnotes, or somehow better describe it (maybe as an example table or figure?).. but I would not include it in the references.

problem statement:
- “values of specific type” -> a specific type

section 4:
- “that we use in out work” -> our
- It looks a bit weird that you describe the types of numbers in section 2 and 4 again.. maybe you can find a better way to structure this. E.g., in 2 the discussion as it is, and in 4 a more structured (formal) definition of the types.
- “Below we explain how do we detect” -> reformulate
- What are hierarchical nominal numbers? Can you give an example? You do not define or introduce this type. Also, I don’t understand the intuition behind the heuristic.
- You use numbers in sports as an example of sequential nominal numbers. Isn’t that rather an example of categorical? E.g., in football number 1 and 10 could be also considered a category, and given a dataset of multiple players/teams, we would have many duplicates of these numbers. Also, there might be no number 2, 3, 4 etc.. so they are not necessarily sequential.
- “We put ratio and interval together because just looking at a zero” -> reformulate.. What do you mean with “real” zero? Again the definition of the ratio and interval type is not fully clear to me, and therefore the heuristic not
- “treat them similarly” -> similar
- “Often have the difference between [...]” -> reformulate
- “counts tends to” -> tend

In general, the heuristics in section 4 sometimes look ad hoc and tailored, and often the algorithm is not clear to me. I would like to see clear definitions of the different types and better descriptions of the algorithms in section 4.. Maybe you can re-structure and align this.

section 5:
- 5.1: How many properties do you extract?
- 5.2: Did you evaluate the number type for the properties? Why didn’t you assign them manually? You could have used such a manual assignment (even just for a sample set of properties) for testing/evaluating your heuristics in section 4.
- 5.2.3: “can be a difficult or even expensive” -> reformulate
- “distinguish simple counts from the rest as it has” -> as they have
- “Counts usually vary and using” -> “vary and use”, reformulate the whole sentence

section 6:
- I would suggest to go into more detail about the labeling approach (from your EKAW paper), to make the article self-contained.

section 7:
- 7.3: Given the number of sequential, hierarchical, categorical, and ordinal columns I wonder how well your heuristics work. Since it is (only) 124 columns you could have manually checked if there is really no hierarchical and categorical column.
- In 7.3 you report “areaOfCatchment” as an error: This is not due to your labeling algorithm, but due to your numeric types of the properties. As I mentioned before, I think these types of the properties should be rather given (i.e. manually assigned), or somehow tested. Otherwise you cannot rely on your knowledge graph.

- “We show a typology taking into account the task of semantic labeling.” -> reformulate
- “more tests with upcoming benchmarks” -> what do you mean here? Do you want to create new/better benchmarks?

The quality of writing should be improved: lots of spelling mistakes and untypical phrasing. I tried to collect them, but I did not list all. Please do a thorough proofreading.
Running examples, figures, etc. would improve the quality and readability of the paper.

[a] Y. Oulabi and C. Bizer, “Extending cross-domain knowledge bases with long tail entities using web table data,” in Advances in Database Technology - 22nd International Conference on Extending Database Technology, EDBT 2019, Lisbon, Portugal, March 26-29, 2019, pp. 385–396, 2019.

[b] Z. Zhang, “Effective and efficient semantic table interpretation using tableminer+,”Semantic Web, vol. 8, no. 6, pp. 921–957, 2017.

[c] S. Zhang and K. Balog, “Ad hoc table retrieval using semantic similarity,” in Proceedings of the 2018 World Wide Web Conference on World Wide Web, WWW 2018, Lyon, France, April 23-27, 2018, pp. 1553–1562, 2018.

[d] K. Nishida, K. Sadamitsu, R. Higashinaka, and Y. Matsuo, “Understanding the semantic structures of tables with a hybrid deep neural network architecture,” in Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence, February 4-9, 2017, San Francisco, California, USA., pp. 168–174, 2017.

Review #3
By Ilaria Tiddi submitted on 17/Jun/2019
Major Revision
Review Comment:

The paper focuses on the problem of semantic labelling numerical columns in tabular data. The work is an extension of the EKAW paper, where the authors use fuzzy clustering for automatically identifying the type of columns (any kind of) in tables. The authors first introduce a categorisation of numerical values (sequential, hierarchical, categorical, ordinal, ratio and intervals) and then use a set of heuristics to identify these types; finally, the label is defined using fuzzy clustering and DBpedia. The experiments compare with the previous work and are focused on improving the accuracy in the semantic labeling task through identification of the different types of numerical data.

I like the work as it tackles an interesting problem, but I am seriously concerned on the overall result.

(1) originality is OK, in the sense that there is a clear novelty w.r.t. the previous work

(2) significance of the results
Results are convincing, but I am not sure they are enough for a journal paper. Using 1 dataset and 1 KG only, and comparing solely with their own previous work seem a bit ad-hoc. What would happen with a different knowledge base for instance? Or with a domain-specific dataset? Could a different typology (e.g. the one of Section 2) be compared ?

(3) quality of writing
The writing requires significant revision — I started marking down typos and misformulations, but it was way too much to report them. The authors should consider proof-reading.

I sometimes find the narrative confusing, and authors seem to be jumping steps every now and then. Some decisions and intuitions should be clarified too. For example:
- the approach "does not perform well when the data points are scattered further away from the center" and this suggest that the numerical types in tabular data need to be further specified? There seems to be a missing step here. How did the authors come to this conclusion?
- at which step does the knowledge graph come into play, and what is the rationale behind it? This should be clearly stated both in the introduction and in section 3, and an example could help in understanding its role in Section 5, too.
- Why are the existing typologies (Section 2) not enough, and the authors need to define their own (section 4)?
- Section 3.1 seems to quick, i.e. it is not clear how we jump to the last paragraph, nor how it connects to the problem statement of 3.2?
- what is the intuition behind the detection order? Frequency of the type of numerical data? Complexity?
- In general, I would use the picture of Fig 1 to walk the reader through the pipeline (= extend 3.2)
- the clustering description of Section 6 could be extended — I do not see any problem in reusing part of the previous work
- details about the manual evaluation (how many people, how much time etc) should be provided

One general improvement I would suggest is to use a running example through the description of the process, as this improves clarity and prevents the reader to get lost. The authors use some examples every now and then but are rather scattered (football vs. military…) and what I am suggesting is to use one that illustrates each step in the pipeline.

Additionally, problems with some tables:
>> table 1 on page 7 not referenced?
>> orders of tables is not respected (eg table 7 before 6 etc) - make sure you order them properly in Latex
>> table 3 seems quite useless (can be described in text, or at least use a real example?)

Minor :
>> I would put footnote 8 in text, as it is not clear they are sequential
>> what does "understanding space" in section 3 mean exactly?
>> "detect them is to see whether the set of numbers (what we want to examine) is equal to list of numbers 1 until the size of the list" >> rephrase, and this really needs an example
>> "For the categorical kind, we use the unique num-ber of numbers (the number of categories) followed bythe percentages for each category ordered ascendingly" : this really needs an example
>> page 6, right, r 31 : "We follow the method we published in our previous 31work [8]." is the data extraction the method you refer to? The it should be better said
>> footnote 14 should go in text and be better explained
>> I do not get footnote 17
>> Conclusion : "Further in-depth inspection and even simulation can be employed to extend this work and more tests with upcoming benchmarks" what does that mean?
>> there is no consistent use of US / Brit English ("labeling" or "labelling" ? "summarized" vs. "summarised") also "do/does not" should be used in their extended forms