Start Small, Build Complete: Effective and Efficient Semantic Table Interpretation using TableMiner

Tracking #: 668-1878

Authors: 
Ziqi Zhang

Responsible editor: 
Isabel Cruz

Submission type: 
Full Paper
Abstract: 
This article introduces TableMiner, the first semantic Table Interpretation method able to annotateWeb tables using an incremental, bootstrapping learning approach seeded by automatically selected ‘partial’ content from tables. The basic principle is to create initial and partial annotations of a table using some as opposed to all content in the table. The partial outcome then serves as ‘stepping stones’ to guide interpretation of remaining content in the table, followed by a process of iteratively refining results on the entire table to create final optimal annotations. To construct feature representations, TableMiner uses various types of contextual information both inside and outside tables, including pre-defined semantic markups (e.g., RDFa/Microdata annotations) within some webpages that to the best of the author’s knowledge, have never been used in Natural Language Processing tasks. Evaluated on the largest collection of datasets known to-date including four datasets with a total of more than 15,000 tables, TableMiner consistently outperforms four baselines under all experimental settings. On the two most representative datasets covering multiple domains and various table schemata, it achieves improvement in F1 by between 1 and 42 percentage points depending on specific tasks. Compared against state-of-the-art, it also obtains higher results on similar datasets. The bootstrapping learning strategy seeded by partial table content also enables TableMiner to be very efficient. Empirically, it reduces data to be processed by up to 66% or 29% savings in CPU time when compared against classic methods that ‘exhaustively’ processes the entire table content to build features for interpretation.
Full PDF Version: 
Tags: 
Reviewed

Decision/Status: 
Major Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
Anonymous submitted on 21/Nov/2014
Suggestion:
Major Revision
Review Comment:

The work "Start Small, Build Complete: Effective and Efficient Semantic Table Interpretation using TableMiner" proposes TableMiner, a framework to semantically annotate (Web-)tables with entities (for table cells), concepts (for table columns) and binary relations (between table columns). Instead of using the whole table content TableMiner initially creates partial annotations for table cells and columns. These partial annotations serve as 'stepping stones' for the remaining annotations. In the following the approach iteratively refines all table annotations to create a final reliable set of table annotations. Basically, the author has already presented the work at the International Semantic Web Conference 2014 under the name "Towards effective and efficient table interpretation". Hence, the work's abstract seems to be a bit misleading since it's advertizing with introducing TableMiner. The claims in THIS paper should be scoped carefully in the abstract.

1. Introduction
The author provides an articulate introduction which introduces the comprehensive area and problems of table interpretation. Additionally, the advantages of TableMiner are shortly presented which are clear and meaningful.

2. Notions and Definitions
Section 2 provides a very short overview of basic notations used throughout the work. Anyway, the "Notions and Definitions" section is directly followed by the Related Work Section which seems a little bit artificial since related work uses some notation proposed in section 2.

3. Related Work
The author provides a comprehensive overview of related fields of research, including converting legacy structured data to linked data, NLP and IE as well as semantic table interpretation. The section contains all relevant works in this reseach area and provides a good overview of the previous works.

4. Overview and Notations
To be honest, i was a little bit confused about another chapter containing notations. Is it possible to merge both notation sections? For instance,

section
2. Notation
3. Overview approach
.
.
.
n-1. Related work

The current section 4 should be kept for acquainting the reader with the overall algorithm process. Basically, figure 2 shows the algorithm's overall procedure which is clear and intuitive. However, the reading of section 5 - 10 showed that in this section some more work in terms of explaining the algorithm's processes has to be done. For instance the author may provide a flow chart which clarifies the temporal series of the "learning" and "update" phases.

5. I-Inf (Incremental Inference)
This algorithm is an iterative process where in each turn, a single data item is processed and updates the overall state until convergence. The section is short und provides the reader with the necessary principles for the following sections.

6. Subject Column Detection
Subject Column Detection detects the column most likely representing the main column within a table. The author presents several well-defined features to detect those subject columns. In equation 8, the feature weights are manually determined. Have you tried to learn the feature weights? For instance, the explanation of why you chose 0.5 for feature "uc" is poor. Basically, I understand why feature x is more important than feature y, but it is not clear how you determined the feature weights.

7 - 9. NE-Column interpretation - building blocks, NE-Column interpretation - the LEARNING phase, NE-Column interpretation - the update phase
These sections are very difficult to understand. 2 major reasons:

1. There is a huge number of introduced symbols. Even if all symbols are correctly introduced, it is extremely hard to follow. Several symbols are abbreviations of terms and thus the reader has to remember all these. The number and forms of these symbols can be simplified for sure.
2. As mentioned in this review section 4, I was not sure whether I completely understood the whole algorithm process. These sections lack overview depicting the whole process in greater detail (Figure 2 is not enough).

Additionally, in your related work section you explain that in your paper you significantly refine the column classification and named entity disambiguation methods by changing mathematical formulas etc in contrast to the work "Towards effective and efficient table interpretation" proposed at the International Semantic Web Conference 2014. You did not describe the changings in the respective sections. Hence, I was not sure which equations and methods are new and inherited.

10. Relation enumeration and labeling literal-columns
Section 10 describes the binary relation labeling which is clear and reasonable.

11. Evaluation
The evaluation chaper comprises data sets, evaluation metrics, baseline algorithms and results. First, the author describes the reannotation of the old Limaye table dataset as well as the IMDB and MusicBrainz data sets. The baseline algorithms constitute very simple, but effective algorithms to compare TableMiner with other algorithms. Unfortunately, the results attained by TableMiner cannot be directly compared with the results of Venetis et al. and Mulwad et al. due to unpublished data set changes. This should be explicitly underlined in the respective sections again. In table 11 you note a 90.9% F1 measure at the named entity disambiguation task of the approach by Mulwad et al. That is not correct. According to [1], 90.9% percent are attained by the entity ranker which produces likelihoods that strings should be linked to entities. The F1 values at the named entity disambiguation tasks are quite poor (75.9% on the Wiki-Links data set). Overall, it is clear that TableMiner outperforms the baseline approaches. It is noteable that accuracy improvements are partially marginal (depending on the underlying data sets).

Overall, the author provides a very promising framework for table interpretation with a new state-of-the-art approach incorporated. The evaluation section contains a range of convincing evaluations with reference to annotation accuracy and annotation performance. I suggest that the author improves the algorithm section by clarifying the algorithm process with flow charts (or similar). Some other (minor) issues are denoted in the section reviews.

The work contains only few spelling mistakes.

[1] Mulwad, Varish, Tim Finin, and Anupam Joshi. "Semantic message passing for generating Linked Data from tables." The Semantic Web–ISWC 2013. Springer Berlin Heidelberg, 2013. 363-378.

Review #2
By Venkat Raghavan Ganesh submitted on 26/Nov/2014
Suggestion:
Minor Revision
Review Comment:

The paper introduces a method named "TableMiner" that annotates the tables using a bootstrapping approach. The features identified using these annotations are derived from a knowledge base (Freebase). The method has been evaluated on a large number of well structured tables (such as those from Wikipedia) and compared against other table interpretation methods. Although, there exists a some similarity between this work and the work presented by Mulwad et al. (2013), TableMiner differentiates itself by introducing methods to identify subject columns and by using less data for convergence, which is shown to improve the efficiency. The paper is fairly well written. However, all notations should be reviewed for readability. Detailed review below:

==============
There seems to exist some similarity between the overall method described by Mulwad et al. (2013). In their work, they present a framework that represent the table data as a Markov network. For a meaningful inference between the classes and row data, they parameterize the graph and represent it as a factor graph. They use factor nodes that computes the affinity of a predicted agreement. Measures like PMI and ranking scores are used by the factor nodes for inference. One factor node determines the agreement between column header and row cell value and the other determine the relation between pair of columns. In a similar way, TableMiner also uses a similar factor graph representation and tries to update the scores till they converge. The difference exists in their scoring mechanism and additional "learning" phase which identifies the subject column. The features used for subject-column detection are chosen intuitively. The usage of external information such as those from web search engine and its corresponding scoring function results in good subject column prediction (as per the reported results) which is necessary for relationship identification. The UPDATE phase seems to override the LEARNING phase to some extent. Because the UPDATE phase tries to disambiguate till convergence, choosing a random candidate may also result in similar output, intuitively. Proper explanation is required as to how the output of the LEARNING phase would help in improving the accuracy of the UPDATE phase. The evaluation is very comprehensive and the results justify the claims made. However, it would be nice to clarify TableMiner's accuracy on a large number of heterogeneous tables. For example, tables that only have 15-20 rows but originate from various domains/sources. In these cases, computation would be almost negligible and accuracy in identifying relationship will be the goal.

Presentation: The paper is fairly well written. However, some equations need to be much clearer and use of notations has to be consistent and meaningful. For example, indicating "PG" to indicate a list of web pages decreases readability. On page 8 Section 5, notations need some clarity. The missing space between the mathematical relations (<=, >=) should be corrected.

Review #3
By Mike Cafarella submitted on 08/Dec/2014
Suggestion:
Reject
Review Comment:

The authors of this paper propose TableMiner, a mechanism for semantically annotating an extracted data table (which likely comes from the Web but, strictly speaking, does not have to). Extracting table annotations is an active area of research, and can be useful in applications such as improving open linked databases, improving structured components of web search, and as an aid to data integration applications.

This is an interesting area where it is possible to make contributions. The authors do a good job of reviewing related work, and have assembled an admirable array of table datasets to evaluate. However, there are three serious problems with the article as written, one of which may be hard to address.

1) The authors decide that runtime is a crucial performance criterion of table annotation systems. This is the fundamental motive for their algorithms that attempt to iteratively sample from target tables, scoring candidate extractions' semantic labels row-by-row.

I think this is a serious mistake. Almost all work to date has focused on the markup quality, for good reason. The runtime overhead is entirely dependent on the features being used. But even in the Freebase-style case discussed here, there is likely no heavy computation going on; rather, it is likely that the observed Freebase lookup times are driven primarily by network latency. If the users had a local copy of the Knowledge Base data (which is possible for some some KBs) then the overhead time here would likely be minimal. (Note that experiments evaluate runtime entirely with a single thread, maximizing the problems with latency.) It is a stretch for me to believe that this is a serious problem: it can be fixed by simply downloading a dataset and building a local index.

2) The users don't actually compare against baselines from the literature. Rather, they compare against some simple --- and somewhat artificial --- baselines. As a result, it is hard to say whether their approximation results are actually as good as they should be.

3) There is a real lack of useful examples in the first half of the paper. The authors have a few one-line examples, but the paper could really benefit from a longer running example that helps to illustrate every extraction target.

I think #2 could be addressed with a revision of the paper, but #1 seems pretty core to the paper. As a result, I think the authors should either rethink the goals here, or should make a more serious effort to argue that performance is a serious evaluation metric for such a system.

Detailed comments:

The abstract could really use a stronger example

The use of "concept" to mean "type" should be better described in the intro. Also: use examples!

The link between your stated goals and the iterative method that is introduced at the end of page 2 is unclear. It eventually becomes clear why you're doing this, but iterative methods can serve multiple ends: minimizing human labels, for example. Your main goal is performance, but that doesn't come through right away.

"the largest collection of datasets for this task known to date". What is "this task"?

"most representative datasets" is unclear. Representative of what?

The evaluation criteria in the middle paragraph of left-column page 3 come from nowhere. I thought the whole point was increased performance, not accuracy. You have not told us that these other goals are things you're also trying to address.

There is a lot of related work from Kaushik Chakrabarti, on the InfoGather system. It should be cited and discussed.

The discussion on "Completeness" on page 7 could use a stronger example, esp on the point about "relations between non-subject columns"

I can't parse this important-seeming sentence from Section 4.1, "A candidate label for an NE-column…" This section could also use examples.

Algorithm 1 I-Inf is unnecessarily high level. It communicates the "meta-algorithm" but it is so high level that I can't understand if it's doing something useful.

Section 6 is unclear what is the authors' work, and what is due to [30].

Section 8: "it has been noted that…" by whom? By the authors, or does there need to be a citation here?

The List of peaks named Bear Mountain" example is too sketchily described. Authors should provide the concrete example in the paper and discuss it more carefully.

In the experiments, the authors permit a confusing statement about how the bootstrapping process and how it relates to interdependence among components. Paragraph begins, "In all baselines…" I am not sure whether they are trying to say something new here or not.