Review Comment:
The paper describes the XI pipeline: an Information Extraction-style pipeline for integrating heterogeneous data using a Knowledge Graph, consisting of Named Entity Recognition (NER), Entity Linking (EL), Type Ranking (TR), Coreference Resolution (CR), and Relation Extraction (RE) steps. The TR step is highlighted, in particular, as something distinguished from typical IE pipelines. The paper then discusses three different use-cases in which this pipeline has been employed, involving research articles, social media, and heterogeneous logs in a cloud environment. Based on these use-cases, the author summarises some of the main lessons learnt from his experience of working on/with the XI pipeline for integrating data, highlighting the importance of: keeping a human in the loop, working with fine-grained types for entities, the quality of the initial Knowledge Graph used, and the ability to customise the pipeline for different scenarios.
In general, the paper is very easy to read and very clear. I think the discussion of the use-cases and the lessons learnt is where the real value of the paper lies, and should be of particular interest to those working on KG-mediated data integration.
I recommend acceptance, and leave the following as suggestions to the author:
* The paper starts by mentioning semi-structured formats such as JSON, but the pipeline consists of components that have traditionally been applied to text. Hence it would be of interest to hear in more detail, for example, how a JSON or XML file might be processed: what adaptations are required, what entity mentions would be extracted, how easy/hard is working with such data versus text (in lessons learnt), etc.
* One of the major distinctions of this pipeline is its use in various practical use-cases, but from the text the pipeline still feels perhaps a bit "abstract" in how it is described. Maybe a running example would be useful to give a more concrete feel for the pipeline?
* I'm curious about whether or not the pipeline considers feeding the output of the RE phase back into the KG or not (thus enriching/extending the KG). I would also be curious about the authors' experience on this; was this tried in the use-cases?
## MINOR COMMENTS ###
- a series of systems we [designed]
- is supposed to be given a priori -> should be given a priori [at least to my ear, the first form suggests "... but it’s not", like the bus is *supposed to be here* ... but it's not]
- sate of are -> state of the art
- Fix encoding problems in the references for certain characters
|