Leveraging Knowledge Graphs for Big Data Integration

Tracking #: 2216-3429

Authors: 
Philippe Cudre-Mauroux

Responsible editor: 
Guest Editor 10-years SWJ

Submission type: 
Other
Abstract: 
This article gives an overview of our recent efforts to integrate heterogeneous data using Knowledge Graphs. I introduce a pipeline consisting of five key steps to integrate semi-structured or unstructured content. I discuss some of the key applications of this pipeline through three use-cases, and present the lessons we learnt along the way while designing and building data integration systems.
Full PDF Version: 
Tags: 
Reviewed

Decision/Status: 
Accept

Solicited Reviews:
Click to Expand/Collapse
Review #1
By Aidan Hogan submitted on 23/Jun/2019
Suggestion:
Accept
Review Comment:

The paper describes the XI pipeline: an Information Extraction-style pipeline for integrating heterogeneous data using a Knowledge Graph, consisting of Named Entity Recognition (NER), Entity Linking (EL), Type Ranking (TR), Coreference Resolution (CR), and Relation Extraction (RE) steps. The TR step is highlighted, in particular, as something distinguished from typical IE pipelines. The paper then discusses three different use-cases in which this pipeline has been employed, involving research articles, social media, and heterogeneous logs in a cloud environment. Based on these use-cases, the author summarises some of the main lessons learnt from his experience of working on/with the XI pipeline for integrating data, highlighting the importance of: keeping a human in the loop, working with fine-grained types for entities, the quality of the initial Knowledge Graph used, and the ability to customise the pipeline for different scenarios.

In general, the paper is very easy to read and very clear. I think the discussion of the use-cases and the lessons learnt is where the real value of the paper lies, and should be of particular interest to those working on KG-mediated data integration.

I recommend acceptance, and leave the following as suggestions to the author:

* The paper starts by mentioning semi-structured formats such as JSON, but the pipeline consists of components that have traditionally been applied to text. Hence it would be of interest to hear in more detail, for example, how a JSON or XML file might be processed: what adaptations are required, what entity mentions would be extracted, how easy/hard is working with such data versus text (in lessons learnt), etc.

* One of the major distinctions of this pipeline is its use in various practical use-cases, but from the text the pipeline still feels perhaps a bit "abstract" in how it is described. Maybe a running example would be useful to give a more concrete feel for the pipeline?

* I'm curious about whether or not the pipeline considers feeding the output of the RE phase back into the KG or not (thus enriching/extending the KG). I would also be curious about the authors' experience on this; was this tried in the use-cases?

## MINOR COMMENTS ###
- a series of systems we [designed]
- is supposed to be given a priori -> should be given a priori [at least to my ear, the first form suggests "... but it’s not", like the bus is *supposed to be here* ... but it's not]
- sate of are -> state of the art
- Fix encoding problems in the references for certain characters

Review #2
By Tania Tudorache submitted on 14/Jul/2019
Suggestion:
Minor Revision
Review Comment:

The paper briefly describes the XI Pipeline that helps integrate various data sources into a knowledge graph (KG). The paper also presents briefly three use cases of employing the XI Pipeline, and finalizes with challenges and outlooks for KG research.

The topic is very timely, and is clearly presented. There are a few minor typos, so it is recommended to do a proof-reading of the paper before publication.

Few suggestions on further improving the paper:

- The title is very generic, and the reader might expect a more generic approach to data integration into KGs. I suggest making the title and abstract more specific and fitting to the content of the paper, for example, adding the "XI Pipeline" in the title.

- To further benefit the reader, it is recommended to add a section of related work, not to compare the current approach, but to give an idea of the current landscape for doing data integration into KGs.

- If there is a full paper describing the XI Pipeline, please add a reference.

Minor editorial things:

- Please have the paper proof-read.

- desinged -> designed
- "and as end-to-end techniques to integrate structured data abound" -> does not follow in the sentence, please rephrase
- Section 2.1 Introduce the NER abbreviation (it is used in the figure, but not introduced)
- page 2, line 30, 1st col: period is missing
- "pretty distinctive" - not clear what it means (e.g., it's novel; not part of previous work?), please rephrase
- page 3, line 27, 1st col: add commas for the enumeration

Review #3
By Peter Haase submitted on 28/Jul/2019
Suggestion:
Minor Revision
Review Comment:

The author describes his recent and current work on knowledge graph-driven data integration.
At the core of this work is the XI Pipeline, with a five-step process to integrate semi-structured and unstructured data using knowledge graphs.
Concrete use cases in three areas are presented and lessons learnt are discussed.
Overall, the article is well written and provides a good overview of the author’s work.
For a vision paper for the 10 year SWJ special issue (which I understand the submission to be), there is relatively little focus on challenges, what is next, or visionary elements generally.
Perhaps there is room for adding some remarks in this direction.

Also, some aspects might be made a bit clearer:
- The title of the article prominently mentions Big Data, however, the term is not further used or explained in the article. How does this work address “Big data integration” as opposed to just “Data integration”? Neither in the presentation of the pipeline nor in the use cases I found any indication what would be specific to Big Data.
- For the XI Pipeline, it would be good to explain as what kind of artefact this pipeline exists? Is it an implemented, usable integrated piece of software, or a set of individual components, or more of a framework/process with a blueprint how to do data integration?

Minor comments:
- desinged -> designed (pg 1)
- sate of the are -> state of the art (pg 2)