UnifiedViews: An ETL Tool for RDF Data Management

Tracking #: 1265-2477

Authors: 
Tomas Knap
Peter Hanecak
Jakub Klimek
Christian Mader
Martin Necasky
Bert Van Nuffelen
Petr Skoda

Responsible editor: 
Krzysztof Janowicz

Submission type: 
Tool/System Report
Abstract: 
We present UnifiedViews, an Extract-Transform-Load (ETL) framework that allows users to define, execute, monitor, debug, schedule, and share data processing tasks, which may employ custom plugins (data processing units) created by users. UnifiedViews differs from other ETL frameworks by natively supporting management of RDF data. In this paper, we (1) introduce UnifiedViews’ basic concepts and features, (2) demonstrate the maturity of the tool by presenting exemplary projects where UnifiedViews is successfully deployed, and (3) outline research projects and directions in which UnifiedViews is exploited. Based on our practical experience with the tool, we found that UnifiedViews contributes to simplifying the task for data providers to establish and maintain Linked Data publication processes.
Full PDF Version: 
Tags: 
Reviewed

Decision/Status: 
Major Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
By Daniel Garijo submitted on 08/Jan/2016
Suggestion:
Minor Revision
Review Comment:

This paper describes an extract-transform-load framework called Unified Views. The framework is open source, and the authors show its usefulness through several examples that have adopted it.

The paper is well written and easy to follow. I have found just a couple of typos that I highlight at the end of my review. All the links I have verified work correctly and the resources/datasets/visualizations described in the paper show the hard work that there is behind each of the cases. By looking at the evolution of the project in Github, one can see that the authors are constantly working on fixing issues in the tool, that has many contributors and that will be maintained in the future for upcoming projects. Even though the tool cannot be considered original, as many other ETL frameworks exist, I think that the paper shows how the tool is useful for its purpose and that different types of users are able to adopt it. Therefore, I think that the paper should be accepted in the SWJ journal, after addressing some minor remarks that I detail below.

- One of the requirements for a "Reports on tools and system" submission is that "These reports should be brief and pointed, indicating clearly the capabilities of the described tool or system". Although the paper provides the details needed to understand the tool, it is by no means brief. Maybe the authors could reduce the descriptions in section 4, which sometimes are a bit repetitive.

- In the related work, the authors do not include Open Refine as a framework to transform, link and reconciliate data in CSVs to RDF. I am surprised by this, because the tool is quite known for CSV to RDF transformations. GeoKettle is also another framework tha is used to transform tabular data to RDF in the geographic domain.

- The limitations and future work are not presented in the paper. Some of them are commented in the lessons learnt, but it would be nice to see them in one section at the end.

- (Minor issue) I don't recommend to add wikipedia as a source for the definition of a term.

Typos:
Introduction: "if there were graphical visualizations of the prepared tasks, which shows"->which show.
customizable ,,building blocks`` -> the comma on the bottom
Related work: "Closure functions"-> Clojure functions?

Review #2
Anonymous submitted on 30/Jan/2016
Suggestion:
Major Revision
Review Comment:

The paper presents a tool for extracting, transforming, and loading data (RDF data), dubbed UnifiedViews. Using the tool, it is possible to define, execute, monitor, schedule, and share pipelines for converting raw data to linked data. The main components (backend and frontend) are introduced. Major part of paper presents projects where UnifiedViews is deployed.

The paper was submitted as 'Tools and Systems Report'. It is reviewed considering two dimensions: (1) Quality, importance, and impact of the described tool or system; and (2) Clarity, illustration, and readability of the describing paper, which shall convey to the reader both the capabilities and the limitations of the tool. In this case, the points listed below should be improved for accepting the paper.

UnifiedViews is not well presented. The aspects enumerated in the conclusion section (ability, flexibility, simplicity, …) are not highlighted in the paper. In a section, an example of defining, executing, debugging, monitoring, scheduling, and sharing a pipeline could be explored, explaining those aspects. In addition, an example could exemplify: “(1) the robustness of the pipeline execution engine, (2) THE usability of the graphical user interface, and (3) THE simplicity of new DPUs’ creation”.

Major part of paper (Section 4 with ~ 7 pages) is based on presenting projects using UnifiedViews. More information is needed. It could be done by summarizing the section and using a table, (each row with project information plus pipeline description - e.g., number of DPU and their types, execution periodicity, among other).

Other details:
- Pay attention on formatting the paper according to the IOS Press templates.
- What are the keywords?
- A formal definition of ETL framework is not given.
- What are the general ETL framework features?
- Which features are implemented in UnifiedViews?
- Which features are not implemented in UnifiedViews?
- The user interface of UnifiedViews is not depicted.
- The readability of Figure 1 should be improved. DPUs could be organized vertically, a Quality Assessor DPU should be added, and the legend for DPU types could be presented. In the figure, about the data flows' labels, are they always the same (output → input)?
- '(see Figure 2)' is used three times.
- Is it “...Clojure Language … write Closure functions.” or “...Clojure Language … write Clojure functions.”?
- Section 'Related Work' should be before section 'Conclusion'.
- It seems that something is missed between introduction and conclusion. Some statements from 'Introduction Section' could be revisited for discussing the conclusions.
- Section 'UnifiedViews in Research Projects' could be used as future works in section 'Conclusion'.
- Maybe, section 'Conclusion' could be renamed to 'Discussion and Future Works'.
- Rewrite the long sentences.

Examples of minor details (some of them):
- Standardize use of commas (as example: before term 'which').
- Is it 'data analysis' or 'data analyses'?
- Before an enumeration starting with (1), is recommended to use ':'. As example: '… involves the activities of: (1) getting the data from…'

Review #3
Anonymous submitted on 08/Mar/2016
Suggestion:
Minor Revision
Review Comment:

UnifiedViews is an open source ETL framework which focuses on RDF data. The paper describes the problem of data wrangling wrt. RDF data and how UnifiedViews addresses them. The Related Works section is extensive. Several real-world deployments are described and analyzed in a structured way

Quality and impact is shown in section 4 with 7 concrete use cases where UnifiedViews was deployed by the authors. This section is insightfull and describes not only the deployment but also challenges and learnt lessons which is important for future work.

Regarding importance: The introduction and the amount of related works, indicate the importance of the problem. Still it remains unclear to which extent other, independent users adopted the system. The paper would benefit from an estimate of known other installations.

The paper is clearly structured and well written. The capabilities are shortly described, mainly in section 2.2. This description should be more extensive.
* The section does not describe the task of "sharing" DPUs (over different pipelines? or different installations?).
* What exactly includes "maintenance"?
* How are errors in one DPU handled?
* How are the different backends coordinated?
* are there other transformations available? Which?
* which are the concrete external systems or APIs to which UnifiedViews can load data to?
Although the paper lists many libraries and framework used in the implementation, it is just as important to know the capabilities in more detail.
Also Figure 2 is missing one component: the REST API

The limitations as well as future work of the tool are well described in Sections 3, 4, and 5.

Overall the paper is clearly structured, well written, presents motivation, shows quality (maturity), and importance (to some extent).
We suggest minor revision to improve clarity of Section 2, esp. subsection 2.2, to answer questions as the ones outlined above.

Minor details:

In general citations should be preferred to footnotes, e.g., footnote 10 dbpedia

Some URLs redirect to new tools with new names, e.g., footnote 9 Cr-batch redirects to LD-Fusion

page 8: "provides supports" -> provides

there is a comma missing between footnotes 47 and 48

Inconcsistent formatting of bibliography: entries 3 and 4 both are W3C recommendations so they should be formatted the same, one says "Technical Report" the other doesn't