High-Level ETL for Semantic Data Warehouses

Tracking #: 2663-3877

Rudra Pratrap Deb Nath
Oscar Romero
Torben Bach Pedersen
Katja Hose

Responsible editor: 
Philippe Cudre-Mauroux

Submission type: 
Full Paper
The popularity of the Semantic Web (SW) encourages organizations to organize and publish semantic data using the RDF model. This growth poses new requirements to Business Intelligence (BI) technologies to enable On-Line Analytical Processing (OLAP)-like analysis over semantic data. The incorporation of semantic data into a Data Warehouse (DW) is not supported by the traditional Extract-Transform-Load (ETL) tools because they do not consider semantic issues in the integration process. In this paper, we propose a layer-based integration process and a set of high-level RDF-based ETL constructs required to define, map, extract, process, transform, integrate, update, and load (multidimensional) semantic data. Different to other ETL tools, we automate the ETL data flows by creating metadata at the schema level. Therefore, it relieves ETL developers from the burden of manual mapping at the ETL operation level. We create a prototype, named Semantic ETL Construct (SETLCONSTRUCT), based on the innovative ETL constructs proposed here. To evaluate SETLCONSTRUCT, we create a multidimensional semantic DW by integrating a Danish Business dataset and an EU Subsidy dataset using it and compare it with the previous programmable framework SETLPROG in terms of productivity, development time and performance. The evaluation shows that 1) SETLCONSTRUCT uses 92% fewer Number of Typed Characters (NOTC) than SETLPROG, and SETLAUTO (the extension of SETLCONSTRUCT for generating ETL execution flow automatically) further reduces the Number of Used Concepts (NOUC) by another 25%; 2) using SETLCONSTRUCT, the development time is almost cut in half compared to SETLPROG, and is cut by another 27% using SETLAUTO; 3) SETLCONSTRUCT is scalable and has similar performance compared to SETLPROG.
Full PDF Version: 


Solicited Reviews:
Click to Expand/Collapse
Review #1
Anonymous submitted on 26/Jan/2021
Review Comment:

The authors addressed all my concerns of the original review and added also more empirical evaluations by two ETL specialists.

There seems to be a slight problem with the word “\scon” in Table 6, which appears twice. It should be “SETL_CONSTRUCT”.

Review #2
Anonymous submitted on 02/Feb/2021
Review Comment:

This is a revision of a previously submitted (and reviewed) paper on ETL workflows. The paper's previous version was positively evaluated, and the new version is similarly good enough for publication. However, a few issues remain, which I analyse below in detail. Some of the proposed improvements are reiterations of the comments I had in the previous version.

The definition of RUP is still confusing. It was clear to me given the clarifications in the rebuttal, but the reader will not have access to that, and I was not able to understand the role of RUP before reading the rebuttal. F_R is clear, it is a relationship among levels. Then, RUP is defined as a member of this set. So RUP_{L_i}^{L_j} is ONE pair; but apparently this is not the intention of the authors. The intention is to give a name to each such pair (from F_R). Therefore, if I understand well, RUP is a (naming) function that relates each pair (L_i,L_j) \in F_R to some name (e.g., sdw:payMonth, sdw:payYear etc.). Thus, it should be defined like this.

With regards to Definition 3, although I do accept the argument that the correctness of the process is beyond the scope of the current paper, I would appreciate having such a statement before or after Definition 3, along with the acknowledgement that the proposed process is not fail-safe and could be replaced by another (but it is not within the scope of the paper to solve this problem).

Definition 4: I'm confused with the use of e_{t_i} and e_{s_i}. What are these? I assume the authors meant to write c_t and c_s respectively...?? If not, please explain what these symbols stand for.

Page 20-21 (UpdateLevel): as the authors explain, there are three different ways to make the update (called "update types"). Where is this specified? I mean, shouldn't the update type be a parameter of "UpdateLevel"?

Just before Section 6.3: "the resulting inference is asserted in the form of triples, in the same spirit as how SPARQL deals with inference". I'm not sure I understand this sentence, as, to my knowledge, SPARQL does NOT deal with inference. Unless the authors mean something else that I missed.

Minor comments and typos:

- "datatabase"

- Definition 1: I suggest you use bullets (enumerate or itemize) to help the reader grasp the definition (which is quite complex).

- "this operations"

- "another concept-mapping as,"

- Page 22, right column: I suggest to describe Algorithm 2 in its own paragraph. Now it is partly with Algorithm 1, partly with Algorithm 3 and partly in its own paragraph. Perhaps break down the paragraph in line 31, and reorganise the text below...?

- "reduce use 92% fewer"

- Table 4: "Mapping Generartion"

- Page 41, line 37: ";;"

Review #3
By Patrick Schneider submitted on 16/Feb/2021
Review Comment:

Overall, this publication is an appealing contribution in the overlapping fields of Semantic Web and Business Intelligence technologies, where a two-layered approach for a semantified ETL process was introduced and evaluated.

I am now recommending acceptance of this version for the SWJ, as the authors incorporated and addressed all of my concerns. In particular, I appreciate that they added a qualitative evaluation of their approach, where to ETL experts were interviewed leading to further insight into their work.