Enhancing Virtual Ontology Based Access over Tabular Data with Morph-CSV

Tracking #: 2459-3673

David Chaves-Fraga
Edna Ruckhaus
Freddy Priyatna
Maria-Esther Vidal
Oscar Corcho

Responsible editor: 
Guest Editors Web of Data 2020

Submission type: 
Full Paper
Ontology-Based Data Access (OBDA) has traditionally focused on providing a unified view of heterogeneous datasets (e.g., relational databases, CSV and JSON files), either by materializing integrated data into RDF or by performing on-the-fly querying via SPARQL query translation. In the specific case of tabular datasets represented as several CSV or Excel files, query translation approaches have been applied by considering each source as a single table that can be loaded into a relational database management system (RDBMS). Nevertheless, constraints over these tables are not represented (e.g., referential integrity among sources, datatypes, or data integrity); thus, neither consistency among attributes nor indexes over tables are enforced. As a consequence, efficiency of the SPARQL-to-SQL translation process may be affected, as well as the completeness of the answers produced during the evaluation of the generated SQL query. Our work is focused on applying implicit constraints on the OBDA query translation process over tabular data. We propose Morph-CSV, a framework for querying tabular data that exploits information from typical OBDA inputs (e.g., mappings, queries) to enforce constraints and can be used together with any SPARQL-to-SQL OBDA engine. Morph-CSV relies on both a constraint component and a set of constraint operators. For a given set of constraints, the operators are applied to each type of constraint with the aim of enhancing query completeness and performance. We evaluate Morph-CSV in several domains: e-commerce with the BSBM benchmark; transportation with a benchmark using GTFS dataset from the Madrid subway; and biology with a use case extracted from the Bio2RDF project. We compare and report the performance of two SPARQL-to-SQL OBDA engines, without and with the incorporation of Morph-CSV. The observed results suggest that Morph-CSV is able to speed up the total query execution time by up two orders of magnitude, while it is able to produce all the query answers.
Full PDF Version: 

Major Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
By Pieter Heyvaert submitted on 05/May/2020
Minor Revision
Review Comment:

The work presented in this article is interesting and very relevant for the Semantic Web community and provides a solution to improve the completeness and performance of virtual OBDA systems. Below are my comments per section.

# 1. Introduction
What are the operators mentioned in the goals?

# 2. Motivating example
The performance issue is only explained in the caption of the figure and not in the text. It helps the reader to also explain it in the text.

# 3. Ontology Based Data Access Over Tabular Data
How do the properties of 3.2 align with the challenges of 3.1?

# 4. The Morph-CSV Framework
Can you add an example of what is explained in 4.2, because else it is pretty hard to understand.

For #answers(eval(Q, theta++(VTD))) >= #answers(eval(Q, theta(VTD))) missing closing bracket for the first.

For #time(eval(Q, theta++(VTD))) <= #time(eval(Q, theta(VTD))) the equation says <= but the text says <

Figure 3 uses "Tabular Dataset" during Source Selection, but how are updates of that dataset processed because it was mentioned in Section 3.1 that for other systems this leads to performance issues.

For Figure 4b it would be good to mention that YARRRML is used, because not everybody might know that and they might expect something like [R2]RML.

On page 10 line 26 what is the problem here actually? That is not clear from the example.

# 5. Evaluation

Nice that there are research questions, but it would be good to have corresponding hypotheses that are checked during the evaluation.

What is meant with research question 3? I don't understand the "different levels of data heterogeneity".

Why are Ontop and Morph-RDB chosen for the evaluation?

Regarding the metrics, I don't see anything about the actual query results. Are all queries correctly answered? This should be discussed in the text.

How is the baseline determined? What tool is used to generate the baseline?

One page 14 line 46 what is meant with the "difference between the two approaches is not very relevant and is maintained across the datasets"?

It was only until reaching the end of page 14 that I understood that Morph-CSV cannot be used on its own (right?). This should be clarified more in the text. I see it in the abstract, but not really in the text.

In Figure 11c what are the red parts?

It might be good to merge Figures 15 and 16 so that the results could be better compared.

# 7. Related Work

It is not clear what the approach of [38] is lacking.

If you have questions or remarks don't hesitate to contact me!

Review #2
Anonymous submitted on 31/May/2020
Minor Revision
Review Comment:

This paper describes an approach for accessing heterogeneous data sources thorough the OBDA paradigm. Authors specifically consider a novel aspect, that is, tabular data with the inclusion of constrains. This allows ensuring consistency among attributes. The proposal described in this paper, called MORPH-CSV, defines a set of constraints operators that allow providing more complete and answers while at the same time possibly guaranteeing more efficiency.

-Section 2: the motivating example although taken from real data (if I understood correctly) seems a bit simple. Setting up a whole infrastructure for a mere string matching seems a bit of overkill. In this particular case, an agreement on the field names could have been easily reachable. I suggest to include a more complex example.

-Section 3: it provides a very good background on OBDA, with particular emphasis on OBDA in the context of tabular data. However, it is not completely clear which challenges are specific for tabular data. Authors should put more emphasis on the challenges related to tabular data (along the same line of what it has been done for the Heterogeneity challenge).

 -Section 4: it is devoted to describing the assumptions and the mapping to the specific case of OBDA over tabular data. As for the problem statement (4.3), it is not immediate to understand why the constraint on the cardinality of the solution should hold. Please clarify. The remainder of this section describes the steps of the framework.
-Section 5: this section describes an extensive experimental evaluation devoted to investigating aspects related to the combination of constraints, their impact when the dataset increases in size, and the running time.

This is a very nice paper, clearly written and with an extensive experimental evaluation. I think it can have a real-world impact due to the massive amount of tabular data. I would like to read a bit more about how authors intend to extend this framework to a distributed scenario (as pointed out in the Future Work).

-Section 5:
- line 40:"int the"

Review #3
Anonymous submitted on 18/Sep/2020
Major Revision
Review Comment:

This paper describes Morph-CSV, a framework for querying tabular data that exploits information from typical OBDA inputs (e.g., mappings, queries) to enforce constraints and can be used together with any SPARQL-to-SQL OBDA engine. Morph-CSV is evaluated against other OBDA approaches using benchmarks such as, BSBM, GTFS and Bio2RDF.

The paper is well written and the problem and solution are well explained. I have the following comments or questions which needs to be addressed.

- As mentioned in the paper, "We address the limitations of current OBDA query translation techniques over tabular data." but this is not evaluated in the experiments. To support the claim, the translated queries in the presence of Morph-CSV should be compared with the same queries translated in the absence of Morph-CSV, in terms of these limitations and their effect.

- Another claim is that of query completeness. In the experimental evaluation, all the engines return the same number of results for all the queries which are run successfully by the engine. So It would be interesting to see cases (queries) where Morph-CSV returns complete results as compared to the other approaches.

- Another question here is that what criteria is followed to determine the query completeness?

- It would be interesting to see comparison of MorphCSV (+ ontop or morph-rdb) with Apache drill (+ ontop or morph-rdb) and q (+ ontop or morph-rdb) in terms of performance and query completeness. q also supports auto detection of column types, so would be interesting to see the comparison.

- The evaluation needs to be more granular to get more insights. It would be interesting to see adding more detailed metrics in the experimental evaluation, e.g. Split the total query execution time to time taken by different steps involved in the query execution. Just like load time, if index creation time and constraints extraction time, transformation function time and normalization time. These metrics may not be available for other approaches but it will give the exact contribution of each step in the query execution of Morph-CSV. Also how many and which type of indexes are created for each query, if created.

- One approach could be loading, constraints application, and normalization performed before running queries. What is your reason of not comparing your approach with this one? Result freshness can be a problem with this approach but it can be tackled to a certain extent by running these steps at a certain time interval, may be based on the data update frequency.

Caching is very important in such on the fly processing, do Morph-CSV or other approaches support any caching mechanism? If yes comparison with cold and warm cache should be included.