Scalable Long-term Preservation of Relational Data through SPARQL queries

Tracking #: 384-1464

Authors: 
Silvia Stefanova
Tore Risch1

Responsible editor: 
Christoph Schlieder

Submission type: 
Full Paper
Abstract: 
We present an approach for scalable long-term archival of relational databases as RDF triples, implemented in the SAQ (Semantic Archive and Query) system. In SAQ an RDF view of a relational database, called the RD-view, is automatically generated. The RD-view can be queried by arbitrary SPARQL queries. Long-term preservation as RDF of selected parts of a database is specified in an extended SPARQL dialect, A-SPARQL, as an archival query. A-SPARQL provides flexible selection of data to be archived in terms of SPARQL-like queries to the RD-view, which produces a data archive file containing the RDF-triples representing the relational data content to be preserved. It also generates a schema archive file where sufficient meta-data are saved to allow the archived database to be fully reconstructed. An archival query usually selects both properties and their values for sets of subjects, which makes the property p in some triple patterns unknown. We call such queries where properties are unknown unbound-property queries. To achieve scalable data preservation and recreation, we propose some query transformation strategies suitable for optimizing unbound-property queries. These query rewriting strategies were implemented and evaluated in a new benchmark for archival queries called ABench. ABench is defined as set of typical A-SPARQL queries archiving selected parts of databases generated by the Berlin benchmark data generator. In experiments, the SAQ optimization strategies were evaluated by measuring the performance of the SPARQL queries selecting triples for archival queries in ABench. The performance of the same SPARQL queries for related systems was also measured. The results showed that the proposed optimizations substantially improve the query execution time for archival queries.
Full PDF Version: 
Tags: 
Reviewed

Decision/Status: 
Major Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
Anonymous submitted on 27/Jan/2013
Suggestion:
Reject
Review Comment:

The thesis of this paper is that long-term preservation of relational data can be achieved by executing SPARQL queries over the relational database.

This paper presents 1) A-SPARQL, an extension to SPARQL for archiving data, 2) a benchmark, ABench, for archival queries, 3) optimizations for archival queries on top of relational databases (which are basically optimizations for SPARQL queries which have unbounded predicates) and 4) an evaluation of their system compared to D2RQ and Virtuoso RDF Views.

This paper attempts to unify two pieces of research, A) long term preservation of relational data using semantic web technologies (herein Research A) and B) SPARQL execution of unbounded predicates on relational databases (herein Research B), in an unsuccessful matter. I consider Research A to be new to the audience of the journal and lacks motivation, background, related work and leaves many open questions. Research B could stand by itself, however, it's connection to Research A is weak.

I recommend to reject this current version of the paper and give the authors an opportunity to either present a paper just on Research B, or expand the paper on Research A and make tighter connection with Research B.

Personally, I would be interested in seeing a paper that unifies Research A and B. I see A-SPARQL as a nice syntactic sugar for SPARQL CONSTRUCT queries, which is not only applicable to long term preservation of relational data but also to general ETL of relational data to RDF.

I'll present comments, which hopefully the authors find useful to make this paper into a successful publication.

== Comments on Research A ==

1) Background on Long-term preservation of data: There is no introduction to "long-term preservation of data". Personally, I was not aware of this research area (which seems very interesting) and I had to look it up myself. Not a single paper on the topic is cited. I would suspect that the audience of this journal would need some introduction to this area.

2) Why semantic web technologies: The authors only cite [11] to motivate why semantic web technologies are useful for long-term preservation of data. However, this is not expanded at all. It may be obvious for the authors and for researchers in the field of long-term preservation of data, but I following questions jump immediately to mind:
a) How is XML used for long-term preservation of data? What are the pros and cons?
b) If the work is focused on relational databases, why aren't CSV dumps sufficient?
c) If the objective is to reload data back into a relational database, what happens with the indexes, constraints (not nulls, checks)? That doesn't seem to be encoded at all anywhere. Without this, reconstructing the database can result in a slow execution due to the lack of indexes, or susceptible to violation of integrity constraints.

3) A-SPARQL Syntax:

The syntax of A-SPARQL is briefly explained, but not well defined. The lack of a well defined syntax leads to the following questions based on the examples in the paper:

a) If I understand correct, there is only one RD-View, so why is a RD-view URI needed? Wouldn't it always be the same? In all the examples, it's always . When would it not be the same? How is that URI generated?
b) Following the description of the syntax, the following query seems to be well-formed:

ARCHIVE AS 'data1.nt', 'schema1.nt'
FROM

assuming everything after the FROM is optional. How is that query different from:

ARCHIVE AS 'data1.nt', 'schema1.nt'
FROM
TRIPLES {?s ?p ?o}

c) After the keyword TRIPLES, comes "archived triple pattern". However Query A8, has a set of triple patterns. I assume that instead of an "archived triple pattern", it is a basic graph pattern, which can have 1 or more triple patterns.

d) It's not clear if CLASSES, PROPERTIES and TRIPLES can be combined in the same query. It may be my misunderstanding on how the syntax is described. However, it is my understanding that terms enclosed in brackets are optional. Terms separated by a vertical bar '|', indicate that a choice needs to be made. Therefore, based on this assumption, it seems that CLASSES, PROPERTIES and TRIPLES may be combined. Please clarify.

4) A-SPARQL Semantics:

a) A-SPARQL seems to be a syntactic sugar for CONSTRUCT queries. The authors state it themselves: "Archival queries are straight-forward to translate into CONSTRUCT queries.". The translation rules are presented in Appendix 1. This translation is crucial and one of the important contributions of the paper. Why is it stuck in the appendix? Please put it in the body of the paper.

b) Additionally, I recommend to take a closer look at the translation rules because it is a bit confusing. Assuming that CLASSES, PROPERTIES and TRIPLES may be combined, then looking at 1.a, it seems that i and ii are suppose to be done consecutively. Is this true? I see that steps 1.a.i and 1.b.i applies to Q2 and 1.a.ii and 1.b.ii applies to Q3. But are the steps i and ii mutually exclusive? What happens if you have the following query:

ARCHIVE AS 'data1.nt', 'schema1.nt'
FROM
CLASSES saq:product ;
PROPERTIES saq:product_label

Following those rules, I understand that the generated SPARQL query would be:

CONSTRUCT
{
?subject ?predicate ?value .
?subject saq:product_label ?value1
}
FROM
WHERE
{
{
?subject ?property ?value.
saq:product rdf:type rdfs:Class .
?subject rdf:type saq:product .
}
UNION
{
?subject1 saq:product_label ?value1.
}
}

If so, this seems to be redundant. The same question if TRIPLES are involved.

If CLASSES, PROPERTIES and TRIPLES are not suppose to be combined, then please disregard this comment. Nevertheless, this confusion comes from a lack of a well defined syntax.

c) I would recommend to formally present the semantics either by 1) using rules/datalog syntax to represent the translation or 2) defining its own semantics following the approach of the semantics of SPARQL by Perez et al (Jorge Pérez, Marcelo Arenas, and Claudio Gutierrez. 2009. Semantics and complexity of SPARQL. ACM Trans. Database Syst.) and comparing the expressivity of A-SPARQL with SPARQL CONSTRUCT. This way, there would be no room for ambiguity.

5) The authors state that they use the W3C Direct Mapping. However, recreating a relational database, from an RDF graph (data and schema) that was generated from the W3C Direct Mapping has not been well studied. I'm not saying that it is not possible (most cases seem straightforward), but there are still unknown cases. For example, the W3C Direct Mapping specification states (last sentence of the last paragraph of section 2.1): "The direct mapping does not generate triples for NULL values. Note that it is not known how to relate the behavior of the obtained RDF graph with the standard SQL semantics of the NULL values of the source RDB." The work of Sequeda et al (Juan F. Sequeda, Marcelo Arenas, and Daniel P. Miranker. 2012. On directly mapping relational databases to RDF and OWL. In WWW) introduced an augmented direct mapping which was proved to be information and query preserving, even for databases that have NULL values. If the mapping that is used in this work is not proved to be information preserving, then there is really no guarantee that there is a long-term preservation of data, because data may be lost in the mapping. Nevertheless, in the middle of the paper, I realized that the authors don't actually use the W3C Direct Mapping. In section 4.1, they state: "we define a unique RDFS class for each relational table, except for link tables representing set-valued proper- ties as many-to-many relationships. In addition, RDF schema properties are defined for each column in a table.". Furthermore, mapping tables map relational schema elements to RDFS concepts. This suggests to me that the mapping being used is not the W3C Direct Mapping and is very similar (if not equal) to the augmented direct mapping of Sequeda et al. Therefore, I suggest that the authors clarify what is the exact direct mapping they are using, If it results to be the same direct mapping of Sequeda et al, then it is guaranteed to be information preserving (which is the whole point of this research). If there direct mapping is different, then I expect to see a proof of the information preservation of their direct mapping.

In conclusion there needs to be a thorough explanation of 1) what is long-term preservation of data, 2) it's importance (this may be obvious but still it is needed), 3) current approaches to solve this problem and their drawbacks and 4) why are semantic web technologies promising in order to solve this problem. Additionally, issues of syntax and semantics of A-SPARQL need to be addressed.

== Comments on Research B ==

1) The D-View is defined as a union of sub-views: column view, foreign key view, etc. However, none of the sub-views are defined. How is C_T.A(s,p,v) generated?

2) Following my question (1), the sub-views, even though not defined, seem very similar to datalog rules of the W3C Direct Mapping (Appendix B) and Sequeda et al's Augmented Direct Mapping. What is the relationship?

3) What is the difference of the OR for T.A and OR for F, etc. What special properties does the OR have for each subview? Why are they different?

4) A union of views is also the implementation approach taken by Ultrawrap. How does your approach differ? I assume that the difference is that Ultrawrap actually creates SQL Views while SAQ never makes them explicitly. Further discussion and comparison would be useful.

5) The authors state: "The D-view is usually very large, containing many disjunctions. Naive processing of such a view in the RDB is slow." This is not an accurate claim. Ultrawrap experimentally shows that query execution of bounded predicate queries on a union of views on top of commercial relational database is compatible to its semantically equivalent SQL queries. An accurate claim would be the "query execution of unbounded predicate queries is slow".

6) The benchmark queries sometimes include the triple: ?class rdf:type rdfs:Class. I believe that this triple is needed for SAQ because it access their mapping table which maps schema elements to RDFS elements. Did the benchmark queries to D2RQ and Virtuoso include that triple? If so, this may be a cause for the slow performance. If this is the case, what happens when that triple is not included? What happens to SAQ?

In conclusion, the optimizations presented for unbounded predicate SPARQL queries on top of directly mapped relational databases are conspicuous and needed in order to achieve better query performance. However details on the setup of the views is missing, therefore it makes it very hard to reproduce these experiments.

== Comments on connecting Research A and B ==

1) Research B stands out by itself. SPARQL query execution on top of relational databases is an important topic and optimizing unbounded predicate queries is necessary. However, I don't see why archival sparql queries on relational database *need* to be translated to unbounded predicate sparql construct queries. Actually, for a relational database, the GCT transformation could be applied to the translation of the A-SPARQL query to the SPARQL CONSTRUCT query, and then the SPARQL query would not have unbounded predicate queries. This is possible because the mapping already has knowledge of all of the schema elements. If table Product has n attributes, and table Product is mapped to saq:Product class, it is straightforward to know that it would need to query also the n attributes. However, if we want to use A-SPARQL on top of a RDF database that is not mapped to a relational database, then an unbounded predicate query is necessary because all the predicates that can be attached to saq:Product are unknown until all the data is considered (the flexibility of RDF allows this). If this paper were to focus only on Research B, then these questions would not come up. But by presenting the paper with a connection between A and B, this question would need to be addressed.

2) It seems that the work on Research B was done first and long term preservation of data became a use case to apply Research B. I may be wrong, but this is the impression I get from the weak connection between A and B. Therefore it suffers from "a solution looking for a problem".

3) The benchmark queries seem to have been tailored to show the application of each optimization. In a way this is fine, but also raises the question: do these types of queries come up in the real world. What are the types of queries that real world users would write in A-SPARQL? What are the motivation for these queries?

== Overall Minor Comments ==

- Formatting errors all over the place.

- The paper cites as direct mapping citation [3]. This is a citation of a W3C Use Case Working Draft. The correct citation should be:
M. Arenas, A. Bertails, E. Prud’hommeaux, and J. Sequeda. A Direct Mapping of Relational Data to RDF. W3C Recommendation 27 September 2012, http://www.w3.org/TR/rdb-direct-mapping/.

- Why is [22] cited when making reference to Datalog? I would suggest to cite instead the Foundations of Databases book by Abiteboul, Hull and Vianu.

Review #2
By Günther Görz submitted on 28/Jan/2013
Suggestion:
Accept
Review Comment:

Summary and Main Contributions

The authors address a very important problem, scalable long-time
archival of relational databases as RDF triples. For this purpose,
they implemented the SAQ (Semantic Archival and Query) system which
they motivate by a striking example, define theoretically, give a
detailed account of its implementation and finally provide an
evaluation based on a new benchmark, with a throrough discussion. In
SAQ, an RDF view of a RDB is automatically generated, and long-time
preservation as RDF of selected parts of it are specified in an
extended SPARQL dialect. A data archive as well as a schema archive
are generated. The authors show that their modular approach is
clearly superior to other approaches which implement compilers that
translate SPARQL directly into SQL.

Strong points of the paper

The organisation and presentation of the paper are very good, its
linguistic expression is clear and easy to follow. The original
contribution of the paper is very well motivated and the argumentation
is supported by convincing examples. Related work is covered in
detail. With its new benchmark derived form the well-know Berlin
benchmark, the authors illustrate the superior performance of their
approach.

Possible improvements

None. I think the paper is excellent and should be published "as-is".

Review #3
By Christoph Schlieder submitted on 06/Feb/2013
Suggestion:
Major Revision
Review Comment:

Approaches for querying RDF views of relational data are definitively of interest to the Semantic Web community and more specifically to the SWJ. The authors describe a flexible approach which permits to select parts of a relational database for archiving purposes by querying an RDF view of the data using an extension of SPARQL. The main technical contribution consists in query transformations that help optimizing unbounded property queries. The article presents a convincing solution to the unbound-property query problem. A benchmark study provides evidence for the gains obtained by applying the optimizations. The approach has the potential to contribute to the preservation of relational data, a topic with growing importance for digital preservation research. This said, I also have some concerns which let me come to the conclusion to accept the article for publication under reserve of major modifications.

My main concern regards the contribution to digital preservation. Reading the article from a long-term preservation perspective, I was disappointed finding that, despite the title, the issue of preservation of relational data is not really addressed. Storing an RDF view of a database cannot be considered a long-term archiving strategy per se. No requirements from a preservation workflow – at whatever level of abstraction – are mentioned to motivate the approach. This gap in the description definitively needs to be closed. Some suggestions follow below.

Another concern regards the presentation of the core technical contribution. The article extends a paper that the authors presented at the ISWC workshop on Scalable Semantic Web Knowledge Base Systems (SSWS-2011). The workshop paper describes the GCT transformation algorithm which constitutes a crucial step in the optimizations described in the present article. It is perfectly acceptable to (re)publish such a result in the SWJ since sufficient additional material is provided. In trying to minimize the overlap with the workshop paper, the authors seem to have shifted the focus towards Abench, the new extended set of benchmark queries and their empirical evaluation. As a consequence, they moved the GCT algorithm – as well as the translation rules for A-SPARQL queries – to the appendix where they are obviously misplaced.

Both concerns are related and could be remediated by integrating the technical contributions into the main text and, more important, by highlighting the contributions of the approach to the long-term preservation of relational data.

The following remarks and questions are meant to assist the authors with linking their results to research on long-term preservation of (relational) data.
(1) Provide arguments why one could not just archive the result of the RDB2RDF mapping. This would probably be the first question raised by digital preservation researchers. Unfortunately, digital preservation research is to a large extent document-centric and has not paid as much attention to relational data as it would deserve (see, however, remark 7). As an entry point into the literature consider:

Giaretta, D. (2011). Advanced Digital Preservation, Springer.
Borghoff, U. et al. (2010). Long-Term Preservation of Digital Documents, Springer.

(2) Your approach seems to be much more consistent with a publication workflow than with most archiving workflows. Data selection is the rule in publishing but it is the exception in archiving. In publishing, data is selected to match the information demands of a known group of users. In archiving, one tries to avoid data selection wherever possible because future users cannot undo the selection and it is almost impossible to guess what type of information a user might need in 40 years from now.

(3) There are, however, cases where data selection is necessary. Examples include compliance with data privacy legislation or intellectual property rights. Certain data related to individuals may not be archived for an indefinite time span and data with unclear IP status is often excluded from archiving. Additionally, some application areas call for selection strategies:

Jobst, M. (ed.) (2011). Preservation in Digital Cartography, Springer
Masanès, J. (ed.) (2006). Web Archiving, Springer

(4) You should identify what archiving workflow your approach is designed for. Can it be described in a standard way such as in the OAIS reference model? Or the DCC curation life cycle model? Which examples of selection strategies do occur in your workflow? How do the queries from the benchmark set relate to these selection strategies?

Reference Model for an Open Archival Information System (OAIS) (2012). Recommended Practice CCSDS 650.0-M-2, Magenta Book, Consultative Committee for Space Data Systems.
Higgins, S. (2008). The DCC Curation Lifecycle Model, International Journal of Digital Curation, Vol. 3, No. 1, pp. 134-140.

(5) How does your approach compare to approaches for preservation of relational data not based on RDF? One example:

Ramalho, J., Ferreira, M., Faria, L. & Castro, R. (2007). Relational database preservation through XML modeling, Proceedings Extreme Markup Languages, Montreal August 7-10, 2007, http://conferences.idealliance.org/extreme