Review Comment:
In this work, the authors are interested in designing a read-write Linked Data interface for a legacy information system that relies on a relational database. More precisely, their goal is to set up a read-write OSLC web service together with a read-only SPARQL endpoint, where the latter brings rich querying capabilities missing in the former interface.
In the first part of this paper, they performed a two-step experimentation of existing technologies for achieving this objective. First, they proposed to use a triplestore as a SPARQL endpoint and to set up an ETL process, based on D2RQ, for populating it with the database content. This process relies on a RDB-to-RDF mapping which is first bootstrapped from the database schema and then manually edited. Next, they reused the mapping to bootstrap an OSLC model with ORM (Hibernate) annotations, and they manually enriched this model before using it to generate a ready-to-run OSLC service. As experienced by the authors, this solution has an important limitation: by interacting directly with the database, it bypasses the part of the business logic that resides not inside the database but inside the application controller of the legacy system.
To address this issue, they made the radical choice of ignoring the database layer by focusing instead on the application controller. This choice led the authors to propose an architecture where the OSLC web service is designed manually (that is, using the standard procedure), and where a novel component, called the Lyo store, is introduced between the application controller and the triplestore. This component is in charge of keeping the triplestore in sync with the changes communicated by the application controller. In this architecture, developers are required to implement several application-specific classes both at the OSLC web service and the Lyo store levels. This architecture has been tested on three legacy systems.
Main comments:
* In my view, the central question of this work is how to deal with the business logic that resides in the application controller but not in the database layer. I found the choice of the authors to discard the database layer too radical and disappointing, since it brings us to an architecture less interesting than the one proposed in the experimentation section. Indeed, having a RDB-to-RDF mapping can be very valuable, since, as suggested by the authors, it could be used for partially generating a web service model and for other reasons I will later elaborate on. Therefore, it would have been interesting to study how to model the controller-specific business logic and to understand how it complements the business logic that can be extracted by analyzing the database schema. Said differently, I would have expected more modelling in a declarative fashion (what the Semantic Web is, in my view, actually all about) than less (what is achieved by discarding the mapping). Instead, the description of this controller-specific business logic remains shallow.
* During the experimentation, the authors mentioned that, equipped with a RDB-to-RDF mapping, D2RQ supports virtual RDF graphs (i.e. rewriting SPARQL queries into SQL queries) but this approach was discarded in favor of the ETL approach (where the RDF graph is materialized and stored in a triplestore) without being justified while it is an important design choice. The virtual graph option would allow both the web service and the SPARQL endpoint to query the same database without having to deal with the maintenance burden induced by the ETL approach. Also, since the performance characteristics may differ significantly between a triplestore and a SPARQL-to-SQL system (see e.g. [1]), it could have been interesting to evaluate these two alternatives so as to understand which one performs better in their setting.
* The literature related to RDB-to-RDF management mentioned in this paper is outdated: it predates R2RML (published in 2012) whereas the domain is still active and papers are regularly accepted in top Semantic Web conferences and journals (see, e.g., [1,2,3]). Important trends in this domain, such as the Ontology-Based Data Access (OBDA) approach, seem to be largely ignored. It would have been better to consider the R2RML and RDB direct mapping W3C standards (which both have existed for five years) rather than prior proposals, the native D2RQ mapping language and the D2RQ bootstrapping feature. Note that the last version of D2RQ is already five years-old and does not support R2RML while all the recent systems do.
* Designing a RDB-to-RDF mapping that produces a meaningful and valuable RDF graph remains indeed a challenging task which requires a significant amount of human curation, even after using semi-automatic tools. It would be interesting to provide more details about this step such as showing how far from each other the bootstrapped and the final mappings are.
* Evidence about the performance of the incremental ETL component would have been appreciated. Too little information is provided about this component. In particular, it would be important to understand what precise properties it offers, and how it relates to industrial frameworks such as Kafka.
* Has the proposed architecture been applied to the first legacy system, SesammTool? How does this architecture compare, in terms of integration effort, with the previous architecture?
To conclude, I do think that combining a read-write OSLC web service and a read-only SPARQL endpoint above a legacy information system is a good idea that deserves being studied, but this should be achieved with significantly more precision than this paper did. For instance, by looking carefully at what has been reported by the authors as a negative result, it could probably be possible to find out more promising outcomes. A detailed study on the interaction between the RDB-to-RDF mapping and the OSLC service model would also be very interesting. I also encourage to evaluate the OBDA approach as a possible SPARQL endpoint solution. In terms of contribution, I agree with the authors to consider the experimentation as the main contribution of this paper since it touched several interesting questions. However, in its current form, I do not consider this paper to be suitable for publication.
Other minor comments:
- The update service box is present in Figure 7 but not in Figure 6 while it appears outside of the Lyo store.
- The two last paragraphs of Section 4.1 are unclear.
[1] Calvanese, Diego, et al. "Ontop: Answering SPARQL queries over relational databases"
Semantic Web Journal (2017): 471-487.
[2] Jiménez-Ruiz, Ernesto, et al. "BootOX: Practical mapping of RDBs to OWL 2"
International Semantic Web Conference, 2015.
[3] Sequeda, Juan F., Marcelo Arenas, and Daniel P. Miranker. "OBDA: query rewriting or materialization?
In practice, both!" International Semantic Web Conference, 2014.
|