Morph-KGC^star: Declarative Generation of RDF-star Datasets from Heterogeneous Data

Tracking #: 3238-4452

Authors: 
Julián Arenas-Guerrero
Ana Iglesias-Molina
David Chaves-Fraga
Daniel Garijo
Oscar Corcho
Anastasia Dimou

Responsible editor: 
Guest Editors Tools Systems 2022

Submission type: 
Tool/System Report
Abstract: 
RDF-star has been proposed as an extension of RDF to annotate statements with triples. Libraries and graph stores have started adopting RDF-star, but the generation of RDF-star data remains largely unexplored. To allow generating RDF-star from heterogeneous data, RML-star was proposed as an extension of RML. However, no implementation has been developed so far that implements the RML-star specification. In this work, we present Morph-KGC^star , which extends the Morph-KGC materialization engine to generate RDF-star datasets. We validate Morph-KGC^star by running test cases derived from the N-Triples-star syntax tests and we apply it to two real-world use cases from the biomedical and open science domains. We compare the performance of our approach against other RDF-star generation methods (SPARQL-Anything), showing that Morph-KGC^star scales better for large input datasets, but it is slower when processing multiple smaller files.
Full PDF Version: 
Tags: 
Reviewed

Decision/Status: 
Major Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
By Pierre-Antoine Champin submitted on 17/Oct/2022
Suggestion:
Major Revision
Review Comment:

This paper presents RML-star, an extension of RML (RDF Mapping Language) to support the new features introduced by RDF-star in RDF. The paper also presents Morph-KGC^star, an implementation of RML-star, and compares it with other implementation of mapping languages supporting RDF-star.

The paper is well written and overall convincing. There are however a few aspects of RML-star that, in my opinion, require a major revision.

# Major revisions

- p4 l49: the subclass relationship between rml:NonAssertedTriplesMap and rr:TriplesMap does not follow the Liskov Substitution Principle, so it is a modelling mistake. More specifically, a TriplesMap can not be substituted by a NonAssertedTriplesMap in a mapping while preserving the behaviour of that mapping (some triples will be lost).
A correct way of modelling this would be to create a common superclass of TriplesMap and NonAssertedTriplesMap (e.g. AbstractTriplesMap) and update all properties whose domain is TriplesMap, by changing their domain to AbstractTriplesMap.

- p4 Section 3.2
The section introducing rml:quotedTriplesMap and puts no restriction on its values. I suggest that the TriplesMap pointed to by rml:quotedTriplesMap must be restricted to generating exactly one triple (i.e. have only one rr:predicateObjectMap, and no rr:class). A mapping not complying with this restriction should be rejected as invalid.
Otherwise, it is unclear what the algorithm should do with multiple output triples of these maps. The current implementation has an undefined behavior: it only quotes the triple generated by the "first" predicateObjectMap in mapping.ttl, although the order in Turtle is not significant.
This is a bug in the specification of RML-star and in the implementation that should be fixed before publication.

# Minor revisions

- the original proposal by Hartig is spelled "RDF*"; the specification by the RDF-DEV Community Group is spelled "RDF-star". This difference is was deliberately introduced by the CG because the two proposal differ in some points (not only because the latter is more search-engine friendly...). The paper conflates the two. This should be updated.

- p2 l35: (a) "of an RDF triple" should read "of another RDF-star triple" → it is not accurate that an RDF triple ("no star") can contain other triples, only RDF-star triples can. (b) "and can be recursive" is confusing: it seems to imply that an RDF-star triple can contain itself, which is explicitly forbidden by the specification. By applying suggestion (a), the recursive construction of RDF-star triples should appear, and this "can be recursive" could be simply removed.

- p3 l39: inaccurate reference to the 2004 RDF primer. Standard reification was actually introduced in the first Recommendation in 1999.

- p7 l3: there should be an explicit description of the input parameters of the algorithm ('nestLevel' is explicit enough, by 'm' and 'M' are not)

- p8 l28: after stating that "RML is a subset of RML-star", you state that RML mappings need to be converted to RML-star, which seems contradictory. I assume that this "conversion" consists in replacing original terms in the rr: namespace to their extended version in the rml: namespace (e.g. subjectMap) ? In anycase, this should be explained to remove this apparent contradiction.
NB: I assume that rml:subjectMap is declared as a super-property of rr:subjectMap, and similarly for other "overloaded" properties, which would justify the claim that RML-star is a superset of RML. This conversion is merely a materialization of some inferences.

- p10 l23: the generated RDF-star is brittle, and probably an anti-pattern. If the same triple was generated with different techniques, the metadata would get mingled (which confidence is associated to which technique?).
The community group published a blog post dedicated to this issue:
https://www.w3.org/community/rdf-dev/2022/01/26/provenance-in-rdf-star/
NB: standard reification and singleton properties do not suffer from the same issue, because they natively support multiple *occurrences* of the same triples.
→ It would therefore be fairer to compare the generation of "robust" RDF-star (introducing an "prediction" nodes, similar to the "claim" nodes in the blog post) to reification and singleton properties.

# Other non-blocking remarks

I can imagine situations where a quoted triple must be asserted or not *depending* on its properties (e.g. end-date, or confidence, as in the SemMedDB use-case. Would this be possible in RML-star? If yes, would it require redundant triples maps (one rr:TriplesMap and one rml:NonAssertedTriplesMap)?

# Typos

- p2, l19: "comlause and Apaon" (?) should probably be "comparison"

- p12, Table 1: the generation time for SoMEF / Singleton is much lower than the other, contradicting what is in the text. I suspect some digits are missing.

- p12, Table 2: the generation time for SoMEF / Multiple files / SPARQL-generate is really high, contradicting what is in the text. I suspect it is 630s, not 630179s (you seem to be mixing the use of "," as a thousands-delimiter and as a decimal point)

Review #2
By Sebastián Ferrada submitted on 13/Feb/2023
Suggestion:
Minor Revision
Review Comment:

This paper introduces Morph-KGCstar, a materialization engine to generate RDF-star graphs from heterogeneous data sources. The system is implemented in python, built upon pandas, and is distributed as a Pypi library as well as a GitHub Repository with some documentation and examples of usage. The system is described in a detailed manner, providing relevant examples and figures.

Morph-KGCstar is shown to comply with the RML-star specification, by showing that it passes the defined unit tests. It is also clear from the text that arbitrarily deep statement quotation is possible to manage, in accordance with the RDF-star spec.

The authors compare Morph-KGCstar with two reification approaches and show that Morph-KGCstar produces fewer triples than the alternatives, being the fastest for one of the used dataset and the second fastest on the other one.

Comparison to SPARQL-Anything is also provided, which is the only other way to generate RDF-star graphs from heterogeneous data sources. In this comparison, the authors conclude that Morph-KGCstar can easily manage larger files; however, several small files are better handled by SPARQL-Anything, which is something the authors pretend to address in the future.

I have only a few small comments that the authors can easily incorporate in their final version:

1) Several times in the paper, the authors say that RDF-star is a way to annotate statements, or a means to provide reification capabilities.
This is correct, of course, but it sounds like it limits RDF-star to simply quoted triples in the subject position, which is not the case, as shown in Algorithm 1, Morph-KGCstar can manage any of the nesting cases valid for RDF-star. Thus, I would suggest that the authors use the current informal definition of the RDF-star spec, that RDF-star is "an extension to RDF to make statements about statements", or that the authors refer to this (annotating statements) as one of the use cases for RDF-star.

2) Algorithm 1 could be explained with a bit more detail. The parameters that the procedure receives, except the nesting level, are not discussed, therefore I don't know what m.OM or m.SM mean.

3) In Section 5.2.1 the authors claim that RML-star is the fastest approach. This is only correct for the case of SemMedDB, according to the data presented in Table 1. For the case of SoMEF, RML-star is the second fastest, being one order of magnitude slower than the singleton properties.

4) I understand that Listings 4, 5 and 7 are independent/parallel examples of the different reification strategies, however, I would prefer that the sets of triples produces were semantically and practically equivalent among them, which they currently are not. To fix this, you would need to assert the quoted triples.

5) In line 35, page 2, the authors say that an RDF-star triple can be placed in the subject or object of an RDF triple, which is true, but an RDF-star triple can also be placed in the subject or object of another RDF-star triple, which is more accurate and captures the recursive nature of RDF-star.

6) There are some small presentation issues, (e.g., line 19 of page 2: "a comlause and Apaon", or line 39 of page 3, where citation [19] appears twice in the same sentence), so I would recommend the authors to use an orthography and grammar checker.

Other than that, I think this paper presents a valuable contribution to our community, and I'd be happy to recommend for it to be accepted, considering the minor revisions above.

Review #3
By Kai Eckert submitted on 06/Mar/2023
Suggestion:
Major Revision
Review Comment:

The paper presents Morph-KGC-star (think about the superscript naming, it makes it hard to refer to the software consistently), one of the first implementations of a mapping framework to generate RDF-star data and the first using RML (with the earlier proposed extension RML-star). The formlerly proposed RML-star extension is sufficiently briefly described. The implementation description is based on the algorithm to create the RDF-star triples with a textual explanation, plus more general implementation details (Python, uses Pandas and Oxigraph under the hood). Two examples from real world cases are presented to illustrate the generation of RDF-star data, as well as some experiments and considerations on the performance of Morph-KGC-star, compared to other approaches.

(1) Quality, importance, and impact of the described tool or system (convincing evidence must be provided).

The recent specification of RDF-star provides a new mechanism for statements about statements and could be the basis for many RDF applications. With increasing support in standard triple stores, it is ready to be used. The more tools and frameworks in the RDF ecosystem support RDF-star, the better will be its acceptance. Therefore, this paper and the implementation of Morph-KGC-star is clearly relevant. This is especially true for all applications where RML as a mapping language is already used.

(2) Clarity, illustration, and readability of the describing paper, which shall convey to the reader both the capabilities and the limitations of the tool.

The paper is well written and more or less easy to follow, with the exception of, unfortunately, the core of the paper, the description of the algorithm in Section 4.1. Page 6 is essentially a huge text blob and Algorithm 1 is hardly understandable as, for example, the parameters like m, M, m.S and so on are not explained and not used in the text. I would suggest to (a) improve the structure of the text and clearly refer to the names and parameter in the algorithm listing, and (b) think about a running example that makes the very abstract description easier to understand.

Minor points:
- the triples map should be triple map (as for instance subject map, which is also not in plural).
- p.3, l. 35: I would not call the reification approaches popular. From my experience, both are actually the opposite and only used because of a lack of alternatives.
- p.6, l. 32: superfluous comma after both

Please also assess the data file provided by the authors under “Long-term stable URL for resources”. In particular, assess

(A) whether the data file is well organized and in particular contains a README file which makes it easy for you to assess the data,

Yes

(B) whether the provided resources appear to be complete for replication of experiments, and if not, why,

Partly. It is a Zip archive of the Github repository that contains the source code of the implementation. The pip installation works. Unfortunately the links in Github to the documentation lead to a 404 and I could not create a suitable config.ini. I suggest to create a self-contained example that works out of the box and add the documentation to the zip file in Zenodo.

(C) whether the chosen repository, if it is not GitHub, Figshare or Zenodo, is appropriate for long-term repository discoverability, and

Yes, Zenodo

(4) whether the provided data artifacts are complete.

No data, just code