Managing FAIR Knowledge Graphs as Polyglot Data End Points: A Benchmark based on the rdf2pg Framework and Plant Biology Data

Tracking #: 3702-4916

Authors: 
Marco Brandizi
Carlos Bobed
Luca Garulli
Arné de Klerk
Keywan Hassani-Pak

Responsible editor: 
Guest Editors KG Construction 2024

Submission type: 
Full Paper
Abstract: 
Linked data and labelled property graphs (LPG) are two data management approaches with complementary strengths and weaknesses, making their integration beneficial for sharing datasets and supporting software ecosystems. In this paper, we introduce rdf2pg, an extensible framework for mapping RDF data to semantically equivalent LPG formats and data-bases. Utilising this framework, we perform a comparative analysis of three popular graph databases - Virtuoso, Neo4j, and ArcadeDB - and the well-known graph query languages SPARQL, Cypher, and Gremlin. Our qualitative and quanti-tative assessments underline the strengths and limitations of these graph database technologies. Additionally, we high-light the potential of rdf2pg as a versatile tool for enabling polyglot access to knowledge graphs, aligning with estab-lished standards of linked data and the semantic web.
Full PDF Version: 
Tags: 
Reviewed

Decision/Status: 
Reject

Solicited Reviews:
Click to Expand/Collapse
Review #1
By Pierre-Antoine Champin submitted on 22/Jul/2024
Suggestion:
Major Revision
Review Comment:

This paper defends the idea that RDF and Labeled Property Graphs (LPGs), and their respective query languages, are complementary models for knowledge graphs, and that a good approach is to provide users with a polyglot view of the same data. The authors propose a benchmark providing 1) SPARQL mapping rules for converting plant biology RDF datasets into LPGs, and 2) semantically equivalent queries in SPARQL, Cypher and Gremlin, against the different views of this dataset. They show that each query language as its pros and cons in terms of ease of use of execution time.

The paper is well written enough and easy to read. The empirical results are interesting and compelling, but the paper generally lack theoretical perspective and scientific rigor, as explained below.

A first scientific problem is the claim by the author of a qualitative study (§3.2) of the different query languages. There is no evidence in the paper that this so called study involved anything else than the author's subjective judgement of how easy/hard to write a given query is. If I am wrong, then the authors should present in detail the experimental settings (how many participants? who were they? what was their task?...). If I am right, then the comparison that the authors make (p16) between their own study and a thorough user study such as [66], is really not appropriate.

A second scientific problem is in performance benchmark (§3.3). The authors write (p10) that they "wrote semantically aligned versions [of the queries] in the three tested languages." I am wondering how it was checked that those queries were actually aligned? Since the authors limit to the first 100 results, I guess that different DB engines return different results (or maybe even the same engine on different runs), so the correctness can not be checked by simply comparing the results...
Also, the authors hint later that for Gremlin, the semantics might be subtly different. This kind of thing would deserve to be on the paper, rather than only on the github repo.

In the theoretical considerations (§3.4), the authors write that they "might show a set of conditions sufficient to make [their] transformation information-preserving", then give a few example of what *might* be done. This kind of vague statement has no place in a section dedicated to the contribution of the paper -- it would to the least be moved to the perspectives and future work. Then the authors claim that "reversibility [...] is only interesting for the subset of data that are actually converted"; not only does this claim lack backing, but it is not obvious at all that this subset can easily be characterized for any arbitrary mapping.

To summarize, I find the proposed approach very interesting, and the empirical results encouraging, but this work needs a more thorough and rigorous scientific framing.

Other remarks:
* p1 "[LPGs] are the basis of graph databases": I'm not sure what that means... Graph databases arguably existed before LPGs, so I don't see them as the "basis".
* p3 "must be applied" should be "had to be applied" to maintain grammatical tense consistency
* p3 "once we have" should be "once we had", again because of tense consistency
* p3 "entities [in RDF] are described by means of resolvable URI" → no, this is *not* a requirement of RDF. True, this is good practice for Linked Data, and therefore I agree with your further statement that "in most cases, they are web URLs that employ HTTM". But the use of "resolvable" in this sentence is too strong.
* p3 and following: you mention URIs, but RDF has been using IRIs since RDF 1.1 (2014)
* p4 "in node and edges" should be "in nodes and edges" (plural "nodes")
* p4 reference to Figure 5 should rather be Figure 5-c
* p11 "the fact systems" should be "the fact that systems"
* Figure 7 and following: You should explain, even briefly, what "biopax", "ara" and "cereals" are. Currently, the reader can only infer it from the occurrence of these labels in Figure 6
* p15, Related Works: you might want to cite Julian Bruyat's work on PREC as another example of approach using custom mappings (although his work focuses on LPG to RDF conversion): https://hal.science/hal-03407785v1

* reference 32 should rather be a reference to the Community Group Final Report, which is more stable. It is available at https://www.w3.org/2021/12/rdf-star.html

Review #2
By Souripriya Das submitted on 22/Jul/2024
Suggestion:
Minor Revision
Review Comment:

This manuscript was submitted as 'full paper' and should be reviewed along the usual dimensions for research contributions which include (1) originality, (2) significance of the results, and (3) quality of writing. Please also assess the data file provided by the authors under “Long-term stable URL for resources”. In particular, assess (A) whether the data file is well organized and in particular contains a README file which makes it easy for you to assess the data, (B) whether the provided resources appear to be complete for replication of experiments, and if not, why, (C) whether the chosen repository, if it is not GitHub, Figshare or Zenodo, is appropriate for long-term repository discoverability, and (4) whether the provided data artifacts are complete. Please refer to the reviewer instructions and the FAQ for further information.

I found the work focusing on the idea of converting well-known RDF data to two other forms namely, LPG and GraphML, and then carrying out and reporting a comparative study of load performance and performance of carefully selected queries belonging to five different categories on three different setups -- SPARQL on RDF data in Virtuoso, Cypher on LPG data in Neo4j, and Gremlin on GraphML data in ArcadeDB -- a commendable effort that provides useful information to the research community and practitioners interested in maximizing the utilization of knowledge graph data available as RDF.

I found that the content, although quite detailed, requires some improvements. Also, it was not clear to me if the datasets that the paper talks about are easily accessible for interested readers to download and experiment with. Please see my suggestions below.

A. Question: Are the three datasets used in the performance study – (based on the labels in Figure 7) biopax, ara, and cereals – available in a form that can be directly used by others for experimentation?

B. Suggestions for improving the readability:
B.1) For LPGs, the term "edge" is more popular than (and avoids any confusion that sometimes arises with use of) the term "relation". It may be easier for readers if you use the term "edge" instead of "relation" throughout the paper.

B.2) Similarly, the term "property" is more popular than "attribute" in the context of LPGs. You could use "propertyless" and "propertied" as adjectives to describe edges without any properties and edges with properties, respectively.

B.3) For Figure 2, here are some suggestions for improving the clarity:
a) Use of Figure 1 labels a, b, and c (used for RDF to LPG transitions) for creating labels for the five individual boxes in Figure 2 is very confusing. I think the labeling of Figure 2 boxes should be independent to avoid any confusion. Any correspondence between the Fig. 1 transition labels and the Fig. 2 labels should be explained in the text instead. See below for suggested re-labeling of Figure 2 boxes.
b) For parts "a) b)" on the left column, use "?nodeIRI" instead of just "?iri" as the variable name.
c) For part "c)" on both the left and the right columns, use "?edgeIRI" instead of just "?iri" as the variable name.
d) Re-label the top left box as: "node creation".
e) Re-label the middle box on the left as: "creating label(s) for nodes".
f) Consider combining the top box and the middle box on the left side into a single query and label it as: "A) creating nodes and associated labels".
g) Re-label part "c)" on the left as "B) creation of propertied edges from sets of RDF triples representing reification".
h) Re-label part "c)" on the right as "C) creation of propertyless edges from individual RDF triples that are not part of any reification".
i) Re-label part "a) b)" on the right as "D) creation of properties associated with nodes and edges"

B.4) I did not find Figure 4 to be of much use. It may be more useful for the readers to be able to see, without having to go to another site, the core query patterns in some of the queries that were explicitly mentioned in the text such as join, joinRel, joinRef, 2union, existAg, lngSmf.

C. Consider the following suggested changes: typos, grammar issues, missing info, etc.

C.1) In Section 1.1:
a) it is not clear what "NTT Data" is – are we talking about a business organization?
b) in the paragraph before Section 2: "hereby" -> "here"

C.2) In Section 2:
a) right at the beginning: "As [is] well known"
b) towards the end of the first paragraph: "triples as subjects [or objects] of other triples"
c) in paragraph subtitled "rdf2pg, architecture and approach":
- "part [replace "part" with "a subset"] of the triples"
- "another part [replace "part" with "(distinct) subset"] of the triples"
- "from either straight [replace "straight" with "individual"] RDF triples"
- "which yield LPG relations with [replace "with" with "without"] essentially any attribute [replace "attribute" with "attributes"]"
- Figures 1 and 2 should be placed as close to this section as possible.
- Figure 1 bottom portion is missing the following triples:
ex:pmed_23236473 rdf:type <...Publication> .
<...Publication> rdfs:subClassOf skos:Concept .
ex:cit_TOB1_23236473 rdf:type rdf:Statement .
- In Figure 1 top portion, consider using '=' instead of ':' to avoid IRI-like appearance. For example, iri = http://.../TOB1, instead of iri: http://.../TOB1.

C.3) In Section 3:
a) In the last paragraph of Section 3.2, should we have "prot1/xref/prot2", instead of "protein1/xref/protein2", so that it is a "prefix" of "prot1/xref/prot2/xref/prot3"?
b) In the first paragraph of Section 3.3, please mention the names of the three datasets and, for each, its size. Based on the labels used in Figure 7, I can see that the names are biopax, ara, and cereals. I am guessing that their sizes are 2,21, and 97 million triples, respectively. This information needs to be clearly specified in the text. Also, how can a reader get access to these datasets for experimentation?
c) In Section 3.3 sentence: "All of these hereby [remove "hereby"] results are".
d) In the last paragraph of Section 3.3: "very varying" sounds odd.
e) In Section 3.4, the second paragraph ending with "from these queries to" should be combined with the third paragraph that starts with "GraphML is linear with respect to".
f) In Section 3.4: "our RDF/PG transformations, eg [replace with "e.g."], we might have instances".

C.4) In Section 4:
a) In the first paragraph: "We have shown that labelled [property] graph databases".
b) In the first paragraph: too many uses of "still" in consecutive sentences => "... still important ... still play".
c) In Section 4.1 first paragraph: "Based on this initial work, we have seen [that] the rdf2neo approach [is] suitable for the generalisation and extensions described hereby [replace "hereby" with "here"] and to be used to manage [replace "to be used to manage" with "for use in managing"] the enterprise ETL use case described above, where we have similar needs to align mixed data models and technologies.".
d) In Section 4.1: "Another limit [replace "limit" with "limitation"] of our approach".

Review #3
By Franck Michel submitted on 05/Aug/2024
Suggestion:
Reject
Review Comment:

This paper presents rdf2pg, a tool that leverages and extends previous works (rdf2neo), to carry out the mapping of RDF data to property graph DBs. Using this tool the papers then reports a comparative study of Virtuoso, Neo4J and ArcadeDB.

My main concern is that it is very unclear what the exact contribution of the paper is: is it rdf2gp, is it a benchmark consisting of datasets and test queries, or is it the comparison of 3 graph databases (Virtuoso, Neo4j, ArcadeDB)?
In all 3 cases, the desciptions seem too partial. The paper consists essentially of a technical/engineering contribution, certainly interesting, but that does not fit in a venue like the SWJ. The paper would probably be better fitted in a workshop or the resource-track of a conference.
As I describe below, the real theoretical grounding of the paper actually only comes in section 3.4, leaving a much bigger place to the technical details.

Sections 1 and 2 give lengthy context descriptions: e.g. section 1 describes the 2 uses cases in details although these details are hardly useful in the rest of the paper; in section 2, reminders about the SW are definitely not needed in the SWJ.
This space should be used to provide much stronger grounding of the work: the mapping approach of rdf2pg is described here very succinctly, and there is no questioning about the expressiveness of the mapping, its possible completeness wrt. the target query languages.
Besides, the translation from RDF to LPGs relies on an abstract LPG model. However this abstract model is not described, instead the paper provides a Github link to sheer Java classes.
In section 3, the qualitative considerations about SPARQL/Cypher/Gremelin are interesting as a return on experience, yet this remains very empirical. This should also tackle the questions of expressiveness, the possible gaps between the 3 languages etc.

The paper also lacks "self-containedness" which is what we can expect from a journal article: it very often points to the Github repository (either source files or wiki), for instance to mention test queries (e.g. joinReif, existAg) that are simply named but hardly described. Besides what was the method to come up with those queries? The paper simply indicates that these queries were derived from real use cases and the Berlin Benchmark but along with methodology?

Reading further, I find answers to some of these concerns in 3.4, however these answers seem to be provided in a previous publication [13]. As a journal article, it would be valuable to partially copy/rephrase/extend this referenced article. At the very least, the theoretical considerations of 3.4 should come much earlier in the paper, before digging into the technicalities of the approach, and leaving less space for these technicalities.

Misc.:
Figure 1 is in low-resolution which makes a poor rendering.

Review #4
Anonymous submitted on 02/Sep/2024
Suggestion:
Reject
Review Comment:

The article illustrates techniques and methods used for transforming an RDF dataset into a Property Graph database. The system uses a set of SPARQL query templates to map the RDF data to vertices, properties, and labelled arcs. This are customised to match the actual RDF schema of the input. The article reports on the experience of developers in practicing query languages of the two KG families, namely SPARQL, Gremlin and Cypher. In addition, data in the two formats (RDF and GraphML) is produced, which can be used to benchmark the performance of RDF vs PG database engines.

The main problem with this article is the lack of focus on the contribution and innovation with respect to the state of the art. The narrative jumps from (a) management of fair data; (b) transformation tools; and (c) benchmark to evaluate RDF/PG. None of the three topics are explored in sufficient detail.

The title is somehow misleading: the article is not about management of RDF/PG end points, which supposedly includes issues such as synching the two views on incremental updates.

Tools for transforming data to and from RDF abound. The article should make a comparative analysis of existing tools for KG transformation to demonstrate how the use case of rdf2pg is not handled by existing alternative methods and/or the features of the proposed tool cover requirements that are not satisfied by other solutions. The motivational use case section is too anecdotical, a more rigorous systematic analysis of tools and methods is required.
Crucially, it is not clear how rdf2pg extends/supersedes rdf2neo, apart from the changed name. A clearly stated set of requirements should help making this point.

The benchmark may be a useful resource but then the article should focus on clearly stating why this benchmark is needed and compare the findings with relevant literature (e.g. [1,2]). Experiments are too limited with this respect and should cover a broader range of RDF and PG implementations, similarly to what is available from the literature.

The results section should clearly state what are the hypotheses/requirements that the experimentation aims to evaluate. If that is the performance comparison, see my comments above. Also, these results are not either surprising nor add anything new to what is already known regarding performance of RDF/PG systems (they are more or less the same but differ significantly on certain types of queries).
About the qualitative analysis, this is rather anectodical compared to the robust methodological framework adopted by Warren (cited).

Many references to external material make the reading difficult, these should be integrated (e.g. with an appendix) to make the article self-consistent.

Figure 3 is not very readable, what visual formalism is used?

There is extensive discussion across all the sections, which is rather anectodical. Nothing wrong with reporting personal views but they should be confronted with existing literature or associated with some sort of evidence.

[1] Ravat, Franck, Jiefu Song, Olivier Teste, and Cassia Trojahn. "Efficient querying of multidimensional RDF data with aggregates: Comparing NoSQL, RDF and relational data stores." International Journal of Information Management 54 (2020): 102089.

[2] Kovács, Tibor, Gábor Simon, and Gergely Mezei. "Benchmarking graph database backends—What works well with wikidata?." Acta Cybernetica 24, no. 1 (2019): 43-60.