Review Comment:
This manuscript was submitted as 'full paper' and should be reviewed along the usual dimensions for research contributions which include (1) originality, (2) significance of the results, and (3) quality of writing. Please also assess the data file provided by the authors under “Long-term stable URL for resources”. In particular, assess (A) whether the data file is well organized and in particular contains a README file which makes it easy for you to assess the data, (B) whether the provided resources appear to be complete for replication of experiments, and if not, why, (C) whether the chosen repository, if it is not GitHub, Figshare or Zenodo, is appropriate for long-term repository discoverability, and (4) whether the provided data artifacts are complete. Please refer to the reviewer instructions and the FAQ for further information.
I found the work focusing on the idea of converting well-known RDF data to two other forms namely, LPG and GraphML, and then carrying out and reporting a comparative study of load performance and performance of carefully selected queries belonging to five different categories on three different setups -- SPARQL on RDF data in Virtuoso, Cypher on LPG data in Neo4j, and Gremlin on GraphML data in ArcadeDB -- a commendable effort that provides useful information to the research community and practitioners interested in maximizing the utilization of knowledge graph data available as RDF.
I found that the content, although quite detailed, requires some improvements. Also, it was not clear to me if the datasets that the paper talks about are easily accessible for interested readers to download and experiment with. Please see my suggestions below.
A. Question: Are the three datasets used in the performance study – (based on the labels in Figure 7) biopax, ara, and cereals – available in a form that can be directly used by others for experimentation?
B. Suggestions for improving the readability:
B.1) For LPGs, the term "edge" is more popular than (and avoids any confusion that sometimes arises with use of) the term "relation". It may be easier for readers if you use the term "edge" instead of "relation" throughout the paper.
B.2) Similarly, the term "property" is more popular than "attribute" in the context of LPGs. You could use "propertyless" and "propertied" as adjectives to describe edges without any properties and edges with properties, respectively.
B.3) For Figure 2, here are some suggestions for improving the clarity:
a) Use of Figure 1 labels a, b, and c (used for RDF to LPG transitions) for creating labels for the five individual boxes in Figure 2 is very confusing. I think the labeling of Figure 2 boxes should be independent to avoid any confusion. Any correspondence between the Fig. 1 transition labels and the Fig. 2 labels should be explained in the text instead. See below for suggested re-labeling of Figure 2 boxes.
b) For parts "a) b)" on the left column, use "?nodeIRI" instead of just "?iri" as the variable name.
c) For part "c)" on both the left and the right columns, use "?edgeIRI" instead of just "?iri" as the variable name.
d) Re-label the top left box as: "node creation".
e) Re-label the middle box on the left as: "creating label(s) for nodes".
f) Consider combining the top box and the middle box on the left side into a single query and label it as: "A) creating nodes and associated labels".
g) Re-label part "c)" on the left as "B) creation of propertied edges from sets of RDF triples representing reification".
h) Re-label part "c)" on the right as "C) creation of propertyless edges from individual RDF triples that are not part of any reification".
i) Re-label part "a) b)" on the right as "D) creation of properties associated with nodes and edges"
B.4) I did not find Figure 4 to be of much use. It may be more useful for the readers to be able to see, without having to go to another site, the core query patterns in some of the queries that were explicitly mentioned in the text such as join, joinRel, joinRef, 2union, existAg, lngSmf.
C. Consider the following suggested changes: typos, grammar issues, missing info, etc.
C.1) In Section 1.1:
a) it is not clear what "NTT Data" is – are we talking about a business organization?
b) in the paragraph before Section 2: "hereby" -> "here"
C.2) In Section 2:
a) right at the beginning: "As [is] well known"
b) towards the end of the first paragraph: "triples as subjects [or objects] of other triples"
c) in paragraph subtitled "rdf2pg, architecture and approach":
- "part [replace "part" with "a subset"] of the triples"
- "another part [replace "part" with "(distinct) subset"] of the triples"
- "from either straight [replace "straight" with "individual"] RDF triples"
- "which yield LPG relations with [replace "with" with "without"] essentially any attribute [replace "attribute" with "attributes"]"
- Figures 1 and 2 should be placed as close to this section as possible.
- Figure 1 bottom portion is missing the following triples:
ex:pmed_23236473 rdf:type <...Publication> .
<...Publication> rdfs:subClassOf skos:Concept .
ex:cit_TOB1_23236473 rdf:type rdf:Statement .
- In Figure 1 top portion, consider using '=' instead of ':' to avoid IRI-like appearance. For example, iri = http://.../TOB1, instead of iri: http://.../TOB1.
C.3) In Section 3:
a) In the last paragraph of Section 3.2, should we have "prot1/xref/prot2", instead of "protein1/xref/protein2", so that it is a "prefix" of "prot1/xref/prot2/xref/prot3"?
b) In the first paragraph of Section 3.3, please mention the names of the three datasets and, for each, its size. Based on the labels used in Figure 7, I can see that the names are biopax, ara, and cereals. I am guessing that their sizes are 2,21, and 97 million triples, respectively. This information needs to be clearly specified in the text. Also, how can a reader get access to these datasets for experimentation?
c) In Section 3.3 sentence: "All of these hereby [remove "hereby"] results are".
d) In the last paragraph of Section 3.3: "very varying" sounds odd.
e) In Section 3.4, the second paragraph ending with "from these queries to" should be combined with the third paragraph that starts with "GraphML is linear with respect to".
f) In Section 3.4: "our RDF/PG transformations, eg [replace with "e.g."], we might have instances".
C.4) In Section 4:
a) In the first paragraph: "We have shown that labelled [property] graph databases".
b) In the first paragraph: too many uses of "still" in consecutive sentences => "... still important ... still play".
c) In Section 4.1 first paragraph: "Based on this initial work, we have seen [that] the rdf2neo approach [is] suitable for the generalisation and extensions described hereby [replace "hereby" with "here"] and to be used to manage [replace "to be used to manage" with "for use in managing"] the enterprise ETL use case described above, where we have similar needs to align mixed data models and technologies.".
d) In Section 4.1: "Another limit [replace "limit" with "limitation"] of our approach".
|