Review Comment:
This paper is a study of graph partitioning techniques applied to RDF graphs.
There is an introductory part where RDF and graph partitioning techniques are presented.
Then a benchmark is presented where several techniques are compared on three platforms on two RDF datasets.
The idea of the benchmark is to use federated query engines such as FedX to evaluate federated queries on several SPARQL endpoints where the different partitions are spread.
The benchmark produces data presented in tables an figures.
The paper is well written and is interesting.
It presents an impressive thorough hard work but it suffers from several problems.
The main problem is that at the end, we learn that some partitions may be better than other ones with three specific platforms on two datasets. We do not really get much.
The authors dot not really explain what depends on the federated engine, what depend on the RDF dataset and what depends on the partition. At the end, we do not know what to think about it.
The RDF presentation does not take named graphs into account. This may have an impact of graph partition.
There is an error in definitions about RDF.
The tables are not very clear, units are missing, log scale are somehow misleading.
The SPARQL queries in the benchmark do not cover whole SPARQL.
There are a lot of missing statements : property path, subquery, minus, named graph pattern, from, from named, exists, aggregates.
We do not know the size of the datasets, there is no example of queries, there is no URL where to look at query example.
A threshold of 180 seconds is set, after what a query is said to timeout. Why 180 s ?
How long time above 180 s do these queries need to complete: 181 s ? 1000 s ? 10000 s ?
Do they timeout without partition ?
When a timeout occurs, the authors consider the value of 180 s to compute average time, but may be the query would take 1 hour ?
At the end, the paper is difficult to read and we do not learn that much about the effect of RDF partitioning.
I would suggest a qualitative study to understand why some partitions would be better than others and in which case.
Detail
There is a problem with Definition 2 :
"E includedIn {(s, o)|(s, p, o) in G} is the set of edges between the
vertices and l(s, o) = p if (s, p, o) in G is the edge labeling function of G."
->
In RDF, there may be several edges relating s and o, with different predicates: s p o, s q o, s r o.
Hence the labeling function l(s, o) should return a set of labels: {p, q, r}
Named graphs are missing from the Preliminaries section on RDF.
It is not clear what "Rank score" and "Partitioning Imbalance" mean.
One sentence of explanation for each would be welcome.
"divided into n numbers of partitions of each size"
->
what does "of each size" mean ?
"later two will go"
->
next ten triples will go
"If there is a significant access of a group of rows together, then Horizontal partitioning may make sense."
->
"group of rows" is undefined.
3.3. Predicate-Based Partitioning
"The idea is to group all the triples with same predicate and assign them into one partition based on a hash value computed on their **subjects** modulo."
->
Is it their subjects or their predicates modulo ?
3.5. k-way Partitioning
This section is not clear at all, in particular the coarsening and uncoarsening phases are undefined.
"TCV-Min Partitioning: The total communication volume"
->
Define what is communication volume in one sentence.
Min-Edgecut Partitioning: explain why the method gets *minimum* of edges connected
Table 1
Results : indicate "number of results"
For BGP, TP, JV, etc. what do the numbers mean and what are their units ?
What is the difference between BGP and TP ?
Figure 6
Due to the log scale, similar times have different heights.
You compute average time on different datasets (SWDF and DBpedia), what does that mean ?
Figure 9
What does it mean exactly to have 1000, 2000, 3000 sources selected ?
"Consequently, the general graph partitioning techniques may not lead to better performance when **implied** to RDF graphs."
->
applied
What do you mean by "general graph partitioning techniques" ?
In the bibliography, reset uppercases for names and acronyms (rdf -> RDF, sparql -> SPARQL)
|