DIAERESIS: RDF Data Partitioning and Query Processing on SPARK

Tracking #: 3284-4498

Authors: 
Georgia Troullinou
Giannis Agathangelos
Haridimos Kondylakis
Kostas Stefanidis
Dimitris Plexousakis

Responsible editor: 
Aidan Hogan

Submission type: 
Full Paper
Abstract: 
The explosion of the web and the abundance of linked data demand effective and efficient methods for storage, management, and querying. Apache Spark is one of the most widely used engines for big data processing, with more and more systems adopting it for efficient query answering. Existing approaches exploiting Spark for querying RDF data, adopt partitioning techniques for reducing the data that need to be accessed in order to improve efficiency. However, simplistic data partitioning fails, on one hand, to minimize data access and on the other hand to group data usually queried together. This is translated into limited improvement in terms of efficiency in query answering. In this paper, we present DIAERESIS, a novel platform that accepts as input an RDF dataset and effectively partitions it, minimizing data access and improving query answering efficiency. To achieve this, DIAERESIS first identifies the top-k most important schema nodes, i.e., the most important classes, as centroids and distributes the other schema nodes to the centroid they mostly depend on. Then, it allocates the corresponding instance nodes to the schema nodes they are instantiated under. Our algorithm enables fine-tuning of data distribution, significantly reducing data access for query answering. We experimentally evaluate our approach using both synthetic and real workloads, strictly dominating existing state-of-the-art, showing that we improve query answering in several cases by orders of magnitude.
Full PDF Version: 
Tags: 
Reviewed

Decision/Status: 
Major Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
By Louis Jachiet submitted on 02/Dec/2022
Suggestion:
Major Revision
Review Comment:

This manuscript was submitted as 'full paper' and should be reviewed along the usual dimensions for research contributions which include (1) originality, (2) significance of the results, and (3) quality of writing. Please also assess the data file provided by the authors under “Long-term stable URL for resources”. In particular, assess (A) whether the data file is well organized and in particular contains a README file which makes it easy for you to assess the data, (B) whether the provided resources appear to be complete for replication of experiments, and if not, why, (C) whether the chosen repository, if it is not GitHub, Figshare or Zenodo, is appropriate for long-term repository discoverability, and (4) whether the provided data artifacts are complete. Please refer to the reviewer instructions and the FAQ for further information.

==================== GENERAL FEEDBACK ============================

This paper introduces a new SparkSQL-based SPARQL query evaluator
called DIAERESIS. This system is based on a novel Data Partitioning
scheme that tries to split the graph using information derived from
the schema graph.

The paper is composed to two main parts, a first part explains the
novel partitioning scheme and second part validates the interest of
the approach by running a benchmark comparison with other Spark-based
SPARQL query evaluators, namely S2RDF, SPARQLGX and WORQ.

I think the approach has great qualities:

- extending the vertical partitioning with a partitioning on nodes is
a good idea, especially for a Spark-based method where this
partitioning serves as a sort of index.

- the approach was actually implemented with the code is available

- the authors ran a benchmark over several datasets with several
competitors, which is a long and challenging task

However, I also see important problems in the current papers:

- many parts of the paper are hard to understand. For instance, I
know RDF and BGP, but I found your explanation of those a bit
confusing. In particular, while I understand the overall approach
for your partitioning, much of the actual details explained in
section 4 remain unclear to me.

- The benchmark does not fully convince me of the interest of the
approach. The benchmark in the following directions (see the
detailed comments for more information):

- give more details in the setup

- use more/other datasets that do take into account the scale of the data

- use competitors that are not Spark-based. I fully agree that it
is not because Virtuoso (or any other) outperforms you in
certain cases that your method is not interesting, but we need to
see a comparison with up to date and actually used stores.

- explain the situation w.r.t. LUBM and SWDF(u) (see below)

- since your method is based on a parameter k, explain the effect
of k on the performance of your system. At the moment I am
completely unable to say whether your method performs well
because Spark SQL is fast or because the partitioning is working
as intended. Also fix the optimization bias error that, I think,
you make (or explain why it is not an issue).

For all those reasons I hesitated between reject and major revision.
I selected major revision because I think this work has potential in
terms of significance and originality but please note that this paper
will require a lot of work to be accepted both in the writing and in
extending the experimental part.

==================== DETAILED COMMENTS ============================

Note that I gave many comments and therefore there might be more
issues that I have forgot to explain here.

--- Comment #1

The first § of the introduction is a bit too much hyperbolic. For
instance in:

To store, manage and query these ever increasing RDF data, many
distributed big data processing engines have been developed, like
Hadoop, HBase and Impala

it supposes that those tools (Hadoop, etc.) were developed for RDF, this is false.

--- Comment #2

The problem § includes this sentence:

As Spark is focusing on a balanced data distribution, by default, it
partitions triples based on a hashing of the whole triple.

Of course Spark will use hashing for the operations such as groupBy, reduceBy, etc.
but the storage in HDFS does not use hashing.

--- Comment #3

In the following sentence:

The set C includes all classes and the set P includes all properties, except
rdf:type which connects individuals with the classes they are instantiated under.

What do you mean exactly? That rdf:type is not a member of P?

--- Comment #4

Why introduce complex RDF machinery (such as RDF datasets) when you
seem to ignore their existence in the following? It feels like you
could just say something along the line "the core component of any
SPARQL query evaluator consists in matching BGP to labeled
graphs". Indeed, in the SPARQL queries that you support there is no
way to actually navigate the various graphs of an RDF dataset.

--- Comment #5

I understand that you don't want to introduce the full SPARQL and
that you don't even want to explain in lots of details what is BGP
and what is the semantics of matching but the current explanation
is unsatisfactory: I believe it won't be able to help someone who
does not already know about SPARQL

The introduction of SPARQL is not very clear. It does not state that
you limit yourselves to BGP queries (even if SPARQL is bigger than
just BGP queries).

I don't fully agree with "Join operations are encoded in those queries
by sharing the same variable in more than one triple pattern". To me,
each TP defines a set of solution and then the BGP is just a join
between the solution of all TP composing the BGP.

The part on star and path queries should not be in the same § as the
introduction of what is a BGP.

For this whole section 2.2 an example would be useful.

--- Comment #6

In 2.3 and later in the related work you write about the "Spark’s
default implementation for querying RDF datasets" but Spark does not
have a "default implementation". You can talk about the "naive"
implementation if you want (storing everything in a single file).

--- Comment #7

The related work only mentions alternative based on Spark but many
SPARQL query evaluators are not based on Spark and some do use a
partitioning such as the one you have. Even if this work is the only
Spark-based query evaluator that does partitioning you have to
mention them.

--- Comment #8

I was not able to understand much of section 4.1 and 4.2. Thankfully
the paper is written in a way that I can read the rest supposing a
black box giving me the partition.

--- Comment #9

Equation (2) what does normal mean exactly? I get that it is
normalized but how? Is normal(BC(v)) divided by n² for n the number
of nodes? And for InstV(v)? Is it divided by the total number of nodes
in the graph?

--- Comment #10

The Definition 3, as it is written, does not make sense. You take
particular nodes u_i and u_j but how does the result does not depend
on the particular choice of u_i and u_j? Also, what |u_i| and |u_j|
means? Should I somehow interpret this as a sum over all possible u_i
and u_j? The notation Distinct(u_j) for the number of distinct pairs
(u_j,u_j) does not make sense to me. Is there a typo stating that one
u_j should be a u_i? if so why does it not appear as a parameter of
Distinct?

--- Comment #11

As said above, I did not get much of section 4.1 and 4.2 but the
claimed complexity in 5.1 does not depend on the number of edges in
the graph but CC(p(v_k,v_s)) was depending on the number of distinct
u_j, u_j. How do you compute this without looking at the edges?

More generally how, starting from a bunch of triples, do you compute
the number of distinct nodes without going through each triple? Do you
suppose that the input is already indexed in some way?

--- Comment #12

I understand that you add the constant (1+|v|) / |v| so that the CC
stays above 0 but why this specific number? Is there some reason to
use this and not e.g. 1? Note that on a large graph this number will
actually be very close to 1 in practice…

--- Comment #13

I don't see the point of giving the pseudo code "algorithm 1" without
giving what each function used by algorithm 1 does:
caclulateImportance (sic), selectTopKNodes, selectPartitionBalanced,
schemaNodesInPathWithMaxDependence, getNeighborsAndProperties,
instances.

--- Comment #14

LUBM is a benchmark with an ontology. If you do not take the ontology
into consideration many of the given queries return empty sets. How
did you take into consideration the ontology? If you did not take the
ontology into consideration discarding the queries returning empty
results might widely change the result.

--- Comment #15

LUBM comes with 14 queries, what are the 13 provided queries of line
21 page 13?

--- Comment #16

SWDF fits within 49.2 MB. I feel it is not very fair to include such a
small dataset in a comparison between Spark-based query evaluator
because the typical HDFS block size is bigger than this… Also it is
unclear what exact queries you used in the paper. On the github
https://github.com/isl/DIAERESIS/blob/master/queries/swdf/ we see a
lot of queries did you use those?

--- Comment #17

The selected datasets are relatively small for a big data tool
especially considering the potential problems with LUBM (see above).

--- Comment #18

It seems you are setting the parameter k (the number of partition)
using the benchmark and then publish the result with that benchmark.
It is an optimization bias. Furthermore, it would be much more
interesting to present a benchmark with different values of k so that
we see the effect of k.

--- Comment #19

Which exact version of the competitor did you use? They have been
published for some time with several versions and several options
available. What options did you use? For instance S2RDF and SPARQLGX
can use a statistic table, a prefix compression/expansion scheme and
Hadoop compression or an upper bond for extVP tables. Did you activate
some or these options or not? If not, how it is possible that DBpedia
is stored on 3 GB with SPARQLGX?

--- Comment #20

You use 4 physical machines but you only compare Spark-based query
evaluators. 4 machines is a bit small for large datasets…

Furthermore, you say that they have 400 GB of storage but what
kind of storage? For an experiment over large datasets it is really
important to know this as data reads and writes are generally the
bottleneck. If each machine has a single hard drive the 38 cores are
pretty much useless for most queries…

--- Comment #21

There seems to be very little change in processing time for SPARQLGX
between the various LUBM (except the big one and even the path queries
take less time than on LUBM2300). How is this the case? Is there some
"booting" time?

--- Comment #22

Fig 9 seems a bit useless to me.

--- Comment #23

For the Data Access Reduction I would like to see some sort of explanation.

--- Comment #24

You say that you competitors do not support the queries in SWDF(u), so
why include SWDF(u)? you can explain that you do support features that
they don't support but it seems a bit weird to include that has a
benchmark with 0 points of comparison. Also they both have a way of
supporting them by bypassing the (ext-)VP tables…

Review #2
Anonymous submitted on 03/Jan/2023
Suggestion:
Major Revision
Review Comment:

### Summary

The paper presents 'IAERESIS, ' SPARQL engine built on top of SPARK SQL. The system uses a two-level partitioning technique that involves clustering schema nodes and the second vertical partitioning step. DIAERESIS also includes its query processors, which exploits (SPARK's) indexes and query" optimisations (not specified). The system is compared with three baselines, i.e., S2RDF, WORQ, and SPARQLGX, over three datasets LUBM (different scales), DBPedia, and Semantic Web Dog Food.

Several attempts to execute SPARQL queries on top of Apache Spark (SQL or not) have been proposed. Usually, they fall back to engineering a relational schema that optimises a given query family. In these regards, the fundamental role of partitioning has been highlighted, yet the issues are far from being solved. Therefore, additional contributions in this space are more than welcome.

The paper is overall well written. However, there are some necessary improvements. In particular, it was not immediately clear whether the authors were proposing a fully distributed query engine or a new partitioning technique. This is because of the unbalance in the explanation: most of the part is used to describe partitioning (hence, pre-processing). On the other hand, the evaluation mainly focuses on assessing DIAERESIS superiority in query answering, limiting the study of the partitioning quality to only one experiment (differences in data access). This is essential, as it impacts how the work should be evaluated.

In its current form, I believe the paper requires a significant revision. As I will explain below, some unsolved points need action.

**Comments on Background**

The part on **RDF** should be more structured. It currently presents the concepts in line, including basic definitions.
Redefining the Dataset concept may be confusing considering its meaning in SPARQL query answering.
Finally, the interpretation of RDF data as a graph is necessary for the paper and should be expanded.

The subsection on **SPARQL** is extremely limited and not very rigorous. An exact characterisation of what queries can be answered is essential for judging the fairness of the comparison. In particular, what the authors refer to as "path" queries are, in fact ", chains" or "linear" as called in Schätzle, A., Przyjaciel-Zablocki, M., Skilevic, S., & Lausen, G. (2015). S2RDF: RDF querying with SPARQL on spark. _arXiv preprint arXiv:1512.07021_.

This last point frankly confused me quite, as I could not find a path query (using regular expressions on edge, for instance, in the repository).

**Part on SPARK**

A clear explanation of how indexing works in SPARQK is essential. To my knowledge, it is closer to an in-memory view over a dataframe. Is that the way indexing was implemented, or the authors did something different?

The spark query optimisation method should also be explained, as the authors later claim but don't provide details, about some implemented optimisations.

**Comments on the Contribution**

As mentioned above, the paper is centred around the novel partitioning technique proposed by the authors. In particular, the proposed approach is based on the assumption that workload and data don't change.

Although such an assumption could be reasonable in some contexts, DIAERESIS systematically takes more pre-processing time for one dataset. Yet, the robustness of its performance are not tested against different workloads. Therefore, understanding the impact of the Contribution is quite tricky. In practice, the paper as it currently does not answer the questions:

- is it worth it to be slower in pre-processing to gain in query execution for the given datasets and workloads?

An additional question regards the relational schema adopted by the system. Initially, i though it was a triple-table plus a partitioning technique. Then I understood a vertical partitioning step is included. Therefore there should be a relation for each property. I think this part should be explained better, especially elaborating on the query translation part.

- is the query translation performed manually or not?

The authors mention query optimisation based on statistics collected during pre-processing.

- did they modify the catalyst spark optimiser to take such statistics into account?

About the "Query Index": Aside from naming issues (query index is a bit confusing as it reuses an already defined term), I understood this is a de-fact planning step, which could, in principles implemented directly into the catalyst.

- How is the query index done in practice?

The technical quality of DIAERESIS is also hard to assess. The authors provide the repository, which is excellent, but I was not able to find the data (a sample would have been enough).

The repository is not well organized, and it was not easy for me to understand, by inspecting the code, how things were designed and implemented:

- are indexes implemented using SPARQL SQL?
- the parsing seems to be broken as specific queries have "<>" and some have not.
- The workloads are not organised as in the paper; it takes a bit more time to validate.

**Comments on the evaluation**

Most of my concerns regard the evaluation of the approach in relation to its positioning, as mentioned above. Trying to avoid the usual comment, "why not use dataset X" my first concern was why the authors did not reuse the datasets and workload (except LUBM) used in the paper of their baselines.

In addition, the most critical point is the lack of some ablation studies. DIAERESIS contains two levels of partitioning, indexing, and query optimisations. The impact of each of these components should be studied independently.

Moreover, the reduction of data access is the only assessment of the quality of the partitioning. This is not enough, considering the weight of such a contribution to the work. What should be verified is its performance variating the workload (splitting in 2/3 the existing ones) as well as the number of partitions available.

Regarding the failure of some systems,
- failure in pre-processing is a bit hard to justify; it is known that real-world datasets suffer in data preparation.
- I believe the issue with WORQ could be solved by rewriting the query as a free join plus a selection (FILTER== URL).

Finally, the following questions remain unanswered
- Why were only three out of the several state-of-the-art approaches where compared? (missing [A])
- How does DIAERESIS without optimisation performs?
- What are the consequences of using a real-world workload?
- is the comparison fair in terms of the used software?
- in particular, S2RDF was implemented in 2016, and Apache Spark has had two major releases since. I understand the cost of reimplementing the approach, but is it possible that the results also depend on the software version?

[A]: Cossu, Matteo, Michael Färber, and Georg Lausen. "Prost: Distributed execution of SPARQL queries using mixed partitioning strategies." _arXiv preprint arXiv:1802.05898_ (2018).

**Minors**

In Section 4, definitions 3 and 4 require two examples that can help explain the definitions.

Importance should be introduced as a definition, and its relation to [22] should be explained.

The description of competitors 6.1.2. can be omitted as it was covered before.

The repository can also host the evaluation results that do not fit the paper.

Review #3
Anonymous submitted on 10/Jan/2023
Suggestion:
Minor Revision
Review Comment:

This paper presents a data partitioning and query processing technique for semantic data (RDF format) using Apache Spark. The innovation in the work is their technique for partitioning which is different from the recent works that focus on extensions of vertical or horizontal partitioning. They use a technique to identify important schema nodes and create groups our clusters around those important ones using betweenness centrality, cardinality closeness of nodes, and dependence between nodes. They arrange data using this method and then apply vertical partitioning within the individual clusters before performing query processing.

Evaluation is done through experiments with the use of synthetic datasets such as LUBM and real datasets such as DBPedia.

Overall the presentation of the work is very good and reads well. The related work discussion is presented well. The use of a running example to describe their methodology works well for understanding the work. The evaluation results show that the proposed system outperforms the state-of-the-art systems. Overall this work is interesting and very useful to the research community.

Comments:
----------

Fig 6 shows the query execution performance of the different systems being compared. It is obtained using an average of 10 executions of each set of queries. It is not clear how many queries were used and which ones. Also, what is the rationale for taking the average of different categories of benchmark queries?

Q8 is snow-flake shaped query, is there a reason that distinction was not made and separated for the analysis of results?

A discussion of the analysis of why DIAERESIS outperforms in certain situations such as large datasets or specific query types is missing. This is important for the research community reading this paper. Is the system much better for a specific query category in comparison with others and why?

Was the Watdiv benchmark considered? It has queries categorized evenly into all 4 types (linear, star, snow-flake, complex).

Minor: Authors say this in multiple places, "... is presented in sequel". It would be better to reference using the section numbering.

Minor: Does DIAERESIS have a full form? Is it an acronym?

------

This manuscript was submitted as 'full paper' and should be reviewed along the usual dimensions for research contributions which include (1) originality, (2) significance of the results, and (3) quality of writing. Please also assess the data file provided by the authors under “Long-term stable URL for resources”. In particular, assess (A) whether the data file is well organized and in particular contains a README file which makes it easy for you to assess the data, (B) whether the provided resources appear to be complete for replication of experiments, and if not, why, (C) whether the chosen repository, if it is not GitHub, Figshare or Zenodo, is appropriate for long-term repository discoverability, and (4) whether the provided data artifacts are complete. Please refer to the reviewer instructions and the FAQ for further information.