SPARQL2FLINK: Evaluation of SPARQL queries on Apache Flink

Tracking #: 2008-3221

Oscar Ceballos
Carlos Ramirez
María-Constanza Pabón
Andres Mauricio Castillo
Oscar Corcho

Responsible editor: 
Ruben Verborgh

Submission type: 
Full Paper
Increasingly larger RDF datasets are being made available on the Web of Data, either as Linked Data, via SPARQL endpoints or both. Existing SPARQL query engines and triple stores are continuously improving to handle larger datasets. However, there is an opportunity to explore the use of Big Data technologies for SPARQL query evaluation. Several approaches have been developed in this context, proposing the storage and querying of RDF data in a distributed fashion, mainly using the MapReduce Programming Model and Hadoop-based Ecosystems. New trends in Big Data Technologies have also emerged (e.g., Apache Spark, Apache Flink); they use distributed in-memory processing and promise to deliver higher performance data processing. In this paper we present an approach for transforming a given SPARQL query into an Apache Flink program for querying massive static RDF data. An implementation of this approach is presented and a preliminary evaluation with an Apache Flink cluster is documented. This a first step towards this main goal, but efforts to ensure optimization and scalability of the system need to be done.
Full PDF Version: 


Solicited Reviews:
Click to Expand/Collapse
Review #1
By Joachim Van Herwegen submitted on 10/Sep/2018
Major Revision
Review Comment:

In this paper the authors propose a system to solve SPARQL queries over distributed systems over Apache Flink. For this they formalized the Apache Flink operators, and how to convert from SPARQL algebra to these operators to acquire correct results.

The work itself is interesting and could definitely help in solving an actual issue, but in my opinion there are too many problems to have the work accepted in its current form for the reasons given below.

There are a multitude of grammatical and structural errors in the text which detract from the text. In the formal definitions themselves there are also some errors. The authors mention that a formal proof of correctness is out of scope, but I do wonder if some of these errors could have been avoided if a formal proof was made. All problems found are listed below.

Additionally, the evaluation done is quite limited. The authors evaluated their system over the BSBM benchmark to test whether their results are identical to those of a normal SPARQL execution framework. Unfortunately this is the only evaluation, resulting in a small table with 9 times "Yes" for the queries executed. In my opinion this work needs a much more extensive evaluation section. In the end, the formalization was created so SPARQL queries can be executed over Apache Flink and the inherent advantages of such a system can be exploited. In that regard, I would want to see what the actual impact is of this work compared to existing SPARQL solutions. Query execution time and possibly other metrics over parameters such as data size, data distribution, query type and others would shed a clearer view on the impact of using this framework.

** Formalization problems **

definition 5: "[i1, ..., im] = [v1, v2, ..., vm]" should probably be "t[i1, ..., im] = [v1, v2, ..., vm]"

I think there is a problem with definition 6. The intent is to group all records based on a projection, but with the current definition it does more than that.
Assume we have a record set T' = { [a: 1], [a: 2] }, f is a reduce function that sums the 'a' values of records, and to keep it simple K is the empty set {} so all records get reduced together.
Following the given definition, the result would be { [a: 3], [a: 1], [a: 2] }. [a: 1] is in there because there exists a set of records ({ [a: 1] }) that is subset of T for which all the given conditions apply.
I would expect the definition to contain something to make sure it takes all records that match for the given set of K.

definition 10: records that do have a matching record in the other set are also matched with the empty record. this differs from the definition of SPARQL OPTIONAL

definition 11: this potentially has the same problem as definition 6

definition 13: in situation 5 (filter) it is never mentioned what the solution is for ||P||^D

definition 16: in the definition of "order", <= should be used instead of < since it is possible that records have the same value for a certain variable (so < ordering can not be guaranteed)

definition 17: it is not specified that M' contains the first m elements of M. Why not have M' = {t1, ..., tm}?

would have been interesting to see actual performance evaluations, to see what the impact is of transforming and executing over Flink

** Grammatical/structural errors **

page 2 (25), left: "which transform" -> "which transforms"
page 2 (22), right: "Blank nodes" should not be capitalized. Unless perhaps the idea is to make clear the the name of the set "B" comes from, but then "literals" should also be capitalized
page 3 (24), left: "a initial set of five Input Contract" -> "an initial set of five Input Contracts"
page 3 (27), left: "function" => "functions"
page 3 (29), left: "function" => "functions"
page 3 (43), left: inconsistent capitalization: "Core" is capitalized, but a bit further it is not. Same with "Libraries"
page 3 (12), right: "transform" -> "transforms"
page 4 (37), left: "an scheme" -> "a scheme"
page 4 (21), right: "to precise" -> "to specify"/"to clarify"/"to highlight"
page 4 (23), right: "the keys order" -> "the key order"/"the keys' order"
page 4 (30-31), right: these are the wrong dashes to use when writing "birth-city" and "residence-city", should use the same ones as on the line below it
page 5 (37), left: "it is applied" -> "it applies"
page 5 (40), left: "due to several records can have same values" -> "due to several records having the same values"/"since several records have the same values"
page 5 (42), left: "takes" -> "take"
page 5 (44), left: "to a specific semantics" -> "to specific semantics"
page 5 (43), left: "produces" -> "produce"
page 5 (21), right: "if there not exists a correspondent record in the another dataset" -> "if a corresponding record does not exist in the other dataset"
page 5 (42), right: "there not exists" -> "there does not exist"
page 5 (48), right: "it is defined the cogroup transformation" -> "the cogroup transformation is defined"
page 6 (11), left: "processes conforms groups" -> "processes groups" (not sure what was meant here otherwise)
page 6 (40), left: "a RDF dataset" -> "an RDF dataset"
page 6 (47), left: "with keys as RDF variables" -> "with RDF variables as keys" (unless I misunderstood. I assume this means that all keys are RDF variables)
definition 13: some inconsistent spacing in the definitions of f1 and f2
page 7 (40), left: "by means function f2" -> "by means of function f2"
page 7 (41), left: "solutions mappings" -> "solution mappings"
page 7 (43), left: missing dash in front of new explanation block
page 7 (51), right: "a order by query" -> "an order by query"
page 8 (18), left: "ot" -> "of"
page 8 (11), right: "submodule loads" -> "This submodule loads"
page 8 (25), right: "Translation Query To Logical Query Plan" -> "Translate Query to Logical Query Plan"
page 8 (25-28), right: this entire sentence is messy and seems to be missing some components
page 8 (30), right: "with and" -> "with an"
page 8 (44), right: "submodule convert" -> "This submodule converts"
page 8 (46), right: "from DataSet API" -> "from the DataSet API"
page 9 (36), right: "testing empirically" -> "empirically testing"
page 10 (43), left: "the resulting query not cont
page 10 (3), right: "we details" -> "we detail"
page 10 (5), right: "to testing empirically" -> "to empirically test"
page 10 (33), right: "presents gave" -> "presents a"
page 10 (39), right: "identified and discussed" -> "it identifies and discusses"
page 11 (2): "Sparql" -> "SPARQL", "means" -> "mean"
page 11 (22), right: "provide" -> "provides"
page 11 (36), left: "has been" -> "have been"
page 12 (3), left: "which use of Google DataFlow to processing" -> "which use Google DataFlow to process"
page 12 (48), left: "we will to provide" -> "we will provide"