Semantics and Canonicalisation of SPARQL 1.1

Tracking #: 2757-3971

Jaime Salas
Aidan Hogan

Responsible editor: 
Guilin Qi

Submission type: 
Full Paper
We define a procedure for canonicalising SPARQL 1.1 queries. Specifically, given two input queries that return the same solutions modulo variable names over any RDF graph (which we call congruent queries), the canonicalisation procedure aims to rewrite both input queries to a syntactically canonical query that likewise returns the same results modulo variable re-naming. The use-cases for such canonicalisation include caching, optimisation, redundancy elimination, question answering, and more besides. To begin, we formally define the semantics of the SPARQL 1.1 language, including features often overlooked in the literature. We then propose a canonicalisation procedure based on mapping a SPARQL query to an RDF graph, applying algebraic rewritings, removing redundancy, and then using canonical labelling techniques to produce a canonical form. Unfortunately a full canonicalisation procedure for SPARQL 1.1 queries would be undecidable. We rather propose a procedure that we prove to be sound and complete for a decidable fragment of monotone queries under both set and bag semantics, and that is sound but incomplete in the case of the full SPARQL 1.1 query language. Although the worst case of the procedure is super-exponential, our experiments show that it is efficient for real-world queries, and that exponential cases are rare.
Full PDF Version: 

Minor Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
Anonymous submitted on 30/Jun/2021
Minor Revision
Review Comment:

1.This paper gives a detailed semantics for SPARQL 1.1, including the set, bag and sequence semantics.
2.This paper offers a solution for canonicalising monotone SPARQL 1.1 queries.
3.The experiments demonstrate the practicability of the proposed canonicalisation method.

1.The proposed canonicalisation method only work for montone SPARQL queries. I think it would be better if the authors extend their canonicalization method to some non-montone SPARQL fragments, i.e. well-designed queries.
2.It would be better if the authors provide a complexity analysis for the canonicalisation method.

1.I don’t quite understand the content in Page 5: “and + for addition, and – for subtraction, * for multiplication, / for division, - are denoted as unary, binary and/or n-ary functions,”
2.In page 10, the definition of [Q_1 FE Q_2](D) and [Q_1 FNE Q_2](D) is problematic. E.g. Let $\mu_1 = \{?x \rightarrow 1\}$ and $Q = ((?x,?y,?z) FE bnd(?x))$ then $\mu(Q) = (1,?y,?z) FE bnd(1)$, which is syntactic wrong. I hope the authors can check the definitions in this paper:
3.I don’t understand the last sentence in the first paragraph in Page 20 “Note that even if there are duplicate triple patterns (more precisely, singleton BGPs) in a join, the result still holds under bag semantics, as each triple pattern can produce the same (compatible) solution only once when evaluated on an RDF graph: a set of RDF graphs”.
4.In the second paragraph in Page 20, I think the sentence “in other words, we translate unions of joins to (equivalent) patterns involving joins of unions” should be changed to “we translate joins of unions to unions of joins”.

This paper provides a clear semantics for SPARQL 1.1, including set, bag and sequence semantics. Also this paper proved that the canonicalization for montone SPARQL 1.1 fragments is decidable. I think the authors show dig a little bit deeper about the canonicalization for some non-montone SPARQL 1.1 fragments. Anyway, this paper is a good start for this line of research.

Review #2
Anonymous submitted on 01/Jul/2021
Minor Revision
Review Comment:

Overall, this is a high-quality paper that fills a gap in semantics and canonicalization for the SPARQL in terms of theory, algorithm, and experimental validation. The authors presented a formal semantics of SPARQL 1.1 and a canonicalisation procedure that produces a deterministic query string modulo congruence.

Personally, I love this paper, which was previously highlighted by SPARQL researchers in ISWC 2018, where it won the Best Student Paper Award. So I really appreciate that the authors were able to refine the work and tinker with a lot of theoretical and technical details after that.

Compared with the previous version, in this paper, the authors have done a lot of supplementary work, as well as expanded the canonicalisation from monotone SPARQL queries to SPARQL 1.1 level.

My major concerns are:

1) The author should explain the proportion or difference between monotone queries and other normal SPARQL 1.1 queries in the experiments, so as to better highlight the increment of this paper.

2) I would strongly encourage the authors to revise the writing of their paper, as throughout the text I found multiple issues, that often obscured clarity, for instance, in Page 2: “if we rewrite ?n to ?x in the third query”: should be ?n to ?z

3) Canonicalisation is very valuable for the SPARQL analysis study at the Session-level, see below work [1], and I recommend that the author analyze or mention this value in the introduction or related work.
[1] Xinyue Zhang, Meng Wang, Muhammad Saleem, Axel-Cyrille Ngonga Ngomo, Guilin Qi, and Haofen Wang. "Revealing secrets in sparql session level." In International Semantic Web Conference, pp. 672-690. Springer, Cham, 2020.

4) The project on GitHub is not good enough, which undermines the impact of this work. It should have received more attention. I suggest that the authors continue to improve the Git project, such as provide more detailed dataset, algorithm and test instructions, so that more SPARQL researchers can use the results of this work.

Review #3
By Guohui Xiao submitted on 03/Jul/2021
Minor Revision
Review Comment:

The submitted manuscript "Semantics and Canonicalisation of SPARQL 1.1" studied the problem of computing the canonical representation of SPARQL queries. Query canonicalization is an important research topic, which can be used in several use-cases, e.g., caching, views, log analysis, optimization. The paper covers all the major aspects of this topic, including formalization, complexity, algorithms for different fragments, implementation, and an empirical evaluation. The obtained results are significant and comprehensive. I think the paper can be accepted after some minor changes.

1. One contribution of the paper is the Semantics of SPARQL 1.1. The authors claim that the paper includes "features often overlooked in the literature." Since the semantics of SPARQL 1.1 have been proposed in several papers as mentioned by the authors. Currently, there is some discussion comparing with existing semantics but spread in the text. It would be very helpful to explicitly list "features often overlooked" and compare them with existing proposals in a separate subsection.

2. The paper is very long. It is good that all the proofs are provided inline but it also makes the paper a bit difficult to follow since the long proofs break the reading flow. Please consider putting some proofs (in particular those of the lemmas) to the appendix to make the paper more streamlined.

3. Minor. On page 15, "Based on a query of the form SELECT_V(Q), we define *six* syntactic query classes". In the list, there are 11 classes.

4. In Conclusion, the authors mentioned OBDA [56]. However, citation [56] only considered CQs as the query language. Consider citing more recent works on OBDA where SPARQL is used as the query language. For example: Xiao et al. Ontology-based data access: A survey. In IJCAI-18, pages 5511–5519, 7 2018.

5. The code is a very useful resource. Here are some comments on the code in the Github repo:
- Please provide instructions on how to compile the code
- All the dependencies are provided as directly jar files, which are in general difficult to manage and upgrade. Consider using a build tool, e.g., Maven or Gradle.
- There seem to be some typos in the README. E.g., `java Benchmark -x filename -n numberOfQueries -o offset` should be `java main.Benchmark ...`