Complexity of redundancy detection on RDF graphs in the presence of rules, constraints, and queries

Paper Title: 
Complexity of redundancy detection on RDF graphs in the presence of rules, constraints, and queries
Reinhard Pichler, Axel Polleres, Sebastian Skritek, Stefan Woltran
Based on practical observations on rule-based inference on RDF data, we study the problem of redundancy elimination on RDF graphs in the presence of rules (in the form of Datalog rules) and constraints, (in the form of so-called tuple-generating dependencies), and with respect to queries (ranging from conjunctive queries up to more complex ones, particularly covering features of SPARQL, such as union, negation, or filters). To this end, we investigate the influence of several problem parameters (like restrictions on the size of the rules, the constraints, and/or the queries) on the complexity of detecting redundancy. The main result of this paper is a fine-grained complexity analysis of both graph and rule minimisation in various settings.
Full PDF Version: 
Submission type: 
Full Paper
Responsible editor: 

Revised manuscript after an accept with minor revisions. Now accepted for publication. Reviews of the first round are below.

Solicited review by anonymous reviewer:

The paper considers the problem of redundancy elimination in RDF graphs for various settings and extends a line of work that was begun in [23].
Figures and typesetting are good. It is very well-written, technically sound, and generally well-motivated. It clarifies its relationship to state-of-the-art SemWeb standards.

It should be accepted, pending minor revisions only.

Major comments (not all points are criticism, some are just thoughts):

- I am not sure whether it is a good idea to state that the author's work is based on practical observations. This term is irritating to me. The authors do not seem to have experimented with their approach, or at least they do not report about their experiments. Maybe you should reformulate this phrase such that it fits into your argumentation.

- Some proofs appear much too lengthy to me. Especially, the proofs of Lemma 3.5, 3.7, 4.3, and 5.5.

- In the paper you switch between SPARQL and conjunctive queries. You should provide a paragraph on the relationship between these two formalisms. Besides, you should formally introduce the SPARQL fragment that you consider. Defining SPARQL by a citation is not enough to my mind's eye.

- You provided many complexity results for decision problems in your paper. Is there a fundamental reason why you did not consider the corresponding construction problems? What about their complexity? Can you give an outlook?

- Please double-check your citations. Sometimes you use abbreviations for journals, sometimes not. For [13] the conference name is completely missing.

Minor comments:

- On page 2 you should provide a citation for tgds.
- On page 2 you state that a constraint may be read as a generalized rule. Please make this more explicit by citing the Motik paper on the semantic differnece between rules and constraints.
- On page (2) the query (16) is not a conjunctive query.
- On page 5 the definition og the semantics of a conjunctive query does not seem correct to me. Please check.
- On page 5 you should provide a citation for the completeness of the QSAT problems.
- Definition 2.1: are the 3-colorings of V_1, V_2 and V_3 with respect to the induced substructures?
- Table 1 is mentioned directly after Theorem 3.1. Please insert the table one page earlier.
- On page 37 you state that your work can be considered as the starting point for studying the equivalence of SPARQL queries under RDFS, RIF and OWL2RL semantics. You should provide a citation related to semantic query optimization, where query equivalence under generalized rules or constraints, that contain negation and disjunction, is considered. E.g. the PODS 2010 paper "Semantic Query Optimization in the Presence of Types". This work directly applies to some SPARQL fragments because RDF triples come, by definition, with a typing of its three components. So, SPARQL query optimization is an example for a scenario in which you naturally have rules that contain negation and disjunction.
- I doubt that citation [30] is really necessary.

Solicited review by Adila Krisnadhi:

This paper is a purely theoretical paper about complexity of redundancy elimination/detection of RDF graphs in various situations. It presents a fine-grained complexity analysis for the mentioned problem with respect to various settings of rules, constraints and queries. The results are particularly relevant for the application of rule-based inference on top of many RDF stores and also in the context of entailment regime for SPARQL.

This is a solid technical paper with important theoretical contributions as such a collection of complexity results are necessary to understand how easy/difficult the problem of redundancy elimination/detection of RDF graphs with respect to rules, constraints and queries. I also found no problem in all the proofs. Regarding the authors' claim that this paper is a significant extension over the earlier conference version, I agree that the results in this paper are indeed stronger. Overall, I believe that this paper merits an acceptance after taking into account my remarks below.

First, the title of the paper: "Redundancy Elimination on RDF Graphs ... " can be, in my opinion, a bit misleading. A better title would be "Complexity of Redundancy Detection on RDF Graphs ....". When I first read the title, I initially thought that the paper is really about methods/techniques to eliminate redundancy in RDF graphs which, in fact, is not really the case. All the results are about the complexity of the decision problems related to redundancy detection on RDF graphs. Even if there are parts that can actually be used to really eliminate redundancy on RDF graphs, I cannot clearly see it from the discussion. The reason of this might be due to the fact that the analysis is more fundamental than redundancy elimination as it focuses instead on the associated decision problems of redundancy detection. This is not necessarily bad by itself, but since the title somewhat indicates redundancy elimination as the main theme, I think that this is actually a slight discrepancy. To improve the paper, I suggest that the authors change the title to something similar to my suggestion earlier, and even better, add some discussion on the actual methods/techniques on redundancy elimination in which the results in this paper will be important ingredients.

Another issue with this paper is the readability. I applaud the approach the authors used which is by always starting with overview and rought intuition of the results before proceeding to the really technical parts of the results. Despite this, I still found it hard to follow the discussion because the paper really presents many reduction proofs which are similar to some degree. One way to alleviate this is to move some or all parts of the proofs into an appendix and then replace them in the main text with the suitable informal intuition. That way, the narration can be a bit easier to follow and a reader can more easily note the comparison between the theorems/lemmas.

Table 3: it would be better for comparison if the results on Table 1 and Table 2 are also included in this table. That way, the changes in complexity results can be more easily pointed out.

Several results regarding the complexity of checking whether a RDF graph satisfies a set of constraints are used in several places:
- The PI_2^P complexity (do the following refer to the same thing?):
- in Proof of Lemma 3.4 (step (2)), [10, Proposition 5.5,(1)] is refered.
- in Proof of Lemma 5.3 (step (3)), [11] is refered.
- The tractability: Proof of Lemma 3.6 refers to [23, Proposition 3].
==> As this is an important ingredients in the technical content, I suggest to put them somewhere in the preliminary, before using them in the proofs.

Claim 2 in Proof of Lemma 5.5 recalls the definition of G^i_v from the proof of Lemma 3.7 with slight change. I suggest that instead of referring to Lemma 3.7, the actual definition of G^i_v for this context is spelled out explicitly here.

typos, etc.:
- title text on page headers: is it deliberately shortened? (I'm not sure whether this is the requirement from the journal itself).
- p4, right column, paragraph 2, second sentence before last on definition of satisfication of a constraint over RDF graph:
should be ".. if for each homomorphism h on X mapping BGP(Ante) to G ..." (missing h)
- p4, right column, last paragraph, on the explanation of least fix-point of the immediate consequence operator:
- delete extra right curly brace after G
- should mention that /mu is a homomorphism.
- p6, the paragraph before Definition 2.3:
should be "Q-3COL_{forall,2}
- p7, last sentence of the paragraph below Proof of Lemma 3.2:
better wording is "The concrete proofs are given in Section 3.2, but beforehand, the rough intuition of these results is sketched."
- p9, first paragraph of Proof of Lemma 3.5:
need correction on whitespace in "Q-3COL_{exists,3}"
- p9, last line:
need correction on whitespace in "Q-3COL_{exists,3}"
- p10, right column, first sentence on the "only-if direction" paragraph:
whitespace correction similar to the one in p9.
- p16, right column, the paragraph before Claim 2:
" ... convenient for formulate ... " ==> " ... convenient to formulate ... "