Handling Qualitative Preferences in SPARQL over Virtual Ontology-Based Data Access

Tracking #: 2856-4070

Marlene Goncalves
David Chaves-Fraga
Oscar Corcho

Responsible editor: 
Guest Editors Web of Data 2020

Submission type: 
Full Paper
With the increase of data volume in heterogeneous datasets that are being published following Open Data initiatives, new operators are necessary to help users to find the subset of data that best satisfies their preference criteria. Quantitative approaches such as top-k queries may not be the most appropriate approaches as they require the user to assign weights that may not be known beforehand to a scoring function. Unlike the quantitative approach, under the qualitative approach, which includes the well-known skyline, preference criteria are more intuitive in certain cases and can be expressed more naturally. In this paper, we address the problem of evaluating SPARQL qualitative preference queries over an Ontology-Based Data Access (OBDA) approach, which provides uniform access over multiple and heterogeneous data sources. Our main contribution is Morph-Skyline++, a framework for processing SPARQL qualitative preference to directly query relational databases. Our framework implements a technique that translates SPARQL qualitative preference queries directly into queries that can be evaluated by a relational database management system. We evaluate our approach over different scenarios, reporting the effects of data distribution, data size, and query complexity on the performance of our proposed technique in comparison with state-of-the-art techniques. Obtained results suggest that the execution time can be reduced by up to two orders of magnitude in comparison to current techniques scaling up to larger datasets while identifying precisely the result set.
Full PDF Version: 

Minor Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
By Ruben Taelman submitted on 06/Aug/2021
Review Comment:

I thank the authors for their reply letter and updated paper.
All my previous comments (which were minor) have been adressed:

* The inconsistent naming issue has been resolved by using Morph-Skyline++ instead of Morph-Pref.
* The confusion around TPF and brTPF executing preference-based queries has been resolved.
* The hypotheses have been made objectively testable.
* The discussion around TPF has been made more nuanced.

I have no further comments for acceptance.

Review #2
Anonymous submitted on 29/Aug/2021
Review Comment:

The revised version of the manuscript has adequately addressed my comments, so I now suggest acceptance.

Review #3
By Aidan Hogan submitted on 01/Sep/2021
Minor Revision
Review Comment:

I appreciate the authors' detailed response to my comments.


# Previous review

In my previous review, the concerns I outlined were:

W1: The contributions are a bit unclear to me in terms of what, precisely, are the novel aspects of this work.

W2: There are certain aspects of the paper that I found unclear (listing specific cases).

W3: As a general limitation of the approach, it only works when the base dataset is stored as an RDB.

In particular, I requested improvements regarding W1 and W2. (W3 is acknowledged in the paper.)


# Assessment

Regarding W1: I think that the paragraph on "Contributions" in the introduction addresses this quite well.

Regarding W2, I think that the paper has proved considerably. Overall, I find that the paper reads much better than before.

I do, however, have a number of comments that I feel must be addressed before the paper is accepted. Most of these comments relate to the formal content of the paper. Noting that the contributions of the paper are primarily practical, I do not see these as being critical issues in terms of the overall purpose of the paper, but I feel they must be addressed before the paper can be published. I think that there are relatively simple solutions to these issues (based, e.g., on simplifying or removing some of the formal content, again considering that the contributions here are mainly practical, not theoretical). With that in mind I recommend a Minor Revision to address these remaining comments.

C1: Definition 6, point 7 covers preferences for SPARQL, but it appears to be incorrect.

- You cannot test preferences based only on one solution, so it does not seem correct in such cases to use μ |= crit. Likewise the definitions of the cases only consider one solution, which does not seem correct (and the cases appear redundant ...).
- Definition 6, point 7: It is unclear why (a)-(d) are defined. In the case of (a), this is just σ, which was defined previously. Cases (b)-(d) are covered by previous definitions, as far as I can see. (Note that these cases are not defined in the corresponding relational calculus case, nor do they need to be, for the same reason.)

= C1: Suggested change => I would suggest to define the preference operator in a similar way to RA, along the lines of { μ \in Ω | \neg\exists μ' \in Ω : μ' \prec μ } and remove cases (a)-(d).

C2: Theorem 1 and Proof: In my previous review, I mentioned that I did not really understand the purpose of this proof. Now that the intent is more clear to me, I think that the theorem and proof are problematic.

- The statement "the translation of Q to an equivalent SQL qualitative preference query with an equivalent Relational Algebra expression RA \in E over a relational schema S is a semantics preserving translation" is not properly defined. Specifically, the translation is not defined, "an equivalent SQL qualitative preference query" is not defined, and "a semantics preserving translation" is not defined. Also no instance is given, or at least I do not see what constrains the instance of S to "encode" O in any way (maybe something is assumed here about the mapping, but it is not part of the claim of the theorem).
- It seems as if the proof is defining the translation mentioned in the theorem, and so is defining the meaning of the theorem, which is not valid. In each case it is claimed that the translation is equivalent, but this is not proved, nor is the notion of equivalence between SA and RA defined (SA returns solution mappings, while RA returns tuples, but I do not see a definition of which tuple is the same as which solution mapping).
- The cross-product of solution mappings for SA is never defined. Rather the definition of SA for cross product is actually that for join. (Cross products for SA are not well-defined in the case that there is some variable in common to both sides.)
- The join conditions for SA and RA are not properly aligned (SA uses a natural join).
- Various operators in SPARQL are more flexible than the relational equivalents, such as UNION: in SPARQL UNION, we do not need the same variables to appear on both sides, whereas in SQL UNION, the attributes on both sides need to correspond (as noted in Definition 8). It is not clear how these cases can be translated, or how the translation can be equivalent.
- It is also unclear how OPTIONAL in SPARQL is translated to SQL in the general case. The only case that is considered is the OPTIONAL/!bound case. How is OPTIONAL translated to SQL if there is no !bound?
- (Generalising the last two points, I do not see how you can translate SA queries that may return UNBOUND to RA, but have not seen these cases restricted anywhere.)
- In "Difference", it is claimed that SA is equivalent to RA, but the SA expression is never defined (SA_2)?

= C2: Suggested change => These issues have little to do with the topic of the paper, which relates to preferences. Rather they all relate to "core ODBA" for SPARQL, which is covered elsewhere. Hence I would rather propose to replace the theorem and proof with a definition of the translation itself (currently implicit in the proof), and mention that it extends the standard OBDA setting with preference operators. (In fact, since the translation itself seems to fall into standard OBDA territory, I think it would also be acceptable to only define the translation for the preference case). Since the additional preference operators in SA/RA are defined in the same way (assuming C1 is addressed), then *assuming the OBDA translation of SPARQL 1.0 -> SQL to be equivalent* (under whatever conditions this is normally assumed for ODBA), the extension with preference operators must be equivalent under the same conditions. I do not think this argument needs to be stated as a theorem or formally proved. Whether or not the base OBDA methods produce an equivalent translation is not really the topic of this paper. The above issues then disappear.

C3: Most of the minor issues have been fixed. However, below I will list some additional minor comments to address.

= C3: Suggestion => Revise the comments below.


# Minor comments

- "a framework for processing SPARQL qualitative preference to directly query relational databases" -> "a framework for processing SPARQL qualitative preference[s by] directly querying relational databases" Or remove "to directly query relational databases" as it is explained next. (The original phrasing is a bit awkward, especially for the abstract.)

Section 1:
- "scor[ing] function" A bit more natural like this.
- "(RDB)[1–3]" missing space
- "these variables impact []the total query execution time" (or "have impact on")
- "previous work [19] that only concerns in evaluating" -> "previous work [19] that only addresses the evaluation of"

Section 2:
- "If we consider [that] most"
- "converting these data to RDF and [making] them"
- Figure 2: Add a little space (\hfill) between (a) and (b) to make the captions readable.
- "[16] avoids" -> "Börzsönyi et al. [16] avoid" is much more readable, versus reading "Sixteen avoids", which makes no sense. (Or it can be "Börzsönyi and colleagues [16] avoid" for consistency.)
- "avoids either comparing the input instances against themselves in the best of the cases" Unclear to me what this means. I get the idea of avoiding unnecessary (reflexive) comparisons, but this could be better phrased.
- "to translate the SPARQL-to-SQL" -> "to apply a SPARQL-to-SQL translation"

Section 3:
- "based on the SPARQL NOT EXISTS" -> "based on[] SPARQL NOT EXISTS"

Section 4:
- Definition 3: It would be a little cleaner to define the atomic ones first, and then the combined preferences. It would also be good to introduce the terms used in the conditional case (φ appears to be a boolean condition, while γ and Φ are preferences?). Also it's a bit strange to use two variants of phi here; why not use a different letter?
- Definition 4: Instead of '\wedge' (and), you need something like ':' (such that). "There does not exist p_2 in \Sigma *such that* p_2 precedes p_1."
- Definition 5: Instead of the first '\wedge' (and), you need ':' (such that).
- Definition 6: "and μ(x) denotes the triple obtained by replacing the variables in a triple pattern x according to μ." x was defined as a variable just before this, so there is something wrong here. (Probably the quoted text could just be removed; the triple pattern case is covered later.)
- Definition 6: "A domain mapping set" A "domain mapping" is never defined.
- Definition 6: "A well-designed SPARQL graph pattern gp defined by ..." This does not define a well-designed SPARQL graph pattern. Maybe introduce it as "A SPARQL graph pattern gp defined by ..." and somewhere else add the assumption that it be well-defined?
- Definition 6, point 2: This defines any solution with these variables defined. It is missing the condition that μ and μ_{|v1,...,vn} be compatible.
- Definition 6, point 6: You do not need the '{' '}' as the union of sets returns a set.
- Definition 6, point 7: '\oplus' is never defined.
- "an element of gp to Ω" It should accept gp and an ontology? (I guess this is okay if considering the ontology as "given").
- Definition 8: What is the double bar used to define the domain relation? I'm not familiar with that notation.
- Definition 8: You have not defined the attributes of R_1 and R_2 to be A_1,...,A_n and B_1,...,B_n. Also you lose generality here as this defines the relations to be of the same arity (better A_m rather than A_n).
- "R2RML mappings to translate [the query] into SQL"

Section 5:
- "We have omitted a detailed description of the SPARQL query rewriting technique" That's reasonable, but maybe include just an example to give the "flavour"?
- "A[n] extended qualitative preference query"

Section 6:
- "As the [number of] preference criteria increases"?
- "and even [throws] timeout[s] in the last two queries"
- "and timeout[s] occur[]".
- "[The o]bserved results"
- "therefore, we think it is because all these queries include functions in their preference clause, in particular, the ABS function." Not really sure I understand why this would be an issue (or, in particular, why ABS is problematic)? I wonder: might it rather be the case that Virtuoso changed its plan because of the additional preference criteria, ending up with a much more costly plan?
- "4/15[ ]queries"
- "this kind of quer[y]"
- "from [] the theta self-join"
- Figures 8, 9: Would be good to follow the convention of the captions of the other figures, and provide details, such as dataset, etc.
- Figure 9 is very strange as some of the bars go downwards. While the origin for the x axis cannot be 0, due to the log scale, perhaps the unit could be changed to milliseconds, using 1 millisecond (or some other value before the min value in the data) as the origin.
- "this operator implemented [within] an RDB engine"
- "conceive [of] the"

Section 7:
- "allows [for] grouping" or better "supports grouping" or "enables grouping"


The limitations of related work as stated in the article are not comprehensible.
SkyTPF, for instance, only supports MIN and MAX preferences, so although it might not be possible to express the motivating example in SkyTPF, other types of queries are well supported.

In general though, a simple statement, such "skyTPF was not reported because it returned an empty answer" is a bit too bold. What causes this behaviour? Was there an issue when loading the data into the triple stores, was there a problem with parsing the query? Have the authors of this paper reached out to the authors of skyTPF to maybe receive support in running the evaluation?

The authors state that SPREFQL and SkyTPF produce incomplete and incorrect results and provide values for precision, recall and f-measure values in Table 4 and 5. It is not quite clear how these measurements were produced. How was the ground truth, i.e., the complete and correct result computed that all results are compared against? Since neither SPREFQL nor SkyTPF introduce any kind of approximation, incomplete results indicate more a technical error than a problem with the approach (maybe not all triples were loaded into the triple store?). Some elaboration and in particular sharing the queries would provide a higher degree of transparency and also enable other researchers to repeat and confirm the experimental results.

SkyTPF is reported to time out in 5 queries in Figure 7b. However, as the text describes, it is actually not a timeout, it's rather a Null Pointer Exception, which indicates an implementation error - but not a timeout, presenting it as such is misleading. As mentioned above, it should be possible to reach out to the SkyTPF authors for help fixing the issue. The authors of this paper are highly encouraged to do so. In fact, there has already been some communication and the offer of assistance - I (one of the SkyTPF authors) am happy to follow up on that.

In summary, the experimental section requires some more work to be acceptable. In particular, queries and results should be shared in more details so that the experiments can be repeated and verified by other researchers. In addition, the comparison against SkyTPF appears to be incomplete and not sufficiently fair. However, as described above, it should be possible to address these issues in a revision.