Similarity Joins and Clustering for SPARQL

Tracking #: 3392-4606

Sebastián Ferrada
Benjamin Bustos
Aidan Hogan

Responsible editor: 
Marta Sabou

Submission type: 
Full Paper
The SPARQL standard provides operators to retrieve exact matches on data, such as graph patterns, filters and grouping. This work proposes and evaluates two new algebraic operators for SPARQL 1.1 that return similarity-based results instead of exact results. First, a similarity join operator is presented, which brings together similar mappings from two sets of solution mappings. Second, a clustering solution modifier is introduced, which instead of grouping solution mappings according to exact values, brings them together by using similarity criteria. For both cases, a variety of algorithms are proposed and analysed, and use-case queries that showcase the relevance and usefulness of the novel operators are presented. For similarity joins, experimental results are provided by comparing different physical operators over a set of real world queries, as well as comparing our implementation to the closest work found in the literature, DBSimJoin, a PostgreSQL extension that supports similarity joins. For clustering, synthetic queries are designed in order to measure the performance of the different algorithms implemented.
Full PDF Version: 

Major Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
Anonymous submitted on 07/Apr/2023
Minor Revision
Review Comment:

The work describes an interesting proposal to extend the SPARQL query language with operators that are able to compute N-to-M similarity joins using distance metrics over multiple dimensions as well as a grouping operator based on clustering.

The work is an extension of a previous published work.

The proposal is well motivated and adds interesting functionalities, the implementation is also provided showing that is actually feasible to implement the proposed methods within a real system.

Overall, the paper is well written. It could improve if the two proposed operators and queries, with use cases where presented earlier in the introduction, this would make much easier to follow the exposition in the paper.

Review #2
By Agnieszka Lawrynowicz submitted on 16/Apr/2023
Major Revision
Review Comment:

The paper is engaging and well-written.
It provides a thorough theoretical analysis, implementation, and experiments, making the related artefacts publicly available.
Moreover, the described research results may be useful in practical systems in various applications.

I have some remarks, though, mainly dealing with the following:
1) novelty concerning the paper published at ISWC (reference [24]),
2) concerning the last release of the code,
3) for making the paper more understandable for the broader audience.

*** Novelty ***
Please provide a discussion with a clear distinction of the novelty of the current paper with respect to [24].
For instance, Table 3 is a copy of Table 1 from [24], etc.

*** Source code ***
Page 14: Evaluation
The code is from 3-4 years ago. Is that correct?

*** Other remarks ***

Page 6, line 5: Also grid-based, distribution-based, hierarchical clustering.

Page 7, line 9: Range-based semantics at this point needs to be clarified.

Page 8, Definition 4.7: I recommend adding information on the meaning of the r symbol to make the definition more self-contained and understandable.

Page 8, Definition 4.8: What is r'? (It is only explained on the next page) What is k? Is k related to k-nn?
The last line on page 8 needs to be clarified. What does it mean that k' may equal infinity?
Why does one need the term r' referring to the distance required to return at least k nearest neighbours? In the original k-nn formulation, there is no need for the such an additional parameter.

Page 15, Fig. 8: The country codes are barely visible. When printed in black-white, the figure does not show clusters, only (unclustered) points. I would use additional means to distinguish clusters (besides colors).

Page 18, subsection 6.1.2: from what I have understood the authors compare the running times of their method with the running times of DBSImJoin that was run on different hardware. It can make sense when comparing implementations rather than methods, but testing on the same hardware would be better. However, I can understand the challenges concerning reproducibility. But why then, are the results copied from [24]?

Page 23, Proofs:
To make the paper more comprehensive for a broader set of readers, I recommend the following:
1) To provide some examples illustrating the proofs.
2) Discuss why such algebraic properties as commutative, distributive, etc., are essential and for what purposes. Please explain here or in Section 4.2 (Semantics).

*** Further remarks ***

Check the pronunciations of all names of the authors as there are typos.
Perhaps number equations.
Page 11, Fig. 3: The example might be received as provocative. I would avoid such examples in already tense times. But I leave the decision to keep the example for the authors.