Similarity Joins and Clustering for SPARQL

Tracking #: 3540-4754

Sebastián Ferrada
Benjamin Bustos
Aidan Hogan

Responsible editor: 
Marta Sabou

Submission type: 
Full Paper
The SPARQL standard provides operators to retrieve exact matches on data, such as graph patterns, filters and grouping. This work proposes and evaluates two new algebraic operators for SPARQL 1.1 that return similarity-based results instead of exact results. First, a similarity join operator is presented, which brings together similar mappings from two sets of solution mappings. Second, a clustering solution modifier is introduced, which instead of grouping solution mappings according to exact values, brings them together by using similarity criteria. For both cases, a variety of algorithms are proposed and analysed, and use-case queries that showcase the relevance and usefulness of the novel operators are presented. For similarity joins, experimental results are provided by comparing different physical operators over a set of real world queries, as well as comparing our implementation to the closest work found in the literature, DBSimJoin, a PostgreSQL extension that supports similarity joins. For clustering, synthetic queries are designed in order to measure the performance of the different algorithms implemented.
Full PDF Version: 


Solicited Reviews:
Click to Expand/Collapse
Review #1
Anonymous submitted on 11/Sep/2023
Review Comment:

The revised introduction addresses sufficiently my previous comments.

Review #2
By Agnieszka Lawrynowicz submitted on 12/Jan/2024
Review Comment:

I thank the Authors for their work done towards the revised version of the paper.
I am satisfied with the revisions and the answers.

Maybe what the Authors write in their response concerning Section 4.2 (Semantics) would be an excellent explanation on top of the discussion which can be found towards the end of Section 4.2: "Perhaps a key finding here is that properties that hold for (equi) joins in SPARQL do not necessarily hold for similarity joins, and our results show when this is the case, and why this is the case. This means that common optimisations applied in SPARQL engines (such as join reordering) cannot be applied “as is” for similarity joins, and thus that there are interesting open challenges on how to optimise such queries."

I am inclined to accept the paper.