Editorial Board

Editor-in-Chief
Cogan Shimizu
Eva Blomqvist

Editorial Board
Mehwish Alam
Claudia d’Amato
Stefano Borgo
Boyan Brodaric
Philipp Cimiano
Michael Cochez
Oscar Corcho
Bernardo Cuenca-Grau
Elena Demidova
Jerome Euzenat
Mark Gahegan
Aldo Gangemi
Dagmar Gromann
Armin Haller
Pascal Hitzler
Aidan Hogan
Katja Hose
Eero Hyvönen
Sabrina Kirrane
Agnieszka Lawrynowicz
Freddy Lecue
Maria Maleshkova
Raghava Mutharaju
Axel Polleres
Guilin Qi
Marta Sabou
Harald Sack
Angelo Salatino
Christoph Schlieder
Stefan Schlobach
Cogan Shimizu
Blerina Spahiu
GQ Zhang
Rui Zhu

Former/Founding Editors-in-Chief
Krzysztof Janowicz
Pascal Hitzler

Editorial Assistants
Michael McCain

Syndicate

Similarity Joins and Clustering for SPARQL

Submitted by Sebastián Ferrada on 02/28/2023 - 02:24

Tracking #: 3392-4606

A new version of this paper is available

Authors:

Sebastián Ferrada

Benjamin Bustos

Aidan Hogan

Responsible editor:

Marta Sabou

Submission type:

Full Paper

Abstract:

The SPARQL standard provides operators to retrieve exact matches on data, such as graph patterns, filters and grouping. This work proposes and evaluates two new algebraic operators for SPARQL 1.1 that return similarity-based results instead of exact results. First, a similarity join operator is presented, which brings together similar mappings from two sets of solution mappings. Second, a clustering solution modifier is introduced, which instead of grouping solution mappings according to exact values, brings them together by using similarity criteria. For both cases, a variety of algorithms are proposed and analysed, and use-case queries that showcase the relevance and usefulness of the novel operators are presented. For similarity joins, experimental results are provided by comparing different physical operators over a set of real world queries, as well as comparing our implementation to the closest work found in the literature, DBSimJoin, a PostgreSQL extension that supports similarity joins. For clustering, synthetic queries are designed in order to measure the performance of the different algorithms implemented.

Full PDF Version:

swj3392.pdf

Revised Version:

Tags:

Reviewed

Long-term Stable Link to Resources:

https://github.com/scferrada/jena

Decision/Status:

Major Revision

Solicited Reviews:

Click to Expand/Collapse

Review #1

Anonymous submitted on 07/Apr/2023

Suggestion:
Minor Revision

Review Comment:

The work describes an interesting proposal to extend the SPARQL query language with operators that are able to compute N-to-M similarity joins using distance metrics over multiple dimensions as well as a grouping operator based on clustering.

The work is an extension of a previous published work.

The proposal is well motivated and adds interesting functionalities, the implementation is also provided showing that is actually feasible to implement the proposed methods within a real system.

Overall, the paper is well written. It could improve if the two proposed operators and queries, with use cases where presented earlier in the introduction, this would make much easier to follow the exposition in the paper.

Review #2

By Agnieszka Lawrynowicz submitted on 16/Apr/2023

Suggestion:
Major Revision

Review Comment:

The paper is engaging and well-written.
It provides a thorough theoretical analysis, implementation, and experiments, making the related artefacts publicly available.
Moreover, the described research results may be useful in practical systems in various applications.

I have some remarks, though, mainly dealing with the following:
1) novelty concerning the paper published at ISWC (reference [24]),
2) concerning the last release of the code,
3) for making the paper more understandable for the broader audience.

*** Novelty ***
Please provide a discussion with a clear distinction of the novelty of the current paper with respect to [24].
For instance, Table 3 is a copy of Table 1 from [24], etc.

*** Source code ***
Page 14: Evaluation
The code is from 3-4 years ago. Is that correct?

*** Other remarks ***

Page 6, line 5: Also grid-based, distribution-based, hierarchical clustering.

Page 7, line 9: Range-based semantics at this point needs to be clarified.

Page 8, Definition 4.7: I recommend adding information on the meaning of the r symbol to make the definition more self-contained and understandable.

Page 8, Definition 4.8: What is r'? (It is only explained on the next page) What is k? Is k related to k-nn?
The last line on page 8 needs to be clarified. What does it mean that k' may equal infinity?
Why does one need the term r' referring to the distance required to return at least k nearest neighbours? In the original k-nn formulation, there is no need for the such an additional parameter.

Page 15, Fig. 8: The country codes are barely visible. When printed in black-white, the figure does not show clusters, only (unclustered) points. I would use additional means to distinguish clusters (besides colors).

Page 18, subsection 6.1.2: from what I have understood the authors compare the running times of their method with the running times of DBSImJoin that was run on different hardware. It can make sense when comparing implementations rather than methods, but testing on the same hardware would be better. However, I can understand the challenges concerning reproducibility. But why then, are the results copied from [24]?

Page 23, Proofs:
To make the paper more comprehensive for a broader set of readers, I recommend the following:
1) To provide some examples illustrating the proofs.
2) Discuss why such algebraic properties as commutative, distributive, etc., are essential and for what purposes. Please explain here or in Section 4.2 (Semantics).

*** Further remarks ***

Check the pronunciations of all names of the authors as there are typos.
Perhaps number equations.
Page 11, Fig. 3: The example might be received as provocative. I would avoid such examples in already tense times. But I leave the decision to keep the example for the authors.

Log in or register to post comments
2503 reads

Main menu

Editorial Board

Syndicate

Similarity Joins and Clustering for SPARQL

Tracking #: 3392-4606

Reviewed Articles

Authors & Reviewers

Links

Recent blog posts

Accepted Articles

Search form

Main menu

Login

Editorial Board

Syndicate

Similarity Joins and Clustering for SPARQL

Tracking #: 3392-4606

Reviewed Articles

Authors & Reviewers

Links

Recent blog posts

Accepted Articles