Identifying, Querying, and Relating Large Heterogeneous RDF Sources

Tracking #: 2457-3671

Andre Valdestilhas
Tommaso Soru
Muhammad Saleem
Edgard Marx1
Wouter Beek1
Claus Stadler
Bernardo Pereira Nunes
Konrad Höffner
Thomas Riechert

Responsible editor: 
Ruben Verborgh

Submission type: 
Full Paper
Although we have witnessed the growing adoption of Linked Open Data principles for publishing data on the Web, connecting data to third parties remains a difficult and time-consuming task. One question that often arises during the publication process is: ``Is there any data set available on the Web we can connect with?". This simple question unfolds a set of others that hinders data publishers to connect to other data sources. For instance, if there are related data sets, where are they? How many? Do they share concepts and properties? How similar are they? Is there any duplicated data set? How to identify and query a huge amount of heterogeneous datasets. To answer these questions, this paper introduces: (i) a new class of data repositories; (ii) a method to identify datasets containing a given URI; (iii) a query engine and source selection in a large RDF dataset collection; (iv) a novel method to detect and store data set similarities including duplicated data set and data set chunk detection; (v) an index to store data set relatedness; and, (vi) a search engine to find related data sets. To create the index, we harvested more than 668k data sets from LOD Stats and LOD Laundromat, along with 559 active SPARQL endpoints corresponding to 221.7 billion triples or 5 terabytes of data. Our evaluation on state-of-the-art real-data shows that more than 90% of data sets in the LOD Laundromat do not use owl:equivalentProperty or owl:equivalentClass to relate to one another data, which reaffirms and emphasizes the importance of our work.
Full PDF Version: 

Major Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
Anonymous submitted on 11/May/2020
Major Revision
Review Comment:

The paper presents a framework that aims at guiding a user from URIs/queries they have as input, to a collection of relevant datasets and a mechanism to query them. The work build on (at least) 2 existing publications, and adds the dataset similarity index (ReLOD) to the picture. Each step is evaluated, code and tools are available online.

Overall evaluation: I find the results presented in the paper valuable, but the presentation flaws make me ask for major revision. Details follow below in my comments for the motivation of the research, presentation, and smaller more detailed comments.


I come from a practical side of things: my current team has several commercial products in a natural science domain, running on top of triple stores, and we regularly face questions like "how can I find a dataset similar to ...? also talking about ...?" for bio, chemical, medical and bibliographic data. And I never even hoped to get some sort of automated answer to the question "Which of the data sets contains the most valuable results?" you pose in Scenario 1 in the introduction.

Therefore, I am excited by the direction this paper is going towards, although I couldn't find many answers playing with the tools linked in the paper. In the cases when I found related datasets, it was difficult to make any use of them, as the only info I got was a link to a file with a cryptic name (ex. b63446ad7d9f8960762c50a5a3492120.hdt), and I haven't understood where to get any quantification of the relatedness. Just saying "these 2 datasets are related" is still useful, but surfacing the explanation would help a lot. Perhaps, the short time spent playing with the tool is to blame, and low coverage for some of the LD types.

The notion of dataset similarity seems a useful construct, but, in reality, URIs are not that often shared among actually similar datasets describing same things. I would be happy to see some use cases which demonstrate the usefulness of the chosen approach to defining the dataset similarity. As well as the usability studies or any applications of the approach (which is not meant to be theoretic, right?).

Presentation and structure:

This is the weak side of this paper. It is a very bumpy read :) Especially introduction, but all other sections also desperately need proof reading. There are too many unnecessary commas and "which's" (with the beloved construction being "in which" - the meaning often remains unclear), missing verbs, pronouns and prepositions, some not really English expression (e.g. "the first" instead of "firstly"). Some sentences give an impression of literal translation from some other language. So, please, proof read the paper, it makes understanding of otherwise interesting results very difficult to comprehend.

Related work is thorough, though in parts lengthy and not very concrete (e.g. page 7, column 1).

The role of the formalization and the unproved theorem in Section 3 remain unclear. Some parts (like file structure) are appropriate rather for the tool documentation.

Evaluation section contains pages copied from 2 other papers the work is based on, [10] and [11], lengthy especially for WimuQ. I don't mind intersecting papers at all, but frankly, I don't see why it is done in this specific case. The big and very interesting contribution of this paper is ReLOD. Perhaps, it would make more sense just to summarize the research published before, and explain the tables and graphs in 4.4 a bit more, maybe with examples and links to use cases. More details to help understand tables and figures (e.g. Table 7) would also be beneficial.


Smaller comments:

p1 l38 "Those datasets represent now the well known as Web of Data, which represents..." - lost in English

p2 l7: "we created called "Where is my URI?" - created what? or remove "called"

p2 l14 "we show how to integrate and querying LOD datasets" - grammar

p3 l7 "ReLOD, which is the extension..." - main sentence not found
p3 l16 "Concerning the LOD-cloud dataset..." - same

p4 l12 "In this section we will present the state of the art relate to Identfying datasets" - grammar

p6 l44 "The approach was build to cluster entities, not LOD datasets, which we cannot use the same concepts here" - what does it mean?

p7 l9 " described on the paper[64]. In which discuss the identity crisis" - here and in many places before "which" is used where it shouldn't

p9 l43 "index of LOD datasets, in which involves" - no "in"

p16 l30 "While Ch queries often 30 require higher number of distributed datasets in order to compute the final resultset of the queries." - not a sentence, what's the meaning?

Tables 4-5: why some datasets are NOT similar, but one is contained in another? (ex. agrinepaldata and vivosearch). DB sizes in Table 3 could be useful

Table 6: the number of exact/similar properties is extremely low, why is that?

p18 l6 "Thus, an example of application could with a case when..." - something wrong with this sentence

p18 l16 " that one dataset can enrich each other with complementing information from another dataset." - again, grammar

p20 l44 "Where DsPropMatch on the Table 7 refers to the number of properties/classes the datasets share among each other." - grammar, why "where"? what do you want to say here? In Table, not on.

p20 l48 "the quality of the datasets should be considerate a important phase. " - you mean "should be considered"? in which sense can "quality" be "a phase" - quality evaluation?..

Why is Table 1 referred only at page 20, after Tables 2 to 7, what' the logic?

Table 16 - how have you done this comparison, how the gold standard was created, etc.? A very interesting part (as it evaluates the novel results in this paper), explained in a cryptic way.

page 22 a paragraph from line 48 - you said this on page 14 already

Review #2
By Anastasia Dimou submitted on 24/Aug/2020
Major Revision
Review Comment:

This paper presents a method to create and search an index. The paper relies a lot on already published work and in principle brings together existing work while adds some work on indexing.

This manuscript was submitted as 'full paper' and should be reviewed along the usual dimensions for research contributions which include (1) originality, (2) significance of the results, and (3) quality of writing. As far as (1) is concerned, I do not think that the paper in its current state presents a ground-breaking and exceptionally innovative solution, the indexing approach that is proposed is rather straightforward. I include some more detailed comments/remarks bellow which I hope that will help the authors to shed some light on the innovative aspects of this paper. Besides my concerns related to the paper's originality, the evaluation method does not seem to be adequate to assess the proposed indexing methodology which, in fact, is the actual contribution of the paper. Thus, it is hard to assess its significance (2). Last but not least, as far as (3) is concerned, I think that there is a lot of room for improvement when it comes to the quality of writing. The paper suffers of repetition and a lot of non comprehensive sentences, let alone grammar and syntax mistakes. Thus, the paper would benefit from a thorough proof reading.

The paper focuses on the use of owl:equivalentProperty or owl:equivalentClass for equivalence but I do not think that this is the only way of doing. There are other ways to show equivalence, which I would have expected to read about in the related work section but this is not the case.

Most importantly though, the use of owl:equivalentProperty or owl:equivalentClass to relate data does not reaffirm and emphasise the importance of this work. I think that there is a more fundamental problem that is ignored and a work is proposed that seems to ignore the role of semantics. I mean the proposed work might solve the problem but I am wondering what’s the role of semantics after all. Indexing datasets and identifying similarities is well-known problem in other communities and I do not see how the proposed work harmonizes the regular indexing with the merits of the semantic web.

It is mentioned in the paper that the four rules of Linked Open Data is followed but what is the purpose of using http URIs if the URIs are indexed in the end? One doesn’t follow the http URIs to discover linked datasets but an index that could achieve the same even without URIs, especially if they are not dereferenceable. And what’s the benefit of supporting non-dereferenceable URIs if they are eliminated later on when the proposed methodology is explained?

Overall, the paper suffers of repetition. The same statements are repeated all over and most importantly, the paper repeats parts of previous works of the authors. This is evident by the fact that the new contributions, as it is mentioned in the paper, are only in the subsections 2.3 (which is a related work subsection), 3.3 (which is literally only 5 lines - so I assume this was a sub-sectioning mistake and it is meant 3.3 - 3.5?) and 4.4. This way, the paper becomes unnecessarily lengthy while the actual new contributions are very limited.

The related work is fragmented as it only mentions a couple of systems for e.g. identifying URIs without being clear how these tools were chosen. It is not clearly shown which directions are prevalent so far in identifying datasets.
In the case of querying, a few approaches are outlined but it is not clear how they were chosen and what they represent with respect to the state of the art, let alone that half of them are over a decade old and presented in workshops. In the meantime, other and more diverse approaches for federated querying were proposed.

According to the papers that are proposed, we observe that the oldest one consider indexing, whereas the more recent ones are either index-free or combine an index with ASK. Thus, reading the related work, it is is not evident what the advantages and disadvantages of each approach are and why the authors chose an index-only approach.

The state of the art focusses on federated querying which is understandable but it is not explained in the text why federated systems are the most relevant as opposed to any querying system. It is also not clear what serves the distinction between query and link traversal federation with respect to the proposed approach. Despite the descriptions of the different systems, in the end it is concluded that WimuQ distinguishes which SPARQL endpoints are relevant to be queried. Why would that be better?

The related work contains wrong statements. For instance, it is mentioned that the difference between DataTank and WIMU is that WIMU provides a RESTful service instead of a setup to configure a Linked Data repository. However, if one visits the DataTank’s website, the first thing he reads is “Transform your datasets into a RESTful API.”.

The workflow for the querying process that is described in section 3.2 is trivial and has nothing innovative to show.

In the deduplication part, it is mentioned that the blocking method is chosen but no alternatives are mentioned in the related work, so it is not clear why this algorithm was chosen. Similarly, algorithms for string matching were not presented in the related work section, so it is not clear how adequate these algorithms are. In the end, neither the impact of the blocking method nor of the string matching algorithms were assessed.

The file structure and querying the index subsections read more like a tool's manual. I do not think it adds to the scientific content if it’s included.

My main concern related to the evaluation is how it validates the contribution. While the paper points to the index method as its contribution, I think that in the end, the whole "setup" is evaluated. Thus, I think there is a mismatch between what the contributions are, where the focus of the paper is and what is evaluated. At some point, the evaluation sets as its goal to prove that combining different SPARQL query processing approaches to retrieve more complete results compared to the results from individual approaches. How this validates the contribution?

From the text it is not clear why these three benchmarks are chosen compared to other benchmarks but most importantly it is not clear what they serve. The results show that WimuQ provides more results compared to using plain SPARQL endpoints federation engine. However, are more results better than fewer results? And how accurate are these results compared to other approaches. An evaluation based on precision/recall/f-measure better answers the aforementioned questions and might be more adequate choice for such a contribution.

There are a lot of sentences that are badly shaped, e.g., “Concerning the LOD-cloud dataset, which currently contains 1239 datasets with 16 147 links (as of March 2019)[16].”.
Other sentences miss punctuation, e.g., “To deal with all needs presented before we create an integrated approach to identify, relate and query a massive amount of heterogeneous RDF sources, in which join the two previous works[10, 11] With this new approach ….”.
There are references that are not correct, e.g., “S. Harris, N. Gibbins and T.R. Payne, SemIndex: Preliminary Results from Semantic Web Indexing (2004).”.
There are multiple references for almost the same content, e.g [44], [45] and [46], even to non-peer reviewed content.
There are grammar and syntax mistakes, e.g., “both produce reasonably query runtime performances comparing to state-of the-art approaches.” and “FedX can only works with public SPARQL endpoints.”
Even multiple languages are combined in certain sentences “Contains de dataset matched containing the property”.
The situation aggravates towards the end and in particular in the evaluation section where many sentences are hard to read.
I would suggest to proceed with a very thorough proof reading of the text.

Based on the aforementioned comments, I do not think that the paper could be published in its current state. I would thus opt for a major revision of the content and evaluation.