Discovery of Related Semantic Datasets based on Frequent Subgraph Mining and String Matching Techniques

Tracking #: 1177-2389

Mikel Emaldi Manrique
Oscar Corcho
Diego López-de-Ipiña

Responsible editor: 
Claudia d'Amato

Submission type: 
Full Paper
We describe an approach to find similarities between RDF datasets, which may be applicable to tasks such as link discovery, dataset summarization or dataset understanding. Our approach builds on the assumption that similar datasets should have a similar structure and include semantically similar resources and relationships. It is based on Frequent Subgraph Mining (FSM) techniques, used to synthesize the datasets and find similarities among them. In addition, string matching techniques are used for improving obtained results. The result of this work can be applied for easing the task of data interlinking and data reusing in the Semantic Web.
Full PDF Version: 


Solicited Reviews:
Click to Expand/Collapse
Review #1
Anonymous submitted on 23/Oct/2015
Review Comment:

This manuscript was submitted as 'full paper' and should be reviewed along the usual dimensions for research contributions which include (1) originality, (2) significance of the results, and (3) quality of writing.

Review #2
Anonymous submitted on 03/Nov/2015
Review Comment:

This manuscript submitted as 'full paper' presents a method for discovering linking datasets in the context of the LOD.

The method reuses and builds upon an existing sub-graph discovery and a number of string matching measures.

(1) originality

The related work section may be considered as satisfactory.
However it suggest the point that natural candidates for a comparative experiment are available.
Hence the proposed experimentation should be aimed at showing the supposed superiority of the proposed method compared to the state-of-the-art. Conversely the authors should argue about why existing OM methods cannot be employed for solving the same problem.

The paper seems a combination of an existing FSM method and proposed string distances.

Concerning the method, it should be clarified if the difference of the various kinds of relations in either graphs has been taken into account.

(2) significance of the results

The contribution seems somewhat limited
"some steps forward into some of the aforementioned limitations"

One of the claims is the growth of the number of available linked datasets that will trigger the need for methods for their interlinking in an automatic way.

This objective would need a study of the complexity of the task with respect to the number and size of the involved datasets.

Overall there seems to be a strong connection with the aims pursued with Ontology Matching (OM) techniques.
One would expect a discussion of the differences and when applicable a comparison against techniques that may provide the same kind of result.
It is not clear if the proposed method can be useful only for ultimately discovering links as matches between classes on couples of different datasets or can do something more.
(current OM systems actually can find more informative mappings between classes).

Also the extent of the contribution coming from structural information on either graph is to be discussed more extensively.
Conversely it may seem that this information is only involved in a process of summarization aimed at reducing the overall complexity of the task.

As regards the summarization of the RDF graphs, it is probably worthwhile to discuss the consequent loss of semantics owing to the proposed transformation.
The considerations currently contained in the paper are quite vague.
What is the impact of replacing individual URIS with those of their types (Classes) ?
This rules out owl:sameAs links.
Are you targeting only subClassOf or owl:equivalentClass.
What about links between properties ?
Aren't these tasks included in the standard tasks for OM systems?

In my experience sometimes indirect relations (links) may emerge from existing/discovered direct ones? Have you investigated this chance?

From the informal presentation of the method it appears that
it may waste its time rediscovering external links that may have been removed (see p.7).
I wonder if that is plausible.
Couldn't such information be exploited and be propagated for suggesting new links?

Likewise, the choice concerning the removal of literals would deserve a less simplistic justification. For some Datatypes one may think of generalizing with the use of sub-ranges [e.g. intervals for numeric types]

Whatever the structure of the synthesized graphs,
the FSM problem is likely NP-complete.
In this respect the number of substructures that are considered (5 ?) is to be justified
and, more importantly, their size.

Concerning the string similarity: are you comparing full URIs or just CURIEs ?

As the method is very related to OM techniques one would expect some sort of comparative evaluation.
The presented evaluation could be used as a preliminary tuning phase
for selecting best parametres / sim. measures to be employed.
Together with the gold standard that was prepared why re-using OM testbeds has not been considered?
This is worth a justification.
The complementation of the gold standard with human experts seems a bit weak
(also in consideration of the agreement level among the reviewers)
In the baseline (sect 5.2) the usage of certain thresholds has to be justified.
How are they determined?
Is there a separate dataset used to tune these parametres?
the whole discussion of this point is a bit too vague.
Some hint of using ML techniques appears in the paper as future work.

Determining the best similarity measures to go with the presented methods is certainly an interesting result; however one would like to know whether these findings can be generalized to other related methods.

The intended usage of the proposed method is merely to find couples of datasets to be further processed by OM systems.

It would be more interesting to see some of the developments, proposed in the final section, addressed.

(3) quality of writing

This is one of the weak points of this manuscript
which would surely be improved by a thorough revision of its writing.

There's a number of typos or misused forms that ought to be emended.
For your convenience some will be indicated below.

Overall, please avoid the use of citation numbers as subjects or objects of sentences.

page 2 column 1: "applying the brute force for applying"
repetition that may be avoided.

p.2, c.1: "Proposed solution can ..."
please reconsider these forms and the usage of the definite article "the" also in other parts of the paper.

p.2, c.1: "applying and applying" --> "repeatedly applying"

p.2, c.1: "At last" has a meaning of "After a long wait";
this can be probably replaced by "Finally".
This change can be repeated throughout the paper, e.g. at p. 9

p.2, c.2: simplify? "for allow-ing making queries" --> "for allowing queries"

p.3, c.1: please rewrite the beginning of section 2.2 (1st sentence).
The justification of the proposed method as compared with those referred to in sect. 2.2 appears a bit weak.

p.3, c.1: "Similar to" --> "Similarly to" ?

p.3, c.1: An abstract graph is a graph that disposes the ontological classes..." could you please find a synonym for "dispose"

p.3, c.2: "finding candidate datasets" --> "find candidate datasets"

p.3, c.2: "so it is no a suitable" --> "so it is not a suitable"

p.3, c.2: "cold-starting problem" --> "cold-start problem"

p.4, c.2: "Labeled edges:" please rewrite. "this solution...." ?

p.4, c.2: generally formal lang. textbooks use dot products for the concatenation rather than "+" used in prog. languages.

p.4, c.2: L* --> L^*
generally the closure of the alphabet is denoted with a superscripted asterisk

p.4, c.2: string equality similarity \sigma seems to be ranging in {0,1}

p.5, c.1: [end] "Comm function computes common substrings" --> "The Comm function computes the number of common substrings" ?

p.5, c.2: in Eq. could you please clarify the range of values assumed by index i in \sum_i

p.5, c.2: "the method developed by Winkler [28] for improving the results obtained through this distance."
do you mean SMOA?

p.6, c.1: "Regarding to" --> "As regards" ?

p.6, c.1: could you please rewrite the sentence:
"Inspired by Ontosim 3, given two terms this method computes the distance among the synonyms of these terms, including the terms themselves."

p.7, c.1: "Regarding to" --> "As regards" ?

p.7, c.2: here and in the following, "vertices" can be used as the plural of "vertex"

p.7, c.2: "some lacks at computing" what does that mean? "failures", "weak points"?

p.8, c.2: "an unique" --> "a unique", also at p. 9

p.8, c.2: "at these step" --> "at this step" (also elsewhere)

p.8, c.2: "requested" --> "required"

p.8, c.2: "processes" --> "processed"

p.8, c.2: "Presented approach has been evaluated" --> "The presented approach has been evaluated"

p.9, c.1: "solve proposed problem" --> "solve the proposed problem"
please revise the whole paragraph

p.9, c.2: "these evidence" --> "this evidence"

p.10, c.1: please rephrase the first sentences of sect 5.2 (what do you mean by shared ontologies?)

p.13, c.2: "to determinate" --> "to determine"