An Authority-flow based Ranking Approach to Discover Potential Novel Associations Between Linked Data

Paper Title: 
An Authority-flow based Ranking Approach to Discover Potential Novel Associations Between Linked Data
Maria-Esther Vidal, Jean-Carlo Rivera, Luis-Daniel Ibanez, Louiqa Raschid, Hector Rodriguez, Edna Ruckhaus
Under the umbrella of the Semantic Web, Linking Open Data projects have the potential to discover links between datasets and make available a large number of semantically inter-connected data. Particularly, Health Care and Life Sciences have taken advantage of this research area, and publicly hyper-connected data about disorders and disease genes, drugs and clinical trials, are accessible on the Cloud of Linked Data. In addition, existing health care domain ontologies are usually comprised of large sets of facts, which have been used to annotate scientific data. For instance, annotations of controlled vocabularies such as MeSH or UMLS, describe the topics treated in PubMed publications, and these annotations have been successfully used to discover associations between drugs and diseases in the context of the Literature-Based Discovery area. However, given the size of the linked datasets, scientists have to spend uncountable periods of time, traversing links before making a discovery. This paper describes an authority-flow based ranking technique that is able to assign high scores to terms that correspond to potential novel discoveries, and to efficiently identify these highly scored terms. Two sampling techniques are proposed to only traverse the terms that may have high scores; terms, links and scores are modeled as a Bayesian network. The first technique implements a Direct Sampling reasoning algorithm to approximate the ranking scores of nodes in the Bayesian network and visits only the nodes with the highest probability. The second technique samples paths in the Bayesian network with the highest conditional probability. An experimental study reveals that the proposed ranking techniques are able to reproduce state-of-the-art discoveries; additionally, the sampling-based approaches are able to reduce the exact solution execution time by up to one order of magnitude and approximate human validated rankings with a precision of up to 80%.
Full PDF Version: 
Submission type: 
Full Paper
Responsible editor: 

Revision accepted by responsible editor.

The reviews below are from the previous version of the manuscript.

Review 1 by Thomas Scharrenbach

Accept with Major revision

The paper presents a technique for computing ranks of links in a layered Discovery Graph (lgDG), as well as two methods performing simple random sampling in this lgDG. One sampling methods explores the lgDG in a breadth-fist-search (BFS) manner whereas the second does in a depth-first-search (DFS) manner.

The paper is well motivated and technically sound. Its contributions are clearly stated but need to be distinguished from related approaches in more detail. The experimental results need revision as well as a more detailed discussion. The hypotheses need to be stressed out more clearly. Currently, the results as well as the limitations and future work are too briefly discussed and the conclusions presented in an over-optimistic way. In summary, the paper needs to undergo some major revisions.

The introduction lacks references to the databases (PubMed, BIOSIS), the terms RDF, OWL, and SPARQL – where it remains unclear why SPARQL is mentioned at all. The term annotations should be made clearer in its relation to ontologies, in particular how OWL 2 and RDF relate in the context of this paper. Furthermore, it was a good idea to include the reference to the Linked Data paper (Bizer et al 2009) in the introduction.

The claimed contributions need to be more clearly distinguished from other work.
Your approach for sampling Bayesian Networks seems to be the same as those defined by Henrion (1988). Please relate your work to this approach for random sampling Bayesian networks.
You state that your ranking approach exploits “semantic annotations and the topology of the graph induced by the data links” rather than “paths between data terms” (p. 4). Please clarify what you mean by “semantic annotations” and how your approach benefits from these whereas, for example, PageRank and HITS do not. Please also clarify why your approach benefits from the topology of the graph and the above approaches do not.

The methods section of the paper is technically sound. Some definitions could be given in a clearer way.
The motivation for choosing ranking-flow based solutions (p. 5) is not clear.
Are lgDGs allowed to be infinite? If no, please clarify this.
Are you really referencing to a metric on p. 6 for alpha? Isn’t this supposed to be a distance, rather than a metric?
Provide the number of paths to sample in terms of N (p. 9).
Try to avoid the double usage of the letter p for publications on the one hand and for probabilities, on the other hand.
It would be interesting to see how your approach would benefit from using ontologies, actually. Currently it seems that you are using RDF “just” as a data format without exploiting the corresponding terminologies.

The experiments section needs revision.
Please formulate the hypotheses for your experiments in a clearer way at the beginning of the section.
Please separate the presentation of your findings and their discussion.
Where do the formula and the value for the augmented document frequency come from (p. 10)? Please provide some motivation for this score, and the inverse term frequency, as well as the cosine normalization factor. The the last: how is this score computed?
Why do you use top-k Spearman’s rho metric?
Why are you describing the storage layout in the database in so much detail? If this was an important issue, then please stress out your reasons for that.
If you speak of the results of Srinivasan, then please provide a brief overview over what they looked like.
For the comparison, please mark the findings that you and Srinivasan’s approach had in common, for example in bold or italics (e.g. for Tables 3 and 6). This would clarify your findings. It would also be nice to get to know about how precision and recall behave amongst top-k.
Please motivate your choice for the experiment parameters like the number of paths (p. 12) or the value of k (p. 12).
How do Tables 8 and 9 correlate?
Providing also relative numbers in Table 10 would clarify the statements you make in the text.
The results from the second paragraph on p. 15 should be presented in a table or in a diagram.
Presenting the findings of Table 12 in a diagram would make the interpretation of those results easier.

The paper is lacking a detailed discussion of the findings. One of its main selling arguments, i.e. that a precision of up to 80% can be reached is misleading. In most of the experiments, only a much lower precision could be obtained. While the actual precision is less important, a discussion on the consequences would be very interesting, for example whether or not this is useful for the use case. The findings that I can see from this paper so far is: if we apply random sampling on Bayes networks for authority-flow, we decrease the runtime by X orders of magnitude while decreasing the precision by Y. Moreover, please provide a discussion on how the different domains influence your findings, in particular, why the result for DBLP were worse than for the life sciences example.

The conclusion is very short. It could address more of the findings from the experiments. Limitations and possible future work are only briefly mentioned and should be given more emphasis.

General remarks:
Please provide rounded numbers in the text. If you, for example, speak of 13,112,409,691 triples as on page 1, it would improve readability by changing this to 13 million triples.
When you speak of Srinivasan’s algorithm, it may be better to use an acronym instead.
The layout of the first sub-section of Section 6 (pp. 9-11) seems broken.
Make sure that all your tables are self-explanatory by providing the correct units and a more detailed caption. Do not use references in the tables themselves but only in the captions (e.g. Table 7).
Please use \left( and \right) for the layout of formulas. They are hard to read in the present layout.
Figures 7 and 8 are hard to read in black and white.
Try to use subfigures for the small small tables. Currently these waste a lot of space that you could use for a more detailed discussion of your findings.

Henrion, M. (1988). Propagation of Uncertainty by Probabilistic Logic Sampling in Bayes Networks. In J. F. Lemmer & L. N. Kanal (Eds.), Uncertainty in Artificial Intelligence 2 (pp. 149-163). New York, NY, USA: Elsevier Science Publishing Company, Inc.

Review 2 by anonymous reviewer


The paper can be accepted provided that the authors answer to the questions
posed in the review below (incl. some minor issues)

Comments to Authors

The paper presents an interesting approach to discovering
novel associations over linked data.
Given a probabilistic setting, the link method can be assisted by two
sampling techniques based on breadth-first or depth-first propagation.
An experimentation proves the effectiveness of the method
w.r.t. similar other methods on two domains.

Are novel associations relating just single resources.
How can they be generalized.
You seem to have considered only associations that are
already mentioned in the reference vocabularies/ontologies.
But novelty may be triggered also by new associations arising.
Have you considered the case of drift of know associations
as opposed to the detection of novel ones?

Section 3 presents a motivating example.
The example should better clarify which kind of semantics you are really referring to.
Can your representation cope with lexical semantics only?
Is there any form of reasoning required ?
What are the implications of an underlying
Open World Assumption made on the representation of the ontologies?
If the specificity of the tipical Sem-Web reprentations
is not considered, then what is the value added w.r.t. other
statistical relational learning methods ?

How does your work relate to further similar works in the area,
e.g. by Tresp et al. ?

The sampling methods don't seem that original.
Correct ?
Being based on ontology annotations
your methods seem to depend on their semantics.
Can you better explicitate when and how this is taken care of ?

In general, applying the method in a SemWeb applicative framework,
how is to be measured the novelty of the associations ?
Where is the ground truth to be taken from ?

The paper presents an extensive experimentation
which appears to be the real strong point of this work.
Nevertheless a few questions arise:

- What is the impact of the relative simplicity of the reference DB schemas ?
- Would the methods be still feasible when more complex ontologies were considered ?


Overall, the paper is well written and organized.

Section 2 may be broken in subsections
as the first part accounts for related LOD initiatives
while the second is devoted to ranking methods.

Some references are really succint:
they might be extended to include more details.


Minor issues

p.2 col.1:
"MeSH(Medical" --> "MeSH (Medical"

p.6 col.1:
in the ranking vector R equation
why M^{k-1} should be equal to the iterated product \prod_l M^l
whose last term is M^{k-1} itself ?

p.6, col.1:
\noindent where, M is a transition ...

p.6 col.1:
the \alpha(u, v) is not properly introduced _before_ its usage

p.6 col.2:
The Problem must be stated more formally in a single paragraph.

p.8 col.1:
Theorem 1 - The statement is not clearly
separated from its discussion (or proof).
Consider using a roman/italicized font and a single paragraph.

p.9 col.1:
Theorem 2 same as Theorem 1

p.10 col.1:
In the formula for augmented document frequency:
A should depend on publ. p (in the notation)?
the right parenthesis should be made bigger, \right) ?

Review 3 by Ross King

I recommend acceptance.

This manuscript propose a novel knowledge discovery method based on
identifying semantic associations. The core idea is an authority-flow
based ranking technique. Two sampling techniques (graph sampling, path
sampling) are proposed to increase the speed of the approach.

An experimental study demonstrates that the proposed method is able are
able to reproduce exiting ‘discoveries’.

The manuscript’s literature review is extensive

The complex methodologies used are well described.

A relational database is used in the implementation, rather than native
semantic web technology, this conflicts somewhat from the argued

p11 Table 2. It seems to me that ‘obesity’ is ontological distinct from
the other terms which are anatomical. If I were a user of the system I
would not like this semantic clash.

There is no Discussion section. Nor is there any description of ‘Further
work’, even though there is a section ‘Conclusions and Further Work’!

I could find no technical errors.

The weakest part of the manuscript is the Results section. The authors
justify their system by a series of application runs. I am not fully
convinced that the results they compare their system with are that solid
to start with.

What I would like to see to be fully convinced would be a truly novel
discovery backed up by new experimental results.

The English is good.

Minor points

There is no need to describe Spearman’s rank correlation.

p12 There is something strange with the printing of column two.