A Scalable Approach for Statistical Learning in Semantic Graphs

Paper Title: 
A Scalable Approach for Statistical Learning in Semantic Graphs
Yi Huang, Volker Tresp, Maximilian Nickel, Achim Rettinger, and Hans-Peter Kriegel
Increasingly, data is published in form of semantic graphs. The most notable example is the Linked Open Data (LOD) initiative where an increasing number of data sources are published in the Semantic Web’s Ressource Description Framework and where the various data sources are linked to reference one another. In this paper we apply machine learning to semantic graph data and argue that scalability and robustness can be achieved via an urn-based statistical sampling scheme. We apply the urn model to the SUNS framework which is based on multivariate prediction. We argue that multivariate prediction approaches are most suitable for dealing with the resulting high-dimensional sparse data matrix. Within the statistical framework, the approach scales up to large domains and is able to deal with highly sparse relationship data. We summarize experimental results using a friend-of-a-friend data set and a data set derived from DBpedia. In more detail, we describe novel experiments on disease gene prioritization using LOD data sources. The experiments confirm the ease-of-use, the scalability and the good performance of the approach.
Full PDF Version: 
Submission type: 
Full Paper
Responsible editor: 

Review 1 by Paulo Cesar G. Costa

Recommendation: accept, authors must address the two pending issues below

After reading the new version of the paper and the authors’ reply to the reviews, I keep my original opinion about the main aspects of the text w.r.t. its soundness and level of detail. I’m still a bit concerned about the writing, but acknowledge the current version as a considerable improvement over the original.
I also noticed that two of my concerns were not addressed properly, so I’m reproducing them below in more detail. These are not major issues so they are not influencing my final grading, although I’m expecting them to be addressed in the final version.
Therefore, I’m changing my grading to accept (from weak accept), while maintaining a recommendation that the authors address the two minor issues reproduced below.

Remaining issues (minor):

Pg. 3, left:
- “As machine learning methods, decision trees and similarity based methods are mainly used.” -> This phrase is incomplete.

The authors claim that the phrase is complete. I can read it as:
“Decision trees and similarity based methods are mainly used as machine learning methods.” This indeed makes it a self-sustained, stand-alone sentence. However, the authors’ option for a subject-verb grammatical inversion is and sounds strange to say the least. Also, the way it is written this sentence implies that decision trees and similarity-based methods are mainly used as machine learning techniques, which couldn’t be further from the truth (a brief analysis on the origins, meaning, and current applications of both decision trees and similarity-based methods can support this assessment).
I believe the authors wanted to say:
“As for machine learning techniques, decision trees and similarity-based methods are widely used.” (or some variation that emphasizes the context – machine learning techniques – in which the assertion about DTs and similarity-based methods is being made)
In any case, this is really a minor issue and I’m ok if the authors insist in using their construction, although I do believe it makes the text much harder to parse and open to interpretations that are different from the one they intended to convey.

Pg. 4, right:
- “The problem can sometime be alleviated, i.e., by lifted inference, and by…” -> sometimes … ,e.g.,… lifted inference (reference?)…

The revised text still has no reference to lifted inference.

Review 2 by Joaquin Vanschoren

The second version on this paper is clearly better throughout: it's easier
to read, loose ends and tied up and the quality of the experiments is
improved. I welcome the addition of the Pearson Correlation Coefficient as
a (much more meaningful) baseline in several of the experiments. It is
very good that the data has been made publicly available.

I still feel like the comparisons could have gone in more depth,
in sections 6 and 7. In section 6, it is good to at least have a baseline
(although including more 'rival' algorithms would of course been even
better - see the point below). It is not clear, though, on which data PCC
has been run. On the original data? Or the disamb data? Why not compare
both techniques directly over all data?

The main limitation of this paper is still that there is no true
with Bayesian methods for graph mining (for which the authors claim to
offer an alternative approach) in sections 6 and 7. The authors state in
their reply that they did try them (at least MLNs) but that they achieved
results close to the random baseline on their data. This seems a rather
unexpected result to me. If this is truly so, then this is an interesting
outcome and should be mentioned in the comparisons in the paper.


p.1 in form of -> in the form of

p.3 the bases for -> the basis for (as mentioned in the previous review).
The word is used here in singular, while 'bases' is the plural of 'basis'.

p.7 shows the situation -> concerns/covers the situation

p.8 The figure plot -> The figure plots

p.8 including e.g. -> remove the 'e.g.'

p.9 The caption of Fig. 4 mentions a black frame, but there is none shown

p.16 predicting diagnosis -> predicting diagnoses (or predicting the
diagnosis). You are predicting multiple diagnoses (plural).

The reviews below are from the previous version of the manuscript.

Review 1 by Paulo Cesar G. Costa

This paper focuses on applying the SUNS model to Linked Life Data, and claims that this technique provides good results when data sources are sparse and incomplete. The authors also claim their approach as being naturally scalable, although the size of the experiments isn’t large enough to provide empirical evidence in support of this claim.
The argumentation is generally solid and discussed at a level of detail compatible with the venue. The experiments are replicable, well designed and explained, and do provide enough support to the conclusions.
The text is sometimes truncated and negatively impacts the flow of ideas, but not to the point of jeopardizing the understanding of the technical material. I found the writing of the different sections to be relatively inconsistent, both regarding the grammatical correctness as well as the fluency of ideas. In general sections 2 and 3 seemed to be less carefully written than the remaining sessions. The references must pass through a comprehensive and careful review.
In summary, the ideas and the technical achievements are interesting enough to ensure acceptance of the paper, but the text and references do need to be revised. Special attention should be taken to session 3 (see below), but I didn’t find any issue requiring a major revision. Thus, my recommendation is for a conditional acceptance pending minor revisions.

Typos, grammatical errors, and specific issues:

Pg. 1, left:
- “…Web’s Ressource Description Framework (RDF)…” -> Resource
Pg. 2, bottom left:
- “…relational representation but focussed on the…” -> focused

Pg. 2, middle right:
- “… is quite interesting and we will apply it in the following.” -> In the following what? This seems a non sequitur.

Pg. 3, left:
- “As machine learning methods, decision trees and similarity based methods are mainly used.” -> This phrase is incomplete.
- “In this paper we consider semantic domains that form directed labeled graphs where nodes stand for concepts such as…” -> This is confusing. Any domain can be used to form labeled graphs where nodes stand for concepts. This paragraph can be improved in its terminology.
- “… semantic graphs are RDF (…)” -> you have already spelled out RDF, no need to do it again here.

Pg. 3, figure 1: This graph is ontologically incorrect. Jack is not a type of person. Jack is a person (or an individual of class person if you prefer). Calling an instance of a class as a type of that class is wrong. Please, change the relationship to something such as isA or similar.

Pg. 4, right:
- “Statistical network models, which have been developed form this view points…”-> Did you mean that they “… form this viewpoint”? This doesn’t seem right. They may be good examples of Bayesian approaches that represent what you explain above, but they don’t “form a viewpoint”.
- “The problem can sometime be alleviated, i.e., by lifted inference, and by…” -> sometimes … ,e.g.,… lifted inference (reference?)…
Apart from the above suggestions, this whole paragraph is confusing and controversial. Learning in highly expressive Bayesian approaches such as the ones you mentioned is still a very active research topic and different techniques that explore structural aspects of graphs will have distinct impact depending on the technique and the representational framework you use. For instance, implementing learning using influence counts, lifted inference, and others in ML (undirected graph, assigns weights to FOL sentences) will have a different impact than in PRMs or MEBNs (directed graphs, assign probabilities directly), or in models using BLogs, Plates, etc. It is not immediately clear to me how different are these representational approaches from a frequentist model in terms of their ability to generalize knowledge. All in all, I recommend you to either significantly improve this discussion or to remove the comparison. It seems to be doing more harm than good.

Pg. 11, left:
- “An analysis is further complicates since environmental….” -> This is not a well formed sentence.

Pg. 12, right:
- “This number was reduced to less than 100000 after cutting the attributes of which are associated…” -> cutting the attributes that are associated…

A careful review is needed here. There are typos, inconsistent capitalizations, and other flaws that detract from a quality publication. I’m listing a few (by reference number):
4, 33: semantic web -> capitalize.
7: sparql -> capitalize.
12: Poceedings -> Proceedings
18: ilp … semantic web -> capitalize
31: dSAmato -> D’Amato
49: mendelian -> capitalize
50-60: capitalize the Journal’s title
Dbpedia -> DBpedia

Review 2 by Joaquin Vanschoren

The authors present their SUNS framework, a scalable, frequentist
approach to graph mining and its applications in mining semantic
graphs. In is a very interesting approach that looks like it is very
scalable through the use of sampling and low rank approximations. They
also go to great lengths to illustrate various applications of their
algorithm, showing that it is a good all-round approach to graph
mining. My only gripe with the paper is that while the SUNS approach
is tested in detail, the comparison with other techniques is quite
shallow, and for some examples, no comparison is made at all. I think
this paper could be substantially improved by including in-depth
comparisons with other techniques.

Mayor comments
* Section 3.1: it is stated that you will motivate why a frequentist
perspective is preferred over a Bayesian perspective. You don't do
this. Implicitly, you suggest that it will be more scalable (and I can
see why from later subsections), but this is nor stated explicitly,
nor motivated in detail. The paper also doesn't compare very much with
Bayesian methods. I can see that this approach will be quite scalable
(using sampling and matrix approximations), yet this is never
empirically demonstrated? How much faster is it, really, against other
methods, e.g. Bayesian ones?
* Section 4.3: you mention an interesting kernel for exploiting
ontological background knowledge. Is this useful for the cases
described here, and if so, why don't you compare your method with it?
* Section 5.1.3: you define a FOF baseline method, yet you never use
it in your experiments?
* Section 5.2: you compare your method against three other methods
based on matrix completion. This is very interesting, but what about
other techniques that can be used in this setting? There are many
other SRL methods that could be used?
* Section 6: Here, you show that your method works well on the DBPedia
example. Still, there is absolutely no comparison against other
methods, not even a baseline.
* Section 7.4.2: The comparison with ToppGene is interesting, but I am
somewhat concerned by the fact that you only used default parameter
settings for ToppGene, while the parameters of your method are likely
optimized? Table one also suggests a very basic comparison, i.e. no
repetitions and the resulting standard deviations? Would it also be
possible to run ToppGene on the data discussed in section 7.4.1?

Minor comments
* Section 3.2: you mention pitfalls related to statistical analysis in
networked domains: can you briefly state which ones and how you solve
them? Now, you just state that there are issues, but it is not clear
you solved them, or took them into account.
* Fig.3: Please explain why NNMF is not used in Fig. 3a. I assume this
is because the training data cannot be handled by NNMF, but please
state so explicitly.
* You invested a lot of effort in preparing the data in section 7.2,
yet include little comparison with other methods on this data. Is the
final data available somewhere so other researchers could use it to
compare your method against theirs?

p.1 enriched via -> enriched with
p.2 the bases for -> the basis for
p.3 easy to use -> easy-to-use
p.4 developed form this view points, are -> developed from this view
point are (3 changes)
p.4 considers all nodes in a network -> sentence is not correct
p.4 sometime -> sometimes
p.4 semantic graph requiring -> semantic graph, requiring (comma)
p.4 The issue if one -> The question whether
p.5 Equation 3 -> put between comma's or parentheses
p.6 to predict social status -> to predict the social status
p.11 An analysis is further complicates since -> Analysis is further
complicated because (3 changes)
p.11 could be extracted -> can be extracted
p.11 manually curated well maintained -> manually curated and well maintained

Review 3 by Lorand Dali

Summary and Main Contributions
The authors present a framework for statistical relational learning on RDF graphs.
The motivation is to predict missing links in semantic graphs (e.g. friends in social networks, infuences of genes on diseases, membership of politicians in paries)
The approach presented uses a kernel solution to matrix factorization in order to predict missing values (relations) in a matrix. To make the approach scalable, the authors use the Nystrom approximation.
The experiments are done predicting friendship relations on a FOAF dataset, predicting membership in a party for German politicans in DBpedia and predicting influences of genes to diseases in Life Linked Data.

Strong Points of the Paper
The organisation and opresentation of the paper are good, the language is correct and easy to follow
The contributions are well motivated by applications to social networks and discovery of gene-disease relationships.
Related work is covered in detail
The paper contains many experiments which illustrate the performance and usefulness of the presented approach. The presented approach is able to significantly outyperform SVD, LDA and RRPP for friend suggestion in a social network, and to outperform the state of the art method in gene-disease relation prediction.

Possible Improvements
- the authors should correct the misspelling of "Ressource Description Framework" to "Resource Description Framework"
- in the experiments with DBpedia some examples about what other relations (except political parties) could be predicted by the presented method should be given. I am sure that many more interesting relations can be accurately predicted and if the authors tried or thought about some experiments in this direction, it would be interesting for the reader to find out