Deep learning for noise-tolerant RDFS reasoning

Tracking #: 1866-3079

Authors: 
Bassem Makni
James Hendler

Responsible editor: 
Guest Editors Semantic Deep Learning 2018

<
Submission type: 
Full Paper
Abstract: 
Since the introduction of the Semantic Web vision in 2001 as an extension to the Web, the main research focus in semantic reasoning was on the soundness and completeness of the reasoners. While these reasoners assume the veracity of the input data, the reality is that the Web of data is inherently noisy. Recent research work on semantic reasoning with noise-tolerance focuses on type inference and does not aim for full RDFS reasoning. This paper documents a novel approach that takes previous research efforts in noise-tolerance in the Semantic Web to the next level of full RDFS reasoning by utilizing advances in deep learning research. This is a stepping stone towards bridging the Neural-Symbolic gap for RDFS reasoning which is accomplished through layering RDF graphs and encoding them in the form of 3D adjacency matrices where each layer layout forms a graph word. Every input graph and its corresponding inference are then represented as sequences of graph words. The RDFS inference becomes equivalent to the translation of graph words that is achieved through neural network translation. The evaluation confirms that deep learning can in fact be used to learn RDFS rules from both synthetic and real-world Semantic Web data while showing noise-tolerance capabilities as opposed to rule-based reasoners.
Full PDF Version: 
Tags: 
Reviewed

Decision/Status: 
Major Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
By Alessandra Mileo submitted on 14/May/2018
Suggestion:
Minor Revision
Review Comment:

Summary

This paper proposes an approach for learning RDFS rules based on Recurrent Neural Networks, where each triple is represented as a word and sequence to sequence learning can be applied.
The approach builds upon research on noise-tolerance in semantic reasoning and it extends it to full RDFS reasoning beyond Type assertions. The evaluation is promising and otherwise reveals some interesting aspects. This is a new and interesting area of research that is definitely attracting the attention of the community and can inspire interesting follow-up work.

Comments

The paper in general is well written and organised. It provides a good summary and categorisation of approaches combining semantics web and deep learning, and it clearly identifies where this work stands. The methodology, algorithms and related motivations and proofs are well presented and overall sound. Evaluation is conducted in a rigorous way on specifically designed corrupted datasets based on classes of identified noise and shows interesting results on the noise-tolerance of the approach but also on the way it works when it makes mistakes.

This is a sound paper but a few aspects would need to be better explained or adjusted as indicated below:

- Examples provided in section 3.1 as noisy type and noisy property assertion are both examples of triple corruption according to your description. Would you not need to define triple corruption more formally in terms morphing of at most one among SPO? More specifically, it needs to be clear to what extent is the morphing corrupting the triple. In table 2 you only consider corruption in the predicate for RDFS2/3/7 and in the object for RDFS9. But wouldn’t also be a noise if a subject or an object for RDFS2/3/7 were corrupted?

- Also, is it correct to say that you assume only one of SPO is corrupted? This can have an impact on the extent to which a propagable noise affects the graph, or can relate to a quantification of the extent to which the noise is propagable.

- Authors are using DBPedia and they refer to 17 types of noises on such dataset. Can authors provide an intuition of which of these types of noises are captured correctly by the proposed noise taxonomy in terms of their effects on full RDFS inference with Abox corruption?

- You indicate that around 12% of DBPedia triples are noisy. But how much of this noise is propagable? You only provide one example for dbo:Person and dbo:Place but I think a deeper understanding of this is required to understand how much of this propagable noise is actually captured by the deep learning approach.

- Def 5 and 6 needs to be better characterised, esp re. each E_l (which is a set of tuples (e_i,e_j) such that the predicate p_l relates them, and they are subj or obj in the graph) and also correcting Def 6 on defining e_i and e_j \in G(Subj-obj(T) … this is incomplete or should just be e_i and e_j \in Subj-obj(T) since they are nodes and that’s enough?

- An example of what a layered graph looks like for one of the RDF graphs mentioned earlier in the paper would help have a visual understanding of it.

- No intuitive explanation is provided that two graphs with isomorphic RDFS inference graph have the same layered representation. Again an example might help.

- It is not explicitly mentioned that input RDF graphs in figure 5. are layered graphs.

- The encoding dictionary is not introduced anywhere before being presented in Fig 5. This related to the matrix encoding and dictionary presented later and this should be said when introducing it in fig 5.

- Is there a particular reason for the choice of the sample resources in Table 7 or you just took a random sample? I suppose publication11 and GraduateCourse39 are also part of the local dictionary?

- In section 5.5. you list graph words embedding as a drawback of the layered RDF graphs embedding, but this is not a drawback. It is rather a technique that can be used to to handle unknown words and capture similarity. This should be rephrased.

- while evaluation on LUBM is impressive in terms of noise-tolerance, evaluation when training on a noisy dataset like DBpedia can be interpreted to see how noise-tolerance really depend on the noisiness of the training dataset (which is expected). This somehow requires a bit more discussion and consideration. For example, future research in 8.1 depends on the fact that you have ground truth graph words, which you do not have in case of using noisy DBPedia. Is that fair to say?

Other minor comments

The outline at the end of Section 1 needs to be improved with a full outline section by section, and also a resolution of unclear references:

- missing references and reference to Evaluation chapter instead of a section of the paper
- what is the role of section 3 and 4?
- authors claim RDF graphs are not designed for RDFS inference or to be input to a Neural Network. They should clarify in the outline this is discussed in Section 4.

Sec. 2.1 in the Adaptive Noise Handling paragraph you mention active noise is suitable for some type of noise described “in the following” —> change to reference to section 2.1.1

fig 5 too small to read.

fix reference to Table ?? in table 7 caption.

page 18 handeling —> handling

page 24 —> fix column content going beyond column in precision formula

Review #2
By Peter Bloem submitted on 14/May/2018
Suggestion:
Reject
Review Comment:

This paper aims to solve the problem of using RDFS reasoning in the presence of noise. The proposed model does so by embedding knowledge graphs as 3D tensors, and learning a neural model that implements RDFS rules in a a noise-tolerant manner.

While this is a highly relevant problem, and the authors motivate the work well, I cannot recommend that the paper be accepted. The reasons are outlined below.

# Major flaws

The most serious flaw is the lack of comparison against baselines. The authors report several performance metrics, but it is entirely unclear how impressive it is for a model to score, say, a 93% per-graph accuracy. Would, for instance, a simple link-prediction method like DistMult that was not exposed explicitly to the RDFS rules score significantly lower? Without a comparison against another model, the results are entirely meaningless. If no competing models have yet been published, one or more naive approaches should be devised as baselines.

Due to the lack of baselines, I do not believe that the claim that "deep learning can [...] be used to learn semantic reasoning" has been proved empirically by these experiments.

The related work is severely lacking. The task of link prediction is sufficiently closely related to reasoning that models like ReSCAL, DistMult and TransE should be mentioned. Even though these models are not specifically designed to consume inference rules like RDFS they may well outperform the authors' model when trained on a full materialization of the training data. Even if they're not directly comparable, they may offer insight in how to encode knowledge graphs for the purposes of deep learning.

Other deep learning-based reasoning models that should be mentioned (if not explicitly compared against) include:
** Rocktaschel et al. End-to-end differentiable proving (2017)
** Serafini et al. Logic Tensor Networks: Deep Learning and Logical Reasoning from Data and Knowledge (2016)
While the authors state that "capitalizing on the emerging research on deep learning for graphs" is a reason for modeling RDF data as a graph, they cite very little of this research. For instance: Graph Convolutions, GGT-NN and DeepWalk are ways of applying deep learning to graphs. All are deirectly applicable to, or have been translated to the domain of knowledge graphs.

The exposition is extremely verbose, covering a huge amount of implementation and task-specific detail, before getting to the point. It is almost impossible to see the wood for the trees. I recommend that the authors restructure their paper by first explaining the model in general terms, before moving on to the specifics of the dataset to which they apply it. Another case in point: it is not strictly necessary to describe all graph models that weren't used before describing the one that was. Removing such exposition, or moving it to the back of the paper would make it much easier to read.

# Minor flaws

* The autoencoder model used is very complicated, but its design is never justified. Would a simpler model produce similar results, or is the use of (for instance) GRU layers crucial to the performance? Why use a sequence model when the input is a set?
* It seems extremely counter-intuitive to use a Dense layer before a GRU layer in this way. This should be wrapped in a TimeDistributed layer to maintain the sequential nature of the model.
* The description of the model is very implementation-specific (using Keras-specific terminology and diagrams and referring to cuDNN). These aspects are not essential aspects of the model, and should be separated out in its description. The model should be just as easy to re-implement from its description for users of Pytorch or plain Tensorflow, as it is for users of Keras.
* It's very odd to describe hyperparameters by referring to their _values_. I would prefer description lists with the _name_ of the hyperparameter in bold.
* The paper makes various statements (termed propositions) that are substantiated by proofs. I would not require an empirical paper like this to contains analytical proofs, but if something is labeled a proof it should actually be sufficiently rigorous to deserve that label. This is not true for the propositions and proofs in this paper. Most are simple informal arguments that could be left as running text instead.

# Other comments
* Definition 6 is missing some brackets.
* Various multi-letter function names (Obj, Subj, etc) should be typeset with \text{}
* Subj-obj is a confusing function name (the hyphen could be mistaken for subtraction).
* Standard hyperparameters (batch size, learning rate, etc.) were not reported.
* Due to the changing nature of the DBPedia endpoint, the specific graph extracted should be provided for reproducility.
* page 24: equation runs into the right column.

Review #3
By Dagmar Gromann submitted on 24/May/2018
Suggestion:
Major Revision
Review Comment:

Summary:
This paper presents an innovative approach to noise-tolerant RDFS reasoning by combining a layered RDF graph approach, its encoding in 3D tensors and mapping RDF graphs and their inference graphs using a GRU-based architecture. The inference graphs are generated using the rule-based reasoner Jena, the performance of which also provides the main comparison for the evaluation of the empirical setting, tested on the LUBM dataset and a DBpedia subset. In comparison, the RNN reasoner performed reasonably well on propagable noise (impact on inference graph) but was outperformed by Jena on non-propagable noise (no impact on inference graph). In a rather exhaustive description the paper contributes a typology of noise types, a layered graph model based on 3D tensors, a method for learning graph embeddings, and a GRU-based architecture to learn RDFS rules in a noisy environment.

In spite of a detailed and generally well written mode of presentation, explanations are exhaustive and could benefit from reduction and/or restructuring. For instance, a lengthy section on different graph types does not contribute to the readability of the paper or substantially to its content. This tendency to go into a lot of detail can be observed in other sections as well. Nevertheless, aside from the following comments, the motivation for conducting this research is well-motivated, sufficiently novel, and its approach is well within the scope of this special issue. Several of the claims made, however, need proper reflection in the face of missing related work/events and the entire argumentation and structure of the paper require a stronger focus (in particular Section 5).

Major comments:
- clear method overview: A succinct and detailed overview of the major methodology with all its individual steps and methods chosen for each step somewhere at the beginning of the paper would strongly contribute to its readability. There is an outline, however, it does not provide a very good overview of the process explained later, but instead only indicates where the individual steps are explained in detail.
- evaluation: The current evaluation metrics are quite opaque and difficult to interpret. Without any comparison to other models or datasets as baseline not much can be gathered from the presented metrics. Since the datasets are not standard knowledge base completion datasets, it would be useful to maybe apply a standard link prediction model, such as TransE, to the proposed datasets for better comparison or apply the proposed model architecture on one of the standard datasets.
- evaluation corrupted triples: how does the approach handle valid GRU-generated triples that are not in the OWL LUBM knowledge base against which their validity is tested?
- design choice: why this RNN architecture superior over others? There is a bit of a lack of motivation for the chosen architecture. What are the merits of a simple sequence-to-sequence architecture over more recent graph-based architectures, such as the one used in the following work? Several design choices could be motivated, such as the positioning and high number of dropout layers. This motivation could also be of empirical nature.
Michael Schlichtkrull, Thomas Kipf, Peter Bloem, Rianne van den Berg, Ivan Titov, Max
Welling (2018) "Modeling Relational Data with Graph Convolutional Networks" Proceedings of
ESWC 2018
- related work is lacking (see below)
- deep learning is not equivalent to "classification algorithms" as stated on page 3

Related work:
- embedding generation: what is the relation of the proposed embedding learning approach and RDF2Vec or other knowledge graph embedding methods? I think it would be good to explain why RDF2Vec should not be sufficient for this scenario.
- knowledge graph completion approaches are related enough to be included as related work (e.g. Dong, X., Gabrilovich, E., Heitz, G., Horn, W., Lao, N., Murphy, K., ... & Zhang, W. (2014, August). Knowledge vault: A web-scale approach to probabilistic knowledge fusion. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining (pp. 601-610). ACM.); Additionally, Socher has a paper on knowledge-base completion using a similar method as the work cited in this paper, which would be much more related than the one cited; in fact the paper even claims later to be "capitalizing on the emerging research on deep learning for graphs" - but none of it is provided
- I fail to comprehend how ontology learning from text and the Section 2.2.2 are related to the proposed approach in any way - I strongly suggest removing those two sections and in general focus on approaches/details that are actually related to your approach throughout the whole paper (this shortens the paper and resolves some issues of the argumentation)
- if the Bipartite Graph Model is not suitable for this approach, there is no reason to describe it in detail - the same goes for the Metagraph and the Hypergraph model

Other comments:
- architecture description: In the architecture description there is a mix of implementation specific details and architectural details. For instance, cuDNN and GRU described at once. The hyperparameters should be described as parameters rather then values.
- some of the claims raised: as indicated above a large bulk of related work is missing. Thus, claims about there being no approaches to bridge the "Neural-Symbolic Gap" are problematic since there are so many Knowledge Graph Embeddings, including some specific to RDF graphs (all of which are "suitable for neural network input")
- similarly the claims raised about "initiating the communication" on Deep Learning and Semantic Web are not quite justified - the authors submitted to a special issue on Semantic Deep Learning, which is a series that has been providing a platform to bring together Deep Learning experts and Semantic Web researchers for more than a year, now holding its 4th workshop this year; in line with this comment, I am also not sure that the author's classification of Semantic Web and Deep Learning research really also reflects approaches on Semantic Web injection, e.g. disjointness axiom injection, in the process of training Deep Learning models. In fact, the whole classification does not contribute to the claims made in the paper.

Minor comments:
Many references are missing (e.g. "Table ??" or Fig. 6 is missing completely - only the caption shows)
Please add to the running text that some algorithms and tables are in the appendix, e.g. Appendix B. Algorithm 2
- the references to the RDFS entailment patterns either should reference the online resource or the table provided in the paper so that the reader understands what rule RDFS9 refers to
- thousands separator (e.g. 17,189 instead of 17189) strongly increases readability of larger numbers throughout the paper
- variables used in running text should be written in italics, e.g. (s,p,o) => the "s", "p", "o" afterwards in running text
Spelling of the properties:
- according to RDF Schema 1.1 the namespace is rdfs and not RDFS; the same applies to the namespace "rdf" in RDF properties
- dbr also camel-cases, e.g. dbr:Semantic_Web rather than dbr:Semantic_web => please ensure that all properties are spelled correctly

Minor comments on orthography/language in order of appearance:
"the noise can be as a consequence" => the noise can be a consequence of
"efforts in noise-tolerance" => "efforts on noise-tolerance"
references for the claims of "many researchers" and "current work" would be nice in the introduction Section 1.1.
"combing sound symbolic reasoning with" => combining?
"The research hypothesis are" => hypotheses
"described respectively in ??" => there is some reference missing
"based on Lehigh University Benchmark" => based on the LUBM
"In SDType" => In the SDType algorithm
"infer the the" => infer the
"aim full" => "aim at full"
"Spacial region" => did you mean "Spatial"?
"domain specific" => "domain-specific"
"classes hierarchy" => "class hierarchy"
"different than" => different from (a number of times)
quantification of "University" is missing in Table 4
"foundations of the graph theory" => foundations of graph theory
"order triple (.." => closing bracket missing (p. 9)
"S ubj-obj(T)" => Subj-obj(T)
the text does not fit Figure 3
Definition 6 misses closing brackets ")"
"non isomorphic" => "non-isomorphic"
Figure 5 is barely readable in the current size - enlarge font?
Caption in Table 7 contains ??
"to not update" => "not to update"
"non zeros values" => "non-zero values"
"encoding the inputs graph" => input graph
"This way when the" => "This way, when the"
"the layer dul:isDescribed By" => dul:isDescribedBy
"be searched' two" => be searched two
I am not sure Figure 7 truly contributes to the content of this paper
"set of subjects and objects resources" => subject and object resources
"layers through out" => throughout
"the goal of words embedding" => word embeddings
"words embedding also solves" => word embeddings also solve
"seventeen thousands" => thousand
"Few hyper-parameters needed to be changed though" => A few hyperparamters
"training speeds for both models are" => "training speed for both models is"
NVidia => NVIDIA
12 Gb => GB
this qusi => quasi
the formulas on page 24 extend into the text of the other column
There is a reference to Section 7.2.3 describing its content within Section 7.2.3 (at its end)
"scientists dataset" => Scientists dataset
"88 inference" => inferences
"adversarial generative models" => generative adversarial models
"it can not only" => cannot


Comments

Is Figure 6: 3D Adjacency matrix on page 12 missing?

Yes, thanks for the observation!

I uploaded the figure here:

https://drive.google.com/file/d/1KgHBXAPlCMHI0oAmDEHjSLufbLzpTv8K/view?u...