Fuzzy Constraints for Knowledge Graph Embeddings

Tracking #: 3134-4348

Authors: 
Michael Weyns
Pieter Bonte
Filip De Turck
Femke Ongenae

Responsible editor: 
Guest Editors NeSy 2022

Submission type: 
Full Paper
Abstract: 
Knowledge graph embeddings can be trained to infer which missing facts are likely to be true. For this, false training examples need to be derived from the available set of positive facts, so that the embedding models can learn to recognise the boundary between fact and fiction. Various negative sampling strategies have been proposed to tackle this issue, some of which have tried to make use of axiomatic knowledge claims to minimise the number of nonsensical negative samples being generated. By putting constraints on the construction of each candidate sample, these techniques have tried to maximise the number of true negatives outputted by the procedure. Unfortunately, such strategies rely exclusively on binary interpretations of constraint-based reasoning and have so far also failed to incorporate literal-valued entities into the negative sampling procedure. To alleviate these shortcomings, we propose a negative sampling strategy based on a combination of fuzzy set theory and strict axiomatic semantics, which allow for the incorporation of literal-awareness when determining domain or range membership values. When evaluated on benchmark datasets AIFB and MUTAG, we found that these improvements offered significant performance gains across multiple metrics with respect to state of the art negative sampling techniques.
Full PDF Version: 
Tags: 
Reviewed

Decision/Status: 
Reject

Solicited Reviews:
Click to Expand/Collapse
Review #1
Anonymous submitted on 17/Jun/2022
Suggestion:
Major Revision
Review Comment:

Semantic Web Journal Evaluation Points:

Originality/Significance

The work is definitely original as I have not observed a similar idea used to generate negative samples.

Quality of the writing

I find the paper slightly confusing. Some sections would require a better organization (e.g., Related Work).

Resources

I have quickly checked the code and have been able to run the tests. Minor note: some folders have to be created before running the test script.

- Summary Review -

The authors propose a novel method to select negative samples using fuzzy constraints. While I think the paper is interesting I found the paper a bit hard to read, I have also some questions about how the evaluation has been run (as some parameters have not been reported / some experimental configurations were not clear).

I think the idea has merit and the paper can become a good contribution. I think using fuzzy constraints is in line with many neuro/symbolic systems and is definitely a new idea in this context. However, it requires better writing to make the method clear and the experimental section solid. The next sections of the review will describe some items that I hope should help improve the paper.

- Comments -

Knowledge graph embeddings should be better introduced; the concept of “scoring function” appears only right before the evaluation section.
Indeed, the equations of the models (18,19) used in the evaluation are introduced at the end of section 4, but this would probably fit better in a proper section (or even in the related work, if these are meaningful). Authors could also describe the scoring function in the problem definition (where I would also describe TransE and DistMult). I think another interesting work to take a look at is [7].

Moreover, It was not clear if the authors use LiteralE or only some components of LiteralE. I also failed to understand which is the relation between this part with LiteralE and the previous one on fuzzy constraints. I also wonder if the contribution of this part of the pipeline should be evaluated separately to understand its contribution to the entire system.

I am unsure about the use of the term "fuzzy" as from the description I feel like the membership score is derived from the cardinality and no other fuzzy concept seems to be used. I might be wrong here, but I'd be grateful to the authors if they could provide a more in-depth explanation of this.

Also, I'd suggest the authors use one type of notation. Eq 18 uses h,r,t for a triple, Algorithm 2 uses s,p,o, but the knowledge graph has been introduced using e_i,r_k,e_j; on page 19, the triple is e_s, r, e_o. O is introduced for the first time on page 5, but it is defined on page 10, but I don't think it is the same "O".

I would suggest the authors move definitions to a dedicated section where all the elements are introduced and to keep the notation coherent. There are many examples along with the paper, but it'd be nice if the examples were consistent (e.g., authors can draw a mini knowledge graph to put in the introduction from which all the examples can be taken)

I think that better organization would make the paper easier to read. For example, fuzzy logic is introduced in the methodological section and I found this broke my reading flow. I believe that fuzzy logic can be introduced in a background section. It is not clear to me how authors would like to organize their related work. Is a mix between kg embeddings methods and negative sampling methods (sections could help). The authors cite TRESCAL (without discussing RESCAL), but for example, there are also many different other approaches that use types/schema; see [1,2,3]. The negative sampling methods described are only a small set of those that have been introduced in the literature, see for example [4,5,6].

Tables like the one shown at the end of the paper are difficult to read: I have to jump from text to the tables (that are on a different page) many times. I wonder if plots could help in making it easier to read.

- Experimental Methodology

The authors run the algorithms for 100 epochs, but no validation seems to be run to check or prevent model overfitting. I wonder if this can make a difference in the results.

page 18, section 54. Is it fair to filter out nonsensical candidates only for the models introduced in the paper? eventually shouldn’t be this a constraint that can be applied to any model? can this be applied to the baselines to see the difference?

Parameters are missing. Which optimizer has been used to train the models? (it is Adam, looking at the code, but I believe this should also appear in the paper). Which are the parameters used to train the single fully connected layer between pages 14 and 15? similar comments for doc2vec (e.g., are the default parameters used other than those mentioned?)

How many times have the algorithms been run (to get averaged/stabler results)? can the authors compute if the results are statistically significant?

I might be wrong here, but the method seems to be compared mainly with baselines that are either old (a reference to a paper in 2015) or introduced by the authors. I believe that a more in-depth comparison with the state-of-the-art (or a discussion on why this is not possible) would be good.

I might have lost this in the paper, but is there any reason why the standard datasets like FB15K/DBpedia/Yago (with the respective subsets) are not considered? It is because of the ontology used/the absence of literal or because the original task is too different? While I understand this, it is more difficult to evaluate the contribution wrt other negative sampling methods. It'd also be nice to see what happens when using more recent embeddings models such as RotatE (TransE is known to have some limitations in how it scores triples [6]).

Are the metrics evaluated on a filtered setup? like the one described by Bordes et al. (2013)? from the text, I don’t understand this.

- Minor -

"algorithm 1" - Capital “A”

The authors mention equation (13) on page 14: while I understand why the authors do this, if the equation is never used again then it feels like an element that distracts from the paper.

[1] https://luca.costabello.info/docs/ECML_PKDD_2017_embeddings.pdf
[2] https://dl.acm.org/doi/abs/10.1145/2851613.2851841
[3] https://ojs.aaai.org/index.php/AAAI/article/download/5701/5557
[4] https://ojs.aaai.org/index.php/AAAI/article/view/11536
[5] https://aclanthology.org/2020.emnlp-main.492.pdf
[6] https://arxiv.org/pdf/1902.10197v1.pdf (this is the RotatE paper, but the authors also introduced a novel method to generate negative samples).
[7] https://aclanthology.org/D16-1019.pdf

Review #2
Anonymous submitted on 18/Aug/2022
Suggestion:
Reject
Review Comment:

MOTIVATION
In the development of machine learning models to solve the problem of link prediction in the semantic web, one of the challenges is the generation of embeddings that takes into account also negative examole of triples. Indeed negative statements are not direclty present in the semantic web, i.e., the semantic web does not contains informaiton that two entities are _not_related_ with a certain relations. Furthermore the absence of positive information cannot be interpreted as negative information, due to the fact that the semantic web in inherently incomplete.

PAPER SUMMARY:
The intended contribution of the paper is in describing different methods for generating negative examples, and evaluate how these methods affect the performance of embedding-based link prediction algorithm. The main idea of the paper is that the negative sampling methods should take into account the restrictions imposed by the schema (specified either in RDFS or in OWL). The authors also claim that they are the first of generating negative samples for literals. They evaluate the approach on two KG: namely AIFB, and MUTAGEN.

POSITIVE APECT:

- I believe that considering constraints provided by schema in order to generate more appropriate negative examples for training knowledge graph embedding is a good point.
- Considering literals is also a positive point
- taking also into account the "degree of negativeness" of a training example is also a good idea.

NEGATIVE ASPECT:
- The technical description of the paper is poor and not sufficiently precise.
- some choices are not sufficiently justified
- some of the introduced restrictions (i.e. ExistValue restriction) are not experimented since they are not present in the dataset
- the results of the paper are not reproducible, meaning that, despite the code is running correctly (with minor fix) the produced output cannot be mapped into the tables reported in the paper.

TECHNICAL FEEBACK
The technical description contains a number of imprecisions from the formal point of view. This negatively affects the clarity of the message and the overall readability. In the following I list all the points that I manage to find out (some of them are minor other are more important).

- page 5 line 7: definiton of Knowledge graph. First it is defined as a pair (E,R), and then as a
subset of E xR x E. Clearly the second is the correct one. After that, I believe that the dissertation
about how can be seen a knowledge graph can be summarized by simply saying that, every element
of E is either an entity or a literal, and that literals can appear only as the object of a triple.
The proposed formalisation is awkward and does not add anything new.

- page 5, line 48 and after: The description of random variable is not correct $Y$ is not defined y_ijk\in{0,1} is a value not a random variable. Furthermore, the dimension $N_e x N_r x N_e$ is not correct since it does not take into account that literals appear only on the right. Furthermore, the introduction of probabilities is useless, since the author are not presenting a specific probabilistic model.

- page 5, equation (1)-(3) "O","O+" and "O-" are undefined.

- page 6, line 8-10: The sentence "They [OWL axioms] do not actually restrict what we can express". is not correct. Actually OWL axioms impose constraints. For instance the AllValue restriction imposes that the object of a certain relation is of a certain class.

- page 7, equations (6) and (7) it is not clear what you mean by "(e_i,r_,e_j) is valid".
I think I get the intuition, i.e., that you consider valid only the triples that respect
the constraint. But for this you should better clarify various points:
(1) Do you perform some form of reasoning on the t-box. For instance
if the T-box contain R:rdfs:domain C and D rdfs:subClassOf C, then i suppose that a triple
(e,R,e') where e' is of type D is also valid?
(2) Suppose you have the constraint r rdfs:domain C, in the T-box
but your A-box does not tell anything about the two entities e and e',
is (e,r,e') valid? In the example you gage on lines 43-49, the triple
(ex:foo ex:age 33^^xsd:int) is valid or not, under the hypothesis that
you are not able to derive any information about foo in the knowledge base?
Finally the use of the term "valid" can generate misunderstanding since "valid" in
logic has a vary precise meaning, i.e., true in all the models.
Similar comments can be done for eqauations (10) and (11) they are not formally clear.

- The distinction beween OWA and CWA is not clearly stated. From what I could get in the CWA you
consider all the "potential triplets" that respect the constraints, while for the OWL you consider also
the one that do not respect the constraints. This looks quite contradictory to me.
For instance if you have the relation "nationality" that requires that the object is a country, why
should you add as negative example the triple "john. nationality, mary"? Isn't this what usually happens
when people do not consider range restrictions? I believe you should provide more clear examples and motivations to distinguish between the two.

- page 9: algorithm 1: line 6 is hard to interpret. I suppose you mean that
"domain - subject_type != emptyset", i.e., there is a type on the domain
which is not in the subject_type. IN the case of OWL restriction it looks that you
don't make any distiction between AllValueFrom and SomeValueFrom. By the way the
T-box of the two dataset AIFB and MUTAGENESIS do not contain existential restriction.

- Page 10, Algorithm 2.
* line 6: "neg-ration", is suppose thati si neg_ration*bath_size
* line 7: "RANDOM()" i suppose that you are sampling from a uniform distribution in [0,1]
* line 9" BERNOULLU(p) -> BNERNOULLY(pr). Furthermore if my interpretation is correct then
you don't need to do this double sampling, what you do is equivalent to perform the test
1/2 < uniform(0,1).
* line 12: you select all entities and then check if they are valid. Wouldn't be more efficient
to query the valid entities. My impression is that the condition of validity can be easily encoded in a SPARQL query on the knowledge graph closed under RDF/OWL inference. This would

- page 10:second column line 47: "we can derive ..." I don't think this is always possible. By the way
you introduce mu^R_s and mu^R_o without providing the intuitive meaning. In particular you should explain
what does it mean that mu^R_s(e) = 1. It is more feasable that you can obtain mu^R_s by aggregating mu^R on the object dimension. However not knowing the meaning of mu^R_s this is just a supposition.

- page 13: Algorithm 5: Ther eare various (major/minor) points that should be fixed or clarified
* line 5: the definition of n is a bit strange. A much simpler and (I believe equivalent) definition is
#triples_p/#objects_p+1 (which is equal to 1 + the average number of subject (or equivalentsly triples)
for each object).
* line 10: if SN_max=0 then r is undefined.
* line 10: where the 0.8 comes from? Is this an Hyper parameter? How do you decide to fix this threshold to 0.8.
* line 13 if n==1 then SN_max must be equal to 0 since the everage number of subject per obiect is 0. and carp_p(k=1) will be the empty set. So there is something I don't correctly understand here.
* line 23,23 the SC_sp and SC_ge has not been defined in the algorithm. You explain them later only in the text but you should specify also how you compute in the algorithm.
* line 25: the coefficients, 0.2, and 0.8 are not justified. Are they hyper parameters? How do you choose them.

- page 14, first column, line 37. The radius is defined as the greatest distance from the centroid in the claster, is this euclidean distance or cosine distance? (looking at the code it seems to be cosine distnace)

EXPERIMENTAL EVALUATION:

The paper evaluates the approach on two relatively small datasets AIFB and MUTAG. No comparisono with the state of the art approaches that include background knowledge in the task of embedding creation is done. I believe that this is necessary since the proposed strategy for negative examples generation is based on knowledge about the schema in addition to training data. Given the current hype on Neuro-symbolic integration there are many works that attemps to include knowledge into embedding generation. I believe that the author should compare with the most prominent approaches of the state of the art in this field.

REPRODUCIBILITY:

The code does not run, as it is, it gives an error. I have tried to fix it by adding the directory pickles and the two subdirectories pickle_AIFB and pickle_MUTAG and the directory results. I run the code for one full day 24 hours and only 3 experiments were done, generating outputs

TransE_MUTAG_f0_enhanced_fuzzy_r1.0

micro_mr micro_mrr micro_hits1 micro_hits5 micro_hits10 macro_mr macro_mrr macro_hits1 macro_hits5 macro_hits10
3864.86039402174 0.03798001324660996 0.016304347826086956 0.048403532608695655 0.07591711956521739 1488.9208258804508 0.026901632935893316 0.00966756184824349 0.02741521909152732 0.04888600866232436
2601.673063858696 0.0220483210491774 0.0078125 0.03413722826086957 0.043478260869565216 1222.513125737288 0.2415117168948162 0.09285711896944132 0.4634787838987777 0.4787529422986032
3233.266728940218 0.03001416714789368 0.012058423913043478 0.04127038043478261 0.059697690217391304 1355.7169758088694 0.13420667491535476 0.05126234040884241 0.24544700149515253 0.2638194754804638

TransE_MUTAG_f0_regular_fuzzy_r1.0

micro_mr micro_mrr micro_hits1 micro_hits5 micro_hits10 macro_mr macro_mrr macro_hits1 macro_hits5 macro_hits10
2947.084069293479 0.0427498926856862 0.004245923913043478 0.07659646739130435 0.11005434782608696 1268.2268652459247 0.02724853023751138 0.0020436034100543136 0.046428800748556986 0.06329901187446912
3630.568783967391 0.02149687400075085 0.001358695652173913 0.03838315217391305 0.048743206521739135 2019.7362883214437 0.20494059192714692 0.014236111111111113 0.4588717788808927 0.47242066250993253
3288.826426630435 0.03212338334321853 0.0028023097826086955 0.0574898097826087 0.07939877717391305 1643.9815767836842 0.11609456108232916 0.008139857260582714 0.2526502898147248 0.26785983719220086

TransE_MUTAG_f0_regular_fuzzy_r5.0

micro_mr micro_mrr micro_hits1 micro_hits5 micro_hits10 macro_mr macro_mrr macro_hits1 macro_hits5 macro_hits10
1893.1734035326087 0.08545384716686456 0.0016983695652173915 0.17646059782608697 0.22146739130434784 874.6933887696778 0.04683703771183237 0.0008243244984361537 0.0869351897853666 0.10745877797684515
2382.757472826087 0.03243278238131909 0.003057065217391304 0.050951086956521736 0.07931385869565216 1585.1805630781996 0.23068450766295595 0.04428887085137085 0.47828461955273355 0.49812720488000983
2137.965438179348 0.05894331477409183 0.002377717391304348 0.11370584239130435 0.150390625 1229.9369759239387 0.13876077268739417 0.022556597674903502 0.28260990466905006 0.30279299142842747

Then I decided to stop the evaluation since it is not easy to map the results into the tables reporte in the paper. I think that the author should better clarify this mapping. FUrthermore no specification of the hartware architecture where the experiments have been run and also the necessary computation time. These are important information for reproducibility.