# Fuzzy Constraints for Knowledge Graph Embeddings

### Tracking #: 3134-4348

Authors:
Michael Weyns
Pieter Bonte
Filip De Turck
Femke Ongenae

Responsible editor:
Guest Editors NeSy 2022

Submission type:
Full Paper
Abstract:
Knowledge graph embeddings can be trained to infer which missing facts are likely to be true. For this, false training examples need to be derived from the available set of positive facts, so that the embedding models can learn to recognise the boundary between fact and fiction. Various negative sampling strategies have been proposed to tackle this issue, some of which have tried to make use of axiomatic knowledge claims to minimise the number of nonsensical negative samples being generated. By putting constraints on the construction of each candidate sample, these techniques have tried to maximise the number of true negatives outputted by the procedure. Unfortunately, such strategies rely exclusively on binary interpretations of constraint-based reasoning and have so far also failed to incorporate literal-valued entities into the negative sampling procedure. To alleviate these shortcomings, we propose a negative sampling strategy based on a combination of fuzzy set theory and strict axiomatic semantics, which allow for the incorporation of literal-awareness when determining domain or range membership values. When evaluated on benchmark datasets AIFB and MUTAG, we found that these improvements offered significant performance gains across multiple metrics with respect to state of the art negative sampling techniques.
Tags:
Reviewed

Decision/Status:
Reject

Solicited Reviews:
Click to Expand/Collapse
Review #1
Anonymous submitted on 17/Jun/2022
 Suggestion: Reject Review Comment: MOTIVATION In the development of machine learning models to solve the problem of link prediction in the semantic web, one of the challenges is the generation of embeddings that takes into account also negative examole of triples. Indeed negative statements are not direclty present in the semantic web, i.e., the semantic web does not contains informaiton that two entities are _not_related_ with a certain relations. Furthermore the absence of positive information cannot be interpreted as negative information, due to the fact that the semantic web in inherently incomplete. PAPER SUMMARY: The intended contribution of the paper is in describing different methods for generating negative examples, and evaluate how these methods affect the performance of embedding-based link prediction algorithm. The main idea of the paper is that the negative sampling methods should take into account the restrictions imposed by the schema (specified either in RDFS or in OWL). The authors also claim that they are the first of generating negative samples for literals. They evaluate the approach on two KG: namely AIFB, and MUTAGEN. POSITIVE APECT: - I believe that considering constraints provided by schema in order to generate more appropriate negative examples for training knowledge graph embedding is a good point. - Considering literals is also a positive point - taking also into account the "degree of negativeness" of a training example is also a good idea. NEGATIVE ASPECT: - The technical description of the paper is poor and not sufficiently precise. - some choices are not sufficiently justified - some of the introduced restrictions (i.e. ExistValue restriction) are not experimented since they are not present in the dataset - the results of the paper are not reproducible, meaning that, despite the code is running correctly (with minor fix) the produced output cannot be mapped into the tables reported in the paper. TECHNICAL FEEBACK The technical description contains a number of imprecisions from the formal point of view. This negatively affects the clarity of the message and the overall readability. In the following I list all the points that I manage to find out (some of them are minor other are more important). - page 5 line 7: definiton of Knowledge graph. First it is defined as a pair (E,R), and then as a subset of E xR x E. Clearly the second is the correct one. After that, I believe that the dissertation about how can be seen a knowledge graph can be summarized by simply saying that, every element of E is either an entity or a literal, and that literals can appear only as the object of a triple. The proposed formalisation is awkward and does not add anything new. - page 5, line 48 and after: The description of random variable is not correct $Y$ is not defined y_ijk\in{0,1} is a value not a random variable. Furthermore, the dimension $N_e x N_r x N_e$ is not correct since it does not take into account that literals appear only on the right. Furthermore, the introduction of probabilities is useless, since the author are not presenting a specific probabilistic model. - page 5, equation (1)-(3) "O","O+" and "O-" are undefined. - page 6, line 8-10: The sentence "They [OWL axioms] do not actually restrict what we can express". is not correct. Actually OWL axioms impose constraints. For instance the AllValue restriction imposes that the object of a certain relation is of a certain class. - page 7, equations (6) and (7) it is not clear what you mean by "(e_i,r_,e_j) is valid". I think I get the intuition, i.e., that you consider valid only the triples that respect the constraint. But for this you should better clarify various points: (1) Do you perform some form of reasoning on the t-box. For instance if the T-box contain R:rdfs:domain C and D rdfs:subClassOf C, then i suppose that a triple (e,R,e') where e' is of type D is also valid? (2) Suppose you have the constraint r rdfs:domain C, in the T-box but your A-box does not tell anything about the two entities e and e', is (e,r,e') valid? In the example you gage on lines 43-49, the triple (ex:foo ex:age 33^^xsd:int) is valid or not, under the hypothesis that you are not able to derive any information about foo in the knowledge base? Finally the use of the term "valid" can generate misunderstanding since "valid" in logic has a vary precise meaning, i.e., true in all the models. Similar comments can be done for eqauations (10) and (11) they are not formally clear. - The distinction beween OWA and CWA is not clearly stated. From what I could get in the CWA you consider all the "potential triplets" that respect the constraints, while for the OWL you consider also the one that do not respect the constraints. This looks quite contradictory to me. For instance if you have the relation "nationality" that requires that the object is a country, why should you add as negative example the triple "john. nationality, mary"? Isn't this what usually happens when people do not consider range restrictions? I believe you should provide more clear examples and motivations to distinguish between the two. - page 9: algorithm 1: line 6 is hard to interpret. I suppose you mean that "domain - subject_type != emptyset", i.e., there is a type on the domain which is not in the subject_type. IN the case of OWL restriction it looks that you don't make any distiction between AllValueFrom and SomeValueFrom. By the way the T-box of the two dataset AIFB and MUTAGENESIS do not contain existential restriction. - Page 10, Algorithm 2. * line 6: "neg-ration", is suppose thati si neg_ration*bath_size * line 7: "RANDOM()" i suppose that you are sampling from a uniform distribution in [0,1] * line 9" BERNOULLU(p) -> BNERNOULLY(pr). Furthermore if my interpretation is correct then you don't need to do this double sampling, what you do is equivalent to perform the test 1/2 < uniform(0,1). * line 12: you select all entities and then check if they are valid. Wouldn't be more efficient to query the valid entities. My impression is that the condition of validity can be easily encoded in a SPARQL query on the knowledge graph closed under RDF/OWL inference. This would - page 10:second column line 47: "we can derive ..." I don't think this is always possible. By the way you introduce mu^R_s and mu^R_o without providing the intuitive meaning. In particular you should explain what does it mean that mu^R_s(e) = 1. It is more feasable that you can obtain mu^R_s by aggregating mu^R on the object dimension. However not knowing the meaning of mu^R_s this is just a supposition. - page 13: Algorithm 5: Ther eare various (major/minor) points that should be fixed or clarified * line 5: the definition of n is a bit strange. A much simpler and (I believe equivalent) definition is #triples_p/#objects_p+1 (which is equal to 1 + the average number of subject (or equivalentsly triples) for each object). * line 10: if SN_max=0 then r is undefined. * line 10: where the 0.8 comes from? Is this an Hyper parameter? How do you decide to fix this threshold to 0.8. * line 13 if n==1 then SN_max must be equal to 0 since the everage number of subject per obiect is 0. and carp_p(k=1) will be the empty set. So there is something I don't correctly understand here. * line 23,23 the SC_sp and SC_ge has not been defined in the algorithm. You explain them later only in the text but you should specify also how you compute in the algorithm. * line 25: the coefficients, 0.2, and 0.8 are not justified. Are they hyper parameters? How do you choose them. - page 14, first column, line 37. The radius is defined as the greatest distance from the centroid in the claster, is this euclidean distance or cosine distance? (looking at the code it seems to be cosine distnace) EXPERIMENTAL EVALUATION: The paper evaluates the approach on two relatively small datasets AIFB and MUTAG. No comparisono with the state of the art approaches that include background knowledge in the task of embedding creation is done. I believe that this is necessary since the proposed strategy for negative examples generation is based on knowledge about the schema in addition to training data. Given the current hype on Neuro-symbolic integration there are many works that attemps to include knowledge into embedding generation. I believe that the author should compare with the most prominent approaches of the state of the art in this field. REPRODUCIBILITY: The code does not run, as it is, it gives an error. I have tried to fix it by adding the directory pickles and the two subdirectories pickle_AIFB and pickle_MUTAG and the directory results. I run the code for one full day 24 hours and only 3 experiments were done, generating outputs TransE_MUTAG_f0_enhanced_fuzzy_r1.0 micro_mr micro_mrr micro_hits1 micro_hits5 micro_hits10 macro_mr macro_mrr macro_hits1 macro_hits5 macro_hits10 3864.86039402174 0.03798001324660996 0.016304347826086956 0.048403532608695655 0.07591711956521739 1488.9208258804508 0.026901632935893316 0.00966756184824349 0.02741521909152732 0.04888600866232436 2601.673063858696 0.0220483210491774 0.0078125 0.03413722826086957 0.043478260869565216 1222.513125737288 0.2415117168948162 0.09285711896944132 0.4634787838987777 0.4787529422986032 3233.266728940218 0.03001416714789368 0.012058423913043478 0.04127038043478261 0.059697690217391304 1355.7169758088694 0.13420667491535476 0.05126234040884241 0.24544700149515253 0.2638194754804638 TransE_MUTAG_f0_regular_fuzzy_r1.0 micro_mr micro_mrr micro_hits1 micro_hits5 micro_hits10 macro_mr macro_mrr macro_hits1 macro_hits5 macro_hits10 2947.084069293479 0.0427498926856862 0.004245923913043478 0.07659646739130435 0.11005434782608696 1268.2268652459247 0.02724853023751138 0.0020436034100543136 0.046428800748556986 0.06329901187446912 3630.568783967391 0.02149687400075085 0.001358695652173913 0.03838315217391305 0.048743206521739135 2019.7362883214437 0.20494059192714692 0.014236111111111113 0.4588717788808927 0.47242066250993253 3288.826426630435 0.03212338334321853 0.0028023097826086955 0.0574898097826087 0.07939877717391305 1643.9815767836842 0.11609456108232916 0.008139857260582714 0.2526502898147248 0.26785983719220086 TransE_MUTAG_f0_regular_fuzzy_r5.0 micro_mr micro_mrr micro_hits1 micro_hits5 micro_hits10 macro_mr macro_mrr macro_hits1 macro_hits5 macro_hits10 1893.1734035326087 0.08545384716686456 0.0016983695652173915 0.17646059782608697 0.22146739130434784 874.6933887696778 0.04683703771183237 0.0008243244984361537 0.0869351897853666 0.10745877797684515 2382.757472826087 0.03243278238131909 0.003057065217391304 0.050951086956521736 0.07931385869565216 1585.1805630781996 0.23068450766295595 0.04428887085137085 0.47828461955273355 0.49812720488000983 2137.965438179348 0.05894331477409183 0.002377717391304348 0.11370584239130435 0.150390625 1229.9369759239387 0.13876077268739417 0.022556597674903502 0.28260990466905006 0.30279299142842747 Then I decided to stop the evaluation since it is not easy to map the results into the tables reporte in the paper. I think that the author should better clarify this mapping. FUrthermore no specification of the hartware architecture where the experiments have been run and also the necessary computation time. These are important information for reproducibility.