# RDFRules: Making RDF Rule Mining Easier and Even More Efficient

Authors:
Vaclav Zeman
Tomas Kliegr
Vojtěch Svátek

AMIE+ (Galárraga et al., 2015) is a state-of-the-art algorithm for learning rules from RDF knowledge graphs (KGs). Based on association rule learning, AMIE+ constituted a breakthrough in terms of speed on large data compared to the previous generation of ILP-based systems. In this paper we present several algorithmic extensions to AMIE+ which make it faster and more practical to use. The main contributions are related to performance improvement: the top-k approach, which addresses the problem of combinatorial explosion often resulting from a hand-set minimum support threshold, a grammar that allows to define fine-grained patterns reducing the size of the search space, and the faster projection binding reducing the number of repetitive calculations. Other enhancements include the possibility to mine across multiple graphs, lift as a new rule interest measure adapted to RDF KGs, the support for discretization of continuous values, and the selection of the most representative rules using proven rule pruning and clustering algorithms. Benchmarks show considerable improvements compared to AMIE+ on some problems. An open-source reference implementation is available under the name RDFRules.
Review #1
By Emir Muñoz submitted on 25/Jan/2020
 Suggestion: Major Revision Review Comment: This manuscript was submitted as 'full paper' and should be reviewed along the usual dimensions for research contributions which include (1) originality, (2) significance of the results, and (3) quality of writing. # Summary This paper proposes several optimisations to improve the performance of AMIE+ algorithm for rule mining in knowledge graphs. The authors implemented these optimisations and open-sourced the code. The proposed optimisations are as follows: - Reduction of the search space for rules - Pre-processing to support numerical attributes (object literals) - Post-processing with rule clustering, pruning, filtering, and sorting Experimental results show that the optimisations provide an improved speedup when compared with AMIE+. Given the increasing size of knowledge graphs, the paper is relevant to the journal and the communities interested in knowledge graphs. Although the work is well written and presents a good narrative with examples, I have major concerns because the paper does not fully deliver what is stated by the authors over the first two sections. This requires some significant clarification in the text and likely extra experiments. I have put the paper with all my comments in the following link for the authors: https://drive.google.com/file/d/1O1z2XWC2S8Ba_KBL7UQLvtMmclci2ExU/view?u... # Comments I have three major comments about this paper. I leave out comments on the open-source tool. 1. There are several weak claims that require clarification in the text. For example: - Abstract: “In this paper we present several algorithmic extensions to AMIE+ which make it faster and more practical to use” How much faster? How do you measure “practical to use”? - Page 2: “Experiments evaluating the proposed algorithms, showing considerable improvements over AMIE+.” What is considerable? - The discussion of AMIE+ is not really a review of RDF rule mining as stated in the contributions. - Something that struggles me is that the authors claim that RDFRules provides the same results as AMIE+, but this is not evaluated. Are the number of rules reported in Table 3 the same for AMIE+ and RDFRules? 2. The notation, algorithms, and equations are not clear or sound in many cases. For example: - Angle brackets are used for different purposes (e.g., notation of atoms), which are not clearly stated. In page 13, parentheses are used to denote an atom. - The use of coverage with a different definition. Coverage already has a clear definition in rule mining --- fraction of records that satisfy the antecedent of a rule. It's a percentage. - The ground substitution is not clearly explained and leaves unanswered questions. Could be possible to have a $\theta = {?a=, ?b=}$ How do you handle owl:sameAs equality? - Page 6: In the bsize equation is not clear how $B\theta$ can be compared to a conjunction of triples. Similar situation in the $bsize_{pca}$ definition - Algorithm 1, line 4: It is not clear if you process unique predicates or all instances of predicates for each triple. - Algorithm 2, line 9: $map[r]$ is treated as a set and as a counter. - Algorithm 2, line 13: another use of angle brackets not clearly described. - Similar issues with Algorithm 3. - Algorithm 5 has a few issues with set operators and it’s not clear when the loop should be broken. - Similarity equations on page 16 have issues in their definitions. (See attached PDF for more details.) - The definitions of head coverage (hc) and confidence (conf) differ from the ones introduced in Galárraga et al. 2015. Could you explain why the difference? 3. There are few of the optimisation proposals that were not tested in the experiments section or tested insufficiently. - The graph-aware rule mining is not evaluated at all. Furthermore, the only dataset, YAGO core, is only a part of YAGO that does not contain the YAGO taxonomy with the entity types. Therefore, the rules shown in the examples are not feasible to obtain. Finally, the whole point of improving speedup and scalability should be tested with datasets larger than 948K triples, e.g., DBpedia and Wikidata. This should improve the comparison with Galárraga et al. 2015. - The binning or discretisation approach for numerical attributes introduced in Section 5.2 is not evaluated. - The proposed lift metric in Section 5.6 is not measured at all in the experiments. - The proposed rule clustering in Section 5.7 is not used beyond mentioning that RDFRules supports it with the DBscan algorithm. - The proposed rule filtering in Section 5.8 is not used.
Review #2
Anonymous submitted on 15/Feb/2020