RDFRules: Making RDF Rule Mining Easier and Even More Efficient

Tracking #: 2511-3725

Vaclav Zeman
Tomas Kliegr
Vojtěch Svátek

Responsible editor: 
Agnieszka Lawrynowicz

Submission type: 
Full Paper
AMIE+ is a state-of-the-art algorithm for learning rules from RDF knowledge graphs (KGs). Based on association rule learning, AMIE+ constituted a breakthrough in terms of speed on large data compared to the previous generation of ILP-based systems. In this paper we present several algorithmic extensions to AMIE+, which make it faster, and the support for data pre-processing and model post-processing, which provides a more comprehensive coverage of the linked data mining process than does the original AMIE+ implementation. The main contributions are related to performance improvement: the top-k approach, which addresses the problem of combinatorial explosion often resulting from a hand-set minimum support threshold, a grammar that allows to define fine-grained patterns reducing the size of the search space, and a faster projection binding reducing the number of repetitive calculations. Other enhancements include the possibility to mine across multiple graphs, the support for discretization of continuous values, and the selection of the most representative rules using proven rule pruning and clustering algorithms. Benchmarks show reductions in mining time of up to several orders of magnitude compared to AMIE+. An open-source implementation is available under the name RDFRules at https://github.com/propi/rdfrules.
Full PDF Version: 


Solicited Reviews:
Click to Expand/Collapse
Review #1
By Jedrzej Potoniec submitted on 20/Jul/2020
Review Comment:

I thank the authors for their effort to improve the paper. My remarks were all addressed in a satisfactory manner and in my opinion, the paper is now ready for publication.

Review #2
By Emir Muñoz submitted on 22/Jul/2020
Review Comment:

I really appreciate the effort of the authors addressing the comments made by me and the other reviewers. The quality of the paper was hugely improved in the last round. I find now the paper more self contained and concise on the contributions made to the area. Thank you for publishing your scripts used to run the experiments. That is a great addition towards reproducibility and re-usability of RDFRules.

I still have some really minor comments, however, I think the authors can address them easily before the camera ready and there is no need for another round---at least from my side.

## Minor comments

1. Table captions should be on top of the tables and, in general, I think you could add some differentiation to the headers of all tables
2. In the abstract it would be better to enumerate the performance improvements, e.g., (1) the top-k …, (2) … , (3) …
3. Section 1: “the AMIE+ algorithm can take days, but the presented approach has a more than an order of magnitude shorter mining time” —> in which dataset did you observe this behaviour? I’d be better to name it or reference to your experiments
4. Section 2: “RDF-style KGs” —> I think that here it should just say RDF KGs, remove the “style” part. It’s true that other databases can be converted to RDF; however, you do not process the original format, but the final RDF format
5. Section 2: “Jena API” —> Please provide a link to the project web page
6. Section 2: “included on the output” —> included *in* the output
7. Section 2, typo: “Interestingy, both” —> *Interestingly*, both
8. Section 2: “suggesting subclassof relations” —> do you mean the rdfs:subClassOf relation?
9. Section 2: “which can facilitate work of ontology engineers” —> which can facilitate *the* work of ontology engineers or the work on ontology engineering
10. Section 2: “ILP systems have been reported too slow” —> ILP systems have been reported *to be* too slow
11. Section 2: First mention to YAGO is missing a reference or link to the project
12. Section 2: “ILP systems such as Aleph” —> ILP systems such as *ALEPH* (in uppercase)
13. Section 2, page 3: Third paragraph on the right has a sentence “however” followed by another sentence “however”
14. Section 2: “KGs [20] and is also used” —> KGs [20] and also used (without the *is*)
15. Section 2: First mention to Apache Spark is missing a link to the project
16. Section 2: When introducing SWARM I think it’s needed an explanation for the association rule used as example
17. Section 2: “Apriori algorithm. .” —> double full stop
18. Section 2: the following sentence seems odd “a use of a reasoner in combination with association rule mining results in a high chance of mining not finishing in ’reasonable time’ [26]” —> *the* use of a reasoner in combination with association rule mining is likely to exceed what one can consider a “reasonable time” [26]
19. Section 2: “rules.They” —> there is a missing space after the full stop
20. Section 2: “the SDType algorithm” —> missing reference to the paper
21. Section 2: “e.g. RESCAL [28], HolE [29] or TransE [30]” —> *and* TransE (you are listing examples not providing options)
22. Section 3: “E.g., the atom” —> *For example*, the atom
23. Section 3: “In this respect, AMIE-style rules” —> In this respect, *Horn rules*
24. Section 3.3: Nitpicking, example of substitution using and , the ?a doesn’t have always the same font
25. Section 3.4.2: “is also infrequent (this is called the upward closure property).” —> any reference you could provide here?
26. Section 3.4.2: When defining support you use the instantiation operator, which is used over a rule body and head implication. Should this be B \wedge H or am I missing something?
27. Section 3.4.3: Your equation defining bsize is the same as the one provided for support and the condition also involves instantiation with a rule. Following AMIE+, the instantiation right part should only involve the body of the rule
28. Section 3.4.3: “the number of unique objects in the (empirical) domain of p in KG” —> *subjects* not objects
29. Section 3.4.4: “Galárraga et al. [2] witness)” —> witness is a noun, used witnessed or observed instead
30. Section 5.1: “The recursion is stopped after all fresh atoms have been bound.” —> *found*
31. Section 5.6: “Suppose two rules U and V” —> I think previously you used $r_1$ and $r_2$ to describe two rules. I think you should keep consistency on that notation
32. Section 6: “from RDF-style data” —> from RDF data
33. Section 6.2: “Actions are further divided into the streaming and strict ones” —> I still believe you should use “batch”, where batch is a group of triples or all the set of triples, i.e., the whole KG
34. Section 6.3: “The operations implemented over the RDFDataset structure are the same as for the RDFGraph structure, aside a few extensions.” —> You mention some differences between RDFDataset structure and RDFGraph structure. Where are these extensions enumerated?
35. Section 6.5: “(cf. Algorithm 1)” —> “confer” (“cf.”) should be used to point to another source discussing the same issue or making the same argument. In this case, you should use “see” to point to your algorithm
36. Section 6, Table 2: This table is missing a clear highlight of the headers that would help improving the readability of the table. Also, if the first column is important, you could bold the text to differentiate, instead of the double line.
37. Section 7.1: “The number of triples and their unique elements for each used dataset are shown in Table 3.” —> remove the *used* word
38. Section 7, Table 4: Table 4 seems to have a formatting issue on the reported results for DBpedia 3.8. It is not clear what are setups for the different values reported.
39. Section 7, Table 4: “with approx. 7.14 h“ —> How this approximation was made? Why do you mention first > 2 days?
40. Section 7.2.2: “mining time was set for two days” —> mining time was set *to* two days

Review #3
Anonymous submitted on 07/Sep/2020
Minor Revision
Review Comment:

The authors addressed all of the comments made on the first version.

I acknowledge all of the addition in response to specific comments.
In particular the evaluation of specific new features has been covered to some extent.
The main focus is on tackling weaknesses in AMIE+ although the solutions are not based on the involvement of Web ontologies (e.g. see the comparison to ILP rule learners at p.4, esp. lines 33-38)

My main concern remains the motivation for learning this kind of rules.
I acknowledge the proposed advances regarding the various problems with AMIE+ elicited in the paper, still I'm not convinced of the effectiveness of rules to the claimed purposes.

I see that graph completion may be a valid scenario (as testified by the example added to sect. 7.4, which focuses on discretization, though) but in that case an empirical comparison to different link prediction methods / models may be required.
Triples may be missing owing to different causes: incompleteness is inherent to this context (as investigated in papers on this specific topic).

Triples may be even missing because they are intended to be derived through reasoning.
How much of the missing triples that can be found depend on ruling out reasoning?
Of course noisy data may easily mislead a completion technique totally based on reasoning, but the same would apply to weak or low quality rules elicited from (noisy) data.

The argument (see p.4, lines 43-46) involving the results contained in [17] considers inconsistency problems rather than the inherent incompleteness. As the other systems cited as related work, it would be interesting to know whether these models are better or worse when applied to completion problems. New results achieved with AMIE+ are mentioned but can we expect analogous performance by the proposed system?

In my opinion for completion to be claimed as main purpose of the rules extraction the design of specific experiments regarding the quality of the extracted rules would be required.
But that would also call for a comparison to different treatments of the problem such as supervised statistical learning methods ultimately adapted to these representations.

Thus I would not object to accepting the paper provided that the utility of such approaches is more convincingly discussed.

Minor issues:

- comm. 2.36: owl:sameAs assertions are exactly one kind of info that may be missing due to the typical distributed scenarios; their availability may be not assumed as granted
- comm. 2.41: "However, numerical values are still likely to be unique as long as their range is sufficiently large, never mind infinite." unclear
- comm. 2.45: I'm ok with the reply but semantic features should be inherently important in this context


there's a few font issues to be amended,
e.g. p.6, line 21: ?a and ?b and fig. 1 on p.7


- the project Github URL would better fit a footnote than the abstract's main text
- same applies to the link indicated in the conclusions (you might also use a \tt font or the \url command)


- the error rate measure for a KB mentioned from [17] is unclear

- ref. [12] "JÓzefowska" typo