# An Unsupervised Approach to Disjointness Learning based on Terminological Cluster Trees

### Tracking #: 2233-3446

Authors:
Giuseppe_Rizzo
Claudia d'Amato
Nicola Fanizzi

Responsible editor:
Philipp Cimiano

Submission type:
Full Paper
Abstract:
In the context Semantic Web context regarded as a Web of Data, research efforts have been devoted to improving the quality of the ontologies that are used as vocabularies to enable complex services based on automated reasoning. From various surveys it emerges that many domains would require better ontologies that include nonnegligible constraints. In this respect, disjointness axioms are representative of this general problem: these axioms are essential for making the negative knowledge about the domain of interest explicit yet they are often overlooked during the modeling process (thus affecting the efficacy of the reasoning services). To tackle this problem, automated methods for discovering these axioms can be used as a tool for supporting knowledge engineers in the task of modeling new ontologies or evolving existing ones. The current solutions, either those based on statistical correlations or those relying on external corpora, often do not fully exploit the terminology of the knowledge base. Stemming from this consideration, we have been investigating on alternative methods to elicit disjointness axioms from existing ontologies based on the induction of terminological cluster trees, which are logic trees in which each node stands for a cluster of individuals which emerges as a sub-concept. The growth of such trees relies on a divide-and-conquer procedure that assigns, for the cluster representing the root node, one of the concept descriptions generated via a refinement operator and selected according to a heuristic based on the minimization of the risk of overlap between the candidate sub-clusters (quantified in terms of the distance between two prototypical individuals). Preliminary works have showed some shortcomings that are tackled in this paper. To tackle the task of disjointness axioms discovery we have extended the terminological cluster tree induction framework with various contributions which can be summarized as follows: 1) the adoption of different distance measures for clustering the individuals of a knowledge base; 2) the adoption of different heuristics for selecting the most promising concept descriptions; 3) a modified version of the refinement operator to prevent the introduction of inconsistency during the elicitation of the new axioms; 4) the integration of frameworks for the distributed and efficient in-memory processing, namely Spark, for scaling up the set of candidate concepts generated through the refinement operator. A wide empirical evaluation showed the feasibility of the proposed extensions and the improvement with respect to alternative approaches
Tags:
Reviewed

Decision/Status:
Minor Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
Anonymous submitted on 23/Feb/2020
 Suggestion: Accept Review Comment: I was asked to review this new version of the paper, which was adapted according to the reviewer comments from an earlier submission of the paper. I have carefully studies the new version, particularly w.r.t. my earlier comments about the paper, am satisfied with the improvements made. The only question you had was w.r.t. my request for more information about the ontologies. As discussed w.r.t. earlier comments, I was interested in the relation between your method and the complexity of the ontologies you applied your methods to. Therefore, I had asked for more details about the practical complexity of the ontogolies, so e.g. what the average size of the axioms was, and what type of operators were really used. But that might not add that much to the paper, I guess. So, I am happy to recommend to accept the paper in its current form.
Review #2
Anonymous submitted on 12/Mar/2020
 Suggestion: Minor Revision Review Comment: The new version submitted by the authors has addressed satisfactorily most of the issues raised by the reviewers in the first round. Let me first say that I appreciate a lot the effort that the authors have done to improve the content and tightening the writing. The manuscript has improved a lot and is now well-organized and reads generally well. Yet, there is one thing that needs to be fixed and there are a number of minor things to fix that I mention below. The major thing is that I am still not convinced about the SPARK-approach to computing the cluster tree. This part is not really strongly related to the other parts of the paper, that is the algorithm for constructing TCTs nor the evaluation. Further, I really do not get an intuition on how the computation is sped up. I understand that the specialization procedure is computed in a distributed fashion, but I do not see any details in the paper to understand what exactly is distributed, how one avoids to recompute the same refinements in different threads. To me the solution seems to be a dynamic programming approach rather than a distributed approach. In any case, I am not convinced about this section and think that it should be removed from this paper to focus the paper on one contribution and for the paper to gain clarity. As it stands, the explanations are to sparse. This should be published either in a separate paper or not at all as I do not see any particular strong contribution in this part, which seems quite straightforward. Now the minor issues: Abstract: In the context Semantic Web context -> double mention of context nonneglegible constraints => unclear what non-neglegible means here Page 2 The effectiveness of the mentioned complex inference services => complex in which sense? Unclear Sentence starting with "Reasoning under open world-semantics..." is unclear due to many brackets. Page 3: Unclear what "The former" refers to in "The former (with a similar structure..." second colum: Despite... there are some issues *no comma* that were not further below indicates a truly erroneous axiom *no comma* or a special case Page 6, Algorithm 1, CS in Induce(I,C,CS) is not introduced as input Page 9: Not clear what the following means: "It is important to avoid the generation of satisfiable concept descriptions for which the training individuals exhibit a neutral membership" What is neutral membership? Example 4: Let us *s*uppose (smaller case s) 2nd column to process such kind of data by means a transparent approach => of a transparent approach Formula 3 on page 10: Why is F not part of the index of $\pi$ ? Page 11 Below formula 6, the bracketing for \pi(a) is wrong. It should be \pi_{(a)} but the authors seem to have written \pi_(a) in latex. 2nd column for gathering concepts descriptions => gathering concept descriptions Page 13 The distance measure .. was selected from the family *no comma* with a context of features In all cases but the first release ... this sentence seems grammatically odd Page 14 For eliciting the target axioms... this sentences reads oddly Page 16 instances of C \cap D exceed.... full stop is in next line. This could yield to limit => awkward formulation Page 17 bottom In the experiments with GEOSKILLS both all methods => both all sounds weird. Page 18 in the experiment on the original KBS => KBs This depended on the complexity *no comma* in terms of syntactic length of the ... odd sentences => This depended on the syntactic length of the concept description generated and the threshold on the number... Page 20 Conclusion between the farthest elements of a cluster w.r.t. the medoid the other cluster resulting from => grammatically odd sentence refinement operator with the one used in he previous => in the previous version