# Ontology Understanding without Tears: The summarization approach

### Tracking #: 1248-2460

Authors:
Georgia Troullinou
Haridimos Kondylakis
Dimitris Plexousakis

Responsible editor:
Guest Editors ESWC2015

Submission type:
Full Paper
Abstract:
Given the explosive growth in both data size and schema complexity, data sources are becoming increasingly difficult to use and comprehend. Summarization aspires to produce an abridged version of the original data source highlighting its most representa-tive concepts. In this paper, we present an advanced version of the RDF Digest, a novel platform that automatically produces and visualizes high quality summaries of RDF/S Knowledge Bases (KBs). A summary is a valid RDFS graph that includes the most representative concepts of the schema, adapted to the corresponding instances. To construct this graph our algorithm exploits the semantics, the structure of the schema and the distribution of the corresponding data/instances to initially identify the most im-portant nodes. Then we explore how to select the edges connecting these nodes by maximizing either locally or globally the im-portance of the selected edges. The performed evaluation demonstrates the benefits of our approach and the considerable ad-vantages gained. Furthermore, we present our first steps into enabling summary exploration through extensible summaries.
Revised Version:
Tags:
Reviewed

Decision/Status:
Major Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
By Szymon Klarman submitted on 27/Dec/2015
Review #2
By Silvio Peroni submitted on 12/Feb/2016
Review #3
Anonymous submitted on 18/Mar/2016
 Suggestion: Major Revision Review Comment: General Evaluation This paper introduces algorithms that take as an input an RDF schema S and an RDF dataset I complying to S, and produce as output a summary of S, that is, a smaller RDF schema containing the most representative concepts of S. To do so, the proposed algorithms exploit the graph-like structure of S, and the rdf:type declarations between instances of I and classes of S. Also, the paper contains an evaluation of the performance of the algorithms using real-world RDF schemas and datasets. The experimental results show that the algorithms obtain expected summarisations and, when compared to other summarisation tools, the results are similar or better. The approach is fairly novel, and has been evaluated using real-world datasets and compared to other state-of-the-art approaches. There are, however, some issues that should be addressed by authors. Below I give general remarks and then more specific comments and pose some questions to the authors. First of all, readability, also related to technical soundness. The paper is hard to follow and mainly because of lack of rigour. Many of the formulas given in the paper are incomplete, since they are completed using words. It is important to provide intuitive explanations of formulas, but the formulas themselves should be given explicitly. Take, e.g. the relative cardinality in Definition 3, $RC(e(v_{i_0},v_{j_0}))$ for a given $e(v_{i_0},v_{j_0})\in E$, the numerator should be (if I am not mistaken) $\{r(n_i,n_j) \in R : \tau_p(r(n_i,n_j)) = \lambda_p(e(v_{i_0},v_{j_0}))\}$ instead of just $\{r_m(n_i,n_j)\}$. The specific value of alpha should be given too. As another example, take Definition 9. Where are DistinctValues(e) and Instances(e) formally defined? This is actually misleading because $e \in E$ corresponds to a property, and properties have no instances (classes have instances). I suggest the authors to revise all the definitions of the paper and give complete and formal versions of the formulas. Concerning the evaluation, the authors have used real datasets and they have compared their algorithms with other existing tools, which is very good. However, even though their approach is based on instances, they have only used one knowledge base with instances (CICOC-CRM). I think it is important to provide experimental results using another knowledge base with instances to reinforce the conclusions of the evaluation. This also could clarify whether there is a real difference between using the CM or RM algorithms when instances are available, which is not clear in the case of CICOC-CRM. I do not agree with the authors when they say that, to use instances when comparing their approach with the other chosen methods (Peroni et al. and Queiroz-Sousa) would not be fair. Exploiting instances is an added value of their approach, and, as such, should be used. The two proposed algorithms are claimed to be correct, and to always produce a single result. Proofs (or sketches of the proofs) of these results should be provided. In Section 2, validity constraints are introduced as a way to ensure the uniqueness of summaries. However, this issue is not discussed later. Concerning Section 8, I consider that a journal paper should not contain a section of "work in progress". Since the authors are not far from finishing this work, I suggest them to complete it and include it in the paper. Specific Remarks and Questions to the Authors The role of RDFS inference in this approach is unclear to me. How does this approach benefit from RDFS inference? RDFS inference is not used for instance classification since "inference is implemented only at the RDF schema level" (Section 2). But the measure of relative cardinality, for example, uses available instances, and these may be incomplete due to the lack of reasoning. Please, elaborate more on this. The input of the proposed algorithms is an RDF schema and an RDF dataset complying to the schema. However, in practice, an RDF dataset usually makes use of more than one vocabulary, and vice versa the same vocabulary may be used by different datasets. What are the implications of this in the current approach? Concerning the evaluation, although the output of the algorithms is an RDF schema summary of classes and properties, the reference summaries only contain classes. How is the quality of the outputted properties assessed? This approach clearly makes the unique name assumption, and there should be some discussion about it in the paper. The validity constraints should be discussed more in detail. At a first glance, they seem to limit the scope of the proposed approach. Type uniqueness, for instance, will not hold for all real datasets. If, for any RDF/S knowledge base, it is always possible to build an equivalent and valid RDF/S knowledge base, this should be mentioned in the paper, and explained how it can be done too. Concerning the formal part of the paper, the following points should be argued and/or solved in the paper. Concerning H=(N,<) in Section 2, first, <= should be used instead of < as the subclass/subproperty relations are reflexive. Also, H is not a poset but a preorder - the fact that two classes (properties) subsume each other does not mean they are equal, but equivalent. In addition, the authors write that "the domain of property p, i.e. domain(p) is a class". Does it mean that $domain(p) \in C$? The set C, however, is a set of class names, and, in RDFS, there might be more than one domain declaration, so domain(p) might be an intersection of classes, which, in turn, is beyond the expressivity of RDFS. Ditto for range(p). Moreover, I guess that by "smallest" the authors mean "minimal", and that by "literal" they mean "literal type name". Finally, as a suggestion, I will explicitly say that $C \cup P$ is a disjoint union (even writing $C \uplus P$). Since the input of the summarisation algorithms will be an RDF/S knowledge base, i.e. an RDF schema S and an RDF dataset D that complies to S (two sets of triples), I suggest the authors to modify Definitions 1 and 2 and provide explicit definitions of an RDF schema graph associated with an RDF schema S, and an RDF instance graph associated with an RDF dataset D. These should be exhaustive (e.g. the hierarchy H built from S should be explicitly given). Now, if D complies to S, then, using the authors' terms, the two graphs will be "correlated" via the functions $\lambda_{p}$ and $\tau_{p}$, and $\lambda_{c}$ and $\tau_{c}$. But this will be a consequence, not a condition of the definition. In its current version, Definition 2 uses elements of an RDF schema graph ($\lambda_{p}$, $\lambda_{c}$). Actually, Definition 2 defines an RDF instance graph I complying to an RDF schema graph S, and, thus, S should be given and fixed at the beginning of Definition 2. Moreover, what are e, $v_i$ and $v_j$? It should be written that for every $r(n_i,n_j)\in R_{I}$ there exists $e(v_i,v_j)\in E_S$ such that $\tau_p(r(n_i,n_j)) = \lambda_p(e(v_i,v_j))$. Is $\tau_c$ a function or a relation? There could be more than one rdf:type (not rdfs:type like it is written in Definition 2) declaration for a given instance. In Definition 1, $\lambda_{p}$ should be defined before it is used (second bullet). Ditto Definition 2 and $\tau_{c}$. Also, is it $\lambda_{p}$ or $\lambda_{P}$? In Definition 1, $(s,p,o) = URIs x URIs x URIs$ should be $(s,p,o) \in URIs x URIs x URIs$. Ditto $(s,p,o) \in URIs x URIs x (URIs \cup Literals)$ in Definition 2. Avoid using technical terms if you do not define them previously (e.g. "interpretation" of p in Definition 2). In logic, the interpretation of a property has a very precise meaning, and it's not its domain/range. Wouldn't it better to consider a multigraph structure as in an RDF/S knowledge base two URIs may be linked by more than one property? Aren't V and E in Definition 1 finite too? In Definition 4, the sum should be $\sum\limits_{i=1}^m$ and not $\sum\limits_{1}^m$ (check Definition 5 too). Actually, since the numbers of incoming and outcoming edges could be different, I suggest to write $C_{in}(v_0) = \sum\limits_{e(v_0,v)\in E} RC(e(v_0,v)) \ast w_p \quad C_{out}(v_0) = \sum\limits_{e(v,v_0)\in E} RC(e(v,v_0)) \ast w_p}$ instead. As it is written, $w_p$ seems to be constant and to not depend on e, so, please, change notation. Also, the values of $w_p$ used in the experiments are not given in Section 6. In Definitions 8 and 10, specify where ${p(v_i\to v_j)}$ and ${p'(v_i\to v_j)}$ belong. Shouldn't it be "there exists ${p(v_i\to v_j)} \in E'$ such that there is no ${p'(v_i\to v_j)} \in B$..."? Minor Comments The authors often use "i.e." when it should be "denoted by" (e.g. "i.e. $RC(e(v_i,v_j))$" in Definition 3). Section 8 contains the same paragraph twice (!).

This paper describes two methods (1. SummaryCM, is a existing work; 2. SummaryRM, which is a new approach, with added support for blank nodes) for summarizing RDF/S knowledge bases. The authors claim that their methods gives better correlation to those summaries which were prepared by human experts. In my opinion, the work is good w.r.t. the relevance of the topic and its presentation. The approaches are described using formal notations are clearly explained. However, the manuscript requires a minor revision before publication.

One of my main concern is how far the first method is different from your previous work[11] -- though you have included algorithm complexity details etc--- I think a clear distinction or a summarization (of the details of the method) is necessary. Many of the contributions mentioned (at the end) of the Introduction (as the current work's contributions) coincide with your previous work[11].

"Specifically the contributions of this paper are the following:" // you may change this sentence."

"Our previous work could not handle blank nodes. However...." // you have given a positive appeal for including the blank nodes, but it turns out to have negative impact on the generated summaries. I think you should briefly cover this point in the introduction.

*encountered many typos

*need rewriting of a few sentances, to give more clarity!

*a few notation and reference issues

*clarity of the algorithms used.

Detailed review:

---

Section 2: Preliminaries (Definition 1)

\lambda_{c} or \lambda_{C}??

\lambda_{p} or \lambda_{P}??

---

Definition 2 (last paragraph): Font of the "C" (in c \in C) looks different.

---

Section 3.1: Reference 13 is misleading.

---

Section: 3.1.2: last paragraph

In "We conisder...the latter"

"the former" -- is missing

You meant the other way around?

Which is more important, user defined properties right!

Kindly give more clarity to the sentance: “ This is partly because the user defined properties correlate classes, each exposing the connectivity of the entire schema, in contrast to

the hierarchical RDF/S properties.”

---

Two sections with same titles (3.1 and 3.1.3) may confuse the reader.

---

In Algorithm-1 some functions look too abstract. For e.g., the fn. path_with_max_cov(B, S, vi) – provide more details.

---

“The correctness of the algorithm is proved by construction.” -- please give clarification.

---

Page-7 last paragraph.

To identify... the complexity "of" --missing-- its various components

---

Definition 8 and 10, “p” and “p'” should be italiticed uniformily.

---

“ Kruskal's greedy algorithm [16] is amongthe most efficient ones and we are using it in our

implementation.” --- efficent in what sense?

---

In Algorithm-2

What is “N” in line 7.

What is “r” in line 4.

“the result of our algorithm for a specific input is unique as well.” --- how is it possible? Are the relevance of all the propertes are unique?

---

Section-6

Intext repetitions of the class and property counts can be removied, since they are given in Table-1.

---

In Table-1, giving Property and User property counts togeather looks redudnant, sicne oneis the subset of the other. You may give RDF standard peoprty and User property counts (makes sense).

---

Section 6.2

---

Eq. for Sim(.)

You may use A and R, for automatically produced summary and R for reference summary -- inproves readability.

---

Page-17 1st column last paragraph

“As we will show latter --later-- the way that this value increases as

the size of the summary becomes bigger gives us”

---

In Section 6.3: To evaluate.."these" four ontologies.

which?? the ontology names are too far.

Section 6.3

Paragraph-2:

“We have to note that whereas the reference summaries on these” -- rewrite

Section 6.3

Paragraph-2 ending: It would be very interesting if you could include some egs. of the selected classes / summary egs. as an Appendix section.

---

In Fig. 7 you have give statistics of the Bank-Ontology, you mean the Financial Ontology?? --- this mistake has happend at many other sections.

"bioshere" -- be consistent use "BIOSPHERE" or "Biosphere"

---

Section 6.3:

For example, the Aktors Portal ontology contains a huge amount of blank nodes, and when considered by the SummaryRM, as shown in Fig. 6, the quality of the result is worse than the summary created by SummaryCM. //rewrite -- need more clarity

---

Section 6.4:

---

Section 6.5:

Using comma (instead of period) is little confusing. 1,29 is 1.29 secounds or 129 secounds??

“Finally we can see that SummaryRM is more efficient that --- “than” ---SummaryCM since the latter has to assess for each” // efficency in term of what?

---

Section 7:

“these works (e.g. [9], [10]) provide a list of the more important nodes, whereas others [8], [9], [17] and our approach, create a valid summary schema.” // reference [9] is included in both the types??

---

Section 8:

1. italicize e in Def.9 “values of the *e*..”

2. The paragragh "In our running example,..... "E57 Material". is repetated in Page-15

3. In the paragraph above Definition-10:...according "to" their instances... -- "to" is missing

Some relevant related works are missing:

[1] Wu, G.; Li, J.; Feng, L.; and Wang, K. 2008. Identifying potentially important concepts and relations in an ontology. In International Semantic Web Conference, 33–49