Ontology Understanding without Tears: The summarization approach

Tracking #: 1396-2608

Georgia Troullinou
Haridimos Kondylakis
Evangelia Daskalaki
Dimitris Plexousakis

Responsible editor: 
Guest Editors ESWC2015

Submission type: 
Full Paper
Given the explosive growth in both data size and schema complexity, data sources are becoming increasingly difficult to use and comprehend. Summarization aspires to produce an abridged version of the original data source highlighting its most representative concepts. In this paper, we present an advanced version of the RDF Digest, a novel platform that automatically produces and visualizes high quality summaries of RDF/S Knowledge Bases (KBs). A summary is a valid RDFS graph that includes the most representative concepts of the schema, adapted to the corresponding instances. To construct this graph we designed and implemented two algorithms that exploit both the structure of the corresponding graph and the semantics of the KB. Initially we identify the most important nodes using the notion of relevance. Then we explore how to select the edges connecting these nodes by maximizing either locally or globally the importance of the selected edges. The extensive evaluation performed compares our system with two other systems and shows the benefits of our approach and the considerable advantages gained.
Full PDF Version: 

Minor Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
By Szymon Klarman submitted on 22/Jun/2016
Minor Revision
Review Comment:

I appreciate the authors’ responses and their effort put into revising the paper. I find most of the concerns I raised in my review to have been properly addressed. However, I still have the following issues that I’d like to be considered and resolved:

1) I’m afraid I am not convinced by the response regarding the use of RDF(S) semantics in computing the summary. Even if we restrict the claim just to the schema graph it just doesn’t hold. Let’s assume we deal with an empty instance graph and want to summarize only the schema graph. The algorithm makes references to certain measures, such as RC, Rel or Cov. All these measures evaluate the scores for the respective entities (edges, concepts, paths) by looking up the schema graph G_S only – not its deductive closure Cl(G_S). Because of that, whenever we have two syntactically different graphs G_S =/= G_S’ the resulting summary might be different for both of them, even if it is the case that the graphs are semantically equivalent, i.e., when Cl(G_S) = Cl(G_S’). This is basically follows from the definitions and the algorithm. And yes, the algorithm starts off with Cl(G_S), but since the measures only work over G_S this is not enough . So this either has to be either changed, or made clear in the paper that the proposed method is largely syntactic - making possibly some use of the semantics but not guaranteeing equivalent results for semantically equivalent graphs.

2) The notion of validity occurring on p.4 adds to the confusion regarding the semantics of RDFS. The definition of RDFS per se does not enforce any validity requirements related to the domain and range restrictions. It is the very idea of the open world architecture and logical entailment that ensures that if suitable assertions about certain individuals are not explicitly stated in the graph, they will be inferred from the schema axioms. But there’s no requirement for these assertions to be there to start with.

3) In Definition 1, how come the nodes (which should be concepts or datatypes, as I understand) can also be literals from the earlier defined set L?

Review #2
By Silvio Peroni submitted on 01/Jul/2016
Minor Revision
Review Comment:

I thank the authors for having considered all my comments in their revised version, and to have provided answers to all the issues I've highlighted.

While I'm reasonably happy with all the modifications they have implemented, there are few points that should be discussed a bit further.

# About blank nodes

My previous comment: Again, about the use of blank nodes, there are some aspects that should be clarified as well. OWL ontologies usually adopts blank nodes for expressing class restrictions, group disjointness, etc. However, since these aspect are not considered at all in the algorithms proposed by the authors – at least, as far as I understood – it is not clear what they refer to with "blank nodes". Can they refer to individuals of a certain class that are not provided with a URL? This aspect should be explicitly discussed in the paper.

Authors' answer: Blank nodes do exist both in OWL and RDF/S ontologies. In RDF, a blank node is a node in an RDF graph representing a resource for which a URI or literal is not given.

Further comment: Yes, in RDF (which is the most generic framework involved in the authors' work, of course) that is the case. However, blank nodes can have specific meanings when we look at OWL in particular. OWL, of course, is still based on RDF, but uses blank nodes also for specific purposes, i.e.:

1. for allowing the description of individuals (i.e. instances of a particular class) without providing any explicit URL - and those are part of the ABox of the ontology;

2. for creating class/property/datatype restrictions, i.e. particular ontological entities that are functional to the organisation of the ontological concepts from a TBox perspective.

Thus, since the analysis presented by authors actually uses OWL ontologies, I think that keeping in mind that ABox/TBox distinction for blank nodes (e.g. by not considering TBox blank nodes from the computation) is important and could affect substantially the outcomes of the algoritms presented.

If the authors still believe, however, that it is good for their purposes to use blank nodes independently from this distinction, I would suggest to add some explanatory text in the paper explaining it.

PS: blank node (i.e. anonymous resource) != literal

# About the new evaluation

My previous comment: The fact that the authors use the CIDOC-CRM core as summary of the full CIDOC-CRM is not convincing at all, and rather seems a simplification to me. Since the selection of the summaries for the other ontologies has been done by humans following a particular protocol, I would like to see a similar approach also for the CIDOC-CRM one, so as to basically compare summaries that have been done in the same way. In addition, that will result in a clear additional contribution of the work. Note that, since the source of KCE is available, the authors should also consider at least that method in the evaluation of CIDOC-CRM.

Authors' answer: According to reviewer comments, we conducted a new user-study using three new ontologies CRMdig, LUBM and eTMO with and without instances where three ontology experts generated the reference summaries for each ontology. In addition we included KCE in the corresponding evaluation as well.

Further comment: It is evident that the three experts involved in this new evaluation are not the authors (but I found this information only in the acknowledgements) and, thus, are not biased at all by how the authors' algorithm actually work. This should be explicitly stated in the paper.

In addition, the number of expert involved in this new evaluation is lower than that used in [10] and, thus, I would suggest to make it explicit as well.

# About the low quality of blank nodes

My previous comment: There is no evidence of "low quality" in [18] related to the use of blank nodes in ontologies. Actually, blank nodes are used to define class restriction, and thus are rather useful and surely not so low quality.

Authors' answer: The corresponding statement was removed from the paper. However [...] "Blank nodes are treated as simply indicating the existence of a thing, without using, or saying anything about, the name of that thing" [...] "We discourage the use of blank nodes. It is impossible to set external RDF links to a blank node, and merging data from different sources becomes much more difficult when blank nodes are used. Therefore, all resources of any importance should be named using URI references."

Further comment: The reported citations are correct indeed. However they mainly refer to the use of blank node in RDF / Linked Data domain. There, blank node are the "evil" indeed! However, in the context of OWL ontologies, since they are the main tool for creating complex assertions such as restrictions (see my comment above), they are necessary, useful, and good. It strictly depends on the particular perspective one is looking at them.

However, given this comment, I'm starting to think that the authors refer to them to say "individuals with no IRI associated" (see point 1 in "About blank nodes"). Thus, what happens to the other blank nodes during the algorithm process (i.e. those related to the TBox of the ontology)?

# Blank nodes and evalutation

Authors' text: The only case that our algorithms are worse than the other two algorithms is in the case of the Aktors Portal ontology. By trying to understand the reasons behind this, we identified that the Aktors Portal ontology contains a huge amount of blank nodes and this has a direct effect to the quality of our constructed summary, despite the fact that both our algorithms consider them when calculating the summary schema graphs.

Further comment: That is interesting indeed, and I suspect that this result concern the issue I've already highlighted in "About blank nodes". Something that should be discussed with a bit of more detail in the paper.

# Minor changes

I would suggest to change the sentence "It is well-known [22] that low-level deltas can be used..." with something along the line of "In [22], the authors says that low-level deltas can be used...".

Review #3
Anonymous submitted on 15/Jul/2016
Review Comment:

General Evaluation

The authors have addressed all the remarks I made in my previous review. The formalisation of the approach is now in a much better shape, the authors have added proofs to the theorems and new experimental results.

In my opinion, the paper is very close to be ready for publication. Below I include some remarks that should be considered for the final version of the paper.

Specific Remarks to the Authors

  • Sometimes you speak of an RDF KB, and other times of an RDF/S KB (even in Definition~1). Please, be consistent.
  • In Definition 1, the first $L$ refers to datatypes, and the second one to literals. Please, choose different symbols.
  • Concerning Definition 2:
    • As a suggestion, you could write "the relative cardinality of an edge $p(v_i,v_j) \in E_S$" instead of "the relative cardinality of $p(v_i,v_j)$ $v_i, v_j \in G_S$". Also, I would write $p(v,w)$ and $p(n,m)$ for generic edges in $E_S$ and $E_I$, respectively (most of subscripts in the formula are actually unnecessary).
    • Isn't it $\lambda(r_p(n_i,n_j)) = \lambda(p(v_i,v_j))$ instead of $\tau_c(r_p(n_i,n_j)) = p(v_i,v_j)$? And since $\tau_c: I \to 2^C$, isn't it $v_i \in \tau_c(n_i)$ instead of $\tau_c(n_i) = v_i$?
    • Also, the condition should ensure that the denominator, and not the numerator, is not 0.
    • Inside the absolute value, you must write a set (a set with a defining condition in your case).
    • To sum up, if I'm not wrong, the definition of $RC(p(v_0,w_0))$ for a given $p(v_0,w_0)\in E_S$ should be something like (assuming $E_S \not= \emptyset$): if $\{ p(n,m) \in E_I : v_0 \in \tau_c(n) \} \not= \emptyset$ or $\{ p(n,m) \in E_I : w_0 \in \tau_c(m) \} \not= \emptyset$ then
    • $$
      RC(p(v_0,w_0)) = \frac{1}{| \{p(v,w) \in E_S\} |} + \frac{| \{p(n,m) \in E_I : \lambda(p(n,m)) = \lambda(p(v_0,w_0)), v_0 \in \tau_c(n) \mbox{ and } w_0 \in \tau_c(m) \} |}{| \{p(n,m) \in E_I : v_0 \in \tau_c(n) \} | +
      | \{p(n,m) \in E_I : w_0 \in \tau_c(m) \} |}
      RC(p(v_0,w_0)) = \frac{1}{| \{p(v,w) \in E_S\} |}

  • Concerning Definition 4, $c_1,\ldots,c_n\in V_S$ instead of $c_1,\ldots,c_n\in E_S$ (same for $c'_j$), which is actually redundant if you impose $p(c_i,c)\in E_S$. But better writing $c_1,\ldots,c_n\in V_S \cap C$.
  • In Algorithm~1, the initialisation of \textit{Nodes} is missing (Line~5). Also, isn't it $\mathit{RemNodes}:= \mathit{Nodes}$? Shouldn't Line~7 be: While $\mathit{RemNodes} \not= \{\}$?
  • In 6.2.1, if there was no agreement between the experts, which reference summary did you choose for the evaluation? Please, clarify it in the paper.
  • Minor Comments and typos:

    • Please, write all mathematical symbols/letters in italic (e.g. V instead of V).
    • When citing, write, e.g., [6,7] instead of [6][7].
    • Write $G_I$ instead of $G_i$ whenever you write $V=(G_S,G_i,\lambda,\tau_c)$.
    • Page 2, "user-defined and standard RDF properties" should be "user-defined and standard RDF-S properties".
    • Page 3, "their datasets As such" should be "their datasets. As such".
    • Right parentheses are missing in $C_{out}(c)$ and $C_{in}(c)$ (Definition~3).
    • Page 5, "how central is a schema node in an RDF/S KB" should be "how central a schema node in an RDF/S KB is".
    • In Definition 5, "the following." should be "the following:".
    • In Definition 6, $G_{S'}$ should be $G'_S$.
    • Page 10, RDF Accessor should be RDF Assessor.
    • Use KCM or Peroni et al. when referring to the tool, but not both.
    • Page 12, "did not agreed" should be "did not agree".
Review #4
By Vinu Ellampallil Venugopal submitted on 21/Jul/2016
Minor Revision
Review Comment:

The current version of the manuscript is well written and the authors have addressed all the issues which I had raised. However, the following minor changes should be made before publication.

The paper should be carefully proof-read by the authors once again, I have noted a few typos.

Minor revisions:

Suggestion: It is easier to understand schema content using only the summary graph since it contains the most important/representative nodes out of the initial graph. // A one or two line explanation on the domain-significance of the summary nodes would further motivates the reader.

p-5: "We consider as more important the user-defined ones.."// We consider the user-defined ones as more important..

Def.3- RC( -- closing ")" is missing. You may use \big) and \big(

p-5: Consider now the “Ε38 Image” class shown in Fig.
1. Assume...."//Not clear what exactly you are trying to convey. Please recheck the definition reference and address the sentence-construction issues (are not any instances --> are no instances..)

Initially we calculate the Cl(G s ) and we need O(|V| 3 ).//Provide a reference or explain why it is O(|V|3).

..allowing users to use the second algorithm only whereas soon it will be updated to allow both the algorithms to be selected and used...//allowing users to use only the second algorithm (soon it will be updated with the other algorithm)

In the implemented system (online link), for all the summaries, there exist a node with name "class" which connects all the other classes in the summary. Is it really necessary to include a built-in concept in the summary? or Is it getting generated as a part of the summary result? In that case, can there be a case where some user defined classes (in the summary) that are not related to the so-called *class* node?

Bellow we will describe in detail the performed evaluation.// "Below"

By trying to understand the reasons behind this, we identified that the Aktors Portal ontology contains a huge amount of
blank nodes and this has a direct effect to the quality of our constructed summary// In general, I wonder how does the presence of blank node affects *schema* summary. Blanks nodes can be though of as instances without names; in than sense they do have some significance (same as that of every other instances).

The equation in 6.3.1. Stage 3 Evaluation Measures, should be written properly. Please confirm whether |Vsummary| is present in the denominator OR Did you mean, (Vsummary \union \delta^{+}_{summ,ref})\backslash \delta^{+}_{summ,ref} ---> Vreference

V1 and V2 or V_{1} and V_{2}, be consistent in Sec.6.3.1.

All the best.


It would be great if you could share the responses from the authors.

Can you contact me by email about this, at contact@semantic-web-journal.net. Thanks!