Ontology Understanding without Tears: The summarization approach

Tracking #: 1248-2460

Authors: 
Georgia Troullinou
Haridimos Kondylakis
Evangelia Daskalaki
Dimitris Plexousakis

Responsible editor: 
Guest Editors ESWC2015

Submission type: 
Full Paper
Abstract: 
Given the explosive growth in both data size and schema complexity, data sources are becoming increasingly difficult to use and comprehend. Summarization aspires to produce an abridged version of the original data source highlighting its most representa-tive concepts. In this paper, we present an advanced version of the RDF Digest, a novel platform that automatically produces and visualizes high quality summaries of RDF/S Knowledge Bases (KBs). A summary is a valid RDFS graph that includes the most representative concepts of the schema, adapted to the corresponding instances. To construct this graph our algorithm exploits the semantics, the structure of the schema and the distribution of the corresponding data/instances to initially identify the most im-portant nodes. Then we explore how to select the edges connecting these nodes by maximizing either locally or globally the im-portance of the selected edges. The performed evaluation demonstrates the benefits of our approach and the considerable ad-vantages gained. Furthermore, we present our first steps into enabling summary exploration through extensible summaries.
Full PDF Version: 
Tags: 
Reviewed

Decision/Status: 
Major Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
By Szymon Klarman submitted on 27/Dec/2015
Suggestion:
Minor Revision
Review Comment:

In this paper the authors present a novel method of summarizing RDF/S ontologies (schemas). I am not an expert in this area, but I find the proposed approach formally well-founded, mostly intuitive and rather convincing in terms of the results reported in the evaluation. The foundational work is accompanied by a documented implementation of the presented algorithms with a user-friendly interface accessible online, which allows for an easy experimentation. The problem is definitely interesting and relevant for the Semantic Web community. The approach appears novel and significant, particularly in the idea of exploiting instance data distribution (across different classes and properties) in order to determine the schema summary, and, in general, intertwining this data-oriented with a more common network-oriented take on assessing the relevance of different elements in the schema.

The content added on top of the original ESWC-15 publication seems sufficient and agrees with what the authors declare in the attached letter. Altogether, I believe the paper presents a rather interesting and valuable contribution. I have two more critical comments, expressed in detail below, which I would like the authors to address. I do not think these highlighted issues compromise the ideas and the results of this work in any significant manner, but they should be carefully and systematically tackled for the paper to be published in SWJ.

Major comments:
---------------

My first comment concerns the claim that the summaries are constructed taking into account the semantics of the RDF/S graphs (p.2). The extent to which this is the case (if it is the case at all) is not very clear to me. In what way does the semantics of the “logical” RDFS properties (rdfs:subClassOf, rdfs:subPropertyOf, rdfs:domain, rdfs:range) influence the computation of the summary and in what way are these properties distinguished from user-defined properties, which do not have such specific “logical” semantics? It seems to me that this dimension is not effectively exploited at all, and the method would generate exactly the same summaries (up to renaming) if we uniformly replaced such RDFS properties with user-defined ones. The authors make a number of other references to the semantics of the RDF/S graphs, summarized below, but these remain rather inconsequential in my view (p.3):

- The definition of the concept and property hierarchies is given as H = (N, <), but such hierarchy is never referred to, except when the validity constraints are imposed on H (e.g., acyclicity of <), which do not seem to be really necessary or justified.
- The structural inference (subclass and subproperty inference) is evoked and the deductive closure Cl(S) under this inference is defined, but again this never used again in the paper.
- The inference of rdf:type relations on the instance level is skipped in order to avoid “over-loading superclasses”.
- The “typing” inference (i.e., classifying resources as classes or properties, respectively) is indeed used, but this one seems the least interesting and is certainly not the core of what one typically understands by the RDF/S semantics. And actually, the presence of the node called “class” in the summaries appears quite redundant, just adding some trivial clutter to the picture, and could be easily skipped from the user’s perspective.

In my opinion the claim about the full use of the RDF/S semantics in summary construction would be completely justified only if the summary of a graph S would be guaranteed to be always the same as the summary of any graph S’ which is obtained from S by applying any number of inferences. If this is not the case, then I think that the claim about using the semantics should be carefully qualified and further explained.

The second issue concerns the use of formal notation in the paper, particularly in the definitions of some key notions such as RDF/S schema and instance graphs. This notation appears somewhat clumsy and not always properly justified, which is pretty confusing for the reader. For instance:

- In Definition 1, the expression T_S = (s, p, o) = URIs x URIs x URIs is not the right way to write down that T_S is a set of triples, i.e. a subset of the product URIs x URIs x URIs. Similarly in Definition 2.
- In Definition 2, \tau_C is a function mapping every instance in the (instance) graph to a unique class being the type of this instance. What about instances that can possibly belong to multiple different classes? Why is this uniqueness requirement needed at all?
- Under Definition 3, does expression “connections e(v_i, v_j)” refers to the edges between some specific nodes v_i and v_j or all edges in the entire schema?
- In the equation in Definition 7, the denominator should probably be d_p(n_s, n_i).
- In Algorithm 1, line 7, does < > denote inequality? If so, I would be more comfortable with a more standard symbol.
- In Algorithm 1, how is ADJ exactly defined? It is not clear to me what happens exactly in steps 8-10. If node v belongs to the set TOP but not to ADJ (that’s what v \in TOP\setminus ADJ means) than why do we next add it to ADJ and remove from TOP. Also, the addition/removal should be expressed using proper set-theoretic notation ADJ \setminus {v} and TOP \cup {v}.
- The set-theoretic notation should also be used more consistently in Section 6.2 when describing sets of classes.

Other remarks:
--------------

- I congratulate the authors on having a “top 3 rated paper” at ESWC-15 (reference [11]), but that remark should be definitely removed from the bibliography :)
- The approach seems to rely heavily on the presence of domain and range restrictions on the properties in the schema. I believe that in practice such restrictions are not always (or even often) provided. What would happen in such case? Can the method still generate interesting summaries, especially when the instance level data is given? If this is the case, then this could be usefully emphasized and explained in the paper.
- The notions of ontologies/KBs/RDFS graphs should be carefully explicated and used consistently across the paper. On p.3 KB is defined as an RDF/S schema closed under type inference. Definitions 8 and 10, however, seem to distinguish a KB B from the schema graph S it contains.
- On p.3, “range(p), is either a class or a literal” – it should be “datatype” instead of “literal” – a literal is an instance of a datatype. Following the same convention, I would replace “container type” in Definition 2 with “datatype”.
- If space is not a major issue, I would enlarge Figure 1 to make it more readable (and reorient it, placing the summary under the original graph). Similarly Figures 5 and 9. In Fidure 7, I would recommend differentiating the symbols used on the four charts (instead of the same dot currently used on all of them) to ease reading the paper on black and white printout.
- The word “be” seems to be sometimes missing in the expressions of the form “Let X be sth”, e.g.: Definitions 8 and 10.
- The link http://www.larflast.bas.bg/ontology from footnote 4, p.10 is dead.
- In Section 6.1, the reference in [10] is used in a slightly awkward place. What is it supposed to suggest, that the experts followed the methodology described in this reference?
- In Section 6.2, there is a reference to the bibliographic item [0].
- As a strongly subjective suggestion from my side regarding the ongoing and future work - I would find it extremely intuitive to apply some sort of a “zoom” operation on the summary graph, when a user requests more details about some region. By zooming in on particular region, the user should be given the additional neighbours with the highest relevance, in a similar way as it is done in different online map applications where the user starts observing smaller streets, objects and names once he zooms in on the map. The notion of cardinality closeness inspired by instance matching scenarios feels a bit too application-specific – but it might be just my uneducated impression.
- Observations made in Section 6.4 in the current shape appear very superficial, although they do seem to hint on some potentially interesting issues. Would it be possible to elaborate a bit deeper on these initial insights here? Any additional explanation or even conjecture could be useful.
- The paper should be carefully proof-read, as some editorial revisions are definitely in order, e.g.:
-- p.2, “presented on a new, updated evaluation section” - > “in a new”.
-- p.3, “edges of I. (such that” -> no full stop after I.
-- p.3, “rdfs:type” -> “rdf:type”.
-- p.12, “whereas for some ontologies is high” -> “ontologies it is high”.
-- p.12, “the biggest the size” -> “the bigger the size”.
-- p.12, “observations here remains” -> “remain”.
-- p.14, “a user may be only interested about” -> “interested in”.
-- p.14, “more informative that frequent ones” -> “more informative than”.

Review #2
By Silvio Peroni submitted on 12/Feb/2016
Suggestion:
Major Revision
Review Comment:

In this paper, the authors introduce two algorithms for automatically providing summaries of ontologies. They have tested their approach against an existing set of ontologies already used in several past works (e.g., [9] and [10]), which are included in the evaluation. Finally, they also provided a section where they present a possible approach to select relevant concepts in the original ontology that may be relevant for exploring it starting from specific nodes included in the related summary.

This is a work that have been chosen by the organizing committee of the Extended Semantic Web Conference 2015 to be published in the corresponding special issue organized at the Semantic Web Journal. Thus, the authors had to provide an extension that should include new valid research contributions and insights of the overal work. However, by reading the paper, the actual new original contribution is not enough to deserve a publication in the journal.

In particular, they added:
- Definition 8;
- The computational cost of all the algorithms proposed;
- Section 4.2, where they introduce the new algorithm proposed;
- The results of the new algorithm, which have been added to the evaluation;
- A new metric for calculating the relevance of a summary and a related comparison;
- A sketched approach for exploring the full ontology starting from the concept in the summary.

What it is needed for having an appropriate extension to deserve a publication in the ESWC 2015 special issue is:
- Extending in an appropriate way the evaluation
* by including at least a variant of the first algorithm presented by the authors (which should include blank nodes);
* by using the algorithm presented in [10] in all the evaluations (including CIDOC-CRM with and without instances, and the execution times);
* by preparing a new evaluation - with a new user study - for analysing also the properties their algorithms return in the summary, which is not analysed and evaluated at all in the paper;
* by evaluating the metrics presented in section 8.
- Correct several conceptual errors in the paper about the nomenclature they used to describe ontologies (i.e., RDFS vs. OWL) and blank nodes, unsupported claims, and the features of the other algorithms involved in the experiment, w.r.t. [10] in particular;

Please find detailed comments as follows.

- There are some inconsistencies and issues in the content of the paper. First of all, the authors say that their focus is on RDFS ontologies, and explicitly highlight the fact that the handling of OWL ontologies is left to future works. However, what the authors actually are saying is that they consider only RDFS-like concepts/relations of the analysed ontologies (i.e., class, properties, sub-class hierarchies, sub-property hierarchies, domain and range classes), without caring about any additional construct that Semantic Web ontologies can made available, such as class restrictions, property characteristics, etc. This doesn't mean they consider only RDFS ontologies – in fact several of the ontologies they have used in their experiments are, indeed, OWL ontologies –, but simply that they just consider a limited set of axioms in their analysis. This should be clarified.

- While developing an approach for such limited set of axioms is totally fine for producing an ontology summary, it is not true at all that RDFS is the de-facto standard for publishing and representing data on the web. First of all, I've not found any strong evidence of this claim in the cited work (i.e., [3]). Second, a good part of the vocabularies included in [3] are actually OWL ontologies.

- Again, about the use of blank nodes, there are some aspects that should be clarified as well. OWL ontologies usually adopts blank nodes for expressing class restrictions, group disjointness, etc. However, since these aspect are not considered at all in the algorithms proposed by the authors – at least, as far as I understood – it is not clear what they refer to with "black nodes". Can they refer to individuals of a certain class that are not provided with a URL? This aspect should be explicitly discussed in the paper.

- The sentence (after definition 1) "if p is a property, the triple {p, type, property} can be inferred" seems a bit tautological to me, since to say (in RDFS) that p is a property one has to write exactly that triple – it is not an inference at all. Did the authors want to say that if p is used as a predicate in a triple, then p can be inferred as a property?

- Related to definition 1 and 2, it is not clear where the property rdf:type is involved in. To me, it should be used in both the schema graph and the instance graph. However, it seems it is used only in the latter one. This part should be clarified.

- In section 3.2.1 the author say that the weights of the in-centrality formula have been experimentally defined. How?

- I would suggest to split definition 2 in two (let's say def 2a and def 2b) so as to clearly define in and out centrality formulas. It is quite difficult to follow such text in the present form.

- In definition 8, the situation described in point 3 is always true to me: there exists always a path from a concept to another by going through the top concept (e.g., owl:Thing for classes) by means of the sub-class relations.

- Why did not the authors implement the blank node handling also for the first algorithm? How better would it work if blank nodes are then taken into account? I would like to see in the evaluation also the use of this kind of improved version of first algorithm – it should not be too much difficult to be implemented.

- In section 4.2 the authors refer to Definition 6, while I think it should be definition 9.

- I strongly suggest to provide examples for explaining the execution of the two algorithm graphically, it could be very useful for improving the understandability of the process.

- I've tried to run the tool provided in the URL in footnote 2, but it doesn't work with both the ontologies used in the evaluation nor any other ontology available online. This is an issue for the reproducibility of the experiment.

- There are some misinterpretation of some of the algorithms used in the comparison, in particular with the Key Concept Extractor (KCE) algorithm [10]. First, the fact that the ontologies Biosphere, Financial and AKTOR have been used to evaluate past works on RDFS summarisation is just wrong, since they are OWL ontologies indeed – I think that here the problem is about the "nomenclature" used by the authors, see my comment above about it. Second, the claim that the authors cannot uses ontologies with instances since the other methods do not consider them is simply false for KCE. In fact, the density of such algorithm does use individuals of concepts in its formula – which are specified with a quite low weight in [10] (even if the algorithm is parametric), but still they are used. The fact that the ontologies used in [10] for the evaluation don't include any individual doesn't mean that the algorithm is not able to process them. Thus, I strongly suggest to the authors to include also KCE in the evaluation with individuals. Note that the algorithm is actually implemented and released with an open source license (https://github.com/essepuntato/kce). In addition, it is also available as a web application (http://www.essepuntato.it/kce).

- After the similarity formula, it is not clear what "the set of classes" actually refers to. Are those classes in A or in S?

- The fact that, in the similarity formula, the authors give more importance to superclasses instead of subclasses is not supported by any strong analysis. What would it happen if we consider sub and super classes in the same way (let's say 0.5 and 0.5)? What would it happen if we give more consideration to sub classes instead of super classes? All these variants seem quite reasonable to me. I suggest to the authors to consider all of these different similarity measures in their analysis, or to provide a strong explanation of why superclasses deserve more importance.

- The fact that the authors use the CIDOC-CRM core as summary of the full CIDOC-CRM is not convincing at all, and rather seems a simplification to me. Since the selection of the summaries for the other ontologies has been done by humans following a particular protocol, I would like to see a similar approach also for the CIDOC-CRM one, so as to basically compare summaries that have been done in the same way. In addition, that will result in a clear additional contribution of the work. Note that, since the source of KCE is available, the authors should also consider at least that method in the evaluation of CIDOC-CRM.

- There is no evidence of "low quality" in [18] related to the use of black nodes in ontologies. Actually, blank nodes are used to define class restriction, and thus are rather useful and surely not so low quality.

- While a great work has been done by the authors from a pure algorithmic perspective about including properties in the summary they produce, there is no evaluation about them in the paper. This basically means that the summary returned by authors' algorithms are evaluated only partially. I totally understand that providing an evaluation like that means additional work - they should have to involve users, asking them to provide summaries which also include also properties. However, I firmly believe that this contribution is necessary for deserving acceptance in the SWJ, since it would be a plausible extension of the conference paper for what concern the evaluation part.

- The execution time comparison (fig 8) should include also KCE, since the source is actually available online.

- In section 7, the authors claim that KCE [10] consider only hierarchical relationship. However, also properties associated to concept (i.e., those that have such concepts as domain) are considered in the computation - see the notion of density in the paper. In addition, still, KCE doesn't consider each node in isolation. In fact, there are a lot of metrics that are actually "local", i.e., that are computed taking into consideration the closest neighbours' value.

- The work in progress described in section 8 doesn't cite important references. In fact, studies for exploring an ontology starting from the summary provided by KCE have been done by running a quite huge user testing session in

Enrico Motta, Paul Mulholland, Silvio Peroni, Mathieu d'Aquin, José Manuél Gómez-Pérez, Victor Mendez, Fouad Zablith: A Novel Approach to Visualizing and Navigating Ontologies. International Semantic Web Conference (1) 2011: 470-486

by means of a tool, i.e., KC-Viz, that uses the output of the KCE algorithm for enabling novel ways for browsing and navigate ontologies:

Enrico Motta, Silvio Peroni, José Manuél Gómez-Pérez, Mathieu d'Aquin, Ning Li: Visualizing and Navigating Ontologies with KC-Viz. Ontology Engineering in a Networked World 2012: 343-362

In addition, the authors seem not to consider several important works done with the notion of dependency within the ontology domain, such as those presented by Del Vescovo et al., e.g.:

Chiara Del Vescovo, Damian Gessler, Pavel Klinov, Bijan Parsia, Ulrike Sattler, Thomas Schneider, Andrew Winget: Decomposition and Modular Structure of BioPortal Ontologies. International Semantic Web Conference (1) 2011: 130-145

In addition, the work the authors present in this section should be accompanied by an evaluation for proving its validity.

- In the last section of the paper, what does the authors mean with the fact that they will consider OWL in future developments of their algorithm? Will they consider class restrictions, property characteristics, disjointness, etc.? It is not clear in the paper.

Review #3
Anonymous submitted on 18/Mar/2016
Suggestion:
Major Revision
Review Comment:

General Evaluation

This paper introduces algorithms that take as an input an RDF schema S and an RDF dataset I complying to S, and produce as output a summary of S, that is, a smaller RDF schema containing the most representative concepts of S. To do so, the proposed algorithms exploit the graph-like structure of S, and the rdf:type declarations between instances of I and classes of S. Also, the paper contains an evaluation of the performance of the algorithms using real-world RDF schemas and datasets. The experimental results show that the algorithms obtain expected summarisations and, when compared to other summarisation tools, the results are similar or better.

The approach is fairly novel, and has been evaluated using real-world datasets and compared to other state-of-the-art approaches. There are, however, some issues that should be addressed by authors. Below I give general remarks and then more specific comments and pose some questions to the authors.

  • First of all, readability, also related to technical soundness. The paper is hard to follow and mainly because of lack of rigour. Many of the formulas given in the paper are incomplete, since they are completed using words. It is important to provide intuitive explanations of formulas, but the formulas themselves should be given explicitly. Take, e.g. the relative cardinality in Definition 3, $RC(e(v_{i_0},v_{j_0}))$ for a given $e(v_{i_0},v_{j_0})\in E$, the numerator should be (if I am not mistaken)
    $\{r(n_i,n_j) \in R : \tau_p(r(n_i,n_j)) = \lambda_p(e(v_{i_0},v_{j_0}))\}$
    instead of just $\{r_m(n_i,n_j)\}$. The specific value of alpha should be given too. As another example, take Definition 9. Where are DistinctValues(e) and Instances(e) formally defined? This is actually misleading because $e \in E$ corresponds to a property, and properties have no instances (classes have instances).
    I suggest the authors to revise all the definitions of the paper and give complete and formal versions of the formulas.
  • Concerning the evaluation, the authors have used real datasets and they have compared their algorithms with other existing tools, which is very good. However, even though their approach is based on instances, they have only used one knowledge base with instances (CICOC-CRM). I think it is important to provide experimental results using another knowledge base with instances to reinforce the conclusions of the evaluation. This also could clarify whether there is a real difference between using the CM or RM algorithms when instances are available, which is not clear in the case of CICOC-CRM.
    I do not agree with the authors when they say that, to use instances when comparing their approach with the other chosen methods (Peroni et al. and Queiroz-Sousa) would not be fair. Exploiting instances is an added value of their approach, and, as such, should be used.
  • The two proposed algorithms are claimed to be correct, and to always produce a single result. Proofs (or sketches of the proofs) of these results should be provided. In Section 2, validity constraints are introduced as a way to ensure the uniqueness of summaries. However, this issue is not discussed later.
  • Concerning Section 8, I consider that a journal paper should not contain a section of "work in progress". Since the authors are not far from finishing this work, I suggest them to complete it and include it in the paper.

Specific Remarks and Questions to the Authors

  • The role of RDFS inference in this approach is unclear to me. How does this approach benefit from RDFS inference? RDFS inference is not used for instance classification since "inference is implemented only at the RDF schema level" (Section 2). But the measure of relative cardinality, for example, uses available instances, and these may be incomplete due to the lack of reasoning. Please, elaborate more on this.
  • The input of the proposed algorithms is an RDF schema and an RDF dataset complying to the schema. However, in practice, an RDF dataset usually makes use of more than one vocabulary, and vice versa the same vocabulary may be used by different datasets. What are the implications of this in the current approach?
  • Concerning the evaluation, although the output of the algorithms is an RDF schema summary of classes and properties, the reference summaries only contain classes. How is the quality of the outputted properties assessed?
  • This approach clearly makes the unique name assumption, and there should be some discussion about it in the paper.
  • The validity constraints should be discussed more in detail. At a first glance, they seem to limit the scope of the proposed approach. Type uniqueness, for instance, will not hold for all real datasets. If, for any RDF/S knowledge base, it is always possible to build an equivalent and valid RDF/S knowledge base, this should be mentioned in the paper, and explained how it can be done too.
  • Concerning the formal part of the paper, the following points should be argued and/or solved in the paper.
    • Concerning H=(N,<) in Section 2, first, <= should be used instead of < as the subclass/subproperty relations are reflexive. Also, H is not a poset but a preorder - the fact that two classes (properties) subsume each other does not mean they are equal, but equivalent. In addition, the authors write that "the domain of property p, i.e. domain(p) is a class". Does it mean that $domain(p) \in C$? The set C, however, is a set of class names, and, in RDFS, there might be more than one domain declaration, so domain(p) might be an intersection of classes, which, in turn, is beyond the expressivity of RDFS. Ditto for range(p). Moreover, I guess that by "smallest" the authors mean "minimal", and that by "literal" they mean "literal type name". Finally, as a suggestion, I will explicitly say that $C \cup P$ is a disjoint union (even writing $C \uplus P$).
    • Since the input of the summarisation algorithms will be an RDF/S knowledge base, i.e. an RDF schema S and an RDF dataset D that complies to S (two sets of triples), I suggest the authors to modify Definitions 1 and 2 and provide explicit definitions of an RDF schema graph associated with an RDF schema S, and an RDF instance graph associated with an RDF dataset D. These should be exhaustive (e.g. the hierarchy H built from S should be explicitly given). Now, if D complies to S, then, using the authors' terms, the two graphs will be "correlated" via the functions $\lambda_{p}$ and $\tau_{p}$, and $\lambda_{c}$ and $\tau_{c}$. But this will be a consequence, not a condition of the definition.
    • In its current version, Definition 2 uses elements of an RDF schema graph ($\lambda_{p}$, $\lambda_{c}$). Actually, Definition 2 defines an RDF instance graph I complying to an RDF schema graph S, and, thus, S should be given and fixed at the beginning of Definition 2. Moreover, what are e, $v_i$ and $v_j$? It should be written that for every $r(n_i,n_j)\in R_{I}$ there exists $e(v_i,v_j)\in E_S$ such that $\tau_p(r(n_i,n_j)) = \lambda_p(e(v_i,v_j))$. Is $\tau_c$ a function or a relation? There could be more than one rdf:type (not rdfs:type like it is written in Definition 2) declaration for a given instance.
    • In Definition 1, $\lambda_{p}$ should be defined before it is used (second bullet). Ditto Definition 2 and $\tau_{c}$. Also, is it $\lambda_{p}$ or $\lambda_{P}$?
    • In Definition 1, $(s,p,o) = URIs x URIs x URIs$ should be $(s,p,o) \in URIs x URIs x URIs$. Ditto $(s,p,o) \in URIs x URIs x (URIs \cup Literals)$ in Definition 2.
    • Avoid using technical terms if you do not define them previously (e.g. "interpretation" of p in Definition 2). In logic, the interpretation of a property has a very precise meaning, and it's not its domain/range.
    • Wouldn't it better to consider a multigraph structure as in an RDF/S knowledge base two URIs may be linked by more than one property?
    • Aren't V and E in Definition 1 finite too?
    • In Definition 4, the sum should be $\sum\limits_{i=1}^m$ and not $\sum\limits_{1}^m$ (check Definition 5 too). Actually, since the numbers of incoming and outcoming edges could be different, I suggest to write
      $C_{in}(v_0) = \sum\limits_{e(v_0,v)\in E} RC(e(v_0,v)) \ast w_p \quad C_{out}(v_0) = \sum\limits_{e(v,v_0)\in E} RC(e(v,v_0)) \ast w_p}$
      instead. As it is written, $w_p$ seems to be constant and to not depend on e, so, please, change notation. Also, the values of $w_p$ used in the experiments are not given in Section 6.
    • In Definitions 8 and 10, specify where ${p(v_i\to v_j)}$ and ${p'(v_i\to v_j)}$ belong. Shouldn't it be "there exists ${p(v_i\to v_j)} \in E'$ such that there is no ${p'(v_i\to v_j)} \in B$..."?

Minor Comments

  • The authors often use "i.e." when it should be "denoted by" (e.g. "i.e. $RC(e(v_i,v_j))$" in Definition 3).
  • Section 8 contains the same paragraph twice (!).

Comments

This paper describes two methods (1. SummaryCM, is a existing work; 2. SummaryRM, which is a new approach, with added support for blank nodes) for summarizing RDF/S knowledge bases. The authors claim that their methods gives better correlation to those summaries which were prepared by human experts. In my opinion, the work is good w.r.t. the relevance of the topic and its presentation. The approaches are described using formal notations are clearly explained. However, the manuscript requires a minor revision before publication.

One of my main concern is how far the first method is different from your previous work[11] -- though you have included algorithm complexity details etc--- I think a clear distinction or a summarization (of the details of the method) is necessary. Many of the contributions mentioned (at the end) of the Introduction (as the current work's contributions) coincide with your previous work[11].

"Specifically the contributions of this paper are the following:" // you may change this sentence."

"Our previous work could not handle blank nodes. However...." // you have given a positive appeal for including the blank nodes, but it turns out to have negative impact on the generated summaries. I think you should briefly cover this point in the introduction.

In addition, you should address the following issues:

*encountered many typos

*need rewriting of a few sentances, to give more clarity!

*a few notation and reference issues

*clarity of the algorithms used.

Detailed review:

---

Section 2: Preliminaries (Definition 1)

\lambda_{c} or \lambda_{C}??

\lambda_{p} or \lambda_{P}??

Please confirm.

---

Definition 2 (last paragraph): Font of the "C" (in c \in C) looks different.

---

Section 3.1: Reference 13 is misleading.

---

Section: 3.1.2: last paragraph

In "We conisder...the latter"

"the former" -- is missing

You meant the other way around?

Which is more important, user defined properties right!

Kindly give more clarity to the sentance: “ This is partly because the user defined properties correlate classes, each exposing the connectivity of the entire schema, in contrast to

the hierarchical RDF/S properties.”

---

Two sections with same titles (3.1 and 3.1.3) may confuse the reader.

---

In Algorithm-1 some functions look too abstract. For e.g., the fn. path_with_max_cov(B, S, vi) – provide more details.

---

“The correctness of the algorithm is proved by construction.” -- please give clarification.

---

Page-7 last paragraph.

To identify... the complexity "of" --missing-- its various components

---

Definition 8 and 10, “p” and “p'” should be italiticed uniformily.

---

“ Kruskal's greedy algorithm [16] is amongthe most efficient ones and we are using it in our

implementation.” --- efficent in what sense?

---

In Algorithm-2

What is “N” in line 7.

What is “r” in line 4.

“the result of our algorithm for a specific input is unique as well.” --- how is it possible? Are the relevance of all the propertes are unique?

---

Section-6

Intext repetitions of the class and property counts can be removied, since they are given in Table-1.

---

In Table-1, giving Property and User property counts togeather looks redudnant, sicne oneis the subset of the other. You may give RDF standard peoprty and User property counts (makes sense).

---

Section 6.2

reference [0] is misleading!

---

Eq. for Sim(.)

You may use A and R, for automatically produced summary and R for reference summary -- inproves readability.

---

Page-17 1st column last paragraph

“As we will show latter --later-- the way that this value increases as

the size of the summary becomes bigger gives us”

---

In Section 6.3: To evaluate.."these" four ontologies.

which?? the ontology names are too far.

Section 6.3

Paragraph-2:

“We have to note that whereas the reference summaries on these” -- rewrite

Section 6.3

Paragraph-2 ending: It would be very interesting if you could include some egs. of the selected classes / summary egs. as an Appendix section.

---

In Fig. 7 you have give statistics of the Bank-Ontology, you mean the Financial Ontology?? --- this mistake has happend at many other sections.

"bioshere" -- be consistent use "BIOSPHERE" or "Biosphere"

---

Section 6.3:

For example, the Aktors Portal ontology contains a huge amount of blank nodes, and when considered by the SummaryRM, as shown in Fig. 6, the quality of the result is worse than the summary created by SummaryCM. //rewrite -- need more clarity

---

Section 6.4:

use "bigger"... instead of “biggest”

---

Section 6.5:

Using comma (instead of period) is little confusing. 1,29 is 1.29 secounds or 129 secounds??

“Finally we can see that SummaryRM is more efficient that --- “than” ---SummaryCM since the latter has to assess for each” // efficency in term of what?

---

Section 7:

“these works (e.g. [9], [10]) provide a list of the more important nodes, whereas others [8], [9], [17] and our approach, create a valid summary schema.” // reference [9] is included in both the types??

---

Section 8:

1. italicize e in Def.9 “values of the *e*..”

2. The paragragh "In our running example,..... "E57 Material". is repetated in Page-15

3. In the paragraph above Definition-10:...according "to" their instances... -- "to" is missing

Some relevant related works are missing:

[1] Wu, G.; Li, J.; Feng, L.; and Wang, K. 2008. Identifying potentially important concepts and relations in an ontology. In International Semantic Web Conference, 33–49