Ontology Understanding without Tears: The summarization approach

Tracking #: 1396-2608

Authors:
Georgia Troullinou
Haridimos Kondylakis
Dimitris Plexousakis

Responsible editor:
Guest Editors ESWC2015

Submission type:
Full Paper
Abstract:
Given the explosive growth in both data size and schema complexity, data sources are becoming increasingly difficult to use and comprehend. Summarization aspires to produce an abridged version of the original data source highlighting its most representative concepts. In this paper, we present an advanced version of the RDF Digest, a novel platform that automatically produces and visualizes high quality summaries of RDF/S Knowledge Bases (KBs). A summary is a valid RDFS graph that includes the most representative concepts of the schema, adapted to the corresponding instances. To construct this graph we designed and implemented two algorithms that exploit both the structure of the corresponding graph and the semantics of the KB. Initially we identify the most important nodes using the notion of relevance. Then we explore how to select the edges connecting these nodes by maximizing either locally or globally the importance of the selected edges. The extensive evaluation performed compares our system with two other systems and shows the benefits of our approach and the considerable advantages gained.
Tags:
Reviewed

Decision/Status:
Minor Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
By Szymon Klarman submitted on 22/Jun/2016
 Suggestion: Minor Revision Review Comment: I appreciate the authors’ responses and their effort put into revising the paper. I find most of the concerns I raised in my review to have been properly addressed. However, I still have the following issues that I’d like to be considered and resolved: 1) I’m afraid I am not convinced by the response regarding the use of RDF(S) semantics in computing the summary. Even if we restrict the claim just to the schema graph it just doesn’t hold. Let’s assume we deal with an empty instance graph and want to summarize only the schema graph. The algorithm makes references to certain measures, such as RC, Rel or Cov. All these measures evaluate the scores for the respective entities (edges, concepts, paths) by looking up the schema graph G_S only – not its deductive closure Cl(G_S). Because of that, whenever we have two syntactically different graphs G_S =/= G_S’ the resulting summary might be different for both of them, even if it is the case that the graphs are semantically equivalent, i.e., when Cl(G_S) = Cl(G_S’). This is basically follows from the definitions and the algorithm. And yes, the algorithm starts off with Cl(G_S), but since the measures only work over G_S this is not enough . So this either has to be either changed, or made clear in the paper that the proposed method is largely syntactic - making possibly some use of the semantics but not guaranteeing equivalent results for semantically equivalent graphs. 2) The notion of validity occurring on p.4 adds to the confusion regarding the semantics of RDFS. The definition of RDFS per se does not enforce any validity requirements related to the domain and range restrictions. It is the very idea of the open world architecture and logical entailment that ensures that if suitable assertions about certain individuals are not explicitly stated in the graph, they will be inferred from the schema axioms. But there’s no requirement for these assertions to be there to start with. 3) In Definition 1, how come the nodes (which should be concepts or datatypes, as I understand) can also be literals from the earlier defined set L?
Review #2
By Silvio Peroni submitted on 01/Jul/2016
Review #3
Anonymous submitted on 15/Jul/2016
 Suggestion: Accept Review Comment: General Evaluation The authors have addressed all the remarks I made in my previous review. The formalisation of the approach is now in a much better shape, the authors have added proofs to the theorems and new experimental results. In my opinion, the paper is very close to be ready for publication. Below I include some remarks that should be considered for the final version of the paper. Specific Remarks to the Authors Sometimes you speak of an RDF KB, and other times of an RDF/S KB (even in Definition~1). Please, be consistent. In Definition 1, the first $L$ refers to datatypes, and the second one to literals. Please, choose different symbols. Concerning Definition 2: As a suggestion, you could write "the relative cardinality of an edge $p(v_i,v_j) \in E_S$" instead of "the relative cardinality of $p(v_i,v_j)$ $v_i, v_j \in G_S$". Also, I would write $p(v,w)$ and $p(n,m)$ for generic edges in $E_S$ and $E_I$, respectively (most of subscripts in the formula are actually unnecessary). Isn't it $\lambda(r_p(n_i,n_j)) = \lambda(p(v_i,v_j))$ instead of $\tau_c(r_p(n_i,n_j)) = p(v_i,v_j)$? And since $\tau_c: I \to 2^C$, isn't it $v_i \in \tau_c(n_i)$ instead of $\tau_c(n_i) = v_i$? Also, the condition should ensure that the denominator, and not the numerator, is not 0. Inside the absolute value, you must write a set (a set with a defining condition in your case). To sum up, if I'm not wrong, the definition of $RC(p(v_0,w_0))$ for a given $p(v_0,w_0)\in E_S$ should be something like (assuming $E_S \not= \emptyset$): if $\{ p(n,m) \in E_I : v_0 \in \tau_c(n) \} \not= \emptyset$ or $\{ p(n,m) \in E_I : w_0 \in \tau_c(m) \} \not= \emptyset$ then $$RC(p(v_0,w_0)) = \frac{1}{| \{p(v,w) \in E_S\} |} + \frac{| \{p(n,m) \in E_I : \lambda(p(n,m)) = \lambda(p(v_0,w_0)), v_0 \in \tau_c(n) \mbox{ and } w_0 \in \tau_c(m) \} |}{| \{p(n,m) \in E_I : v_0 \in \tau_c(n) \} | + | \{p(n,m) \in E_I : w_0 \in \tau_c(m) \} |}$$ Otherwise, $$RC(p(v_0,w_0)) = \frac{1}{| \{p(v,w) \in E_S\} |}$$ Concerning Definition 4, $c_1,\ldots,c_n\in V_S$ instead of $c_1,\ldots,c_n\in E_S$ (same for $c'_j$), which is actually redundant if you impose $p(c_i,c)\in E_S$. But better writing $c_1,\ldots,c_n\in V_S \cap C$. In Algorithm~1, the initialisation of \textit{Nodes} is missing (Line~5). Also, isn't it $\mathit{RemNodes}:= \mathit{Nodes}$? Shouldn't Line~7 be: While $\mathit{RemNodes} \not= \{\}$? In 6.2.1, if there was no agreement between the experts, which reference summary did you choose for the evaluation? Please, clarify it in the paper. Minor Comments and typos: Please, write all mathematical symbols/letters in italic (e.g. V instead of V). When citing, write, e.g., [6,7] instead of [6][7]. Write $G_I$ instead of $G_i$ whenever you write $V=(G_S,G_i,\lambda,\tau_c)$. Page 2, "user-defined and standard RDF properties" should be "user-defined and standard RDF-S properties". Page 3, "their datasets As such" should be "their datasets. As such". Right parentheses are missing in $C_{out}(c)$ and $C_{in}(c)$ (Definition~3). Page 5, "how central is a schema node in an RDF/S KB" should be "how central a schema node in an RDF/S KB is". In Definition 5, "the following." should be "the following:". In Definition 6, $G_{S'}$ should be $G'_S$. Page 10, RDF Accessor should be RDF Assessor. Use KCM or Peroni et al. when referring to the tool, but not both. Page 12, "did not agreed" should be "did not agree".
Review #4
By Vinu E. V submitted on 21/Jul/2016
 Suggestion: Minor Revision Review Comment: The current version of the manuscript is well written and the authors have addressed all the issues which I had raised. However, the following minor changes should be made before publication. The paper should be carefully proof-read by the authors once again, I have noted a few typos. Minor revisions: Suggestion: It is easier to understand schema content using only the summary graph since it contains the most important/representative nodes out of the initial graph. // A one or two line explanation on the domain-significance of the summary nodes would further motivates the reader. p-5: "We consider as more important the user-defined ones.."// We consider the user-defined ones as more important.. Def.3- RC( -- closing ")" is missing. You may use \big) and \big( p-5: Consider now the “Ε38 Image” class shown in Fig. 1. Assume...."//Not clear what exactly you are trying to convey. Please recheck the definition reference and address the sentence-construction issues (are not any instances --> are no instances..) Initially we calculate the Cl(G s ) and we need O(|V| 3 ).//Provide a reference or explain why it is O(|V|3). ..allowing users to use the second algorithm only whereas soon it will be updated to allow both the algorithms to be selected and used...//allowing users to use only the second algorithm (soon it will be updated with the other algorithm) In the implemented system (online link), for all the summaries, there exist a node with name "class" which connects all the other classes in the summary. Is it really necessary to include a built-in concept in the summary? or Is it getting generated as a part of the summary result? In that case, can there be a case where some user defined classes (in the summary) that are not related to the so-called *class* node? Bellow we will describe in detail the performed evaluation.// "Below" By trying to understand the reasons behind this, we identified that the Aktors Portal ontology contains a huge amount of blank nodes and this has a direct effect to the quality of our constructed summary// In general, I wonder how does the presence of blank node affects *schema* summary. Blanks nodes can be though of as instances without names; in than sense they do have some significance (same as that of every other instances). The equation in 6.3.1. Stage 3 Evaluation Measures, should be written properly. Please confirm whether |Vsummary| is present in the denominator OR Did you mean, (Vsummary \union \delta^{+}_{summ,ref})\backslash \delta^{+}_{summ,ref} ---> Vreference V1 and V2 or V_{1} and V_{2}, be consistent in Sec.6.3.1. All the best.