From datasets to datanodes: An Ontology Pattern for networks of data artifacts

Tracking #: 689-1899

Authors: 
Enrico Daga
Mathieu d’Aquin
Aldo Gangemi
Enrico Motta

Responsible editor: 
Krzysztof Janowicz

Submission type: 
Other
Abstract: 
Data is at the center of current developments on the World Wide Web. In the Semantic Web, different kinds of data artifacts (datasets, catalogues, provenance metadata, etc.) are published, exchanged, described and queried every day. Data hubs are also emerging in the context of the web of Linked Data, as a way to manage this heterogeneity. There are a number of use cases related to data hub management that can only be addressed if we are able to specify the relations between the managed data artifacts in a way that support useful inferences. This includes the understanding of how the features of the data artifacts propagate. This may not be trivial if we consider complex relations, possibly including datasets in different repositories or data flows happening in independent processes and workflows. We propose an abstract, foundational model which focuses on the graph of relationships between generic “data objects” (which we call datanodes). Following an ontology building methodology based on the analysis of Semantic Web applications, we devise a foundational Ontology Design Pattern by collapsing different types of data objects together, and by remodelling structured relations to simple binary relations. Our pattern represents "a datanode related with a datanode", where the relation can be specified in six fundamental ways. We extend this foundational model and propose a conceptual framework designed to express relations between data nodes, implemented as an extendable OWL ontology covering the possible relationships between such data nodes. We show how the Datanode approach can support the management of (possibly distributed and interlinked) catalogues of web data and reasoning over the relationships between items in data catalogues.
Full PDF Version: 
Tags: 
Reviewed

Decision/Status: 
Major Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
By Yingjie Hu submitted on 08/Jul/2014
Suggestion:
Minor Revision
Review Comment:

This paper proposes an ODP to specify the relations between datasets. Six fundamental relations have been defined, and their corresponding sub-relations have been discussed. The proposed ODP can be applied to facilitating resource discovery on the Semantic Web, as well as the development and maintenance of data hubs. This paper has a clear motivation, and the direction that the authors are going to is sound. There are several questions that still need to be addressed.

1. Section 2.2 provides a four-step abstraction based on the snippet from DBRec. However, it seems that step 3 "remove duplicates (same as redundancies?)" is only a duplication of step 2 "remove redundancies and remove inconsistencies". The authors may need to provide justifications if step 3 is believed to be necessary.

2. In section 2.4 Testing, the authors states that "...we considered two more applications, namely Rexplore [20] and IBM Watson [21], to strenghten our evaluation. On each test iteration we descovered new implications derived from..." It would be better if the authors could provide a few more sentences to explain how the two applications are used in the testing, and what are the new implications derived from the hierarchical organization.

3. Six foundational relations between datanodes are defined, but the authors didn't mention whether two datanodes can have multiple relations. This should be possible, since, for example, two datanodes can have both the relations of "derivation" (one dataset is derived from the other) and "adjacency" (two datasets are under the same catalog). Thus, I think it would be beneficial if the authors could provide a short discussion on this point after explaining the six relations.

4. When the relations between two datanodes are partial (e.g., only a portion of the dataset is derived from the other dataset), could the proposed ODP still handle this situation? It would be good if the authors could provide some discussion on this issue.

There are many typos in the text, and some are listed as below. I would strongly recommend the authors do a comprehensive check to ensure the quality of the texts:

1. Both "in the Semantic Web" and "on the Semantic Web" are used in the text. It would be good to keep the consistency and only use one of them.

2. Section 1: "Data hubs are emerging as a mean to to manage this heterogeneity" should be "Data hubs are emerging as a means to to manage this heterogeneity"

3. "On each test iteration we descovered new implications..." should be "... discovered new implications..."

4. "direct graph" should be "directed graph".

5. "As illustration, we will follow the evolution of a snippet extracted from one of the system analysed, namely DBRec" should be "... from one of the systems analysed..."

6. reference 5 "Twc international open government dataset catalog." should be "TWC international open government dataset catalog"

Review #2
By Adila Krisnadhi submitted on 05/Oct/2014
Suggestion:
Major Revision
Review Comment:

The authors presented Datanode, an ontology pattern that models an abstraction of data artifacts, which can be datasets, repositories, catalogues, or registries. The aim was to fill the use-case gap not covered by other vocabularies such as VoID, DCAT, or Prov-O, in particular, concerning the relationship between data artifacts, so that to allow for inferences about them.

Overall, the comments below means that some major revision may be needed, at least on the paper writing.

ON READABILITY AND CLARITY OF PRESENTATION:

Readability is not a big problem, though the remark below should be able to improve it.

For a user who is not yet familiar with the pattern, the pattern description seems daunting and complicated due to the large number of properties in the pattern. Moreover, the six branches described by the authors are clearly not completely separated because a quite significant number of properties belong to more than one branch. Although the tables do list them all explicitly, a more complete visual description would greatly help in understanding the property hierarchy. Figure 1 only visualize the hierarchy up to the top level properties, and does not indicate the shape of the hierarchy below them, which is definitely not just six simple, separated branches. The meaning of some properties are also not entirely clear (see below). Arrangement of the content in the tables should be improved: the order in which the properties appears in the tables seem random; it would be nicer if they are ordered alphabetically or in some other easily recognizable order to make it easier to locate a particular property.

ON THE PATTERN DESCRIPTION

I like the steps conducted by the authors in the development of the pattern. Typically, content pattern is designed together with some involvement of domain experts. However, in the case of Datanode ODP, the authors themselves could also be considered domain experts. Hence, going through the steps as described by the authors would reasonably lead to a quite good abstraction of the use cases.

The resulting pattern and its description, however, need some improvements.

Some properties are not entirely clear what they mean. This is rather unfortunate considering that the authors (seemingly) emphasize that the pattern can support more useful inferences about datanodes than what the other vocabularies can do.

1. hasInterpretation / isInterpretationOf

What is an interpretation in this context? If I have a datanode that is an interpretation of another datanode, how would it look like? Interpretation can mean completely differently for different users. The subproperties (hasExtraction, hasInference) are pretty clear, though. So, probably, hasInterpretation and isInterpretationOf are not needed in the pattern? Is there any use case for hasInterpretation besides what can be described by hasExtraction and hasInference?

2. hasStandIn / isStandInOf

What is a "stand in" of a datanode in this case? "Stand in" usually means "substitute". My understanding of these terms is that they depend on the context. That is, when I assert that datanode B stands in datanode A, I would probably do it because my application's condition requires it. Without explicitly considering this context, I'm not sure the "stand in" relationship makes much sense. The authors should probably justify the use of this terms with some example, or simply drop them altogether.

3. remodelledInto vs. refactoredInto

It might be simply me not familiar with the relevant use cases here, but I don't quite understand the differences between these two properties. Both the terms "remodeling" and "refactoring" is commonly used to describe a restructuring process (a building or a shape for the former; code or software for the latter). I'm not entirely clear what they mean for datanodes and whether they are actually different.

4. The adjacency branch

Implicitly, it seems that relations from this branch should only be used for two datanodes that belong to the same data container, such that one is not part of the other. This is, however, not formally indicated in the pattern. Also, what does it mean to have a disjointPartWith relationship between two datanodes? Why does it imply that they are part of the same dataset? This relationship can also be conceivably applied to two completely unrelated datanodes, which do not belong to the same dataset at all.

5. overlappingCapabilityWith vs. differentCapabilityFrom

Among the six top relations, overlappingCapabilityWith and differentCapabilityFrom are the most confusing ones. Can two datanodes be related by BOTH properties at the same time? Or is differentCapabilityFrom intended for two datanodes that share no capability at all? It is conceivable that two datanodes have disjoint population, but use the same vocabulary. Hence, they would be both related with the overlappingCapabilityWith and differentCapabilityFrom properties. This may be suprising for some users, as they may use the differentCapabilityFrom property with an intention to indicate that both datanodes share no capability at all.

The notes in the online version of the pattern says that those two properties are needed to generically express comparison of datanodes with respect to specific tasks, which is quite clearly the case when we look at the subproperties of the above two properties in Table 6 and 7. However, I think, using those two properties above might not be the best choice to express a generic comparison. I would rather use something like shareCapabilityWith, together with disjointCapabilityWith.

6. differentVocabularyFrom or disjointVocabularyFrom?

Is differentVocabularyFrom intended to be used for datanodes that share no vocabulary terms at all? Wouldn't disjointVocabularyFrom a better term? Vocabulary is usually seen as a collection of terms, hence it is possible for two datanodes to have an overlapping, but different vocabulary at the same time.

7. disjointPortionWith and disjointSectionWith

Was the intention to say that A "disjointPortionWith" B if A and B are both some disjoint partition of the same datanode C? Or was it to say that A "disjointPortionWith" B if A has some portion A' and B has some portion B' such that A' and B' are disjoint? My guess is that the first one was how the authors intended, but unfortunately, neither is this clear from the explanation, nor is this formally asserted. If the second reading was intended, then it will result in every two datanodes to be trivially related through the disjointPortionWith property. In fact, a datanode would be "disjointPortionWith" itself because we can always conceive an empty portion of it. The remark for the disjointSectionWith property is similar.

8. redundantWith, sameCapabilityAs, duplicate

The authors said: "overlappingVocabularyWith and overlappingPopulationWith, both leading to redundantWith, sameCapabilityAs, and duplicate - all describing a similar phenomenon with different intentions". Could you please explain their differences? When is a term is more appropriate than the others?

9. Leveling in the pattern
I noticed that the online version of the pattern at http://www.enridaga.net/datanode/0.3/ns/ contains some sort of division of the terms into levels (level 1 to 5). What do the authors mean by this? Why is this not reflected in the paper?

10. Is there any other useful inferences that can be drawn using the pattern aside from subproperty relationship? Some properties are asserted with certain property characteristics, e.g., symmetry, functionality, etc. The scenario, however, only talks about inferencing shortcuts (without even spelling out what actually happens in Figure 10). Are such property characteristics useful in some other scenarios? Is there any example for them?

ON ALIGNMENT WITH EXISTING VOCABULARIES

From what I understand regarding the motivation from the authors, the main aim for the pattern is to cater for use cases that cannot be covered by other existing vocabularies. If this is the motivation, I would think that there should be a much more detailed comparison, especially with voID, DCAT, and PROV-O. The section describing the alignments with those existing ontologies only says which part of VoID, DCAT, and PROV-O corresponds to the Datanode pattern. In my opinion, it would be very useful for the users of the pattern if they also understand more clearly the differences between those existing vocabularies and the Datanode pattern, e.g., which features PROV-O posseses, but Datanode do not, and vice versa. The application scenario does describe a situation where Datanode makes a difference, but the users (especially those who are very familiar with VoID, DCAT, or PROV-O) would be helped if other differences are detailed.

MINOR TYPOS, STYLES, etc.

Please be consistent: data node or datanode?

When you list more than two things together separated with commas in which the last one was preceded by the word "and", please put comma before "and", e.g., write "item1, item2, and item3" not "item1, item2 and item3".

p5
left col, par 3: can be related each other --> can be related to each other
right col, par 2: mod-els --> mo-dels

p6
left col, spacing between par 1 and 2 of section 4 needs to be fixed.
right col, section 4.1: This relation has for inverse about --> The inverse of this relation is the property about

p7, section 4.4: Similarly to consistency --> Give pointer to section 4.5?

p9, par1: infererences --> inferences
p9, par3: hasUpdate --> The hasUpdate property

p14,
left col, line 3: partitioning etc... --> partitioning, etc. (no need to put three periods after etc)
left col, line 10: indirect affected --> indirectly affected
Fig. 10: some of the arrows (possibly the dotted ones) are not visible when printed in black-and-white.

References should be rechecked and better formatted; please put the information consistently, e.g., some conference names are only given as a short abbreviation, while others are given as a complete name. The following are the ones I found:
[1] lod --> LOD
[2] aurin --> AURIN, ands --> ANDS, pages 75-82
[4] uk --> UK
[6] Semanic --> Semantic
[8] the authors should be: Keith Alexander, Richard Cyganiak, Michael Hausenblas, and Jun Zhao.
in the title: void --> VoID ?
[9] this is a W3C working group note; please put a more complete information in the reference
[12] extreme --> Extreme
[14] Dbrec --> DBRec
[19] lod --> LOD
[25] "Technical report" appeared twice.
[26] Is this a technical report?
[28] What's the venue? Citeseer?
[33] Is this a technical report?
[34] dl --> DL?
[35] cad --> CAD
[38] Technical report?
[40] pa --> PA? cnr --> CNR?
[41] ou --> OU?

Review #3
By Rinke Hoekstra submitted on 06/Oct/2014
Suggestion:
Reject
Review Comment:

The paper presents an ontology that can be used to represent the way in which different (versions of/parts of) "datasets" are reused and recombined in new datasets. Relations and classes in the ontology are collected from a selection of Semantic Web applications that use multiple datasets. The ontology is defined in OWL2, and (apparently) automatically infers the existence of relations it defined from datasets described using several existing vocabularies (FOAF, PROV, DCAT, VOID, ...). I think this is the main contribution of the paper: the "remodelling" of existing vocabularies to one that expresses the dependencies between datasets from a data-management perspective. To put it bluntly: the ontology is a replacement of what otherwise would have been done using "ad hoc" SPARQL queries (fig. 8). I find the elegance of using OWL appealing, but the SPARQL queries do not have to be ad hoc at all: these can be just as generic. The argument should be that some of the features needed, such as the transitivity of relations defined in the ontology cannot be expressed in SPARQL.

Unfortunately the paper is not very clear about this, and only in the later sections where the LinkedUp use case is described the motivation for the ontology becomes clearer. This should be much more explicit: what are the benefits of having such a higher level description of dataset usage from a data-management perspective? Also, the one use case is not sufficient to illustrate the importance. Why is it needed to have a "generic conceptualisation for the heterogeneous data artefacts that compose the Semantic Web"? (p.2) Also, on p4. the authors dismiss non Semantic Web solutions to their problem, for the reason that they intend to improve the "methods for managing Semantic Web data". To me this argument does not hold: it seems very valuable to look beyond the SW world to learn what other, perhaps more mature fields have designed as solutions.

The authors do not really make clear to me why their ontology is an "ontology pattern" (submitted to the special issue), and the other vocabularies cited (FOAF, PROV, etc...) are not (apparently). The claim that the classes and properties found are somehow "foundational" or "fundamental" is not substantiated. The survey of semantic web applications suggests a bottom-up approach that does not necessarily lead to foundational categories. Also the addition of new features based on the tests (Rexplore and Watson) does not convince me that you may not find new features if you look wider. In other words: what is the scope of the ontology?

In line with this last comment, the future work on property chains is another area where I think the paper falls a bit short in maturity for a journal publication. It is not that I do not believe in the solution offered. I suggest a revision with more real world scenarios, and a resubmission as a regular ontology paper.

A number of smaller remarks:

p1:

* Explaining the term 'data node' by saying that it is a kind of 'data object' does not help. What is a data object?
* 'mean' -> 'means' (throughout the paper)

p2:

* figure 1: why are these relations 'fundamental'?
* motivate our work -> motivateS our work
* devise generic -> devise a generic

p3:

* How can your 159 keywords, after removing redundancies, map onto 132 classes and 168 properties??
* istances -> instances
* formulated these -> formulate these

p4:

* Testing: what does it tell you if your tests show the need to add new features to your ontology?
* standard for -> standards for

p5:

* [29] define -> [29] defines (and you mention 'a number' and 'set of'... this is irritating, just give me the number so that I don't have to count myself)
* objective of make -> objective to make
* mean -> means

p6:

* The 'R' for reflexive property is not used in any of the tables
* Spacing is incorrect between the first two paragraphs of section 4

p9:

* Shares interpretation with is a *transitive* property... this seems a bit too firm a statement, since you may have partially overlapping interpretations.

p10:

* Table 10: it would be good to motivate why you do not reuse the listed properties directly, but rather define new superproperties for each.

p12:

* Figure 4 shows the summary of the scenario... but in the earlier sections you gave extensive lists of properties... it would be good to give a scenario that covers the span of your ontology, rather than such an overly simplistic one.

p13:

* Nowadays, datahubs includes -> include

p14:

* It would be really good if the authors could bring some of the insights discussed in this section to the beginning of the paper.


Comments