Lifecycle Models of Data-centric Systems and Domains

Paper Title: 
Lifecycle Models of Data-centric Systems and Domains
Authors: 
Knud Möller
Abstract: 
The SemanticWeb, especially in the light of the current focus on its nature as aWeb of Data, is a data-centric system, and arguably the largest such system in existence. Data is being created, published, exported, imported, used, transformed and re-used, by different parties and for different purposes. Together, these aspects form a lifecycle of data on the Semantic Web. Understanding this lifecycle will help to better understand the nature of data on the SW, to explain paradigm shifts, to compare the functionality of different platforms, to aid the integration of previously disparate implementation efforts or to position various actors on the SW and relate them to each other. However, while conceptualisations on many aspects of the SW exist, no exhaustive data lifecycle has been proposed to our knowledge. This paper proposes a data lifecycle model for the Semantic Web by first looking outward, and performing an extensive survey of lifecycle models in other data-centric domains, such as digital libraries, multimedia, eLearning, knowledge and Web content management or ontology development. For each domain, an extensive list of models is taken from the literature, and then described and analysed in terms of its different phases, actor roles and other characteristics. By contrasting and comparing the existing models, a meta vocabulary of lifecycle models for data-centric systems — the Abstract Data Lifecycle Model, or ADLM — is developed. In particular, a common set of lifecycle phases, lifecycle features and lifecycle roles is established, as well as additional actor features and generic features of data and metadata. This vocabulary now provides a tool to describe each individual model, relate them to each other, determine similarities and overlaps and eventually establish a new such model for the Semantic Web.
Full PDF Version: 
Submission type: 
Survey Article
Responsible editor: 
Werner Kuhn
Decision/Status: 
Accept
Reviews: 

This is a revised submission, now accepted for publication. The reviews for the revision are below, followed by those for the original submission, which was accepted with minor revisions, are below.

Review 1 by Tomi Kauppinen

The new version is sufficiently taking into account the suggestions I made in my review.

Review 2 by Todd Pehle

The clarifications from the first review appear to be sufficiently addressed so I accept as is. I'm still unsure about the image that was clipping some of the text in section 2.2. The author couldn't reproduce and it wasn't mentioned as an issue from other reviewers, so perhaps I have an outdated .pdf reader. I'll have to check on that!

Reviews for the original submission:

Review 1 by Tomi Kauppinen

As the submission is a survey article, the review is organized according to the criteria for survey articles listed at http://www.semantic-web-journal.net/reviewers.

(1) Suitability as introductory text, targeted at researchers, PhD students, or practitioners, to get started on the covered topic.

The article introduces the topic by giving an overview of existing lifecycle models from the literature and by comparing them using what authors call the Abstract Data Lifecycle Model. This seems to be a good approach as it allows to discuss similarities and differences between existing models. As a result the phases and features of the examined models were summarized in Tables 1 and 2, and further explained in the text. Taking all this, the text serves well as an introductory text on the topic.

(2) How comprehensive and how balanced is the presentation and coverage.

Quite a few lifecycle models were presented in Section 2, and a reader gets a feeling of a full coverage of relevant models. However, proposals related to handling versioning, provenance and trust are also dealing with lifecycle of data and thus could have been included in the discussion or at least mentioned why they were left out. Otherwise a reader might be misled to believe that every possible aspect related to lifecycle of data is covered by this survey article.

(3) Readability and clarity of the presentation.

The text was mostly easy to read. One minor comment though: author sometimes uses "we" and sometimes "I" to refer to the author. I suggest author would harmonize this for the text to be more readable.

(4) Importance of the covered material to the broader Semantic Web community.

Taking into account the points related to (1) and (2) the article serves as a nice survey article on the topic and hence is important as such. What could be perhaps added would be a discussion of what is missing in the current lifecycle models (e.g. issues related to provenance, versioning and trust) and how/if lifecycle models should be enhanced in the future to serve better in creating linked data.

Review 2 by Todd Pehle

Specific Comments:

1. Introduction

"It is therefore crucial to have a common understanding of where and what fixed points are in which discussions can be anchored":

Since this is primary purpose of the research, may use a bit of clarification on "fixed points". For example, fixed points with respect to? Or perhaps a bit more concrete description may be good.

2.2 Lifecycles in eLearning

Minor issue at least on the .pdf that I downloaded. The image in Figure 4 on page 4 clips some of the text making it difficult to read on this particular page.

2.3 Lifecycles in Digital Libraries

"...it becomes obvious that it isn't really a lifecycle model for data, but rather a lifecycle model for ontology...":

Some in the Semantic Web may consider ontology schema along with instance data to all be considered data. It may be worth pointing out how data-centric system lifecycles may vary or differ amongst instance data-based lifecycles, metadata-based lifecycles and schema-based lifecycles.

2.5 Lifecycles in Databases

Just as a potential suggestion, this section may benefit from expanding on other lifecycles in the database realm in addition to CRUD due to relevance of database data to the Semantic Web realm. For instance, the paper cites examples of ontology lifecycles. It may be good to also cite database logical model lifecycles and compare to ontology lifecycles. It may serve as a way to study lifecycle differences between data built for a single, closed world domains(DB) vs. multiple, open world domains(ontology). As such, the comparison itself may be outside the intended scope of the paper.

One other point I noticed as a reader was that at the beginning of the paper I presumed I understood what the term "lifecycle" itself means. As I came across the CRUD lifecycle example I realized I hadn't previously thought of CRUD as a data lifecycle, but instead more as operations on data. Hence, from my view at least, it may be good to cite the definition (if there is one) of a lifecycle. Perhaps this is best done in the introduction of the paper. For example, do they exhibit: temporal flow or state, specify an ordered set of tasks, mandate a beginning and an end, etc.?

3 The Abstract Data Lifecycle Model

"The alternative approach...would be to begin with the abstraction and then use it to classify a selection of instances...":

Just from a reader standpoint, I'd be curious to understand if this alternative top-down approach is not applicable, not correct or just simply wasn't selected as an abstract data lifecycle design methodology for this paper.

3.1 Lifecycle Phases

I like the identification of phases. I also wonder if phase state, inputs/outputs or relationships between phases should also be made explicit? Perhaps this is left as part of the definition of the phases themselves?

Where would data exploitation fit into the lifecycle phases? Is it in creation phase or refinement phase or other? It seems like a distinction should be drawn between knowledge acquisition of raw data and acquisition of knowledge based on data exploitation.

3.2 Distinction Data vs. Metadata

Referencing ontology "models" as data, is there or should there be a distinction made between instance data, metadata AND models?

3.4 Actor Features

Actor Humaness: Would a "Human and Machine" Actor also be needed? Or perhaps ADLMs can have exhibit both "human" and "machine" roles via multi-inheritance? I only mention this because many tools that produce data are classified as manual, fully automated OR semi-automated. However, perhaps this is not the intended granularity the author is seeking.

4.1 Semantic Web Lifecycle Phases

Planning section:

There's a small grammatical error (I think) in the phrase "Planning must precede any creation of refinement of data". I think it was intended to read "creation OR refinement" of data.

Does reasoning on the Semantic Web fit under Refinement? Since reasoning is discussed frequently in Semantic Web, it may be
worth dedicating a few sentences to the discussion.

Under Termination section, the sentence referencing "it is not possible...to completely terminate any piece of data..." brings up a good point and question that could be elaborated. Namely, does "data" in ADLM represent an instance of a single statement or multiple distributed serializations of the same piece of data?

General Notes:

Curious if there are any "Web Data" lifecycles that could be cited? Since the Web of Documents and Web of Data will co-exist in the same information space of the Web, it would be interesting to see similarities and differences between lifecycles for unstructured data vs. structured data. Perhaps this could also be reserved for future work.

I think the paper does a good job of crossing both a wide spectrum of specific data application domains as well as generalized data domains (instance, conceptual, metadata).

Curious if real-time data lifecycles exist or if there is are differences in lifecycle models for these types of domains?

The paper has good clarity and well written!

Tags: 

Comments

---> replies are below, indicated by arrows

As the submission is a survey article, the review is organized according to the criteria for survey articles listed at http://www.semantic-web-journal.net/reviewers.
(1) Suitability as introductory text, targeted at researchers, PhD students, or practitioners, to get started on the covered topic.
The article introduces the topic by giving an overview of existing lifecycle models from the literature and by comparing them using what authors call the Abstract Data Lifecycle Model. This seems to be a good approach as it allows to discuss similarities and differences between existing models. As a result the phases and features of the examined models were summarized in Tables 1 and 2, and further explained in the text. Taking all this, the text serves well as an introductory text on the topic.
(2) How comprehensive and how balanced is the presentation and coverage.
Quite a few lifecycle models were presented in Section 2, and a reader gets a feeling of a full coverage of relevant models. However, proposals related to handling versioning, provenance and trust are also dealing with lifecycle of data and thus could have been included in the discussion or at least mentioned why they were left out. Otherwise a reader might be misled to believe that every possible aspect related to lifecycle of data is covered by this survey article.

---> some of these aspects, in particular versioning, are implicitly covered by the fact that many of the lifecycle models, and also the ADLM, allow for repetition. Data is created, used, there might be feedback, after which the circle potentially starts all over again, resulting in a new version of the data. Trust and provenance are metadata about the primary data of a lifecycle. Several of the models presented in the paper have phases that allow for this (such as "annotation" in Hardman, or the whole metadata space in Kosch). I have added text explaining this Sects. 3.1 and 4.1.

(3) Readability and clarity of the presentation.
The text was mostly easy to read. One minor comment though: author sometimes uses "we" and sometimes "I" to refer to the author. I suggest author would harmonize this for the text to be more readable.

----> the text was revised and made consistent to only use "we"

(4) Importance of the covered material to the broader Semantic Web community.
Taking into account the points related to (1) and (2) the article serves as a nice survey article on the topic and hence is important as such. What could be perhaps added would be a discussion of what is missing in the current lifecycle models (e.g. issues related to provenance, versioning and trust) and how/if lifecycle models should be enhanced in the future to serve better in creating linked data.

---> I added some text to the conclusion to better address this.

---> replies below, indicated by arrows

1. Introduction
"It is therefore crucial to have a common understanding of where and what fixed points are in which discussions can be anchored":
Since this is primary purpose of the research, may use a bit of clarification on "fixed points". For example, fixed points with respect to? Or perhaps a bit more concrete description may be good.

---> I have modified and extended the introduction to make clear what is meant by these points, and that they are what a conceptual model such as a lifecycle will provide.

2.2 Lifecycles in eLearning
Minor issue at least on the .pdf that I downloaded. The image in Figure 4 on page 4 clips some of the text making it difficult to read on this particular page.

---> Unfortunately, I was not able to reproduce this problem. Is it only in the printout or in the electronic PDF as well? Could you say exactly where the clipping occurs, i.e., which text is clipped?

2.3 Lifecycles in Digital Libraries
"...it becomes obvious that it isn't really a lifecycle model for data, but rather a lifecycle model for ontology...":
Some in the Semantic Web may consider ontology schema along with instance data to all be considered data.

---> Absolutely! I'm one of them.

It may be worth pointing out how data-centric system lifecycles may vary or differ amongst instance data-based lifecycles, metadata-based lifecycles and schema-based lifecycles.

---> I have extended Sect. 4.1 somewhat to clarify this question. In essence, a schema or ontology lifecycle would be separate from the instance data, but similar in its shape (it can involve the same phases and can be characterised using the same features).

2.5 Lifecycles in Databases
Just as a potential suggestion, this section may benefit from expanding on other lifecycles in the database realm in addition to CRUD due to relevance of database data to the Semantic Web realm. For instance, the paper cites examples of ontology lifecycles. It may be good to also cite database logical model lifecycles and compare to ontology lifecycles. It may serve as a way to study lifecycle differences between data built for a single, closed world domains(DB) vs. multiple, open world domains(ontology). As such, the comparison itself may be outside the intended scope of the paper.
One other point I noticed as a reader was that at the beginning of the paper I presumed I understood what the term "lifecycle" itself means. As I came across the CRUD lifecycle example I realized I hadn't previously thought of CRUD as a data lifecycle, but instead more as operations on data.

---> It's true that CRUD is a set of processes, rather than a lifecycle model. However, the phases in the lifecycles surveyed in the paper, as well as in the ADLM, are all essentially actions or activities related to the data. I.e., they are very similar to processes. This, and the fact that it is very well known, are the reasons I added CRUD to the survey. I have extended this section slightly to make this clearer.

Hence, from my view at least, it may be good to cite the definition (if there is one) of a lifecycle. Perhaps this is best done in the introduction of the paper. For example, do they exhibit: temporal flow or state, specify an ordered set of tasks, mandate a beginning and an end, etc.?

---> I haven't been able to date to find a proper definition of the term "lifecycle" (from a technical point of view). I suppose there is a general assumption that everyone just knows what it means. For completeness sake, I have added a generic definition from the OED to the introduction of the paper.

3 The Abstract Data Lifecycle Model
"The alternative approach...would be to begin with the abstraction and then use it to classify a selection of instances...":
Just from a reader standpoint, I'd be curious to understand if this alternative top-down approach is not applicable, not correct or just simply wasn't selected as an abstract data lifecycle design methodology for this paper.

---> The top-down approach is not wrong per se. However, I chose the bottom-up approach to ground the ADLM in concrete, existing lifecycle models, and ensure it would at least cover those. It was supposed to be as broad as possible. The top-down approach might be more useful when one has a specific, restricted use case in mind. I have added a footnote with a comment on this.

3.1 Lifecycle Phases
I like the identification of phases. I also wonder if phase state, inputs/outputs or relationships between phases should also be made explicit? Perhaps this is left as part of the definition of the phases themselves?

---> this was left out so as not to overcomplicate the model. Also, the exact inputs and outputs would differ depending on the concrete usage on the lifecycle model.

Where would data exploitation fit into the lifecycle phases? Is it in creation phase or refinement phase or other? It seems like a distinction should be drawn between knowledge acquisition of raw data and acquisition of knowledge based on data exploitation.

---> the difference would be in how the model is used. In this case, I would suggest a low-level lifecycle to model raw data and a high-level lifecycle for data exploitation. Both would use the ADLM, but probably with a focus on different phases and a different level of granularity in terms of data.

3.2 Distinction Data vs. Metadata
Referencing ontology "models" as data, is there or should there be a distinction made between instance data, metadata AND models?

---> See comment above (2.3)

3.4 Actor Features
Actor Humaness: Would a "Human and Machine" Actor also be needed? Or perhaps ADLMs can have exhibit both "human" and "machine" roles via multi-inheritance? I only mention this because many tools that produce data are classified as manual, fully automated OR semi-automated. However, perhaps this is not the intended granularity the author is seeking.

---> I would model semi-automated operations with two actors, one human and one machine. I have added clarification concerning this to 3.4.

4.1 Semantic Web Lifecycle Phases
Planning section:
There's a small grammatical error (I think) in the phrase "Planning must precede any creation of refinement of data". I think it was intended to read "creation OR refinement" of data.

---> this was corrected

Does reasoning on the Semantic Web fit under Refinement? Since reasoning is discussed frequently in Semantic Web, it may be
worth dedicating a few sentences to the discussion.

---> inference would indeed be refinement (or creation). I have added text to this effect to Sect. 4.1.

Under Termination section, the sentence referencing "it is not possible...to completely terminate any piece of data..." brings up a good point and question that could be elaborated. Namely, does "data" in ADLM represent an instance of a single statement or multiple distributed serializations of the same piece of data?
General Notes:
Curious if there are any "Web Data" lifecycles that could be cited?

---> some SW/Web of data lifecycles are discussed at various places throughout the paper. I have added an additional (unpublished) Linked Data lifecycle by Michael Hausenblas. Also, a short paragraph about this has been added to the conclusion.

Since the Web of Documents and Web of Data will co-exist in the same information space of the Web, it would be interesting to see similarities and differences between lifecycles for unstructured data vs. structured data. Perhaps this could also be reserved for future work.

---> that sounds like an interesting idea. I expect it should be possible to apply the ADLM to unstructured data in the same way as Sect. 4 does it for the Semantic Web, though the outcome would of course be different. I'd also like to point out that some of the lifecycle models in the survey do in fact deal with unstructured data!

I think the paper does a good job of crossing both a wide spectrum of specific data application domains as well as generalized data domains (instance, conceptual, metadata).
Curious if real-time data lifecycles exist or if there is are differences in lifecycle models for these types of domains?
The paper has good clarity and well written!