A collaborative methodology for developing a semantic model for interlinking Cancer Chemoprevention linked data sources

Paper Title: 
A collaborative methodology for developing a semantic model for interlinking Cancer Chemoprevention linked data sources
Dimitris Zeginis, Ali Hasnain, Nikolaos Loutas, Helena Futscher Deus, Ronan Fox, Konstantinos Tarabanis
This paper proposes a collaborative methodology for developing semantic data models. The proposed methodology for the semantic model development follows a “meet-in-the-middle” approach: on the one hand the concepts emerged in a bottom-up fashion from analyzing the domain and interviewing the domain experts regarding their data needs; on the other hand, it followed a top-down approach whereby existing ontologies, vocabularies and data models were analyzed and inte-grated with the model. The identified elements were then fed to a multiphase abstraction exercise in order to get the concepts of the model. The model derived is also evaluated and validated by domain experts. The methodology is applied on the crea-tion of the Cancer Chemoprevention semantic model that formally defines the fundamental entities used for annotating and describing inter-connected cancer chemoprevention related data and knowledge resources on the Web. This model is meant to offer a single point of reference for biomedical researchers to search, retrieve and annotate linked cancer chemoprevention related data and web resources. The model covers four areas related to Cancer Chemoprevention: i) concepts from the litera-ture that refer to cancer chemoprevention, ii) facts and resources relevant for cancer prevention, iii) collections of experi-mental data, procedures and protocols and iv) concepts to facilitate the representation of results related to virtual screening of chemopreventive agents.
Full PDF Version: 
Submission type: 
Ontology Description
Responsible editor: 
Guest Editors

Submission in response to http://www.semantic-web-journal.net/blog/special-issue-linked-data-healt...

Resubmission after a "reject and resubmit" in round one and in round two, and an accept with minor revisions in round 3. First round reviews (under paper title "Collaborative development of a common semantic model for interlinking Cancer Chemoprevention linked data sources") are beneath the second round reviews, which are under the third round reviews.

Solicited review by Mari Carmen Suárez-Figueroa:

The following issues should be corrected before accepting the paper:

- Authors should justify in the paper the need of another methodology for building ontologies. They answered my comment in the letter, but they did not include the justification in the manuscript.

- Reference [14] should be corrected in the following way: M.C. Suárez-Figueroa. "NeOn Methodology for Building Ontology Networks: Specification, Scheduling and Reuse". ISBN: 978-3-89838-338-7. IOS Press - AKA. 2012.

- Reference [15] should be updated in the following way: M.C. Suárez-Figueroa, A. Gómez-Pérez, E. Motta, A. Gangemi (eds.). "Ontology Engineering in a Networked World". ISBN: 978-3-642-24794-1. Springer 2012.

- Authors should correct the state of the art because On-To-Knowledge [16] does not support collaborative ontology development neither involves end-users.

- Authors should clarify how (manually, automatically, etc.) the core concepts are identified by analyzing the competence questions.

- Authors should update reference about ODPs [19] in the following way: V. Presutti, E. Blomqvist, E. Daga, A. Gangemi. "Pattern-Based Ontology Design". Ontology Engineering in a Networked World (ISBN: 978-3-642-24794-1.), 35-64. Springer, 2012.

Solicited review by Alejandro Rodrigues Gonzalez:

Accept as is.

Solicited review by Erick Antezana:

Accept as is.

Second round reviews:

Solicited review by Erick Antezana:

use a spell checker and correct words like:

- chemoprevetion
- coneptualisation
- chemoprventive
- ontolology
- ...

some words should be in italics:

- in-silico
- e.g.
- i.e.
- ...

Ask a native english speaker to read the manuscript...many sentences need proper punctuation!

if "linked data" is used as an adjective, use a dash: linked-data.

"on the one hand " --> "on the one hand, "

elaborate legend of Fig 2.

"Various methodologies for the evaluation of on-tologies have been considered in the literature" --> provide references

"implementation : " --> "implementation:"

"Evaluation conducted by people based on criteria and patterns." --> provide examples of criteria and patterns.

"model, that will " --> "model, which will "

create a table and present in the the resources mentioned at the paragraph starting with "In order to identify the linked datasets, a thorough search was conducted .................on their Web site."

In table 2, I do not see any "Question" (header of 2nd column)....only assertions...

"linked dataset databases etc. " <-- punctuation!!

"developing and evaluating semantic model and ontologies. " --> model or models??

"The novel part of the approach is the active involvement of the end-users." --> I cannot buy in this sentence... rephrase it...

[3][4] --> [3,4]

limitations of the proposed methodology are poorly discussed.

again: why is this solution technically better that other ones?

if the methodology is the main product of this work why is it not mentioned in the title? (a methodology for...)

"method" and "methodology" are not properly used in some sentences: review

"reference" [94] is not a reference... fix it.

"Assuming the Application-based evaluation" <-- ??

"A significant percentage of the biomedical experts (42.86%) " <-- is the word 'significant' used in the statistical sense here?

still not clear what "completeness and correctness" mean in this context....

could you give an example of a model AND an ontology after the phrase "During the top-down conceptualization existing models and ontologies..."?

in the sentence "In order to identify the ontologies/models.. ", are ontologies and models synonyms?

unconnected phrase "This extensive search identified a total of 18 ontologies"... This = ?

in "This extensive search identified a total of 18 ontologies...": if an EXTENSIVE search identifies only 18 ontologies, I would be shocked... I guess you mean "18 relevant ontologies" or so, right?

"follows a two step approach." <-- use dashes where appropriate...

"...processing of publications and scientific papers in online ..." <-- what is the difference between publications and scientific papers in this context?

"This concept exists in the ISA model" <-- is ISA a model or a framework?

"Then representative concepts from every cluster were extracted. " <-- how? manually? how many in total?

"For each concept the table lists the Ontologies/Models " <-- capital case?

if we agree that it is CHEBI, and not Chebi, then please update all occurrences thereof (e.g. Fig 1)

Solicited review by Mari Carmen Suárez-Figueroa:

The paper presents a collaborative method for building semantic models and a use case in which such a method has been applied. The use case is focused on the biomedical domain, which is interesting. However, authors should justify in a better way the need of another methodology for building ontologies. In the ontology engineering field, there are several methodologies, like Diligent and the NeOn Methodology, that take into account collaborative issues and interact with end users and domain experts. Thus, it is not clear enough why a new methodology is needed and/or why existing methodologies could not be used and adapted to the presented use case.

Comments in Section 2:
- At least the following methodologies should be included in the paper: OnToKnowledge and Diligent.
- The reference to the NeOn Methodology should be updated to the following ones: ref1 and ref2.
- It is not true that only NeOn Methodology involves end-users, since Diligent takes into account also end-users and domain experts.
- Authors should also rewrite and review the sentence about the concrete steps of the NeOn Methodology. This methdology is based on scenarios and provides a planning for each particular case (this plan includes concret steps to be followed).

Comments in Section 3:
- Authors should review the guidelines provided by the NeOn Methodology, since the specification phase presented in the paper is very similar to the methodological guidelines for ontology requirements specification within the NeOn Methodology.
- Authors should explain in more detail how the core concepts are identified. For example, in the NeOn Methodology the ontology requirements specification document provides the development team with the most common terms that appear in the requirements [ref3, ref4].
- Similar situation occurs with the identification of related models and ontologies. This is the reuse-based approach followed in the NeOn Methodology in which ontologies, non-ontological resources and ontology design patterns are searched and selected to be reused in the new ontology development. Authors should also take into account this reference [ref5] .
- Regarding ontology evaluation, authors could refer to the following summary of methods and tools [ref6]. Authors could also consider the possibility of using OOPS! (www.oeg-upm.net/oops) for detecting possible pitfalls in the ontology.

Comments in Section 4:
- Authors could take a look to the following reference [ref7] regarding guidelines on how to publish data as LD.
- It would be nice to know whether the CQs are available (if so, the url should be provided) and how many requirements the development team gathered in the specification phase.
- Authors should explain why they did not applied a multi-criteria selection method for choosing the ontologies as proposed in [ref 5].
- Authors should mention in the paper whether they considered the use of ontology design patterns during the conceptualization phase.
- It would be nice to have the url in which the OWL ontology is available.
- Authors should explain why the alignment was manual instead of using any of the existing tools in the area.

- NeON methodology --> NeOn Methodology (Section 2)

* [ref1] NeOn Methodology for Building Ontology Networks: Specification, Scheduling and Reuse (EAN/ISBN/ISSN: 978-3-89838-338-7)
* [ref2] Ontology Engineering in a Networked World (http://www.springer.com/computer/ai/book/978-3-642-24793-4)
* [ref3]: How to Write and Use the Ontology Requirements Specification Document (ODBASE 2009)
* [ref4]: Ontology Requirements Specification (2012) (http://www.springer.com/computer/ai/book/978-3-642-24793-4)
* [ref5]: Chapter about Ontology Reuse in "NeOn Methodology for Building Ontology Networks: Specification, Scheduling and Reuse" (EAN/ISBN/ISSN: 978-3-89838-338-7)
* [ref6]: Ontology (Network) Evaluation (2012) (http://www.springer.com/computer/ai/book/978-3-642-24793-4)
* [ref7]: Methodological Guidelines for Publishing Government Linked Data, in Wood, David (Ed) Linking Government Data (2011)

Solicited review by Alejandro Rodrigues Gonzalez:

I've made a first review of the paper without pay attention to previous reviews and I've found the paper well-written and quite interesting (I just have some minor comments). Regarding the previous reviews, it seems that authors have made a great effort to improve the paper in comparison with previous versions. My comments:

1) The datasets mentioned at the middle of the paper (55 datasets) are included in the text with the associated reference. This is not bad but I think that they can be included on a separate table.

2) In page 7, before the SPARQL query there is a typo: "ChemopreventiWe agent".

3) Authors mention OWL Light? Do you mean Lite?

4) The reference of methontology is wrong. In the text you mention "Fernandez et al." but the first author in reference [1] is Corcho.

First round reviews:

Solicited review by Robert Stevens:

This paper describes the development of a Cancer Chemo-prevention ontology to underpin the integration of diverse, heterogenous data in that field, from molecular to the literature. The paper is presented as a report on the ontology and the method used in its development. The ontology is an application ontology that takes other ontologies and fills the gaps, namely cancer chemo prevention, to "complete" the ontology for the evaluation.

The paper suffers in several aspects:
1. It reads just like a report of what was done. It needs to draw out the challenges in modelling and reconciliation so that others can learn from the experience of the authors.
2. It lacks details of the ontology and the method. I'd like to see much more of the ontology, how its components were aligned and reconciled, and how mis-matches were found and resolved. BFO was used as an upper level ontology to ease integration. If some of the client ontologies were not BFO compliant, how was this dealt with? Did it make any difference etc? More detail and evidence that enabled conclusions to be would really help it as a paper.

There is a lack of detail on the crowd-sourcing. In the introduction it is a collaborative method and in the conclusion it is crowd-sourcing. How was this managed and motivated? If the "crowd" were all project members, then the motivation is clear; if not, it would be useful to know about the means of motivation. Numbers of participants and their demographic profiles etc would be a useful detail. Lessons in how to crowd source ontology evalaution and what the right questions to ask are would be a good contribution to the field. More detail in the method would be useful - is it anything new? It is just, as far as I can see, the same old model. It reads like a waterfall model, but the number of interations (which are implied by the evaluation) would be useful. The evaluation only gives percentages of positibve responses - comments and changes made are actually the interesting bit. How were they collected; what types of changes were made; how were decisions made; etc etc? All of this would have lifted the paper above the simple report that it currently is.

Details of method - how were things searched for? deetails of the review. Full list of resoruces as suppplemntary data.

3. The authors claim "It lowers the semantic interoperability barriers and thus contributes to the reusability of existing biomedical ontologies and data described using different semantic models.", but present no evidence. It probably does this job, but we are not shown what questions can be answered, what questions need to be answered and what the workflow is and how much data actually end up in the knowledgebase. How are the data queried? Is it through DL-queries against the ontology; is automated reasoning used; is it SPRQL? This is all unclear.
4. The investigation of the 70 resources mentioned at the start is potentially interesting. These could be listed in supplmentary data and an analysis of the heterogeneities presented. This would give grouns for describing the reconciliation for which I presume Google Refine was used. Again, how this worked would be a useful aspect of a report.
5. Perhaps most problematic for the special issue is the lack of linked data. It is in the title, but not really in the paper. Ontology description and method are in the call, but the topics (which I know are not limiting) are very heavy on the linked data aspect. The paper needs to mention how it pertains to linked data. This links back to the point about how it is used - competency questions, how they are asked, how well they are answered - the performance over the data, how the data were annotated, the error rate of this annotation etc.
6. There is a lot of literature cited, but I don't learn any lessons from the review. What was taken from the papers in terms of lesson etc?
7. The writing style is OK, but needs a lot of polish in the detail.

This could be a good paper. It needs a major re-working to draw out the challenges and the lessons learnt. A goode separation of method and results would help the readability. More detail on the areas closest to the core challenges and lessons would help the paper work much better.

Minor points

- "Most of the biomedical experts (42.86%) found the model easy to understand (Question 2)." - how is this "most"?
- "Regarding questions 3 and 5, the answers vary about the theoretical support needed by the users to understand the model." - how?
- "It lowers the semantic interoperability barriers and thus contributes to the reusability of existing biomedical ontologies and data described using different semantic models. " how do you know?

These are just three examples; it is like this throughout the paper.

Solicited review by Erick Antezana:

Zeginis et al. present a work on the development of a semantic model to serve cancer chemoprevention applications. Although the work presented falls under the best practices to be followed in every serious (engineering) project development (i.e. modeling phase: being this phase the first step that every relatively large project should embark on but it is usually ignored), it fails to present the model itself and methodology:

- the novelty of the work is not properly highlighted and discussed
- there is no comparison with other stablished solutions ("active engagement" is a project management issue….)
- limitations are not discussed (or poorly mentioned)
- why is this solution technically better that other ones?
- how does the model handle the updates of the resources it will support (data, ontologies, etc...)?
- how does the model ensures data consistency? how to handle overlapping data? contradictory information?

A tangible application, use case or minimal setup showing the power of this model would make the case more convincing…

Finally, CanCO and the manuscript have many conceptual mistakes (as well as minor issues):

- The CanCO (http://bioportal.bioontology.org/ontologies/3030) has serious issues: Chemoterapy IS NOT a Molecule !!
- in CanCO, concept "Collection" is duplicated...
- the Gene Ontology (GO) [5] and BioPax [6] aim at standardizing the representation of genes and pathways, --> the Gene Ontology (GO) [5] and BioPax [6] aim at standardizing the representation of genes and pathways, respectively
- chemoprevention action of an agent. : what type of agent? define agent…
- CanCO stands for Cancer Chemo- prevention Semantic Model, is there any reason why the letter S(semantic) and M(model) were not taken into account to come up with a maybe more natural acronym?
- How is the following sentence elaborated in the text: "CanCO facilitates the delivery of machine- interpretable information regarding their structure and content, supporting the on demand discovery of published cancer chemoprevention related data."?
- "and doctors", do you mean physicians?
- should "bioinformaticians" be explicitly shown in Fig 1?
- "Li et al." (nor Öhgren et al.) is a methodology… they each have suggested a methodology, tough…
- In Fig 2, it seems that domain experts and ontology engineers do not interact directly at all…?
- in Fig 2, should readers assume that the steps must be followed from top to down (i.e. in that order)?
- "Specification. This phase investigates" : is the phase that investigates? or an investigation is performed during that phase?… same question/issue for the other phases..
- "the level of granularity of the concepts should also be taken into account": give an example..
- Why is the first phase called specification, and not conceptualization, if you already define the granularity of CONCEPTS? SHould the conceptualization phase be performed before the specification one?
- On the one hand relevant --> On the one hand, relevant
- Legend of Fig 3 is useless…
- In Fig 3, what is the difference between 'available data sets' and 'experimental data'?
- Fig 2 should explicitly show the output out each phase (which in turn is the input for the next phase)
- define 'formal or semi-computable model'
- 'translated into a computable model in any ontology language.' : are you sure that the word ANY is appropriate in that sentence?
- how do you evaluate the "completeness, correctness, usability and simplicity of CanCO"? Could you elaborate more on the human assessment of that evaluation?
- Fig 4: what are the "spaces" exactly?
- Section 3.1, no need to justify the work here… all that justification should be in the introductory section…
- "The genericity of the existing ontologies that…" --> "The genericity of the existing ontologies, which …" review the usage of THAT and WHICH in the paper…
- Rewrite paragraph: "The model reflects the requirements of the bio……"
- "detect modeling needs"???
- Is is necessary to mention the workshop of May 2011???
- into 4 spaces --> into four spaces
- define: performance of cancer chemoprevention experiments
- how do you "detected" the 18 ontologies? elaborate on the "extensive" literature review you mention
- ISA [23] is listed as being an ontology… this in not true… which definition of ontology do you use?
- "where the concepts of the models/ontologies were reviewed." : have you reviewed all the "concepts" in the resources you mention? have you found any issue? do you agree with all what those resources provide?
- how do you define the "clusters of high similarity"?? do you consider the definition of each concept? synonyms? what else?
- "This means that the elements of a specific cluster were conceptually/semantically related despite differences in terminology" : what do you mean with differences in terminology??
- Table 1 legend is useless…
- Chebi or CHEBI?
- Is Table 1 comprehensive? Annexes needed?
- Fig 6 is discussed BEFORE Fig 5…?
- Some elements shown in Fig 5 ARE NOT ontologies…
- a table grouping the 55 "resources" is needed… add there references, a summary of each resource, keywords, URL, if it has a SPARQL endpoint, what it contributes to the CanCO, etc?
- the urls pointing to the questionnaires should not be presented as a footnotes..
- Where can I find the reference 78?
- Rewrite 1st paragraph of section 3.2.3
- What are the advantages and limitations (if any) of the modeling of Fig 6?
- What do you mean by "semi-computable model"?
- Which version of OWL (e.g. OWL DL) does CanCO belong to? and why?
- How is the system dealing with updates? (updating resources might produce some contradictory facts… )
- What about the overlapping concepts? is there any?
- Why do you have/use two top level ontologies: BFO and biotop? what about the top level concepts provided by the other resources?
- "Thus, it enables a semantic interoperability between a large number of ontologies which are accessible ranking "under" this upper ontology." : could you elaborate on the semantic interoperability aspect?
- "CanCO needs to be evaluated and tested according to specified criteria." : needs to be? or was?
- where is the reference to table 4 (in the text)?
- Is the paper's major product an ontology or a methodology?
- I would expect in section 4 (Demonstration and Evaluation of the model) a demonstration of the usability of the model and not an analysis of the questionnaire…
- How were the prospective system users evaluated the model?
- I don't see why the system is "hybrid"…
- How the proposed model "lowers the semantic interoperability barriers"? not well elaborated...
- URL: http://bit.ly/fZLh5K (translated into http://wapps.islab.uom.gr/limesurvey/index.php?sid=48484&newtest=Y) is broken:
Deprecated: Function eregi() is deprecated in C:xampphtdocslimesurveycommon.php on line 86

Deprecated: Function eregi() is deprecated in C:xampphtdocslimesurveycommon.php on line 86

Deprecated: Function ereg_replace() is deprecated in C:xampphtdocslimesurveyclassescoresanitize.php on line 202

- "The model relies on widely known and adopted biomedical standards to…" : which are those widely known and adopted standards?

Solicited review by Iker Huerga:

The main aim of this paper is to investigate the very interesting and complex problem of building an ontological Cancer Chemoprevention Model. In my opinion the significance to the field and contribution are substantial due to i) cancer chemoprevention is very likely to become one of the key cancer research areas in the following years so the need for a unified model exists, ii) currently there is no ontological model clearly designed for specifically targeting this domain and iii) the innovative collaborative methodology introduced by the authors in this paper using the inputs provided by medical experts. Thus, I would recommend including it into the journal.
This said, I would like the following comments/questions to be answered before publishing the paper.
First, the authors propose a combined Bottom-up and Top-down conceptualization of the model which basically means a model composed by concepts extracted from either ontologies or publicly available data sets into the Linked Data cloud. Although where the data comes from is clearly defined, the methodology used to extract it, is not. In the specific case of publicly available datasets the authors just mention "The analysis of the publicly available datasets was based either on the data provided through the SPARQL endpoints of each dataset or through the searching mechanism provided by their Web site", but the algorithm or techniques used to look for valuable entities into these datasets are not mentioned. Did the authors query the 55 datasets they claim to have analyzed by manually SPARQL querying each of them? If so, how are they completely sure that the extracted information is accurate and up to date? In my opinion just an approximation was made, hence the non accurate results in table 1. For instance, the entity Chemopreventive agent, which is the main contribution of CanCO (see last paragraph section 3.2) is already present in UMLS as C1216463 and is already present in one of the analyzed dataset, i.e. Linked Life Data, with its own URI, i.e. http://linkedlifedata.com/resource/umls/id/C1516463, it is also mapped to NCI thesaurus id C1892 as shown in http://proteins.wikiprofessional.org/index.php/Concept:f7d18794-2f6e-4d8... (open More About this concept -> Reference), whereas Table 1 Top-Down column for Chemopreventive agent shows no results.

Second, section 3.2.2 claims that the experimental data analysis was conducted through analyzing two sets of experimental data, see citations [79] and [80]. While [79] refers to a study for screening cancer chemopreventive agents for breast cancer in which MMOC was used, [80] refers to an article published in 2002 for in vitro screening of potential cancer chemopreventive agents. So my questions here are i) why authors did not make a deeper literature review other that these two articles? For instance, SPARQLing linkedlifedata to see articles that mention Chemopreventive agents, and ii) why already established resources such as the cancer prevention network (http://www.cancerpreventionnetwork.org) are not mentioned/used here?

Finally, in my opinion the Virtual screening is a really complicated discipline that is not properly modeled with the classes included in the CanCO ontology and I think that approaches such as http://www.ncbi.nlm.nih.gov/pubmed/21613989 but applied to Chemopreventive agents should be considered here. Also it would be worth to mention how many experts and from which disciplines went over the questionnaires, see table 4, since 71.49% of 1000 is not the same that the 71.49% of 10.
In terms of citations, references, tables and figures format and syntax, I just noticed that references 2 and 4 are the same.