A semantic model for scholarly electronic publishing in Biomedical Sciences

Paper Title: 
A semantic model for scholarly electronic publishing in Biomedical Sciences
Carlos H. Marcondes, Luciana R. Malheiros, Leonardo C. da Costa
Despite numerous advancements in information technology, electronic publishing is still based on the print text model. The natural language textual format prevents programs from semantically processing article content. A semantic model for scholarly electronic publishing is proposed, in which the article conclusion is specified by the author and recorded in a machine-understandable format, enabling semantic retrieval and identification of traces of scientific discoveries and knowledge misunderstandings. 89 biomedical articles were analyzed for this purpose. A content model comprising semantic elements and their sequences in articles is develped. Four patterns of reasoning and sequencing of semantic elements were identified in the analyzed articles. The development and testing of a prototype of a Web submission interface to an electronic journal system that partially implements the proposed model are reported.
Full PDF Version: 
Submission type: 
Full Paper
Responsible editor: 
Krzysztof Janowicz

Submission in response to http://www.semantic-web-journal.net/blog/special-issue-new-models-semant...

Review 1 by Paul Groth
The authors have made appropriate changes since the last revision. In particular, the summary and rationale for the 89 papers is better. The paper can be accepted as is.

Resubmission after "accept with minor revisions." Previous round reviews are below.

Review 1 by Paul Groth
While the paper has a clearer focus, there are still a number of issues that are not completely addressed in this revision. In particular, the introduction is still rather long and does not convey the purpose of the paper. It starts with a long discourse about information retrieval systems and then abruptly moves on to discussing the imprints of relations in science and then skips to the semantic web and then to correlation of article content and scientific discoveries. A more straightforward introduction would be helpful to the reader and bring out the core contribution of the paper.

It's clear that the key contribution of the paper is the analysis of the 87 papers and their mapping to the proposed model of publication. However, this analysis is limited to one paragraph. The second reviewer had asked for a table making this analysis more prominent. Indeed, giving some examples of how papers fit with the identified patterns would help.

Finally, there are still some english errors in the paper a thorough re-reading is necessary.

While I think this paper is almost there, there are still major revisions necessary with respect to the introduction and the addition of the suggested table.

Review 2 by Tim Clark

This paper has been substantially improved over the earlier version and is fine to accept as is for publication.

This is a resubmission after a reject and resubmit. The reviews for the original submission are below.

Review 1 by Paul Groth
This paper describes a semantic model for representing biomedical scientific articles. It discusses the implementation of a submission system for a journal that captures information about articles according to this model. Additionally, it discusses how articles with scientific advances do not fit into standard ontologies when they are first introduced in the literature.

Overall, I think this paper has a number of contributions but tries to do to much and thus does not provide enough information on any of the individual contributions. I believe the possible contributions of this paper are:

1) A semantic model of electronic publishing
2) A journal submission system to capture structured data around scientific articles
3) An analysis of whether one can detect whether scientific articles contain novel science by their lack of correspondence to existing ontologies.

I will discuss where additional details that are missing for each contribution:

1) Semantic Model
It is unclear from the paper whether the model is the same as the one presented in [10] and if so what additions there are. It is also not clear how this differs from existing semantic representations of papers for example: SALT or SWAN or even the old rhetorical structure theory.

I would suggest the following paper as a nice entry point into the literature for these comparisons.
de Waard, A. (2010b). From Proteins to Fairytales: Directions in Semantic Publishing. IEEE Intelligent Systems 25(2): 83-88 (2010)

The related work is ok in particular as it pertains to biomedical publishing.

The model also does not define whether the model is only for the biomedical area or if it applies more generally. There is a general impression that the authors want to say it is general however it relies on the biomedical domain for its support. I think this should be a bit clearer.

Furthermore, Figure 2 which summarizes the model does not provide details of the notation which I think are critical to understand the interrelationships of the model

I think these contrasts must be made for this to be a clear contribution.

2) Journal submission system

This looks like an interesting contribution as it combines a knowledge capture interface with assistance through natural language processing. However, the algorithm details are not given for the system that automatically identifies the parts of the scientific article. This is necessary to fully understand the approach. Furthermore, no user evaluation was done to judge the efficacy of the system.

3) Detecting scientific discoveries
This is an interesting result showing that articles which do not fit into the standard organization of science (i.e. via its agreed upon ontology) are more likely to be precursors to major discoveries. This could be a paper on its own. I would like to see more information about the mapping procedure followed by the experts in the group. I also wondered if this be done automatically over a larger corpus? It would be interesting to discuss the ramifications of the inability to capture these sort of new discoveries within ontologies. What does this mean for the systematization of knowledge in computer understandable forms and its use for knowledge discovery? This wider context is necessary to place this potential contribution

Overall, I would suggest the authors pick a focus of the article. Additionally, I think they should ensure that the biomedical focus is clear.

Some minor notes:
- In the introduction, there is no transition between the discussion of the problems of text (2nd paragraph) and the discuss of bibliographic metadata.
- In the introduction, you say that bibliographic metadata is incapable of exploiting semantic information. Surely, it is capable, however, it currently doesn't use it.
- In the introduction, the discussion of the history of scientific articles seems out of place. Maybe it should be merged with paragraph 2.
- The paragraph starting section 2, needs to be rewritten into smaller sentences.
- The definition of ontology you cite for Ying Ding is actually from Tom Gruber.
- Maybe the discussion of Kuhn could be earlier in the related work.
- Figure 3 showing the RDF of your format should just be the RDF and not a screen shot. The SWJ community can more easily read straight rdf. Especially if it is in turtle format.

Review 2 by Alexander García-Castro

Tittle: "A semantic model for scholarly electronic publishing in Biomedical Sciences"
From the Journal: "Full papers – containing original research results. Results previously published at conferences or workshops may be submitted as extended versions. These submissions will be reviewed along the usual dimensions for research contributions which include originality, significance of the results, and quality of writing."

The paper presents an original approach to the problem of delivering semantic representations for scientific documents –in this specific case research articles in biomedical sciences. The authors present an interesting, feasible and realistic method that combines automatic and semi-automatic metadata enrichment strategies. Papers are written for humans to read and consume; however, within the current landscape in life sciences researchers can hardly cope with the information overload. It is therefore important for us to enable machine readable documents; such approach should make it possible for machines to identify relevant claims in the text as well as to reason over the text. The authors propose a metadata enrichment mechanism that is implementable. Although the research presented by the authors is relevant and interesting, the text does not make it easy for the reader to appreciate the importance of the work. The English needs major improvements; some reorganization is also necessary.

Accept Pending Minor Revisions

The approach presented by the authors is original. It is simple and implementable. Interesting, this work is based upon the careful review of a number of papers –as described in "Materials and Methods". From the review this paper had throughout the SePublica editorial process: "The combination of the manual and automatic approach takes advantage of the best of both worlds. While the manual creation of semantic annotation warrants highest information quality, the automatic annotation is much faster. Since the problem of the ambiguity of natural language is not solved yet (and might be impossible to solve), human interaction in the semantic annotation process will stay necessary for an unpredictable period of time. Therefore the combination is a practical solution."

Significance of the results
The authors present interesting results. Very significant and useful.

A summary of the 89 papers analyzed by the authors is needed. Such summary could be a table. The existing text could expand on the table. As the analyzed documents pertain to a specific domain in biosciences the authors should address the issue of scalability; how scalable is their approach to other domains in biomedical sciences?

The results should also be summarized in a table. Furthermore, the authors have done an interesting work analyzing the rhetorical structure, how could such structure be aligned against previously presented structures? Also, the authors should better present their structure; this is a key result from their research.

Quality of writing:
The authors should present the reader with the problem at hand, research statement, earlier in the manuscript. The actual problem is only presented until the second page, 9th paragraph. What is their approach, what is the problem they are studying, what is the contribution of the paper, all of these should be presented in a straightforward manner earlier in the document.

The "Introduction" is too wordy. It should be shorten to one full page.

"Related Work". The authors should focus this section to previous work related to the problem at hand. The way this section is written it goes from ontologies, and opinions around biomedical ontologies, to the formulation of hypothesis –briefly. The authors should better focus this section. How is this related work making the case for the novelty in their approach?

"Materials and Methods". This is an interesting and useful section. It is written in a very bio style, which is good because it is rigorous in the description. Also, this section makes it easier for the reader to "imagine" how could the experimental part presented in this paper be replicated. However, the English need to be improved.

"Results and Discussion"
This is an interesting section. However, it is difficult to read; it needs some reorganization. The English here needs improvements.

Again, the English needs to be improved. Also, the authors should focus this section on the actual contribution avoiding wordiness. This should be related to the problem at hand declared in the introduction (which also needs some work). The reader should be able to "see" how is the work addressing the research question and more importantly how is this work making a difference. This section should be more focused.

Issues to be addressed:

Authors: hereafter abbreviated as "SN"

Reviewer: just by using parenthesis after the term it is enough, there is no need for "hereafter abbreviated as "SN". For instance, in this case it should be "Semantic Network (SN)".

Authors: "unified medical language system"
Reviewer: use caps, Unified…System.
Authors: "Although this semantically richer schema is supported by the UMLS, the bibliographic record models in databases such as Medline are incapable of exploiting this potential."
Reviewer: such strong claim should be illustrated. The authors should present an example, a real life one.
Authors: "and RDF Schema3 statements"
Reviewer: the authors probably mean RDF schemata. I don't understand the "statements" part.
Authors: "Through scientific articles"
Reviewer: the authors should consider rephrasing, "Through" is not the right word here.
Authors: "Electronic-Web-published"
Reviewer: the authors probably mean "web based publication of scientific articles". Please review/rewrite.
Authors: "humanity body of knowledge."
Reviewer: Please review/rewrite.
Authors: "Several alternatives have already been proposed as new types". Second page.
Reviewer: This paragraph could be better placed as part of the second section.
Authors: "From an ontological point of view, scientific articles are a- documents"
Reviewer: Please review/rewrite.
Authors: "The focus of the proposed model is the second aspect, i.e., the reasoning/rhetorical, and the semantic structure of the scientific articles in Biomedical Sciences."
Reviewer: reasoning and rhetorical? The use of the / implies that both concepts are interchangeable; this is hardly the case because these are concepts that are not intrinsically related to each other. Please review/rewrite.
Authors: "In the literature [29] the term biomedical ontology is a slight imprecise concept naming biomedical concept systems ranging from terminologies used to index scientific literature to highly formal computational ontologies such as OpenGALEN4."
Reviewer: this is quite true.
Authors: "We have been working for years [10] on the development of a"
Reviewer: Please review/rewrite
Author: "richer content surrogate"
Reviewer: why "surrogate"?
Author: "semantic content model"
Reviewer: I find this problematic, the authors are not presenting any schemata, nor are they presenting an ontology modeling a specific section of a paper. The authors should be more specific in presenting the limitation of their model.
Authors: "Scientific claims made by authors in their papers are represented as relations between two different phenomena or between a phenomenon and its characteristics"
Reviewer: how are these "relationships" being modeled?

Review 3 by Tim Clark

Marcondes et al.

The authors propose a semantic model for publishing scholarly electronic articles and present a prototype user interface which authors would be required to use to submit their articles to journals.

The introductory and background material is not particularly well-done or well-reasoned and the referencing does not suggest a very deep familiarity with the subject. Examples:

- "Before the advent of the World Wide Web…man's scientific knowledge was fuzzy…"

- "Relations between concepts are the core of meaning."

This is an idealist theory of meaning which requires at least some citations and a defense. A materialist might say, that the relationship between propositions / concepts and their physical referents is at the core of meaning. When we ask what something "means", are we asking for its related concepts, or do we want to know what it represents in the physical world? Regardless of one's answer to this question, the authors should at least present some backing for their view and recognize that is *is* a view.

- "Since the Actas of the Royal Society in the seventeenth century, …"

The Royal Society did not publish anything called an Acta in the seventeenth century. Their foundational publications were two books, Hooke's _Micrographia_, and John Evelyn's _Sylva_ in 1662; followed in 1665 by the _Philosophical Transactions_, which was the sole periodical publication of the Society until the nineteenth century when the predecessor of the Proceedings was initiated.

- "We also hypothesize that there is a correlation between the articles content and the fact that these articles report scientific discoveries."

- "we propose to engage authors in developing a richer content representation…"

This should cite A de Waard 2007, De Waard & Kircz 2008, and Ciccarese et al. 2008, at a minimum, and probably several of the earlier papers referenced by De Waard & Kircz, as the present authors are certainly not the first to attempt what they are reporting on in this paper.

- "Biomedical terminologies are evolving toward knowledge bases as they are becoming formal."

The authors need to cite some adequate definition of a knowledge base from the mainstream literature in this field.

- "Other important aspects of ontologies to Science are outlined by Yin Ding: Ontology is defined as a formal explicit specification of a shared conceptualization."

The definition cited and attributed to Yin Ding was originated by Thomas Gruber (Gruber 1995).

- "Thomas Kuhn his one of the most prominent authors in Philosophy of Science…"

Besides the fact that Kuhn is deceased and the authors should use the past tense in this sentence, it just seems lazy to throw Kuhn about in this way. Kuhn is no doubt the most well-known author in Philosophy of Science to non-philosophers, and though I am not a specialist, even I know that he cannot be considered representative or definitive of thinking in that domain.

And Kuhn does seem to be just thrown in there.

The authors go on to mention Kuhn's idea of a "pre-paradigmatic stage" in science where there is a lack of precise and agreed terminology. But that is clearly not applicable to biomedical research. Lacking a complete formal semantic specification for the content of research articles is very far from being "pre-paradigmatic" in Kuhn's formulation.

The authors then present a graphical model of knowledge representation in articles, MKA, and claim that it was then represented in RDF. But they do not present or make accessible any ontology or RDF schema for MKA.

They present a prototype application by way of a series of screen shots, with no discussion of implementation whatsoever. We do not know what language, framework or technology stack was employed, nor are we told what issues may have arisen in development or what lessons were learned if any. The authors need to present some way for readers to access the code with licensing requirements.

I consider the authors have some interesting ideas but they have not done their homework. Because these ideas are of interest - I would encourage them to undertake a thorough rewrite based on some further study, and to resubmit after addressing the points indicated.