Publishing DisGeNET as Nanopublications

Tracking #: 879-2089

Núria Queralt-Rosinach
Tobias Kuhn
Christine Chichester
Michel Dumontier
Ferran Sanz
Laura I. Furlong

Responsible editor: 
Boyan Brodaric

Submission type: 
Dataset Description
The increasing and unprecedented publication rate in the biomedical field is a major bottleneck for discovery in Life Sciences. The scientific community cannot process assertions from biomedical publications and integrate them into the current knowledge at the same rate. The automatic extraction of assertions about entities and their relationships by text-mining the scientific literature is an extended approach to structure up-to-date knowledge. For knowledge integration, the publication of assertions in the Semantic Web is gaining adoption, but it opens new challenges regarding the tracking of the provenance, and how to ensure versioned data linking. Nanopublications are a new way of publishing structured data that consists of an assertion along with its provenance. Trusty URIs is a novel approach to make resources in the Web immutable, and to ensure the unambiguity of the data linking in the (semantic) Web. We present the publication of DisGeNET nanopublications as a new Linked Dataset implemented in combination of the Trusty URIs approach. DisGeNET is a database of human gene-disease associations from expert-curated databases and text-mining the scientific literature. With a series of illustrative queries we demonstrate its utility.
Full PDF Version: 

Minor revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
By Amrapali Zaveri submitted on 11/Nov/2014
Minor Revision
Review Comment:

The article “Publishing DisGeNET as Nanopublications” describes the publication of DisGeNET nanopublications as a new Linked Dataset implemented in combination of the Trusty URIs approach. DisGeNET contains information on human gene-disease associations from expert-curated databases and from text-mining the scientific literature.

The dataset has been published by re-using a rich set of vocabularies and also interlinked with several other datasets, thus complying to the Linked Data principles. Also, other relevant information about the dataset has been satisfactorily described. Thus, I recommend to accept the paper. However, I just have a few queries/suggestions:
- Add a bit more information on the advantages of having a dataset in the form of nanopublications.
- According to Table 1, only 4% of the assertions are curated, which makes it rather a low quality dataset. How do you plan to increase this? How accurate is this curation?
- Do you assess the accuracy/quality of the predicted and literature extracted data?
- What is the GDA concept?
- How did you perform the interlinking? Did you assess the quality of the interlinks - accuracy and completeness?
- Figure 1 is illegible, please increase font.
- There is hardly any related work discussed.
- Did you come across any challenges during the conversion to nanopublications?
- How do you plan to update and maintain the dataset?
- There is not much evidence of third-party usage of this dataset.

The paper is well-written, however I encountered some formal errors:
1. Introduction
- IBI - provide full-form
- 7 - seven
2.1.1. GDA Content
- CUI - provide full-form
- de-referenceable - dereferenceable (also in 3.1)
- DisGeNET Nanopublication Dataset
- 4 - four
3.1 Ontologies
- Even though, the modeling - Even though the modeling
3.2 Schema
- e.g. examples - repitition
3.3. Metrics, Versioning, Licensing
- TriG syntax - provide reference
4.3. Linking with other LOD Resources
- Since in DisGeNET RDF is also represented the relation between gene and the protein/s that encodes, - please rephrase
6. Applications
- Open PHACTS Discovery platform - provide reference
- As a side note, I think section 3.3 and section 5 could be merged or put under one section.
- Also, I would prefer that you add the link to directly rather than link to the paper adn have the reader look up the link there.

Review #2
By Eleni Mina submitted on 05/Dec/2014
Minor Revision
Review Comment:

The paper Publishing DisGeNET as Nanopublications, presents the release of a very valuable source, gene disease associations, using the nanopublication model and trusty URIs. This paper is also indicative of the natural evolution in publishing information in science. Adopting semantic web standards for publishing information together with provenance metadata and cryptographic hash values in the URIs. The paper is well written and structured and well motivated. The potential of such an effort is apparent and I find it an excellent effort towards knowledge discovery. I definitely accept this paper, but I do have some minor comments that need to be addressed by the authors.

Minor revision comments
1. Virtuoso can be configured differently, more properly in order to provide a message for the errors that result from the sparql query.

2. It is not very clear to me what exactly is already in nanopublication format and what is not. For example substituting the disease id of the 4.1 section with the huntington's disease id, C0020179, does not give any hits back. Like what percentage of the RDF data source has been already transformed into nanopublications?

3. I would personally find it very helpful to include a picture of the nanopublication schema. This can help a lot the reader to understand the model and it also saves a lot of time and effort when you want to perform sparql queries over this dataset.

4. It is not very clear to me (and maybe this also relates to the previous comment about the figure), how to retrieve with the current model, assertions that are talking about the same GDA (e.g. geneX is associated with diseaseY) but have different types of evidence e.g. literature and predicted.



The link to see the nanopub queries in the paper is DisGeNET nanopub queries.


Dear all,

Apologizes, in the submitted manuscript the hyperlink to the full nanopublication query example is missed. Please, follow this following link in order to access it:

Kind regards,