Towards a Linked Open Dataset for Scholarly Publishing: Semantic Lancet Project

Tracking #: 744-1954

Authors: 
Andrea Bagnacani
Paolo Ciancarini
Angelo Di Iorio1
Andrea Giovanni Nuzzolese
Silvio Peroni
Fabio Vitali

Responsible editor: 
Guest Editors EKAW 2014 Schlobach Janowicz

Submission type: 
Conference Style
Abstract: 
There is an ever increasing interest in publishing Linked Open Datasets about scientific papers. The current landscape is very fragmented: some projects focus on bibliographic data, others on authorship data, others on citations, and so on. The quality is also heterogeneous and the production and maintenance of such datasets is difficult and time-consuming. In this paper we introduce the Semantic Lancet Project, whose goal is to make available rich semantic data about scholarly publications and to provide users with sophisticated services on top of those data. We developed a chain of tools that produce high-quality data from multiple sources. It has been successfully used to produce a rich and freely available LOD, described here as well.
Full PDF Version: 
Tags: 
Reviewed

Decision/Status: 
[EKAW] reject

Solicited Reviews:
Click to Expand/Collapse
Review #1
Anonymous submitted on 24/Aug/2014
Suggestion:
[EKAW] reject
Review Comment:

Towards a Linked Open Dataset for Scholarly Publishing: Semantic Lancet Project

Overall evaluation
Select your choice from the options below and write its number below.

== 3 strong accept
== 2 accept
== 1 weak accept
== 0 borderline paper
== -1 weak reject
== -2 reject
== -3 strong reject
-1

Reviewer's confidence
Select your choice from the options below and write its number below.

== 5 (expert)
== 4 (high)
== 3 (medium)
== 2 (low)
== 1 (none)

4
Interest to the Knowledge Engineering and Knowledge Management Community
Select your choice from the options below and write its number below.

== 5 excellent
== 4 good
== 3 fair
== 2 poor
== 1 very poor
4

Novelty
Select your choice from the options below and write its number below.

== 5 excellent
== 4 good
== 3 fair
== 2 poor
== 1 very poor
3
Technical quality
Select your choice from the options below and write its number below.
== 5 excellent
== 4 good
== 3 fair
== 2 poor
== 1 very poor
4
Evaluation
Select your choice from the options below and write its number below.
== 5 excellent
== 4 good
== 3 fair
== 2 poor
== 1 not present
2
Clarity and presentation
Select your choice from the options below and write its number below.
== 5 excellent
== 4 good
== 3 fair
== 2 poor
== 1 very poor
5

Review

This paper describes the Semantic Lancent system, a system for constructing an integrated and encroached set of linked data for academic papers. The idea for such a system is not new and the authors do a good job of providing a clear overview of projects in the space of triplifying scholarly information and where they are lacking. Based on this overview, the paper defines 6 attributes for making better linked data in this space. They describe how there system implements most of these attributes (unfortunately, provenance is still missing). Two applications based on the integrated datasets are presented. The paper is readable and the applications are compelling.

One of the novelties here is that the system works directly with a publishers API in order to access this data. This is something that I believe is novel as most other approaches use either open access corpora (pubmed) or third party indexes (dblp).

Another positive of the paper is how the authors marry all the various components in this space together from NLP tools like FRED to the rich set of SPAR ontologies to build a cohesive system.

The major issue with the paper is the lack of an evaluation. The paper is a systems/application report paper but here the requirement from both SWJ and EKAW are demonstration of use of the application or some demonstration of impact. Unfortunately, the paper doesn’t provide this evidence.

Overall, I would really like to recommend that this paper be accepted but it’s unfortunately it’s just a bit early given the guidelines of these venues. I would suggest the authors focus on some the validation of their system. (If this was a journal paper, I would have gone for major revisions, as I think the authors could provide some of this validation.) Maybe one approach could be to work directly with researchers in the social scientists to demonstrate applicability.

Minor comments
- “a rich and freely available” - remove the “a”
- “there is an ever increasing interest in publishing Linked Open Datasets about scientific papers.” - is there a citation/evidence to back this up?
- In the discussion of bibliometrics and research evaluation it would be good to cite some of the extensive literature from the research policy and scietometrics community e.g. [1] [2].

[1] Borgman, Christine L., and Jonathan Furner. "Scholarly Communication and Bibliometrics." Annual Review of Information Science and Technology (ARIST) 36 (2002): 3-72.
[2] Smith, Derek R. "Impact factors, scientometrics and the history of citation-based research." Scientometrics 92.2 (2012): 419-427.

Review #2
Anonymous submitted on 25/Aug/2014
Suggestion:
[EKAW] reject
Review Comment:

Overall evaluation
Select your choice from the options below and write its number below.

== -1 weak reject

Reviewer's confidence
Select your choice from the options below and write its number below.

== 3 (medium)

Interest to the Knowledge Engineering and Knowledge Management Community
Select your choice from the options below and write its number below.

== 5 excellent

Novelty
Select your choice from the options below and write its number below.

== 3 fair

Technical quality
Select your choice from the options below and write its number below.

== 3 good

Evaluation
Select your choice from the options below and write its number below.

== 3 good

Clarity and presentation
Select your choice from the options below and write its number below.

== 4

Review

The paper presents the Semantic Lancet Project. the main goal of the project
is to make available rich semantic data about scholarly publications and to
provide intelligent services on top of those data. The project developed
several tools to produce high-quality data from multiples sources.

1) Introduction

The introduction introduces the work context, that is to say the knowledge
management of scholarly products which is an emerging research area. At the
end of the introduction, the contributions and the plan of the paper are
presented.

2) Scholarly publishing and LOD

First of all, the paragraph describes and analyzes the different types of
data available and some of their features: bibliographic data, contributors,
citations, affiliations, classification and abstract.
Secondly, several Linked Open Data projects for scholarly publications are
studied (types of data, ontologies used, data size, etc.): NPG Linked Data
Platform, JISC OpenCitation Corpus, BioTea, DBLP++, ACM, Semantic Web Dog
Food, single journals, etc.

At the end, the authors highlight some lacks of these projects: all projects
do not cover all aspects with the same precision and completeness,
integration of multiple datasets are not complete, ambiguities and redundant
information, automatic process for production and updating is missing.

3) Towards better datasets

To improve the current situation some issues are presented and discussed.
i) Data Diversity: most of the datasets (types, structures, methodologies and
assessments) depend on the discipline. It should be better to identify
patterns in literature across disciplines to enhance reusability and
associations between datasets.

ii) Data Richness: the authors propose seven level of characterization of the
types of data to ensure richness among which the context (institutions
involved, sources of funding, etc.), its structural components (sections,
blocks, tabular data, etc.) or its rhetorical structures.

iii)Data Correctness: the correctness of the data, the quality and the
interconnectedness of the data is the result of the integration of the
different data sources. it is important to prevent from redundant
information, ambiguities, etc.

iv) Provenance information: It is important to record everything about the
origin and the transformation of data items have undergone. It is matter of
metadata about metadata.

v) Time-awareness: as data change over time, it is necessary to keep traces
about the different changes.

vi) Ease of update and enrichment

The two previous paragraphs establish the state of the art and a set of
requirements (issues) to design, manage and provide linked Open Data for
scholarly publishing. Requirements seem to be well-founded and will be useful
for the community.

To some extend, it is the first part of the paper which is rather at
conceptual level. the rest of the paper is rather practical and focuses on
the Semantic Lancet Project and its current state (development).

4) Semantic Lancet Project

The semantic lancet project has two main goals: i) Produce proper RDf data
compliant with the Semantic Publishing and Referencing ontologies (SPAR)
(which is very shortly introduced); ii) Provide a huge and rich RDF
triplestore (with SPARQL endpoint) and a series of services built upon the
RDF dataset.

The paragraph is composed of three sub-paragraphs: one describing the data
engineering, the semantic enhancement and the services.

Data Engineering: The raw data from Scopus and Science Direct are translated
into RDF data. The process is described in details: technical features and
types of data available, in that paragraph. The data model of all journal
articles is described and presented in a figure (which is a little bit
small).

Semantic Enhancement: a module for generating semantic abstract is
implemented and relies on FRED (a tool that implements deep machine reading
methods based on discourse representation theory, linguistic Frames and
ontology design patterns for deriving & logic representation expressed as OWL
and Linked Data from natural languages. An example of semantic enhancement
obtained with FRED from a sentence is described and presented in a figure
(which is also a little bit small). The different semantic features are
presented with their associated vocabularies and their LOD repositories
(DBpedia for instance).
Disambiguation of authors and the citation typing are ongoing.

The proposed semantic enhancement should enable the authors to provide
interesting services (cf. abstract finder) .

Services: two services are described a data browser and and an abstract
finder.

5 Conclusions

In the conclusion, the authors introduce the time-awareness addressed by the
SPAR ontologies: the FRBR layered model and the PRO ontology. But How? It
could be interesting to describe that in more details into the paper (cf.
questions for paragraph 4).

Questions (for the entire paragraph 4):

Why SPAR is used? there is no justification. What are the relationships
between SPAR of the previous vocabularies presented in the state of the art?
Or is it easy to link these vocabularies with the other ones used in the
other projects?

But maybe, the Semantic Lancet project does not aim to ensure
interconnectedness with the previous projects?

How the Semantic lancet project is an answer to the lacks of the current
projects and to what extend, it fulfils the requirements?

some elements are present in the paper, but it is not systematic. The six
issues (or requirements) presented in the paragraph should be
discussed/analyzed at the end of the paragraph 4. Thus, it is difficult to
understand how and to what extend the project (or its current state) fulfils
the requirements.

In other words, the paper seems to be twofold (consisting of two different
parts which are not explicitly linked). Nevertheless each part is well
written and easy to understand.

Review #3
Anonymous submitted on 02/Sep/2014
Suggestion:
[EKAW] conference only accept
Review Comment:

Overall evaluation
Select your choice from the options below and write its number below.

== 3 strong accept
== 2 accept
== 1 weak accept
== 0 borderline paper
== -1 weak reject
== -2 reject
== -3 strong reject
-2

Reviewer's confidence
Select your choice from the options below and write its number below.

== 5 (expert)
== 4 (high)
== 3 (medium)
== 2 (low)
== 1 (none)
3

Interest to the Knowledge Engineering and Knowledge Management Community
Select your choice from the options below and write its number below.

== 5 excellent
== 4 good
== 3 fair
== 2 poor
== 1 very poor
3

Novelty
Select your choice from the options below and write its number below.

== 5 excellent
== 4 good
== 3 fair
== 2 poor
== 1 very poor
2

Technical quality
Select your choice from the options below and write its number below.
== 5 excellent
== 4 good
== 3 fair
== 2 poor
== 1 very poor
3

Evaluation
Select your choice from the options below and write its number below.
== 5 excellent
== 4 good
== 3 fair
== 2 poor
== 1 not present
1

Clarity and presentation
Select your choice from the options below and write its number below.
== 5 excellent
== 4 good
== 3 fair
== 2 poor
== 1 very poor
4

Review
This paper (1) gives an overview of existing datasets on scholarly publishing,
(2) enumerates a couple of issues related to data selection and integration,
and (3) explains some of the effort that has gone into building and disseminating
a dataset on scholarly publishing.

Even though I have had occasional problems with understanding
ungrammatical sentences (see below),
the overall reading experience of this paper was quite good.
The structure of the paper is fine, with the exception of future work
popping up in unexpected spots a couple of times (see below).
The main thing that stuck with me after reading this paper,
is that the authors have done a ton of stuff,
most of which is probably quite useful to the community.
However, I do not believe the work effort has been sufficiently disseminated
in the present article.
I will explain why I believe the current article falls short
on each of the three aforementioned contributions.
I will also make a few generic remarks on wording and grammar.

Assessing existing datasets
===========================

The first contribution of the paper is that it gives an overview of
existing datasets in scholarly publishing. The datasets are compared
with respect to their structuredness, precision and quality.
However, none of these concepts are defined, let alone operationalized.
This results in a dataset overview that seems to largely depend on
the options (opinions?)of the authors.
More work should be done in order to make the assessment of
existing datasets more objective, e.g. by providing criteria a dataset should
meet in order to be called high-quality.

Issues of data selection
========================

The second contribution of the paper is that it identifies
"issues that [...] should characterize the selection process of data".
Firstly, I observe that the "selection process" is only a part
(although a very relevant one) of the process of
constructing and disseminating a dataset on scholarly publishing.
It is unclear to me why only this component is described in such detail
when none of the others are.
Secondly, although the "issues" the authors enumerate are all valid,
they are not novel, as they are all commonly known by researchers in the field,
and for some of the issues approaches exist that try to (partially)
mitigate them.
Thirdly, apart from not being novel, none of the issues identified are specific
to the construction and dissemination of data on scholarly publishing.
These issues are so generic that they would apply to
the construction and dissemination of any dataset that is based on
multiple raw data sources.
The generality of these issues is not inherently a bad thing,
but it is not something one would expect to be part of
a paper that focuses specifically on scholarly publishing.
Phrasing this last point in a different way,
the authors do not identify issues that are specific to
data on scholarly publishing, which is what the paper purports to be about.
Fourthly, it is unclear whether or not the list is exhaustive.

Semantic Lancet
===============

The third contribution of the paper is that it explains
how data on scholarly publishing is created and disseminated
within the Semantic Lancet project.
I consider this to be the main contribution of the paper,
and would suggest to expand on this third contribution a bit more,
e.g. by leaving out the second contribution entirely.
This part of the paper enumerates the components that have been built or reused and the ways in which they interact, as well as components that have
not yet been built or reused, but that are part of the envisioned framework.
For some of the components that have been build or reused,
problems are identified.

Motivations
-----------

The paper does not offer motivations for any of the implementation
choices that have been made.
E.g., we do not know why specific vocabularies (such as SPAR)
were chosen over others.
In the absence of reasons, most implementation decisions seem quite arbitrary.
As a consequence, most decisions are completely uninteresting to me,
unless I'm being told what motivated the authors to make them:
- The results of the three main data construction tasks
are stored into separate named graphs.
- Fuseki is used, Virtuoso is considered.
- Scripts are written in Python.
- 367 articles were converted to 80920 triples.

Data reengineering
------------------

For the data reengineering scripts, the authors claim that existing datasets
are incorrect or incomplete. However, it is not clear what techniques are
used in their scripts in order to resolve these incorrectness and incompleteness
issues. It is also unclear how they would ascertain that a given dataset
is (more or less) incorrect/incomplete.

Semantic enhancement
--------------------

The semantic enhancement section gives a lengthy description of an example
which results from the use of FRED, which is a reused tool.
It also describes the problem of disambiguating authors,
which seems to be quite similar to the generic problem of resource disambiguation.
Citation typing is mentioned as future work.
So, in conclusion, section 4.2 does not introduce new research results.

Services
--------

The data browser is claimed to be user-friendly, but this is not evaluated.
The use of higher-level views is quite interesting,
but is not explained in great detail and is not compared to existing approaches.
E.g., I was wondering whether the authors' approach
is specific to the dataset they use or can be applied to different datasets as well?
Some of the other services result from the use of FRED or are again future work.

Future work
-----------

As a minor point, section 4 mixes things that have been build/reused
with things the authors would like to build/reuse in the future.
The latter are much better positioned in a "Future work" section,
since now the reader has to be careful to distinguish what has already been done
from what is intended to be done.
E.g., the last paragraph of 4.2, last paragraph of 4.3.
Also, section 5 is mostly about future work

Conclusion
==========

The paper draws almost no conclusions, but enumerates future work.
The only conclusion that I can find is that "The integration of multiple
sources, cross-checked and merged together,
increases the correctness of the dataset."
But this conclusion has not be shown in the paper.
The notion of correctness was not defined.
The methods of cross-checking and merging are not properly described,
and are surely not reproducible.
No evaluation has been conducted, e.g. comparing the correctness of
an integrated with a non-integrated dataset.

Wording
=======

Some of the wording seems to make little sense to me.
E.g., what is a "rich RDF triplestore" [p8]?
A special category of nonsensical wording is the use of
definable or quantifiable terms in an entirely subjective way.
E.g. "huge triplestore", "rich data", "high quality", "make sense of data".
Since none of these terms are defined or quantified, using them
makes little sense.

Grammar/spelling
================

The paper contains many grammatical and spelling errors,
most of which are easily unriddled.
In some cases, though, grammatical errors result in unclear sentences.
E.g., I have no idea what the following means: "[...] shaping the data structure of the LOD along one specific discipline to the detriment of all the others means that such discipline will be most probably the only one to which such LOD can be associated with." [p6]

As a closing remark, I believe the authors have done a lot of cool stuff,
but are now trying to cram everything they did into a single paper which lacks focus.