Review Comment:
Overall evaluation
Select your choice from the options below and write its number below.
== 3 strong accept
== 2 accept
== 1 weak accept
== 0 borderline paper
== -1 weak reject
== -2 reject
== -3 strong reject
-2
Reviewer's confidence
Select your choice from the options below and write its number below.
== 5 (expert)
== 4 (high)
== 3 (medium)
== 2 (low)
== 1 (none)
3
Interest to the Knowledge Engineering and Knowledge Management Community
Select your choice from the options below and write its number below.
== 5 excellent
== 4 good
== 3 fair
== 2 poor
== 1 very poor
3
Novelty
Select your choice from the options below and write its number below.
== 5 excellent
== 4 good
== 3 fair
== 2 poor
== 1 very poor
2
Technical quality
Select your choice from the options below and write its number below.
== 5 excellent
== 4 good
== 3 fair
== 2 poor
== 1 very poor
3
Evaluation
Select your choice from the options below and write its number below.
== 5 excellent
== 4 good
== 3 fair
== 2 poor
== 1 not present
1
Clarity and presentation
Select your choice from the options below and write its number below.
== 5 excellent
== 4 good
== 3 fair
== 2 poor
== 1 very poor
4
Review
This paper (1) gives an overview of existing datasets on scholarly publishing,
(2) enumerates a couple of issues related to data selection and integration,
and (3) explains some of the effort that has gone into building and disseminating
a dataset on scholarly publishing.
Even though I have had occasional problems with understanding
ungrammatical sentences (see below),
the overall reading experience of this paper was quite good.
The structure of the paper is fine, with the exception of future work
popping up in unexpected spots a couple of times (see below).
The main thing that stuck with me after reading this paper,
is that the authors have done a ton of stuff,
most of which is probably quite useful to the community.
However, I do not believe the work effort has been sufficiently disseminated
in the present article.
I will explain why I believe the current article falls short
on each of the three aforementioned contributions.
I will also make a few generic remarks on wording and grammar.
Assessing existing datasets
===========================
The first contribution of the paper is that it gives an overview of
existing datasets in scholarly publishing. The datasets are compared
with respect to their structuredness, precision and quality.
However, none of these concepts are defined, let alone operationalized.
This results in a dataset overview that seems to largely depend on
the options (opinions?)of the authors.
More work should be done in order to make the assessment of
existing datasets more objective, e.g. by providing criteria a dataset should
meet in order to be called high-quality.
Issues of data selection
========================
The second contribution of the paper is that it identifies
"issues that [...] should characterize the selection process of data".
Firstly, I observe that the "selection process" is only a part
(although a very relevant one) of the process of
constructing and disseminating a dataset on scholarly publishing.
It is unclear to me why only this component is described in such detail
when none of the others are.
Secondly, although the "issues" the authors enumerate are all valid,
they are not novel, as they are all commonly known by researchers in the field,
and for some of the issues approaches exist that try to (partially)
mitigate them.
Thirdly, apart from not being novel, none of the issues identified are specific
to the construction and dissemination of data on scholarly publishing.
These issues are so generic that they would apply to
the construction and dissemination of any dataset that is based on
multiple raw data sources.
The generality of these issues is not inherently a bad thing,
but it is not something one would expect to be part of
a paper that focuses specifically on scholarly publishing.
Phrasing this last point in a different way,
the authors do not identify issues that are specific to
data on scholarly publishing, which is what the paper purports to be about.
Fourthly, it is unclear whether or not the list is exhaustive.
Semantic Lancet
===============
The third contribution of the paper is that it explains
how data on scholarly publishing is created and disseminated
within the Semantic Lancet project.
I consider this to be the main contribution of the paper,
and would suggest to expand on this third contribution a bit more,
e.g. by leaving out the second contribution entirely.
This part of the paper enumerates the components that have been built or reused and the ways in which they interact, as well as components that have
not yet been built or reused, but that are part of the envisioned framework.
For some of the components that have been build or reused,
problems are identified.
Motivations
-----------
The paper does not offer motivations for any of the implementation
choices that have been made.
E.g., we do not know why specific vocabularies (such as SPAR)
were chosen over others.
In the absence of reasons, most implementation decisions seem quite arbitrary.
As a consequence, most decisions are completely uninteresting to me,
unless I'm being told what motivated the authors to make them:
- The results of the three main data construction tasks
are stored into separate named graphs.
- Fuseki is used, Virtuoso is considered.
- Scripts are written in Python.
- 367 articles were converted to 80920 triples.
Data reengineering
------------------
For the data reengineering scripts, the authors claim that existing datasets
are incorrect or incomplete. However, it is not clear what techniques are
used in their scripts in order to resolve these incorrectness and incompleteness
issues. It is also unclear how they would ascertain that a given dataset
is (more or less) incorrect/incomplete.
Semantic enhancement
--------------------
The semantic enhancement section gives a lengthy description of an example
which results from the use of FRED, which is a reused tool.
It also describes the problem of disambiguating authors,
which seems to be quite similar to the generic problem of resource disambiguation.
Citation typing is mentioned as future work.
So, in conclusion, section 4.2 does not introduce new research results.
Services
--------
The data browser is claimed to be user-friendly, but this is not evaluated.
The use of higher-level views is quite interesting,
but is not explained in great detail and is not compared to existing approaches.
E.g., I was wondering whether the authors' approach
is specific to the dataset they use or can be applied to different datasets as well?
Some of the other services result from the use of FRED or are again future work.
Future work
-----------
As a minor point, section 4 mixes things that have been build/reused
with things the authors would like to build/reuse in the future.
The latter are much better positioned in a "Future work" section,
since now the reader has to be careful to distinguish what has already been done
from what is intended to be done.
E.g., the last paragraph of 4.2, last paragraph of 4.3.
Also, section 5 is mostly about future work
Conclusion
==========
The paper draws almost no conclusions, but enumerates future work.
The only conclusion that I can find is that "The integration of multiple
sources, cross-checked and merged together,
increases the correctness of the dataset."
But this conclusion has not be shown in the paper.
The notion of correctness was not defined.
The methods of cross-checking and merging are not properly described,
and are surely not reproducible.
No evaluation has been conducted, e.g. comparing the correctness of
an integrated with a non-integrated dataset.
Wording
=======
Some of the wording seems to make little sense to me.
E.g., what is a "rich RDF triplestore" [p8]?
A special category of nonsensical wording is the use of
definable or quantifiable terms in an entirely subjective way.
E.g. "huge triplestore", "rich data", "high quality", "make sense of data".
Since none of these terms are defined or quantified, using them
makes little sense.
Grammar/spelling
================
The paper contains many grammatical and spelling errors,
most of which are easily unriddled.
In some cases, though, grammatical errors result in unclear sentences.
E.g., I have no idea what the following means: "[...] shaping the data structure of the LOD along one specific discipline to the detriment of all the others means that such discipline will be most probably the only one to which such LOD can be associated with." [p6]
As a closing remark, I believe the authors have done a lot of cool stuff,
but are now trying to cram everything they did into a single paper which lacks focus.
|