Facilitating Data-Flows at a Global Publisher using the LOD2 Stack

Christian Dirschl
Katja Eck
Jens Lehmann
Lorenz Bühmann
Sören Auer

Responsible editor: 
Michel Dumontier

Submission type: 
Application Report
The publishing industry is at the verge of an era, wherein particular professional customers of publishing products are not so much interested in comprehensive books and journals, i.e. traditional publishing products, anymore as they now are interested in possibly structured information pieces delivered just-in-time as a certain information need arises. This requires a transformation of the publishing workflows towards the production of much richer meta-data for fine-grained and highly interlinked pieces of content. Linked Data can play a crucial role in this transition. The LOD2 Stack is an integrated distribution of aligned tools which support the whole lifecycle of Linked Data from extraction, authoring/creation via enrichment, interlinking, fusing to maintenance. In this application paper, we describe a real-world usage scenario of the LOD2 stack at a global publishing company. We give an overview over the LOD2 Stack and the underlying life-cycle of Linked Data, describe data-flows and usage scenarios at a publisher and then show how the stack supports those scenarios.
Major Revision

Solicited Reviews:
Review #1
By Joachim Baran submitted on 03/Oct/2013
Minor Revision
Review Comment:

The journal's guidelines suggest that application reports are rather short documents, which the submitted manuscript with its 11 pages is certainly not. However, the work is presented in a well-written style, its sections are structured logically, and the report's narrative can be easily followed. The length of the document is therefore justified.

The presented application is of relevance to the scientific community and the authors place their contributions into context by providing ample related work and/or references. Most significantly, the presented work is being applied in real-world scenarios by a publisher.


Page 3, under "The Linked Data Lifecycle": Numbered lists repeats the first ordinal (1, 1, 2, 3, etc., instead of 1, 2, 3, 4, etc.).

Page 3, Figure 1: Text of nodes with dark background color (blue, dark green) is hard to read. Lighter background colors should be chosen, or alternatively, the text should be white.

Page 4, Figure 2: Text shows "red squiggle" underlines that typically are used to denote incorrect words. These underlines should be removed.

Page 6, Figure 3: see comments regarding Figure 1.

Page 10, under "Usage of the LOD2 Stack at WKD": "aufsatz" should be [der] "Aufsatz". A translation of the word should be provided within the subsequent bracket that explains that the word is German.

Page 11, under "References": Many acronyms appear in lowercase, where they should be uppercase (RDF, RDBMS, etc.).

Review #2
By Jose Cruz submitted on 12/Nov/2013
Review Comment:

This application report describes the usage scenarios of LOD2 stack at a German global publisher company. Unfortunately, this report has not been written in such a way that allows me to accurately review the technical soundness of its contents. The authors do not provide a satisfactory description of the methodology used, no clear and concise results are presented and the conclusions in this report are vague. Moreover, this manuscript does not clearly delineate a problem statement that has been addressed/solved by the application of a Semantic Web technology

Major Comments:
-Pages 1-2, Section 1: Introduction:
The authors describe the requirements of a real world use case for the LOD2 stack at an accounting agency. However no relevant technical details are given. For example, no explanation is given regarding how TWC manages to keep track of changes of value added tax returns regulations across Euro zone countries. How is the data retrieved, normalized (syntactic or semantic) and queried?

-Pages 2-3, Section 2: Overview of the LOD2 Stack:
In this section the attempt to provide a methodological overview of their use of the LOD2 stack, however a description of how the data is gathered and transformed into RDF is missing. Also, the authors fail to provide a list of known ontologies or vocabularies that they are using to integrate their data. Lastly, an explanation on how the data is being normalized across the different datasets would be imperative.

- Pages 3-4, Section 3: The Linked Data Life Cycle:
In this section the authors do not introduce any of the 8 stages of LOD2's Linked Data Life cycle with any tangible, technically relevant examples. Thereby making this section somewhat difficult to follow for the reader.

-Pages 4-5, Section 4: Data-Flows at Global Publishers:
The authors provide an extraneous extensive description (including Figure 2) of “Business units” that form part of the company Wolters Kluwer Germany (WKG). Please summarize or remove this section.

-Pages 5-8: Section 5: Usage of the LOD2 stack at WKG
The authors should provide an explanation as to why they decided to use a tool like Valiant for directly transforming XML into RDF. Please provide details regarding the type of data being serialized by these XML documents and the inherit challenges in preserving the intended semantics when performing automatic syntactic transformations.

Minor Comments:
-Page 2, column 1, second paragraph:
In the sentence: “The evolution of the extracted knowledge needs to be supported, since the original documents might change.”. The authors should clarify what is meant by “supporting” the evolution of extracted knowledge. How are they suggesting that this would be done?

- Page 2, column 1, second paragraph:
In the sentence: “We can apply reasoning techniques to enrich the knowledge bases with upper level structures”, what is meant by “structures”? Please clarify if the authors are referring to an ontology or a vocabulary.

Page 2, column 1, last paragraph:
- The acronym “ERP” has not been properly described.

- Page 3, column 2:
Please rectify the numbering of the stages that form part of the linked data life cycle

-Page 4, column 1:
Please clarify the use of “mutual fertilization” in the sentence: “... but by investigating methods which facilitate a mutual fertilization of ...”.

-Page 5 Section 5:
Please revise the title of this section. I think it should read WKG and not WKD.

- I would advise that the authors make use of a proofreading service to check for minor grammatical and spelling mistakes found throughout the manuscript

Review #3
By Andrea Splendiani submitted on 26/Jan/2014
Major Revision
Review Comment:

This paper reports on the application of Semantic Technologies (the LOD2 stack in particular) in the operations of a specialized global publishers.
As an application report, this is a worthwhile contribution.

However I think the paper should be improved in order to provide more value to its readers.
In particular, it would be good to provide measures for the data processed (size of terminologies, document sets).
It would be useful to know which assessment was made on the proposed technologies: how was the improvement over existing solutions measured ?
The paper mention the use of DBPedia for enrichment of information: was there any data-quality issue related to DBPedia ?

From another point of view, this paper focus on technology. It would probably be of interest to its "public" if there was a running example showing different stages of a processed document.

Minor points:
On page 2: let's assume the following real world user scenario: I think this example should be better introduced as it is not evident, at first, that "tax" example related to publishing. Just noting that the example relates to a specialized field of publishing would help the reader.

Overview of the LOD2 Stack: perhaps there should be a mention that components of the LOD2 stack (or at least some of them) will be presented in more detail later. Maybe what is now in page 9 could be put in a table and cited here.

Page 4: the discussion on the use of Semantic Pingback seem to refer to future work: if so, this is more for a discussion/duture work session. If not, it should be clarified.

Page 7: there is really no point in mentioning that, since the technology was new, you preferred to opt for two commercial tools for support. It reads like gratuitus advertisment. It would be more sensible to use a more general sentence: "we preferred to rely on commercial providers when available" and a speficiation of the nature of providers/supports for each component when it is presented.

"Since it is based on SKOS, it gave us all the freedom we needed for that purpose": can you articulate more ? With SKOS you gain freedom, but you loose consitency. Can you explain how SKOS was a better fit in your case ? Which were the limitation of a more strict taxonomy/ontology approach ?

Page 8:
"the data in the metadata... gave completely new possibilities": an example ?

Page 10:
why OWL and closed world assumptions and not a SPIN/SPARQL based approach ?