Facilitating Data-Flows at a Global Publisher using the Linked Data Stack

Tracking #: 1130-2342

Christian Dirschl
Katja Eck
Jens Lehmann
Lorenz Bühmann
Bert Van Nuffelen

Responsible editor: 
Michel Dumontier

Submission type: 
Application Report
The publishing industry is at the verge of an era, wherein particular professional customers of publishing products are not so much interested in comprehensive books and journals, i.e.~traditional publishing products, anymore as they now are interested in possibly structured information pieces delivered just-in-time as a certain information need arises. This requires a transformation of the publishing workflows towards the production of much richer meta-data for fine-grained and highly interlinked pieces of content. Linked Data can play a crucial role in this transition. The Linked Data Stack is an integrated distribution of aligned tools which support the whole lifecycle of Linked Data from extraction, authoring/creation via enrichment, interlinking, fusing to maintenance. In this application paper, we describe a real-world usage scenario of the Linked Data Stack at a global publishing company. We give an overview over the Linked Data Stack and the underlying life-cycle of Linked Data, describe data-flows and usage scenarios at a publisher and then show how the stack supports those scenarios.
Full PDF Version: 


Solicited Reviews:
Click to Expand/Collapse
Review #1
By David Odgers submitted on 06/Aug/2015
Minor Revision
Review Comment:

-The application of linked data principles to the particular use case in this paper was well described.
-More concrete examples of the Linked Data Stack in section 2 and 3.
-The storyline that is started in the Introduction concerning Gerhard does not carry forward as a motivating element.
-The Wolters Kluwer Germany storyline should be woven into the introduction.
-A storyline that begins directly following the description section 3 “The Linked Data Life Cycle” would make this much more readable.
-The description of WKG can be integrated into the storyline much more fluidly.
-Figure 2 has red underlines under the wording.
-Line spacing changes after Figure 3.

Review #2
Anonymous submitted on 14/Sep/2015
Major Revision
Review Comment:

In this paper the authors present the description of the use of a set linked data managing tools in an editorial company. This set of tools (the Linked Data Stack) helps the company in managing its entire publishing process.

Introduction section
This section presents a data integration scenario based on an editorial company example. This example is related to local VAT modifications which have to be managed globally. Every country has its own laws and law modifications about VAT and the company has to adapt to them.
I agree that this is a nice scenario for applying Semantic Web technologies. However, are there other scenarios in the company? are there others more related for such publication companies?

Overview of the Linked Data Stack Section
In this section the authors present the Linked Data Stack set of tools, aka LOD2 Stack. This stack can be installed as a Debian, use RDF as data model and REST for accessing data. The authors point that this work is an extension of [1] and [3]. They also point that this paper is the application of [2] in this specific scenario.
Reading this section I got a bit lost, I do not understand the relation between [1], [2[ and [3].Can the authors clarify a bit more that relation?

Linked Data Lifecycle Section
Half this section is Section 2 in [1]. A reference to it would be nice. Not much more to say since most of this section is based in [1] and [11].

Data-Flows at Global Publishers section
In this section the authors present briefly the company which provided the use case for this application report. The authors also describe briefly how the LOD2 Stack could help: company products are not connected with each other and linking these products could be the solution to provide a better customer service.
I miss a bit more of detail about how the LOD2 stack could help in "provide a better customer service” Is that the only use case, customer service? what is customer service exactly?

Usage of the Linked Data Stack at WKG section
In this section the authors describe the requirements from WKG. These requirements are for the development of the solution to the data integration problems they had which were enriching content, vocabulary extensions, etc. These major business requirements were:
- Processing and enriching mass content from partners
- Extension and consolidation of controlled vocabularies
- Managing content metadata addition depending on the sources
- Enabling vertical view of the content
Next the authors how they used the technologies in the LOD2 Stack. First explaining why they decided to use commercial software, how they extracted content from documents, how they did the quality assurance, how they used the vocabularies for representing metadata, etc.
I have two comments in this section:
- How all these processes relate with the major business requirements? the authors started describing 4 requirements, but I do not see in this 6 pages section details about how the LOD2 Stack helps in solving these requirements.
- I miss many details, like the following:
- Why the authors decided to convert certain documents?
- What vocabularies they used/extended?
- What these vocabularies are about, etc.
- How many controlled vocabularies are managed by PoolParty?
- How many triples are stored in Virtuoso globally?
- How many nodes/WKG departments are producing data?
- How distributed is the data?
- Do you use query federation?
- Link discovery: precision and recall 100%? how well performs the tool? how often the tool is executed?
- Is new data continuously added to the system? how often? how hard the Linked Data Stack works?

Related work section
In this section the authors briefly describe some related work, describing the NY Times application and the BBC. However, since the authors talk in the paper about legal documents, I miss some related work about legal documents and the Semantic Web. Besides, isn’t there any company that already has integrated highly distributed data from several companies’ offices?

Overall comments:
In general I think this is a nicely motivated paper trying to show how a set of Semantic Web technologies work in a real scenario. However I think it is hard to understand how these technologies are used to fulfil the company requirements. There is a list of requirements but I don’t see how that list is related to the next paragraphs in the same long section. Besides, I think that many interesting details are missing (details about the vocabularies, how URIs are generated, etc.). From my point of view a Semantic Web journal should publish papers containing these details, not only a generic description of the use of these technologies. Specially in an application report.

I also saw some typos:
Page 2: After that, the vision of the vision
but requirs -> but requires
afterwarts -> afterwards
All this are -> all these are

Review #3
Anonymous submitted on 22/Oct/2015
Major Revision
Review Comment:

The main issue with this paper is that it lacks some details on the described processes. This is both in terms of types of examples of steps, and in terms of overall "measures" of what is presented (data dimension, concept space size, quality assessment for links detected).
Respect to the previous version, there has been an attempt to fix this, but it is quite bad. Fig.5 could be an interesting example, but it is not even cited in the text. Could the whole discussion in section 5 refer to it ?
Table 2-4 also present some information without context and not so informative. Overall statistics: of what ? What is presented is a dynamic solutions: what do these numbers refer to ?
Are table 1-2 referenced in the text ?

For instance, taking the block: "The creation and maintenance of domain specific Knowledge...": it would be much clearer if it was linked to something like Fig.5, mentioning what happened in each step and providing some sense of measure: x documents were mapped to y concepts, z were enriched through DBpedia (approx w% mappings had to be revised...).
At the moment, it's really hard reading this discussion to get a feeling of what was done

Minor issues:
The paper is well written until section 4, than it starts to be confused.

Block starting with: "In general, there were to approaches for using tools... ". What is specific in this for Linked-Data?

Sentence: "In the case of reconsiliation... the resulting data": can this be explained ?

Sentence: "The assessment of candidates.... project budget": it's hard to read, can you rephrase it ?

Block: "The outcome of the extraction... SPARQL endpoint". Minor: it may give the ida that SKOS and RDF are different things.

"WKG decided to integarted"" remove "d"

Block: "Despite the growing amount.. publish legal resources": it's unclear what this means. Wasn't publishing legal resources part of the core business of WKG ?

Sentence: "URIs in PoolParty based on controlled vocabularies are used by Valiant for the content transformation process and stored in Virtuoso": it's a bit imprecise. URIs referenced in PoolParty. Statements are stored in Virtuoso (URIs may be referenced by it).