Reviewed

This category lists all reviewed submissions; for papers under review please visit the <a href='http://www.semantic-web-journal.net/category/tags/underreview'>under review papers section</a>.

EARTh: an Environmental Application Reference Thesaurus in the Linked Open Data Cloud

Paper Title: 
EARTh: an Environmental Application Reference Thesaurus in the Linked Open Data Cloud
Authors: 
R. Albertoni, M. De Martino, S. Di Franco, V. De Santis, P. Plini
Abstract: 
The paper aims at providing a description of EARTh, the Environmental Application Reference Thesaurus. EARTh represents a common general terminology for the environment, which has been published as a SKOS dataset in the Linked Open Data cloud. It promises to become a core tool for indexing and discovery environmental resources by refining and extending GEMET, which is considered the de facto standard when speaking of general-purpose thesaurus for the environmental domain in Europe. The paper illustrates the main key characteristics of EARTh as a guide to its usage. It clarifies (i) the methodology adopted to define the EARTh content; (ii) the design and technological choices made publishing EARTh as Linked Data; (iii) the information pertaining to its access and maintenance. Descriptions of EARTh applications and future relevance are also highlighted.
Submission type: 

5

Responsible editor: 
Decision/Status: 
Reject and Resubmit
Reviews: 

Submission in response to http://www.semantic-web-journal.net/blog/semantic-web-journal-special-ca...

Resubmission after a "reject and resubmit" in round one and also in round two. Round one reviews are beneath the round two reviews.

Solicited review by Natasha Noy:

I commend the authors for taking care to consider and address the reviewers' comments. The submission is much stronger now. However, I think it still requires improvements and clarifications in order to be published in SWJ.

First, the links, through GEMET, to other resources are very nice and indeed make it part f the "cloud" rather then an island. However, I would have liked to see more discussion about the additional, non-GEMET, links that were discovered through SILK and verified by domain experts. How many of these? How good were the SILK mappings? What fraction of EARTh is now linked to other thesauri?

The examples and the queries are nice and give a better idea of what's inside. However, for the SPARQL queries, it would be nice to give some intuition as to what these queries are supposed to find? What is specific about these datasets? For Query 3, you might want to focus on the links only as this is what (I think) you want to show off.

My major concern remains the usage of the dataset. Section 4 mentions several projects that link to LOD version. The only "use" however, seems to be creating the exactMatch links. How are these links used by those projects? How did having the EARTh thesaurus available in the LOD cloud helped those projects. Being able to link is probably not a goal per se. What can they do now that these links are available that they couldn't do before? I think without such discussion, it is hard to argue on the usage of the dataset.

Also, the submission "teases" with some technical details but never quite provides enough information. For instance, you mention that there are a number of additional relations among the classes. What are they? Table 1 lists only the "usual" ones. There is a mention of materializing relations. Which ones? It's a bit hard to understand exactly the content of the dataset without such details. I would suggest being much more precise about these types of details so that the readers can get a very clear idea of what to expect and what types of relationships they would see.

Solicited review by Marta Sabou:

The authors have significantly improved the EARTh LOD dataset by linking it to several other datasets (DBPEDIA, AGROVOC, EUROVOC UMTHE). Additionally, the paper has been extended and improved, especially in terms of examples and a clear description of where and how EARTh is used (section 4). Therefore, I suggest accepting this re-submission as is.

Solicited review by Tomi Kauppinen:

I checked the new version, and the reply to reviewers. All the points have been fully addressed, so I recommend accepting the paper.

Round one reviews:

Solicited review by Natasha Noy:

The paper describes the publication of a thesaurus of bi-lingual environmental terminology as a linked dataset.

The authors spend a fair amount of time describing the database structure, which is perhaps less relevant for this special issue. Instead, the readers might be better served if there were some examples of the terms, perhaps topics that are covered, etc. I think the paper does not contain a single environmental term other then Earth anywhere in the discussion.

I would have also liked to see some discussion of linking this dataset to some other publicly available datasets. It seems that at the moment it is an "island" and the authors never discuss what, if anything, they might gain by using the linked data. It seems that it is more of a format-publishing decision than a true linked dataset.

Finally, it would have been nice to read a bit more about the relevance and applications. It is one of the key criteria for the special issue, and the authors don't do a very good job of convincing the reader that the others have found the dataset relevant. How is it used in projects? The only reference seems to be to another project by the same authors [11]. Does anyone outside of the authors' group use or plan to use the dataset? If they do, it would have been nice to have a description.

Solicited review by Marta Sabou:

The paper presents the EARTh LOD dataset, a thesaurus in the environmental domain derived through the refinement and extension of GEMET and enabling tasks such as indexing and discovering environmental resources. Based on the criteria of the call, I judge this paper as follows.

Quality of the dataset.
High. The paper reports on exposing a thesaurus (EARTh) that has been obtained through refining GEMET from as early as 2001. EARTh is used in several projects, which testify its quality and usefulness for the field. The EARTh data has been exposed using the D2R server and it provides both human accessible semantic descriptions and a SPARQL endpoint. In terms of linking, currently there are 4000 links to GEMET, which is a side-effect of the refinement process that lead to EARTh rather than of a link-establishing procedure. Given the broad coverage of EARTh as well as the aim to use it for indexing, linking it to multiple other LOD sources (for example, the AGROVOC data set submitted to this call) would be a major benefit. The authors conclude their paper with envisioning future work in link creation. Could they extend this part with some concrete LOD datasets that they are considering for the linking process?

Usefulness (or potential usefulness) of the dataset.
High. Based on section 4, it is evident that this dataset could play a pivotal role in the environmental domain, not just for indexing documents but also by becoming a hub for interlinking with other thesauri in this domain (in the NatureSDIplus project). Section 4 should be improved by making it more concrete. For example, the authors use the rather generic term of "recognize" in relation to important institutions/projects. What is concretely meant here? Did these organizations/projects commit to use EARTh? For what purposes exactly? Such clarifications will greatly increase the quality of the paper.

Clarity and completeness of the descriptions.
Good. The authors provide enough details about the dataset and the publishing process, but the paper contains several typos and the URL's in the footnotes cannot be clicked (plus they do not print properly either, probably due to some font inclusion issue).

Minor comments and some typos:
* the references are formatted according to different styles, e.g., the publication year appears sometimes after the authors and other times at the end of the reference.
* abstract:
** "main key characteristics" => keep either main or key but not both
** "made publishing" => "made when publishing"

*section 2 - revise second sentence of intro text, currently it does not make sense

*section 3.1:
** "has been adopted" => "have been adopted"
** "Naturals Keys" => "Natural Keys"

*section 5: "evolves as result" => "evolves as a result"

Solicited review by Tomi Kauppinen:

Authors had the goal of providing a thesaurus called EARTh online as Linked Data. Authors state that the EARTh content is accessible via HTTP deferenceable URIs. However, the content (i.e. relations to other concepts) is only delivered as HTML and not as RDF. For example, the URI http://linkeddata.ge.imati.cnr.it:2020/page/EARTh/34910 only serves HTML version of the description of relations. Authors of course provide SPARQL endpoint and RDF dump, but it would be useful to get RDF directly served from the URIs as well. Authors should thus clarify this issue. Moreover, loosely speaking a thesaurus could perhaps be considered as a Linked Data dataset, but then it would be essential to have some convincing linkage to a variety of other thesauri. In other words, the concern is that EARTh seems to be not linked to other data in the Linked Open Data cloud, except via sharing the use of properties from the SKOS vocabulary. Authors mention linkage to GEMET, but it is not served as Linked Data at all when tested (e.g. http://www.eionet.europa.eu/gemet/concept?cp=9290). Taking all this, I am not convinced that this paper should be published in the SWJ special issue, at least not before the above mentioned issues are taking into account.

One minor issue:

- The references list is quite messy: many references lack details, and they are formatted in a variety of different ways. Please consider polishing them.

Linked European Television Heritage

Paper Title: 
Europe’s Television Heritage
Authors: 
Nikolaos Simou, Nasos Drosopoulos and Vasillis Tzouvaras
Abstract: 
The EUscreen project represents the European television archives and acts as a domain aggregator for Europeana, Europe’s digital library. The main motivation for its creation was to provide unified access to a representative collection of television programs, secondary sources and articles, and in this way to allow students, scholars and the general public to study the history of television in its wider context. In this paper, we present the methodology followed for publishing the EUscreen dataset as Linked Open Data.
Submission type: 

5

Responsible editor: 
Decision/Status: 
Major Revision
Reviews: 

Submission in response to http://www.semantic-web-journal.net/blog/semantic-web-journal-special-ca...

This is a revision after a "reject and resubmit", now "conditionally accepted with major revisions". The original submission was entitled "Europe’s Television Heritage", and its reviews can be found beneath the second round reviews.

Solicited review by Aidan Hogan:

Thanks to the authors for the revision and the response letter.

My main concerns with the paper related to presentation issues, as well as issues with the dataset itself. Both concerns have been partly but not fully addressed. A notable improvement is that the dereferenceability of URIs has been fixed: the dataset could now be considered as Linked Data.

In terms of presentation issues, the authors have provided some examples of RDF data produced by the process which helps get an idea of the dataset. They have also addressed various other issues raised by myself and other reviewers. Still, however, the paper suffers from presentational issues:

* There are still about 25/30 typos. (Some of these could be fixed with an automatic spell-checker.)

* Figure 2 needs to resized.

* With the exception of the itemized URIs in Section 4.1, I think all URLs should be given as footnotes to prevent interrupting the text so frequently.

* The spacing (both horizontal with respect to justification and vertical with respect to spacing around figures, etc.) is a bit ugly. Perhaps this can be improved for the camera-ready version?

* The references are underlined, ugly and difficult to read.

All of that said, the paper is understandable and reads okay. Problems are more minor issues wrt. sloppiness rather than a lack of legibility. Please pay more attention to these issues!

My main concerns still lie with quality of the dataset. As I say, URIs now seem to dereference correctly to RDF, which is good. However, other issues are only partly addressed and I'm thus still concerned about the usability of the data:

* The overuse of literals is still present. Granted the authors mention this as a weakness of the data, but I don't buy their excuse for having literals like this. The problem that the authors encountered, as I understand it, is with extracting unique and unambiguous URIs for literals. But, for example, I see literals like "Stereo" for ebu:hasAudioFormat and "Colour" for ebu:hasVideoFormat. I don't see what the problem would be here: there's presumably a controlled set of literal values for these attributes, and converting them to URIs using the labels as suffixes would seem simple enough. In fact, the ontology *requires* that many of these properties are given URI values (discussed later) so it is erroneous to have literals in these positions.

* Relatedly, for me, sticking with the EBUCore ontology is really dragging down the quality of the dataset. The usability of the dataset is from the perspective of the consumer. As a generic Linked Data consumer, the EBUCore properties and classes used in the data currently mean nothing: (i) they cannot be dereferenced and (ii) they are not related to popular terms elsewhere.

~ First, the URIs are not dereferenceable. In their response letter, the authors acknowledge this problem and state that they have contacted the maintainers. However, as it stands at the moment (which is all I can evaluate the dataset on), the lack of dereferenceability means that the semantics of the class and property terms are lost: in a Linked Data setting, they're just URI strings.

~ Second, little re-use of *extremely* common existing terms is present. Thus, a Linked Data consumer that understands popular terms like "rdfs:label" as a name for something, or "dct:creator" as the person who created something, or "foaf:thumbnail" as a small image for something that can be displayed to users, can do very little with the current data. After dereferenceable URIs, a core tenet of Linked Data is to stop people from creating yet another "Document" or "Agent" class or yet another "title" or "name" or "latitude" or "longitude" property. In Linked Data, re-inventing such common terms *again* (and again and again…) is a capital offence.

To me, it seems that EBUCore was designed as a "traditional" ontology: to be self-contained (it defines a lot of terms already made available in FOAF, DC, DCTERMS, GEO, etc.), to be loaded manually (it's not dereferenceable), etc. Unless the ontology is improved (made dereferenceable, linked to legacy terms, etc.), it's not suitable for a high-quality Linked Data export. In terms of possible remedies, in order of preference:

* Use the W3C Recommended Media Resources 1.0 vocabulary directly. It at least dereferences and seems to cover what you need. (Unfortunately it also reinvents several common terms, but manual mappings are described in the specification.)

* Get the EBUCore ontology to dereference. Ideally get terms linked to their legacy counterparts. If not, use legacy terms directly in the data as opposed to EBUCore ontology (e.g., dct:creator). If you still need the EBUCore properties in there, provide redundancy with both legacy and EBUCore properties.

* At the very least, discuss the weaknesses of the EBUCore ontology as weaknesses of your dataset.

In any case, much of your RDF export is not compatible with EBUCore (apologies for not noticing this in the previous review). Looking at the example data from:

http://lod.euscreen.eu/data/EUS_55F569268ACA42B186682960875F862B.rdf

I find the following issues (probably not a complete list):

* The following properties are defined as ObjectProperty in the ontology, but given literal values in the data (relating again to the issue of overuse of literals):
~ hasSubject
~ hasKeyword
~ hasObjectType
~ hasFormat
~ hasGenre
~ hasLanguage
~ hasVideoFormat
~ hasPublicationChannel (aside: if a value is not given, omit the attribute ... don't just give it a blank literal)

* The following properties have a defined range incompatible with how they are used in the data:
~ locator (range anyURI, given plain literal)
~ identifier (range anyURI, given plain literal)

* The property ebucore:topic is used but not defined

The authors still have work to do on their dataset, and (to a lesser extent) on their presentation. Again, the dataset is interesting and the direction is encouraging, but *more attention to detail is needed* for both the paper and the data.

Solicited review by Michael Hausenblas:

The authors have addressed all the issues I've raised and made the paper much more readable, it is ready for publication now.

Solicited review by Emanuele Della Valle:

The paper has largely improved in content and form. I recommend the authors to address the following minor issues before accepting it:
- links in-line in the text may be replaced by footnote. For instance, instead of writing "European (http://www.europeana.eu/)" the authors may add a footnote of the form "See http://www.europeana.eu/ September 18, 2012."
- on the left column of page two, the authors refer to a project survey and some reports. They are not available on the project website (http://euscreen.eu/). The authors should either avoid referring to them or should make them available and add appropriate references.
- Figure 1 is little informative. The authors may instead show a high level representation of EBU Core (e.g., figure 2 in http://tech.ebu.ch/docs/tech/tech3293v1_3.pdf)
- a direct link to EBU Core (http://tech.ebu.ch/lang/en/MetadataEbuCore), EBU Core ontology (http://www.ebu.ch/metadata/ontologies/ebucore/) and MAWG (http://www.w3.org/2008/WebVideo/Annotations/) should be added; the readers should not have to dig them out of the Web on their own.
- in Figure 2, baseURI should be replaced by lod.euscreen.eu
- the link to the google doc should be replaced by a reference to a project deliverable (see also the first comment in this list)
- how many links to DBpedia were added? 1365 (as written in the left column of page 5) or 1490 (as written in Table 1)?
- the sparql endpoint may be placed under lod.euscreen.eu
- the example SPARQL query may be replaced by the following federated query that shows the value of the linking
PREFIX ebu:
PREFIX dbp:
PREFIX dbp-onto:
SELECT ?video ?actor
WHERE {
SERVICE
{ dbp:James_Bond dbp-onto:portrayer ?actor }
SERVICE
{ ?video ebu:mentionedPersonInSummary ?actor }
}
- in the conclusion section, the following three statements appear weak; consider rethinking them:
- "However having in mind that the type of content served by EUscreen is European television programmes, we can say that its size is significant." -- why?
- "The reason why we preferred to keep the original values from metadata creating literals was because we did not want to destroy or lose this information." -- why? most of the values can be captured by ontological instances. For instance "Stero" --> ":stero", "Colour" --> ":colour", etc.
- "we intend to upgrade the existing triplestore with one that supports federated SPARQL queries" -- why?

First-round reviews:

Solicited review by Aidan Hogan:

This paper presents ongoing efforts to expose metadata about European television heritage as Linked Data. The metadata describe various programmes selected as being relevant to significant 20th Century European historical events. As part of the EUscreen project, content providers select relevant programmes and upload them along with metadata offering information about title, series title, language, genre, subject, etc. Metadata is uploaded to the "MINT" service in (standard) XML or CSV format, which allows for various editing and transformation steps. The current paper proposes taking these metadata in XML format and converting them to RDF using the EBUCore ontology to model the output. A Linked Data platform is then built using dereferenceable naming schemes. Countries mentioned in the metadata have been linked to DBpedia. In total, 22,190 programme resources are currently described, with a total of 114,142 including related resources. 4store is used to host the data.

The data sound interesting to have available online as Linked Data, esp. if integrated with Europeana. In general, the description is fairly well written, though some parts of the text could do with a more thorough proof-red. It does an adequate job of giving the reader an impression of what data is being exported and how. That said, the description does have some significant shortcomings that should be addressed:

* No overview of the model/vocabulary/ontology/schema is provided. The linked Google spreadsheet does not help to get an overall picture of the data being captured. A diagram showing the key classes, properties and their inter-relation would help a lot.

* Futhermore, instead of describing the resulting RDF data in prose, it would be better to give some example(s) of instance data created by the process and shorten the current text.

* I would like to see argumentation as to why making these metadata available as RDF is useful/important? What can people do with the data? Can they be combined with other datasets in a non-trivial way? If so, which datasets (DBpedia countries is not very convincing)? Can new questions be asked against it using SPARQL? This should be argued in the paper.

Without these details, the description falls short of communicating what kind of data is being exported and also falls short of arguing why the dataset is interesting for Linked Data consumers. These are important aspects of the evaluation of the submissions for this Special Issue.

The other important part of evaluating submissions is, of course, the dataset itself. I did manage to find the RDF linked in the paper online. However, I did find certain shortcomings in how it is published:

* The resource URIs do not seem to dereference correctly to their RDF/XML descriptions. This is of course the key aspect of Linked Data. If I look up:
- http://lod.euscreen.eu/resource/EUS_55F569268ACA42B186682960875F862B
(taken from the paper) with Accept: application/rdf+xml, I get a 303 redirect to:
- http://lod.euscreen.eu/data/EUS_55F5692.rdf
However, this URI gives a 404.

* The EBUcore vocabulary used also does not dereference. For example, if I look up
- http://www.ebu.ch/metadata/ontologies/ebucore#hasAffiliation
looking for RDF/XML, I get a 301 to a directory containing the relevant OWL description. This is no good for a software agent.

* I did find the HTML example and the RDF/XML example data at
- http://www.euscreen.eu/play.jsp?id=EUS_55F569268ACA42B186682960875F862B
- http://lod.euscreen.eu/data/EUS_55F569268ACA42B186682960875F862B
respectively. Having a ".rdf" extension on the latter would be welcome. Otherwise, I do have some comments about the data I found in the RDF file.

~ First, although there is some re-use of legacy terms for describing documents, there is the potential for a lot more re-use of existing vocabularies. One option is to use the external term directly. Another is to map to the external vocabularies from the EBUCore ontology. Some suggestions for re-use:
# ebucore:name -> foaf:name, rdfs:label, ...
# ebucore:summary -> rdfs:comment, dc[t]:description, ...
# ebucore:hasSubject/ebucore:topic -> dc:subject, ... preferable to use SKOS scheme
# ebucore:alternativeTitle -> skos:altLabel
# ebucore:dateCreated -> dcterms:created
# ebucore:rights -> better to try reuse cc: vocabulary and licence URIs where possible?
# ebucore:genre -> po:genre
...and so on. Also for linking, SKOS offers skos:exactMatch, skos:narrowMatch, skos:broadMatch and skos:relatedMatch. In general, the authors should look to either re-use or map to equivalent terms in DC(TERMS), RDFS, FOAF, SIOC, Music Ontology, Programme Ontology, SKOS, etc. (Note that re-use of vocabularies is a key feature of Linked Data was one of the criteria mentioned in the evaluation of the submissions for this Special Issue.) Alternatively, the authors can look to re-use the "Ontology for Media Resources 1.0" as mentioned.

~ Second, there seems to be an overuse of literals. For example, formats like "Video" should be given a URI (possibly even a class), same for genres like "Factual", same for topics and subjects which should probably use SKOS, same for licences (though I note ebucore:rights does use URIs). Ideally keywords could also be given URIs (if they can be successfully disambiguated).

Given the shortcomings in the description (esp. no overview of model or examples of data, no arguments as to why these data are good to have exposed as Linked Data) and the dataset (esp. problems with dereferenceability and lack of re-use of vocabularies), I cannot recommend an accept at this time.

Solicited review by Michael Hausenblas:

Overall the paper is a valid contribution and on-topic but has some presentation issues that should be addressed before it gets accepted.

The authors describe the publishing of the European television archives dataset through the EUscreen project at http://lod.euscreen.eu/ and provide insights into design decisions in the process. The dataset is relevant and of high quality, the potential usefulness is given (although could be extended beyond one use case). The dataset description seems complete but lacks clarity.

## Core DSD
Core questions concerning the DSD including licensing and availability are listed.

## Publishing and metrics
The authors clearly described the coverage and provided relevant metrics as well as discussed the access methods in Section 4. It appears to me that the authors performed a manual interlinking task ("the names of the local dataset countries were compared using SPARQL [7] to names of the countries resources served by DBpedia." in Section 4.2) - it would be good to highlight why this has been done and if semi-automatic approaches such as Silk or Limes could be useful.

## Examples, modeling patterns and shortcomings
Examples are provided (though one representative, complete example in RDF/Turtle syntax or as a graph figure might be beneficial to include) and the modeling process including the design decisions is present. I did not find a proper discussion about shortcomings of the dataset, though.

## What is missing
Besides a 'related work' section the authors have covered the relevant parts, content-wise. The main issue I have is with the presentation (see below).

## Editorial comments
Although the use of English is not too bad, the paper would benefit from another round of proof-read, ideally from a native speaker. In addition the article is somewhat wordy - I will provide concrete suggestion what could be cut down in the following.

Presentation:

Section 1 and Section 2 provide the background and should be dramatically shortened into one section. For example, the entire history (around EBUcore to MPEG7) can be removed as not directly relevant. Then, in Section 2 there is IMO no need to go into the details of the EUscreen project consortium and goals. Simply describe the topics in one paragraph (listing at the end of the section is core, I think).

Section 3 contains a number of not relevant descriptions, can be cut down, for example the entire paragraph "Registered users can start by uploading their metadata records in XML or CSV serialization, using the HTTP, FTP and OAI-PMH protocols. Users can also directly upload and validate records in a range of supported metadata standards (XSD). XML records are stored and indexed for statistics, previews, access from the mapping tool and subsequent services. Handling of metadata records includes indexing, retrieval, update and transformation of XML files and records. XML processors are used for validation and transformation tasks as well as for the visualization of XML and XSLT." can be stated in one short sentence.

In Section 4.1, the sentence "The complete set of properties and classes used for the mapping of all the harvesting schema's elements can be found at https://docs.google.com/spreadsheet/ccc?key=0Akru w5a0_oaLdEQyMl85NVQxZ2lmT00wcVU4ZVRJZ 0E&hl=en_US#gid=3" is sort of poor in terms of presentation - can this be made available via a nicer location and in a better digestible format?

In Section 4.2, I suggest to remove "External RDF links are crucial for the Web of Data as they are the glue that connects data islands into a global, interconnected data space [5]." as it is a generic statement and doesn't add anything here.

In Section 4.3, I suggest to turn the paragraph "At the moment the pilot holds 22.190 programme resources while the total amount of resources is 114.142. Among the total resources, 13.158 are made for persons individuals referring to the contributor of the programme while 582 are made for countries - linked to 1439 externals- and 22 for languages – linked to 63 externals. In addition by using spotlight, 1490 person resources are extracted to which links are made from 1133 programmes' English summaries." into a table that makes it easier to understand.

The Section 5 is again quite wordy and also introduces new facts: "In particular in total 2855 person resources were extracted and 1365 of them were wrong (manually filtered), despite the fact that the confidence value in the spotlight setup was set high." - I suggest to move this into Section 4.

Typos:

* Section 4.1: "that states the use of URIS for things" -> "that states the use of URIs for things"
* Section 4.1: "domain administered by the project (lod.euscreeen.eu)" -> "domain administered by the project (lod.euscreen.eu)"
* Section 4.2: "(info from google anytics)." -> "(info from Google Analytics)."

Solicited review by Emanuele Della Valle:

The paper presents in a well-written and correctly structured format an important dataset of the European Commission. The dataset is rather small, but it is externally connected to DBpedia, and Geonames. The vocabulary is presented at a level of details that allows readers to issue SPARQL queries against the dataset. An example of SPARQL query that bridges DBpedia is illustrated.

I only have minor comments:
- General
- is a VoID description of the dataset available?
- can the authors elaborate a bit more on the licensing? Why not using a Open Data Commons license?
- Page 1 column 2
- The RDF version of the dataset will eventually be hosted by the EU, i.e. the actual dataset owners, themselves, which ensures a long time availability of the data. -> The RDF version of the dataset will eventually be hosted by the EU, i.e. the actual dataset owner, itself, which ensures a long time availability of the data.
- Page 2 columns 2
- the IRI "fts-o:cofinancingRate" runs into the margin
- Page 3
- Figure 1 is difficult to read. The authors may consider redrawing it by hands. To save space they may want to remove instances.
- Page 3 column 1
- & -> and
- Page 3 column 2
- copmile -> compile
- consider adding a link to JAXB
- Page 4 column 1
- in table 1 the word "Commitments" touches 28114

Fiction Literature as Linked Open Data - the BookSampo Dataset

Paper Title: 
Fiction Literature as Linked Open Data - the BookSampo Dataset
Authors: 
Eetu Mäkelä, Kaisa Hypén, Eero Hyvönen
Abstract: 
The BookSampo dataset provides information as linked data on fiction literature published in Finland going back to the 15th century, along with rich descriptions of both their content and context. The dataset contains data on nearly 400,000 subjects, including literary works, authors, book covers, reviews, awards, images, and movies, over 3 million triples in total. The data has been applied as the basis of the BookSampo portal in public use in Finland, and is aligned with the cross-domain cultural heritage contents and ontologies of CultureSampo, another in-use semantic portal. The data has been used to answer complex questions, such as what topics should one write about, if one wants to get a literary award (based on statistics). The metadata was transformed into RDF from legacy library databases, then enriched manually by dozens of librarians in aWeb 2.0 fashion in Finnish public libraries, and is constantly updated at a rate of some new 90,000 triples monthly.
Submission type: 

5

Responsible editor: 
Decision/Status: 
Accept
Reviews: 

Submission in response to http://www.semantic-web-journal.net/blog/semantic-web-journal-special-ca...

Revised resubmission after an "accept with minor revisions", then accepted with minor revisions, and finally accepted for publication. The first round reviews are beneath the second round reviews.

Solicited review by Oscar Corcho:

Following on the comments from my early review of the early version of this paper, the paper describes a relevant dataset that can be useful for a large community of users, what indicates that the work presented is suitable for this special issue.

From a methodological point of view, the design decisions on the vocabularies reused and the part of the vocabulary that is created for the publication of this dataset are good, although the paper still lacks some clear descriptions of the design decisions for URIs. It is worrying that the constraints on RDF Schema that are presented in the paper still hold, when the vocabulary could have been presented in OWL instead.

I tried tonight the urls provided for the sparql endpoints and dumps and they do not work, btw. This should be solved, the dataset registered in a registry, etc.

Solicited review by Fabien Gandon:

I am satisfied with the answers to the reviews.

Solicited review by Aba-Sah Dadzie:

My two main concerns have been much better addressed in the revised submission. I'd recommend accept, minor comments below. Wrt to the key review criteria for the call:

* Quality of the dataset
* Usefulness (or potential usefulness) of the dataset - in use and continuously updated (from multiple sources). Specific use cases are also described.

* Clarity and completeness of the descriptions - good amount of detail provided. The paper is well written and overall, fairly easy to follow.

* Name, URL, version date and number, licensing, availability, etc. - licensing information provided, with relevant information on which types apply to which versions of the data.

* Topic coverage, source for the data, purpose and method of creation and maintenance, reported usage etc.
* Metrics and statistics on external and internal connectivity, use of established vocabularies (e.g., RDF, OWL, SKOS, FOAF), language expressivity, growth. - provided in good detail. While most linking is within the Finnish cultural heritage system, reuse of other standard ontologies and vocabularies provides points from which to link to other "external" data.

* Examples and critical discussion of typical knowledge modeling patterns used - improved compared to original submission. One of my main concerns was design that limited the structure of the dataset. The reasons for the decision and its effects have been more fully addressed. Options for addressing this are also presented. Improving linking to the LOD cloud is discussed in "future work".

* Known shortcomings of the dataset - addressed (see above)

Additional points to address

RDF Export URL (section 2) not reachable (07-08.11.2012)

Table 2 (Important classes in BookSampo along with their instance counts) is missing

"[section] 5. Uses Cases for the BookSampo Dataset" -> "5. Use Cases for the BookSampo Dataset"

First round reviews:

Solicited review by Oscar Corcho:

The paper describes a dataset about fiction literature from libraries in Finland, which is being continuously updated when new additions of book are input in the source data system. The paper does not provide much information about typical aspects that would be necessary to make some checks of the quality of the data inside (e.g., a sample URI for a sample item), but clearly describes the types of entities that it deals with and the use cases that can be run on the dataset.

The quality of the dataset is, as in many of these papers for this special issue, relative to the quality of the source dataset. Here a clear potential problem of co-reference resoultion appears, but it is nicely solved in the approach presented, and additionally Web2.0-like annotation from experts is used in order to curate and add more information into the set. The vocabularies used for the RDF export are adequate and those that are normally used in the bibliographic domain, and the design decisions on which parts of those vocabularies to use (e.g., from FRBR) are appropriate.

The dataset is useful, from what it can be inferred from the use cases that are presented, although its use may be limited to Finland mainly. however, that should not be a problem, obviously.

Finally, the dataset is quite complete, considering the sources that are being used, and it would be nice if links to external resources were added, or at least described, such as authority records for the authors, what would increase the value and completeness of the dataset. A dicussion on this should be available in the revised version.

Solicited review by Fabien Gandon:

The paper presents the BookSampo dataset that provides linked data on fiction literature published in Finland.

The provided URLs were working at the time of writing that review.
Metrics and stats are provided for the internal content.
Interlinking with external schemas and datasets is mentioned but no statistics are provided

The authors identify a list of shortcomings among which one is very disturbing: "the schema definitions in the dataset virtually violate RDFS semantics in one major aspect, due to the specifics of the SAHA editor used: properties may have multiple separate domain and range constraint statements, but this doesn't imply that the instances related by these properties are members of the intersection of domain/range classes, as required in the RDF Schema specification."
Using additional (abstract) classes this could be avoided using for the domain and range a class defined as the super class of the united classes i.e. replacing union by dedicated super classes.
Breaking the RDFS semantics is a very big problem for interoperability if I load your data in my triple store, I will draw false conclusions.

"Bringing events to the fore, the approach fractured and distributed the metadata of the original primary objects. For example, people wanted much more to see information on authors' birth and death dates and places as simply attribute-object values of the author, instead of as events where the author was involved in. The project thus changed back to a more traditional model, where data about times and places of occurrences are directly saved as author, not event attributes. In the case of representing degrees attained by authors, this did lead to some loss of data, since the flat attributes allowed only representation of multiple degrees without dates. However, the librarians deemed the simplicity to outweigh the costs in this situation."

This is surprising: you changed the conceptual model because of an interaction design issue. Why not design an interaction mechanism that bridges the two worlds. For instance in RDF the Fresnel initiative was introduced to decouple RDF models and RDF views.

Solicited review by Aba-Sah Dadzie:

The paper discusses design decisions taken in building the Finnish BookSampo linked dataset, with the use of Web 2.0 technology to index fiction literature such that it provides a rich resource for browsing and analysis not possible with traditional indexing.
The main data source was an RDF dump provided by the Helsinki metropolitan area library; new data and annotations are provided by a named company. The dataset is already in use, and continues to grow, due to annotation carried out by librarians. In addition to web services that support this annotation, support for browsing by ordinary users of the libraries is provided through a dedicated web portal.

The authors describe the domain-specific ontologies used, out of a subset of Finnish resources for describing cultural information (primarily the KOKO ontology), and also the links to other standard ontologies and resources (e.g., DBPedia and GeoNames to match to physical locations). They also discuss restrictions to their model that limit interconnectivity with other resources, due mainly to the need to simplify the model to suit the librarians who provide the annotations and other related information (e.g., awards). A few examples of use are given, including statistical analysis to derive information about the dataset itself and its use for other purposes such as grant sourcing based on subject area. Design to enable easy linking to Finnish cultural heritage resources is also highlighted.
Conflicts in licensing are discussed; the dataset does however appear to be largely accessible for analysis and other (re)use.

DETAILED REVIEW

"In the case of representing degrees attained by authors ..." (p.4) - does "degrees" refer to academic degrees? If so this should be explicitly stated - the word is ambiguous.
I am a bit puzzled as to the justification for using a flat structure for storing data about dates, with the specific example of degrees. Based on the authors' description I (safely?) assume the librarians do not manually edit the backing ontologies - would it not have been possible to set up this structure to capture the date information as well and provide a more usable interface to support simpler provision of the information? Along with additional training as described for the example of annotating a part or a series of a book (p.5-6)? Alternatively or additionally, automatic methods could be used to (attempt) to retrieve this information by making use of other related author attributes, e.g., timestamped information about attendance at relevant institutions - information which IS also captured.

The paper ends suddenly. A brief conclusion with a discussion of future work is necessary; the discussion within the paper (mostly in section 4) covers specific design decisions and the data model, but does not identify any open issues and/or plans to revisit those areas the authors acknowledge to be less than optimal.

==============================

Figures & Tables

Convention places table captions at the top. Also, because the tables in the paper do not have a line at the bottom, the text in the bottom cells runs into the captions, making reading more difficult.

Citation & Bibliography

[1] is a non-English citation. While it may well be appropriate it requires at least a translation of its title into English to give some indication of its relevance - simply because THIS article is written in English, which is, fortunately or not, the lingua franca when it comes to scientific articles. (see, e.g., http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2570362)
Further, the claim it is meant to support appears to be the main justification for the creation of this linked dataset.

(p.2) - The "Getty AAT thesaurus" is not cited - at least a URL must be provided.

Language & Presentation

(p.3) - "so the user should be aware of a few conventions, or lacks thereof" -> "so the user should be aware of a few conventions, or lack[no 's'] thereof"

(p.4) - There is a weird split from the paragraph at the top to the next - it would make for better reading to move the 1st sentence in para2 to the end of the previous one - "The project thus changed back to a more traditional model, where data about times and places of occurrences are directly saved as author, not event attributes."

A Curated and Evolving Linguistic Linked Dataset

Paper Title: 
A Curated and Evolving Linguistic Linked Dataset
Authors: 
Emanuele Di Buccio, Giorgio Maria Di Nunzio, Gianmaria Silvello
Abstract: 
This paper describes the Atlante Sintattico d’Italia, Syntactic Atlas of Italy (ASIt) linguistic linked dataset. ASIt is a scientific project aiming to account for minimally different variants within a sample of closely related languages; it is part of the Edisyn network the goal of which to establish a European network of researchers in the area of language syntax that use similar standards with respect to methodology of data collection, data storage and annotation, data retrieval and cartography. In this context, ASIt is defined as a curated database which builds on a dialectal data gathered during a twenty-year-long survey investigating the distribution of several grammatical phenomena across the dialects of Italy. Both the ASIt linguistic linked dataset and the Resource Description Framework Schema (RDF/S) on which it is based are publicly available and released with a Creative Commons license (CC BY-NC-SA 3.0). We report the characteristics of the data exposed by ASIt, the statistics about evolution of the data in the last two years, and the possible usages of the dataset, such as the generation of linguistic maps.
Submission type: 

5

Responsible editor: 
Decision/Status: 
Accept
Reviews: 

Submission in response to http://www.semantic-web-journal.net/blog/semantic-web-journal-special-ca...

Revised submission, now accepted, after a "reject and resubmit" and a subsequent "accepted with minor revisions". Reviews of the first round are beneath the second round reviews, which are beneath the third round review.

Solicited review by Jesse Weaver:

My previous concern about the dereference behavior of /terms URIs has been sufficiently addressed. These URIs now 303 to the ontology document, which aligns with the current resolution of httpRange-14. The explanation of this behavior is also implied by the statement that publication of the Linked Data follows the guidelines of "Linked Data: Evolving the Web into a Global Data Space" by Heath and Bizer.

The URIs previously mentioned in table 1 have been corrected except for geo. As previously stated, the geo namespace URI should end with a '#' instead of a '/'. (Namespace URIs can be validated by checking prefix.cc, for example, http://prefix.cc/geo .) Other than that, the publication seems ready for acceptance.

Second round reviews:

Solicited review by Jesse Weaver:

My previous concern about the dereference behavior of the URIs has been addressed by the additional discussion at the end of section 3. This discussion is nearly satisfactory, agreeing with perusal of the data. However, when dereferencing ontology terms, like http://purl.org/asit/terms/Province , these terms 302 redirect to RDF/XML documents. Personally, I am not so strict as to require compliance with the current resolution of httpRange-14 (303 redirection for slash URIs) since there still seems to exist some debate on the matter, but if the behavior does not comply with httpRange-14, expectations must be managed. The paper addresses resource/ data/ and page/ URIs, but not terms/ URIs. In addition, the former URIs comply with httpRange-14 while the latter do not. Thus, there appears to be an inconsistency, which at the very least needs to be discussed and justified in the paper. Aside from this issue, the new URI design vastly improves the technical quality of the dataset, and the added discussion is a much welcomed addition to the paper.

In Table 1, the gn namespace URI should be ended with a '#', that is, altogether, http://www.geonames.org/ontology# . The geo, owl, rdf, and rdfs namespace URIs should be ended with '#' instead of '/' . (These appear to be correct in the actual data, just not in the paper.)

The paper also needs to be revised for minor grammar errors. Additionally, the right column of the first page seems oddly formatted. [12] in the bibliography has two commas in a row. [13] has a title with two colons (at books.google.com, it seemed the appropriate title was "Language and Space: Language Mapping").

Solicited review by Marta Sabou:

I am satisfied with the way in which the authors have addressed my comments and recommend accepting the paper as is.

Solicited review by Ivan Herman:

The authors have answered my earlier comments in a satisfactory manner. As a result, I have increased the ratings and I am happy to see the paper published in the journal.

First round reviews:

Solicited review by Ivan Herman:

My biggest problem with the presented dataset is that I miss an explanation why this exercise is worthwhile. Of course, we all have the goal of having more and more open data available as Linked Data, but I did not understand the motivation of converting this particular data. The usage descriptions touched upon in section 5 are (besides being speculative at this point) all related to the particular usage of linguistic diversity which does not seem to refer to the extra possibilities offered by being linked to outside datasets at all; in other words, all those applications could be realized through any other type of data storage and publication mechanism. To summarize: how would applications benefit from the data in this format? What does linked data bring as a plus to this particular field?

The work flow between curation of the data and the final linked dataset is unclear. How faithfully does the LOD version of the dataset reflect the current status of curation? Is it a regular dump of the data? How frequent? Ie, if I rely on the RDF version, how up-to-date is that data?

Quality of the dataset: good
Usefulness (or potential usefulness) of the dataset: questionable
Clarity and completeness of the descriptions: good

Solicited review by Marta Sabou:

I organize my review according to the criteria of the Special Call for Linked Dataset descriptions

Quality of the dataset
Low. In itself, the linguistic dataset is very interesting, especially from a linguistic perspective. However, the exposure of this dataset as LOD is still in an initial stage and it accounts to making the entire RDF/s dataset available for download as a single file. The dataset has not been linked to other LOD datasets and there is no SPARQL endpoint for querying it. So at this stage I would consider this dataset as being a Semantic Web dataset, but more work needs to be done to expose it properly as a LOD dataset.

Usefulness (or potential usefulness) of the dataset
Low. While academically very interesting, this dataset of information and sample texts for Italian dialects will probably only be of interest to a niche segment, most probably in the linguistics area. However, inovatively linking this dataset to other sources might further increase its usefulness.

Clarity and completeness of the descriptions
Medium. The paper is easy to read, however, many of the details in Section 2 and 3 have a low relevance to the topic of the paper.

Solicited review by Jesse Weaver:

This article describes an RDF version of the Syntactic Atlas of Italy (ASIt) linguistic curated database containing data about dialects, sentences, words, translators, etc. associated with translation questionnaires. The usefulness of the content of the data seems sufficient, and the written description (aside from grammar errors) with associated figures and tables is excellent. However, there are major issues regarding quality of the RDF dataset as Linked Data.

The fundamental characteristic of Linked Data (in the Tim Berners-Lee sense) is that "when you have some of it, you can find other, related, data." [ http://www.w3.org/DesignIssues/LinkedData.html ]. In practice, this means that the URIs used to identify things should dereference in an appropriate manner to data about the thing identified by the URI. The authors do not discuss in what manner the URIs in their dataset should be dereferenced. Upon inspection, dereferencing URIs in the RDF dataset (e.g., http://purl.org/asit/Town/Ronago , http://purl.org/asit/Sentence/54151 ) using HTTP appears to result in 404s. If that is the general dereference behavior of the URIs in the dataset, then the dataset does not constitute Linked Data. Therefore, the usefulness and quality of the dataset (as Linked Data) is nullified.

This is unfortunate because, otherwise, the dataset appears interesting and the article well-organized. If the authors were to include some description regarding dereferencing of URIs in a manner that constitutes Linked Data, then this article could be considered for acceptance as a Linked Dataset description.

Hide Reviews: 
no

The Digital Agenda Scoreboard: An Statistical Anatomy of Europe’s way into the Information Age

Paper Title: 
The Digital Agenda Scoreboard: An Statistical Anatomy of Europe’s way into the Information Age
Authors: 
Michael Martin, Bert van Nuffelen, Stefano Abruzzini, Sören Auer
Abstract: 
Evidence-based policy is policy informed by rigorously established objective evidence. An important aspect of evidence-based policy is the use of scientifically rigorous studies to identify programs and practices capable of improving policy relevant outcomes. Statistics represent a crucial means to determine whether progress is made towards policy targets. In May 2010, the European Commission adopted the Digital Agenda for Europe, a strategy to take advantage of the potential offered by the rapid progress of digital technologies. The Digital Agenda contains commitments to undertake a number of specific policy actions intended to stimulate a circle of investment in and usage of digital technologies. It identifies 13 key performance targets. In order to chart the progress of both the announced policy actions and the key performance targets a scoreboard is published, thus allowing the monitoring and benchmarking of the main developments of information society in European countries. In addition to these human-readable browsing, visualization and exploration methods, machine-readable access facilitating re-usage and interlinking of the underlying data is provided by means of RDF and Linked Open Data. We sketch the transformation process from raw data up to rich, interlinked RDF, describe its publishing and the lessons learned.
Submission type: 

5

Responsible editor: 
Decision/Status: 
Reject and Resubmit
Reviews: 

Submission in response to http://www.semantic-web-journal.net/blog/semantic-web-journal-special-ca...

Solicited review by Aidan Hogan:

The paper describes ongoing efforts to export and publish statistical information from the European Commission's Digital Agenda as Linked Data. The Digital Agenda consists of a set of thirteen ICT-related goals that the EC hopes members states can reach before 2020. To track progress, a Digital Agenda Scoreboard was initiated to collate statistics on these goals from different sources across various years and members states. The scoreboard consists of 108 indicators divided into 12 groups, with a total of 17,646 observations currently available for each indicator. Sources of information include eGov. reports, Eurostat, EU FP7 ICT project information, etc.

Primarily, the Linked Data representation of the data relies on the (existing) Data-Cube vocabulary, designed for representing statistical information (itself using parts of SKOS, DC, FOAF, SCOVO, etc.). Raw data are taken from spreadsheets and RDFised following the Data-Cube vocabulary. URIs are generated by appending various uniquely-identifying information onto a local namespace. Legacy URIs for countries and so forth are re-used. Provenance metadata are also attached. Links are generated to existing EUROSTAT exports using the SILK linking framework, and published under a separate namespace. A total of 33,465 links are generated, mostly using "skos:RelatedTo". A few owl:sameAs relations are provided for various countries and years.

In general, the work seems interesting and the data sound useful to have published as Linked Data, particularly in the context of other EUROSTAT exports. The high-level methodology, use of existing vocabularies and provision of external links seems sound. However, the dataset and the description do have some significant problems.

The provided description of the dataset is adequate, giving a good overview of the origin, purpose and nature of the data. However, presentation could be improved significantly: various typos and poorly constructed sentences are found throughout the text (please give a thorough proof-read). The figures are sometimes too small to be legible. Since 4-6 pages is only a guide, I would suggest the authors take another page and expand the figures. Otherwise, they may consider shortening the related work section.

In general, I feel that the paper misses some important discussion:
* The authors should clarify if the export is a first-party or third-party effort. It seems that RDF is made available through the public UIs, but the data is seemingly hosted on a lod2.eu namespace.
* What applications or use-cases could be envisaged for the data? Part of the evaluation of datasets for this special issue relates to their usefulness. Unfortunately, this is not directly addressed by the authors. What benefits does the RDFisation and Linked Data export bring to the table? I do believe that the dataset is potentially useful, but the authors should address this themselves directly.

Aside from high-level issues, I'm concerned by a number of other factors. For this special issue, practical details play a major role. Unfortunately, I sometimes find a worrying lack of attention to detail for this dataset/description.

First off, the dataset itself is being evaluated, not just the description. However, at the time of reviewing, I could not access the dataset on the Web. URLs under the main namespace were returning an error code:

"""
OntoWiki Error

Zend_Controller_Dispatcher_Exception: Invalid controller specified (scoreboard)
/var/www/data.lod2.eu/libraries/Zend/Controller/Dispatcher/Standard.php@242 (0)
"""

URLs I tried:

http://data.lod2.eu/scoreboard/links/
http://data.lod2.eu/scoreboard/items/
http://data.lod2.eu/scoreboard/indicators/
http://data.lod2.eu/scoreboard/indicators/i_igovrt_IND_TOTAL_ind
http://data.lod2.eu/scoreboard/properties/

Perhaps this is only a temporary issue, but it is discouraging. I did find the dump through -thedatahub- URL linked, but a dump is not a Linked Data site.

The description itself also contains some errors relating to the data produced.
* Listing 2: the property dcterms:publisher is used with a literal value. However, dcterms defines this property to have the rdfs:range dcterms:Agent [1].
* Section 2.3: There is no owl:ontology class. (owl:Ontology?)
* Section 2.4: There is no skos:RelatedTo property. (skos:related? [2])

Undoubtedly these are minor errors, but they have a significant influence on the quality of the dataset, and do not inspire confidence. (On a side note, I would also suggest to use language tags for labels.)

So in general, although I believe the work to be of significant value, the description and the dataset have some significant shortcomings. In particular, the following issues should be addressed: (i) the lack of argumentation as to why making these statistics available as Linked Data is a good thing, (ii) the inaccessibility of the data, and (iii) various [admittedly minor] practical issues with the data.

[1] http://dublincore.org/documents/dcmi-terms/#terms-publisher
[2] http://www.w3.org/TR/2009/REC-skos-reference-20090818/#L2255

Solicited review by Oscar Corcho:

This paper describes the RDF-enabled dataset associated to the digital agenda scoreboard, which is one statistical dataset generated by one European institution with a set of indicators for each year and country of the European Union.

The paper follows the usual methodology for publishing data as Linked Data, from the description of the data sources, to the definition of the transformation process, to the selection of URIs and vocabularies, to the transformation itself, the generation of links and the publication of the dataset. In this context, there are no major innovations, but as this special issue is focused really on the datasets and not on innovations in the transformation and publication process, this would be perfectly ok for publication in that sense.

The call for papers puts an emphasis on three major topics: quality of the dataset, usefulness (or potential usefulness) of the dataset and clarity and completeness. My following set of comments is related to this.

With respect to the quality of the dataset, it is clear that the systematic transformation process that has been followed and the origin of the data (which comes from a well-curated source) ensures that the dataset will be generated with a high degree of quality. The dataset is small, given that it is based not on core data used to generate the indicators, but on the indicators as generated by the institution (it would have been nice to see a discussion of what added benefit we could generated for the evidence-based approach presented by the authors in generating also some of the micro-data used to generate indicators). The selection of the DataCube vocabulary is an obvious choice, and the selection of URIs seems quite sensible, although an important aspect that has to be considered here is the fact that trying the URIs that were given in footnotes 3 and 4 provided me with an error from OntoWiki, something that I would like to alert authors about, and that motivate my request to resubmit. In fact, listing 1 suggests that the URIs are generated with http://data.lod2.eu/scoreboard/obs/xxx" and then in the beginning of page 4 it says that they are generated as /items/xxx". There are some contradictions that should be considered in order to consider this dataset of good quality.

Another aspect that is important in this context is the absence of links, which limits the usefulness of the dataset for allowing other forms of exploration not initially foreseen, but also that limits the quality of the dataset. An example is provided of how links could be generated, which is quite obvious, but it is not clear what the process will be to generate those outgoing links or, the opposite, incoming links to the data, which are equally important and relevant. The usefulness of the dataset, that said, is heavily related to the usefulness of a scoreboard with these indicators. It will not be a source used everyday, but it seems useful anyway.

As for the descriptions themselves, it is not fully clear why the authors talk at the beginning about spreadsheets, then they say that data is in a relational database, then they say that the application is a PHP based on top of a relational database, instead of a triple store. This is quite unclear and should be better explained to make it more compact and to add clarity about the structure and architecture of the application that has been developed.

Finally, I would suggest removing the related work, or shortening it. The transformation tools are not directly applicable here (I am surprised that tools to deal with spreadsheets are not considered there, but tools for RDB2RDF), and the discussion on statistical datasets and on Government data in general is quite obvious, but probably not too necessary for this paper.

Finally, as a small point, the title is too verbose, and the authors should avoid repeating the whole first paragraph of the introduction as it is written in the abstract. These are minor points.

Solicited review by Axel Polleres:

The paper "The Digital Agenda Scoreboard" describes the Digital Agenda dataset and the corresponding visualization tool Digital Agenda Scoreboard (DAS). The linked dataset mainly uses an extended version of the DataCube vocabulary to encode the statistical source data which describe different parameters of EU countries.
The authors describe the Digital Agenda Scoreboard, its motivation and key functionalities.

One point of criticism it that the paper – in light of the call – has too much focus on the digital agenda scoreboard (DAS) tool and the transformation, rather than on the description of the dataset as such. So, it reads more like a paper describing the scoreboard than the linked dataset

The paper is evaluated along the following three dimensions.

Quality of the dataset
The RDF dataset is published on the web. The authors describe the export of the whole dataset but miss giving a URL to download the data.

The dataset uses vocabularies (RDFS, DC, OWL, DOAP, CC) although I could not find the described CC triples in the actual dataset.

The sources of the data are described well.
The authors are extending the DataCube ontology for encoding the data.

Provenance: Each indicator links to its source via DC terms.

The ontowiki links are not working (footnotes 3,4,6).

Linkage: I could not find the linking dataset, so I was unable to verify the linkage to other datasets.
Usefulness of the dataset
So far the dataset is only used by the DAS. The authors do not describe any other (external) users. However public data from the European commission seems hghly usable for all kinds of Linked Data applications, it would be nice if the authors at least could sketch some ideas for further usage, otherwise(i.e. if the dataset only serves the DAS tool, it is not really justified why the integration benefits from Linked Data.
Clarity and completeness of the descriptions
The descriptions are too much focused on the DAS on the expense of the actual Linked dataset. Section 3 is a good example for this mis-focus. I would suggest to actually describe the publishing details in Section 3, and remove all references to the DAS. It is also strange that Section 2.3 (Transformation workflow) starts with a description on how to generate charts and the HTML output.

Section 2.2 Good description; What does "das:breakdown" mean and what is a "das:variable"?

The actual transformation is not described extensively enough in Section 2.3. The authors only implicitly mention that the data extracted from Excel is then stored in a relational database. How this transformation done? The authors describe how the DAS output is generated but description of the actual RDF generation (triplification) which I think would be more in scope for the particular call, is notgiven.

Section 2.4 Interlinks is too short and confusing. How is the LATC project exactly related?

The prefixes are not explained?

What is a breakdown?

What is a das:variable?
Formatting

Links in footnotes 1,7,10 are duplicates. Consider using only a single footnote or moving to references.

Figure 1 and esp. 2 are hard to read on printout: small and pixelated font and fading borders

The line numbers for listings are not needed.

Removing the Id column from Table 2 might save some space.

The sentence describing Listing 1 lacks a verb.

Hide Reviews: 
no

Pages