A Curated and Evolving Linguistic Linked Dataset

Paper Title: 
A Curated and Evolving Linguistic Linked Dataset
Emanuele Di Buccio, Giorgio Maria Di Nunzio, Gianmaria Silvello
This paper describes the Atlante Sintattico d’Italia, Syntactic Atlas of Italy (ASIt) linguistic linked dataset. ASIt is a scientific project aiming to account for minimally different variants within a sample of closely related languages; it is part of the Edisyn network the goal of which to establish a European network of researchers in the area of language syntax that use similar standards with respect to methodology of data collection, data storage and annotation, data retrieval and cartography. In this context, ASIt is defined as a curated database which builds on a dialectal data gathered during a twenty-year-long survey investigating the distribution of several grammatical phenomena across the dialects of Italy. Both the ASIt linguistic linked dataset and the Resource Description Framework Schema (RDF/S) on which it is based are publicly available and released with a Creative Commons license (CC BY-NC-SA 3.0). We report the characteristics of the data exposed by ASIt, the statistics about evolution of the data in the last two years, and the possible usages of the dataset, such as the generation of linguistic maps.
Full PDF Version: 
Submission type: 
Dataset Description
Responsible editor: 

Submission in response to http://www.semantic-web-journal.net/blog/semantic-web-journal-special-ca...

Revised submission, now accepted, after a "reject and resubmit" and a subsequent "accepted with minor revisions". Reviews of the first round are beneath the second round reviews, which are beneath the third round review.

Solicited review by Jesse Weaver:

My previous concern about the dereference behavior of /terms URIs has been sufficiently addressed. These URIs now 303 to the ontology document, which aligns with the current resolution of httpRange-14. The explanation of this behavior is also implied by the statement that publication of the Linked Data follows the guidelines of "Linked Data: Evolving the Web into a Global Data Space" by Heath and Bizer.

The URIs previously mentioned in table 1 have been corrected except for geo. As previously stated, the geo namespace URI should end with a '#' instead of a '/'. (Namespace URIs can be validated by checking prefix.cc, for example, http://prefix.cc/geo .) Other than that, the publication seems ready for acceptance.

Second round reviews:

Solicited review by Jesse Weaver:

My previous concern about the dereference behavior of the URIs has been addressed by the additional discussion at the end of section 3. This discussion is nearly satisfactory, agreeing with perusal of the data. However, when dereferencing ontology terms, like http://purl.org/asit/terms/Province , these terms 302 redirect to RDF/XML documents. Personally, I am not so strict as to require compliance with the current resolution of httpRange-14 (303 redirection for slash URIs) since there still seems to exist some debate on the matter, but if the behavior does not comply with httpRange-14, expectations must be managed. The paper addresses resource/ data/ and page/ URIs, but not terms/ URIs. In addition, the former URIs comply with httpRange-14 while the latter do not. Thus, there appears to be an inconsistency, which at the very least needs to be discussed and justified in the paper. Aside from this issue, the new URI design vastly improves the technical quality of the dataset, and the added discussion is a much welcomed addition to the paper.

In Table 1, the gn namespace URI should be ended with a '#', that is, altogether, http://www.geonames.org/ontology# . The geo, owl, rdf, and rdfs namespace URIs should be ended with '#' instead of '/' . (These appear to be correct in the actual data, just not in the paper.)

The paper also needs to be revised for minor grammar errors. Additionally, the right column of the first page seems oddly formatted. [12] in the bibliography has two commas in a row. [13] has a title with two colons (at books.google.com, it seemed the appropriate title was "Language and Space: Language Mapping").

Solicited review by Marta Sabou:

I am satisfied with the way in which the authors have addressed my comments and recommend accepting the paper as is.

Solicited review by Ivan Herman:

The authors have answered my earlier comments in a satisfactory manner. As a result, I have increased the ratings and I am happy to see the paper published in the journal.

First round reviews:

Solicited review by Ivan Herman:

My biggest problem with the presented dataset is that I miss an explanation why this exercise is worthwhile. Of course, we all have the goal of having more and more open data available as Linked Data, but I did not understand the motivation of converting this particular data. The usage descriptions touched upon in section 5 are (besides being speculative at this point) all related to the particular usage of linguistic diversity which does not seem to refer to the extra possibilities offered by being linked to outside datasets at all; in other words, all those applications could be realized through any other type of data storage and publication mechanism. To summarize: how would applications benefit from the data in this format? What does linked data bring as a plus to this particular field?

The work flow between curation of the data and the final linked dataset is unclear. How faithfully does the LOD version of the dataset reflect the current status of curation? Is it a regular dump of the data? How frequent? Ie, if I rely on the RDF version, how up-to-date is that data?

Quality of the dataset: good
Usefulness (or potential usefulness) of the dataset: questionable
Clarity and completeness of the descriptions: good

Solicited review by Marta Sabou:

I organize my review according to the criteria of the Special Call for Linked Dataset descriptions

Quality of the dataset
Low. In itself, the linguistic dataset is very interesting, especially from a linguistic perspective. However, the exposure of this dataset as LOD is still in an initial stage and it accounts to making the entire RDF/s dataset available for download as a single file. The dataset has not been linked to other LOD datasets and there is no SPARQL endpoint for querying it. So at this stage I would consider this dataset as being a Semantic Web dataset, but more work needs to be done to expose it properly as a LOD dataset.

Usefulness (or potential usefulness) of the dataset
Low. While academically very interesting, this dataset of information and sample texts for Italian dialects will probably only be of interest to a niche segment, most probably in the linguistics area. However, inovatively linking this dataset to other sources might further increase its usefulness.

Clarity and completeness of the descriptions
Medium. The paper is easy to read, however, many of the details in Section 2 and 3 have a low relevance to the topic of the paper.

Solicited review by Jesse Weaver:

This article describes an RDF version of the Syntactic Atlas of Italy (ASIt) linguistic curated database containing data about dialects, sentences, words, translators, etc. associated with translation questionnaires. The usefulness of the content of the data seems sufficient, and the written description (aside from grammar errors) with associated figures and tables is excellent. However, there are major issues regarding quality of the RDF dataset as Linked Data.

The fundamental characteristic of Linked Data (in the Tim Berners-Lee sense) is that "when you have some of it, you can find other, related, data." [ http://www.w3.org/DesignIssues/LinkedData.html ]. In practice, this means that the URIs used to identify things should dereference in an appropriate manner to data about the thing identified by the URI. The authors do not discuss in what manner the URIs in their dataset should be dereferenced. Upon inspection, dereferencing URIs in the RDF dataset (e.g., http://purl.org/asit/Town/Ronago , http://purl.org/asit/Sentence/54151 ) using HTTP appears to result in 404s. If that is the general dereference behavior of the URIs in the dataset, then the dataset does not constitute Linked Data. Therefore, the usefulness and quality of the dataset (as Linked Data) is nullified.

This is unfortunate because, otherwise, the dataset appears interesting and the article well-organized. If the authors were to include some description regarding dereferencing of URIs in a manner that constitutes Linked Data, then this article could be considered for acceptance as a Linked Dataset description.