Review Comment:
The IULA's META-SHARE LOD dataset, lessons learnt
Marta Villegas & Núria Bel
Assessment
----------
I suggest acceptance with minor revision (as detailed below).
It is a reasonable data set description in terms of
(1) Quality and stability of the dataset
The original data set is known in the NLP community as the result of a large-scale resource survey, it thus provides state-of-the-art data quality. Quality improvements over the original data are a main point in the paper. It is further developed by the LD4LT W3C CG, so, we can expect stability and maintenance out of this community effort. These aspects are also mentioned in the paper, but could be more elaborately emphasized.
(2) Usefulness of the dataset, which should be shown by corresponding third-party uses
Usefulness of the data in its new representation is illustrated by the Catalog Browser. The original data has a history of third party usage which will probably continue with the LOD version.
(3) Clarity and completeness of the descriptions
The description of the LOD conversion is concise, clear and well described. The conversion, however, refers to the original XSD scheme, where some design decisions (global vs. local elements) remain unclear.
Review
------
The authors describe the conversion of the META-SHARE data model from XSD to OWL, and of the XML content of the IULA dataset to instances thereof. META-SHARE provides a network of repositories of language resources, including both language data and language tools, described through a set of metadata.
They elaborate on specific details of the RDFication process, most noteably the treatment of RDF enumerations (§2.1) and the conversion of certain free text entries to structured data (§2.2). This effort is motivated by the need for a more user-friendly and less technical formalism (p.2) which is provided by the authors through an RDF-based catalog. This catalog is one primary result of their work and subject to this data set description, and in its current instantiation, it is the basis for discussing META-SHARE content within the Linked Data for Language Technology Community Group (LD4LT) working group.
One benefit of this representation is the linking of different data sets which facilitates cleaning and distributed use of data sets. Another is that queries over relations become possible that "were hidden" in the original XML data [mentioned in the abstract, reference to example? -- this pertains to "unstructured" free-text content, I presume]. Both aspects are demonstrated in end-user oriented use cases in §4.
Usability is demonstrated through an RDF-based catalog browser (§1) which manages, disseminates and grants access to language resource metadata. Creating an LOD version of META-SHARE data does not immediately lower the entrance barrier for *non-technical* users, more convincing here would be to reverse the presentation of LOD pros on p.2, §1, top left, and first emphasize the way that "LOD eases data enrichment", and then suggest that this may also provide a usability benefit over the original implementation (as hinted at in "MS nodes target language technology professionals and become too technical for a potentially wider community of users."). A nice feature of RDF is data correction/checking via SPARQL and SPARQL Update (§2.2, §2.3) -- can you identify advantages of this approach over, say, XUF (or whatever update formalism was used in the original implementation)? In §2.3, the notion of a local XML element should be explained in relation to MS: In the original XSD, this was a design decision (which seems to be inherited to the LOD version), but the motivations behind are unclear.
Data and tool availability are discussed in §3, the functionality of the catalog browser is illustrated in §4. I'm not fully convinced that SPARQL queries do indeed improve accessibility of the data to non-technical users, but the faceted browsing functionality probably do. Unless something like this has already been available in the old implementation, the authors may consider to emphasize the linking aspects rather than usability.
Throughout the paper, design decisions and conversion steps are generally well explained and the obligatory statistics about the data set are provided with sufficient level of detail. The text is readable and generally well-structured, stylistically, however, its descriptions are occasionally repetitive (e.g., multiple section summaries) or abstract (e.g., examples for simplification in §2.0). See below for detailed comments. The observations of the authors on RDFization semistructured (XML) data (§2) are not surprising, but although moderately innovative only, this description is valuable in that it provides a point of orientation for colleagues interested in projects of similar scale and scope. In that regard, the article is sufficiently relevant for inclusion into the journal.
At the beginning, I am missing a motivation for using OWL. Wouldn't RDFS be more natural, in particular for mixed types? (I would imagine to find those as a result from converting free text into ObjectProperties). Are there plans for using OWL's reasoning capabilities? If so, which OWL dialect are you aiming for?
In general, there is no discussion of related research. For a plain data set description, this is acceptable, but references to other efforts to LODify metadata repositories (e.g., ISOcat -- although this differs in scope) would improve its usability to the intended audience.
Remarks and questions
---------------------
p.1, §1: The motivation doesn't clearly state the formalism applied (OWL is *implied only* by second paragraph).
p.1, §1: if there is an initial version of the model, is there an advanced one as well ? If the authors want to refer to their MS model as starting point for LD4LT work, they should rephrase.
p.1, §1: separate catalog from LD4LT: remove line break before "The catalog", add line break after "... [1]." (same sentence). Otherwise, LD4LT work appears to be an aspect of work on the catalog.
p.1, §1: Relation between LD4LT and the authors is unclear. If the authors see their work as contribution to LD4LT work, they should say so.
p.2, Tab.1: please check fourth vs. sixth row: Isn't the global element to be accompanied with owl:DatatypeProperty (along with rdfs:Datatype), as well?
p.2, §1: Motivation as a more user-friendly formalism should be earlier, maybe even with a clause in the abstract. Also, please mention earlier in what way this would be more user-friendly and less technical (SPARQL vs. XQuery?), or at least refer to the corresponding section.
p.2, §1: "maximize the information": Please rephrase, the informational content remains the same (I hope), just becomes more accessible.
p.2, §1: "The section focuses on ...": repetitive, move to and merge with §2 intro
p.2, §2: "SXD" -> "XSD"
p.2, §2: Please provide a *concrete* example for the simplification, for both `wrapping elements' and `superfluous elements'. This may require a new subsection.
p.2, §2: last paragraph redundant with beginning of section.
p.3, §2.1: the observed inconsistencies are nicely described and rather typical for pre-linked data, this is valuable to anyone considering RDFization, also beyond the MS data set.
p.3, §2.2: RDFying unstructured data: I really didn't get what you wanted here before I read the section -- mostly because it wasn't clear to what extend MS contains free textual descriptions. Maybe describe it (at the first encounter with "RDFying unstructured data") more explicitly as mining object properties from free-text content of the original MS data. In this regard, your term "text RDFication" would be more easily comprehensible. The extend of free text content in MS becomes only clear on p.5
p.5, §2.3: The document discussion is interesting, but also of different character than that of person, project and organization. Maybe put into its own subsection.
p.8: The "CCC Browser" and the "catalog (browser)" are basically referring to the same entity, resp. its software and data components. To facilitate comprehensibility, please use the same term throughout the paper to refer to system and data as a unit.
Minor comments
--------------
p.1, abstract: "enumerations; the RDFication" check syntax, if enumeration, use ","
p.1, §1: "Currently, this initial version" I presume the authors mean the version described in this paper. In the current formulation, "this initial version" may refer to a predecessor. If the latter is the case, then please clarify in what respects they deviate.
p.1, §1: "repository of language resource*s*"
p.1, fn.3: In an RDF context "resource" is ambiguous, please clarify your term usage here more explicitly.
p.4, §2.3: Headline ": dealing" => ": Dealing"
p.8, §4, Headline "; making" => ": Making"
p.8, fn.23: please, put this to a listing (with proper formatting), not a footnote.
references: Check formatting and capitalization.
overall: Please check whether section-inital indent is style-conformant.
overall: Check American vs. British English: "Catalog" vs. "Catalogue" etc.
|