The IULA’s META-SHARE LOD dataset, lessons learnt

Tracking #: 900-2111

Marta Villegas
Núria Bel

Responsible editor: 
Philipp Cimiano

Submission type: 
Dataset Description
This article describes the IULA’s META-SHARE LOD dataset and the RDFication task performed when moving the original XSD/XML data into RDF/OWL. The dataset has to do with language resource descriptions and it includes the LOD version of the META-SHARE model plus the IULA’s language resource descriptions. The article focuses on some critical aspects when RDFying XSD/XML data. Essentially these include: the mapping of controlled vocabularies expressed in XML enumerations; the RDFication of certain unstructured data (those where unrestricted input strings may generate relevant instances) and the cleaning and linking tasks required once eventual instances are RDFied. Data cleaning and linking become crucial in a scenario where different distributed metadata nodes share their data. The eventual dataset proves efficient for data exploitation and capitalizes the efforts done. This is demonstrated by the catalog browser developed which allows retrieving relevant relations between tools and datasets, services and publications, people and projects etc.. that remained hidden in the original data and demonstrates some data mashups. This web application uses the dataset described to promote the use of language technology to researches of Humanities and Social Sciences as part of the CLARIN initiative.
Full PDF Version: 

Major Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
By John McCrae submitted on 25/Jan/2015
Major Revision
Review Comment:

This paper presents the conversion of the MetaShare LOD dataset to linked open data and as such is a timely and important dataset. The need for good metadata describing language resources is a key challenge to enabling NLP systems and other language technologies to quickly adapt to new domains and languages. This paper however, does have two major flaws: firstly, there is very little discussion of related work, and it would be good if the author considered a) what other metadata resources there are and b) compared their approach to converting this dataset to some other resources. Secondly, this paper describes only the conversion of the metadata stored at IULA-UPF and not the whole of the MetaShare data, it would be great if the authors could clarify if there are any plans to complete the conversion.

There are a number of minor issues as well

Abstract: "dataset has to do with" => "dataset is concerned with"
p1: "Humanities", "Social Sciences", "Catalog", "Language Resource", "Repository Node": None of these are proper nouns so please don't capitalize
p2: "become too technical" => "are too technical"
p2: "Besides, some interesting" => "Furthermore, some interesting"
p2: "MS nodes remain invisible to end users" please explain more about this
p2: "are plenty of data" => "are a large collection of data"
p2: "i) to generate... ii) to map" => "i) generating... ii) mapping"
p2: SXD should be XSD, right?
p2: "derive in" => "result in"
p2: "ulterior" (too pejorative) => "external"
p2: "in [2]" => "in Villegas et al. [2]"
p2: "eventual instances" => "instances"
p3: "of which 807 distinct ones" => "of which 807 are distinct"
p3: "let's mention" => "we note"
p3: "include the so common" => no so
p3: Best practices recommend the Library of Congress or LexVo for language codes (see W3C BPM-LOD Community Group)
p4: The SPARQL queries should be as follows:
DELETE { ?s ms:mimeType "text" }

?s ms:mimeType
} WHERE { ?s ms:mimeType "text" }

?s ms:mimeType ?type .
FILTER(!strStarts(str(?type), ""))
or if you still want to use regex at least use "^"
p4: "eventual model" => "final model"
p4: "the fact is that free text is widely used" => "in fact free text is widely used" and could you provide a citation for this?
p5: "VIAF 14" (extra space)
p5: "oddities" => "outliers"?
p6: "(hopefully avoidable)" no brackets
p6: "ORCHID" => "ORCID"
p6: "xsd" => "XSD" for consistency
p6: What is the "CCC Browser"?
p6: "DBpedia resource X", what is X?
p6: "the Browser" no capitalization
p7: Capitals for "HTML" and "Turtle"
p8: "allowed making the most of the data" not grammatical
p8: "not much considering the node is a central one" please expand why this is surprising
p8: "IUA" => "IULA"
p8: "the DBpedia" no the
p9: Please include page numbering for citations from LREC

Review #2
By Jorge Gracia submitted on 28/Jan/2015
Minor Revision
Review Comment:

The article describes the conversion of the Meta-Share metadata for describing language resources from its original XSD/XML version into RDF. This work describes with detail the main steps carried out in the process, including the strategies for XML enumerations and for certain unstructured data, and the final data cleaning and linking step. Finally a browser was developed to access the data. As result, the so called IULA’s META-SHARE LOD dataset was created, based on the metadata corresponding to the UPF's Meta-Share node.

The paper is clear and well structured. English needs further checking, though. One of the aspects I like more is that it clearly identifies the limitations of XSD/XML that can be overcome by using RDF and linked data. The strategies discussed here are general enough to be useful also for migrating other XML-based data schemes in other domains. The paper is timely in the sense that there is a growing interest for migrating linguistic data and metadata into the LOD cloud by the community of language resources. In fact, the work described here has served as input to other community based efforts such as the W3C LD4LT group, where an OWL model for describing language resources is currently under development.

There are however certain issues that the authors should address for improving the paper.

- I would like to read in the paper a more explicit analysis on which limitations are inherent to XML/XSD and which ones depend on the modelling choices made in MetaShare.
- A brief comparison with other non LOD-based browsers/repositories (Meta-Share, CLARIN, ...) would be also desirable.
- In I see no ontology, only a set of prefixes.
- At the time of writing this, the sparql endpoint is not available.
- For a "data description" type of submission I consider that the two previous issues are essential and should be fixed prior to the final acceptance of the paper.
- I would omit the last sentence in the abstract, which is not essential and is already explained in the motivating section.
- It is said that "the open world assumption of LOD eases data enrichment". I would not say that OWA is an assumption of linked data but of the Semantic Web. Linked data is just a set of best practises. In any case it is true that OWA makes data enrichment easier, which is a good point and should be further discussed in the paper.
- Also in the motivating section it is pointed that "some interesting information in the MS nodes remained invisible to end users". In which ways LOD helped to extract such "hidden" existent information (maybe by inference mechanisms)? This needs further explanation.
- In Section 1 near the end, I would join the lines starting "Section 3..." and "Section 4..." into the previous paragraph.
- The rules summarised in table 1 for the XML-RDF mapping, where do they come from? Add citation.
- Why flatter and shallower representations are obtained after identifying the "superfluous" nodes?
- In section 2.1 they say "In OWL, these data categories are actually ordinary resources", what do the authors mean by "ordinary resources"? maybe ontology entities?
- The last sentence in Section 2.1 could be omitted or replaced by "as we will see in Section 2.3"
- Citations or footnotes are missing for LD4LT, Open Refine, Dublin Core, DBLP and Google Scholar
- In Section 2.3, about RDFying documents, they say "we use DBLP and Google Scholar to search for input unstructured documents". Is this a manual process?
- I would omit footnote 16
- In section 2.4 they say "We enriched our original dataset with additional documentation stuff...". From where?
- To justify that their "own records" were created for identifying persons (instead of linking to DBLP as initially planned) the authors claim that "the first tests we performed showed that the performance was neither good nor reliable enough". This should be further clarified, for instance by briefly explaining the type of experiment and the metrics considered.
- It should be clarified why 42 creators were encoded as string values instead of creating local URIs for them. In which way they are different with respect to the other 29 creators locally declared?
- Is "CCC Browser" the same as "Catalog Browser"? Such browser is repeatedly mentioned in the paper but a proper introduction to it is missing.
- Reference to Table 2 in Section 3 is wrong (is Table 3 instead)
- The purpose of column "scope" in table 3 is unclear, it seems to mix vocabulary abbreviations (e.g., "dc") with vocabulary topics (e.g., "licenses")
- In the bibliography there are a typo "Núia Bel" -> "Núria Bel"
- Please review English for typos. A couple of examples:
Abstract: "etc.." -> "etc."
Section 1: "MS is a network of repositories of language resource (LRs)" -> "...language resources..."; "illustrative use cases, etc)" -> "illustrative use cases, etc.)"

Review #3
By Christian Chiarcos submitted on 17/Feb/2015
Minor Revision
Review Comment:

The IULA's META-SHARE LOD dataset, lessons learnt
Marta Villegas & Núria Bel


I suggest acceptance with minor revision (as detailed below).

It is a reasonable data set description in terms of

(1) Quality and stability of the dataset
The original data set is known in the NLP community as the result of a large-scale resource survey, it thus provides state-of-the-art data quality. Quality improvements over the original data are a main point in the paper. It is further developed by the LD4LT W3C CG, so, we can expect stability and maintenance out of this community effort. These aspects are also mentioned in the paper, but could be more elaborately emphasized.

(2) Usefulness of the dataset, which should be shown by corresponding third-party uses
Usefulness of the data in its new representation is illustrated by the Catalog Browser. The original data has a history of third party usage which will probably continue with the LOD version.

(3) Clarity and completeness of the descriptions
The description of the LOD conversion is concise, clear and well described. The conversion, however, refers to the original XSD scheme, where some design decisions (global vs. local elements) remain unclear.


The authors describe the conversion of the META-SHARE data model from XSD to OWL, and of the XML content of the IULA dataset to instances thereof. META-SHARE provides a network of repositories of language resources, including both language data and language tools, described through a set of metadata.

They elaborate on specific details of the RDFication process, most noteably the treatment of RDF enumerations (§2.1) and the conversion of certain free text entries to structured data (§2.2). This effort is motivated by the need for a more user-friendly and less technical formalism (p.2) which is provided by the authors through an RDF-based catalog. This catalog is one primary result of their work and subject to this data set description, and in its current instantiation, it is the basis for discussing META-SHARE content within the Linked Data for Language Technology Community Group (LD4LT) working group.

One benefit of this representation is the linking of different data sets which facilitates cleaning and distributed use of data sets. Another is that queries over relations become possible that "were hidden" in the original XML data [mentioned in the abstract, reference to example? -- this pertains to "unstructured" free-text content, I presume]. Both aspects are demonstrated in end-user oriented use cases in §4.

Usability is demonstrated through an RDF-based catalog browser (§1) which manages, disseminates and grants access to language resource metadata. Creating an LOD version of META-SHARE data does not immediately lower the entrance barrier for *non-technical* users, more convincing here would be to reverse the presentation of LOD pros on p.2, §1, top left, and first emphasize the way that "LOD eases data enrichment", and then suggest that this may also provide a usability benefit over the original implementation (as hinted at in "MS nodes target language technology professionals and become too technical for a potentially wider community of users."). A nice feature of RDF is data correction/checking via SPARQL and SPARQL Update (§2.2, §2.3) -- can you identify advantages of this approach over, say, XUF (or whatever update formalism was used in the original implementation)? In §2.3, the notion of a local XML element should be explained in relation to MS: In the original XSD, this was a design decision (which seems to be inherited to the LOD version), but the motivations behind are unclear.

Data and tool availability are discussed in §3, the functionality of the catalog browser is illustrated in §4. I'm not fully convinced that SPARQL queries do indeed improve accessibility of the data to non-technical users, but the faceted browsing functionality probably do. Unless something like this has already been available in the old implementation, the authors may consider to emphasize the linking aspects rather than usability.

Throughout the paper, design decisions and conversion steps are generally well explained and the obligatory statistics about the data set are provided with sufficient level of detail. The text is readable and generally well-structured, stylistically, however, its descriptions are occasionally repetitive (e.g., multiple section summaries) or abstract (e.g., examples for simplification in §2.0). See below for detailed comments. The observations of the authors on RDFization semistructured (XML) data (§2) are not surprising, but although moderately innovative only, this description is valuable in that it provides a point of orientation for colleagues interested in projects of similar scale and scope. In that regard, the article is sufficiently relevant for inclusion into the journal.

At the beginning, I am missing a motivation for using OWL. Wouldn't RDFS be more natural, in particular for mixed types? (I would imagine to find those as a result from converting free text into ObjectProperties). Are there plans for using OWL's reasoning capabilities? If so, which OWL dialect are you aiming for?

In general, there is no discussion of related research. For a plain data set description, this is acceptable, but references to other efforts to LODify metadata repositories (e.g., ISOcat -- although this differs in scope) would improve its usability to the intended audience.

Remarks and questions

p.1, §1: The motivation doesn't clearly state the formalism applied (OWL is *implied only* by second paragraph).
p.1, §1: if there is an initial version of the model, is there an advanced one as well ? If the authors want to refer to their MS model as starting point for LD4LT work, they should rephrase.
p.1, §1: separate catalog from LD4LT: remove line break before "The catalog", add line break after "... [1]." (same sentence). Otherwise, LD4LT work appears to be an aspect of work on the catalog.
p.1, §1: Relation between LD4LT and the authors is unclear. If the authors see their work as contribution to LD4LT work, they should say so.
p.2, Tab.1: please check fourth vs. sixth row: Isn't the global element to be accompanied with owl:DatatypeProperty (along with rdfs:Datatype), as well?
p.2, §1: Motivation as a more user-friendly formalism should be earlier, maybe even with a clause in the abstract. Also, please mention earlier in what way this would be more user-friendly and less technical (SPARQL vs. XQuery?), or at least refer to the corresponding section.
p.2, §1: "maximize the information": Please rephrase, the informational content remains the same (I hope), just becomes more accessible.
p.2, §1: "The section focuses on ...": repetitive, move to and merge with §2 intro
p.2, §2: "SXD" -> "XSD"
p.2, §2: Please provide a *concrete* example for the simplification, for both `wrapping elements' and `superfluous elements'. This may require a new subsection.
p.2, §2: last paragraph redundant with beginning of section.
p.3, §2.1: the observed inconsistencies are nicely described and rather typical for pre-linked data, this is valuable to anyone considering RDFization, also beyond the MS data set.
p.3, §2.2: RDFying unstructured data: I really didn't get what you wanted here before I read the section -- mostly because it wasn't clear to what extend MS contains free textual descriptions. Maybe describe it (at the first encounter with "RDFying unstructured data") more explicitly as mining object properties from free-text content of the original MS data. In this regard, your term "text RDFication" would be more easily comprehensible. The extend of free text content in MS becomes only clear on p.5
p.5, §2.3: The document discussion is interesting, but also of different character than that of person, project and organization. Maybe put into its own subsection.
p.8: The "CCC Browser" and the "catalog (browser)" are basically referring to the same entity, resp. its software and data components. To facilitate comprehensibility, please use the same term throughout the paper to refer to system and data as a unit.

Minor comments

p.1, abstract: "enumerations; the RDFication" check syntax, if enumeration, use ","
p.1, §1: "Currently, this initial version" I presume the authors mean the version described in this paper. In the current formulation, "this initial version" may refer to a predecessor. If the latter is the case, then please clarify in what respects they deviate.
p.1, §1: "repository of language resource*s*"
p.1, fn.3: In an RDF context "resource" is ambiguous, please clarify your term usage here more explicitly.
p.4, §2.3: Headline ": dealing" => ": Dealing"
p.8, §4, Headline "; making" => ": Making"
p.8, fn.23: please, put this to a listing (with proper formatting), not a footnote.
references: Check formatting and capitalization.
overall: Please check whether section-inital indent is style-conformant.
overall: Check American vs. British English: "Catalog" vs. "Catalogue" etc.