WOLD, WALS and IDS - RDF conversion and interoperability of linguistic datasets of the MPI EVA Leipzig

Tracking #: 423-1547

Martin Brümmer

Responsible editor: 
Guest editors Multilingual LOD 2012 JS

Submission type: 
Dataset Description
This paper describes the conversion into RDF, the internal structure, as well as the semantic content of three linguistic datasets of the Department of Linguistics at the Max Planck Institute for Evolutionary Anthtopology. Two of the datasets where converted in the course of the MLODE 2012 workshop, while one is a pre-existent dataset converted by the MPI EVA. The description spans three datasets to illustrate similarities and differences, as well as common shortcommings in the conversion of linguistic datasets into RDF. Alongside the descriptions, the interoperability of the specific datasets and Linguistic Linked Open Data as a whole is examined and pitfalls common to the interaction of multiple datasets from different sources are discussed.
Full PDF Version: 

Major Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
By Sebastian Nordhoff submitted on 28/Jan/2013
Minor Revision
Review Comment:

The paper discusses an important aspect of the future development of the Linguistic Linked Open Data Cloud, namely interoperability. It is well structured and well argued. This being said, the paper cannot be published in its present state. There are close to 100 places where the English used does not meet academic standards. It is required that the paper be proof-read by a professional. I will omit all remarks regarding orthography, grammar, and style.

Other comments
- Capitalize Linked Open Data in keywords
- Acronyms of the projects should be given when introducing them on page 1.
- italicize 'feature' on p2l8
- p3 "Data entry ... [is] converted to RDF" ???
- p3 What does "granularity of language classes" mean? What is a "language class"?
- p4 The editors of the works are listed in the references, no need to repeat them here
- expand blocks (a) and (b) in Section 5
- p7 I have no idea what "elevated" is supposed to mean here
- some (cross)references got lost and are shown by '??'. Enable warnings in Latex to see where this happens.

Review #2
By Christian Chiarcos submitted on 26/Feb/2013
Minor Revision
Review Comment:

Martin Brümmer: WOLD, WALS and IDS. RDF conversion and interoperability of linguistic datasets of the MPI EVA Leipzig

The paper describes the RDF conversions of three linguistic datasets of the MPI Leipzig. Despite numerous language issues (language needs to be substantially improved), it is an interesting and insightful data set description, and as it deals with a prominent set of resources from typology, it is certainly worth to be included in the special issue -- after the issues mentioned below have been addressed.

In section 2, the author criticizes the use of Literals for linguistic features. It may be worth checking whether the recently released TDS ontology (http://languagelink.let.uu.nl/tds/ontology/LinguisticOntology.owl, see, e.g., Saulwick et al. 2005; link and publication under CC-BY were announced last month by Menzo Windhouwer in personal email communication) provides a formal representation of the necessary information and provide a reference to these. Sebastian Nordhoff and colleagues from the MPI have initiated integration efforts for TDS, MPI data sets and other resources, and they may be consulted with respect to this.
Same section: "There was no Linked Data version or SPARQL endpoint available at the time of writing." If I recall correctly, the MLODE (workshop) organizers intended to provide an endpoint. The data is available, but at least, *efforts* to link the data sets should be mentioned. (And AFAIK, these are underway, for the state of development of late 2012, see Sect. 4 in Chiarcos, Moran et al. 2013. For more recent information, please double-check with Sebastian Nordhoff.)

As for sections 5.2 and 5.3, I wonder whether it would be possible for the final paper to use Glottolog language ids instead of or besides ISO 693 codes. As these were developed by the MPI itself, they should cover a greater portion of language identifiers than the ISO list. Also, they were specifically designed with the goal to address the owl:same issue mentioned in 5.2. If time permits and if the Glottolog development has progressed accordingly, he might consider extending the experiments accordingly. This is, however, only a suggestion.

Most importantly, the language needs to be improved, e.g., the very first sentence of the abstract could be restructured such that the resources are introduced earlier: "This paper describes the RDF conversion of three linguistic datasets of the Department of Linguistics at the Max Planck Institute for Evolutionary Anthtopology, their internal structure, as well as the semantic content." It's "Anthropology", of course (occurs multiple times), etc. In the following, I only mention language issues where they affect the understandability of the text.
page 1: "problems unique to Linguistic Linked Open Data (LLOD) will occur and tried to be solved" => "... are tried ..." (or, better, use active sentences). Use consistent spelling of "code-a-thon". Check hyphenization (and language settings): "devel-oped" ?
page 2: "code[2]. Instead, existing" => Instead of what ?
page 2: "The features themselves are modeled as a property" => multiple properties, I guess
page 4, 5 (and elsewhere): Typographical issues, e.g., boundaries on p. 4, line breaks on page 5, etc.
page 7: "The most basic concept in the domain of Linguistic Linked Open Data is the concept of language." => "One fundamental concept ..."
page 7: "as mentioned in ??."
page 7: "research and interoperability" => "research, and interoperability"
page 7: "Linguistic field research may disagree" => researchers, not research
page 8: "The points made in section ?? can not yet seen as proven,"

Minor comments:
- First paragraph should explicitly introduce the abbreviations used in the paper title.
- No references or links for OLiA and ISOcat. In the final paper, these could be cross-references within the special issue, I presume, but this needs to be counter-checked by the editors.

Saulwick, A., M. Windhouwer, A. Dimitriadis, R. Goedemans (2005), Distributed tasking in ontology mediated integration of typological databases for linguistic research, In: Proc. 17th Conf. on Advanced Information Systems Engineering (CAiSE'05), Porto.

Chiarcos, C., S. Moran, P. Mendes, S. Nordhoff, R. Littauer (to appear 2013), Building a Linked Open Data Cloud of Linguistic Resources: Motivations and Developments, In: Iryna Gurevych and Jungi Kim (eds.) The People's Web Meets NLP: Collaboratively Constructed Language Resources, Springer. [manuscript can be requested from the authors, responsible author for the corresponding subsection is Sebastian Nordhoff]

Review #3
Anonymous submitted on 06/Mar/2013
Review Comment: