Linguistic Resources Enhanced with Geospatial Information

Tracking #: 411-1525

Authors: 
Richard Littauer
Boris Villazon-Terrazas
Steven Moran

Responsible editor: 
Guest editors Multilingual LOD 2012 JMS

Submission type: 
Full Paper
Abstract: 
In this short report on language data and RDF tools, we describe the transformation process that we undertook to convert spreadsheet data about a group of endangered languages and where they are spoken in West Africa into an RDF triple store. We use RDF tools to organize and visualize these data on a world map, accessible through a web browser. The functionality we develop allows researchers to see where these languages are spoken and to query the language data. This type of development not only showcases the power of RDF, but it provides a powerful tool for linguists trying the solve the mysteries of the genealogical relatedness of the Dogon languages.
Full PDF Version: 
Tags: 
Reviewed

Decision/Status: 
Reject and Resubmit

Solicited Reviews:
Click to Expand/Collapse
Review #1
By Dongpo Deng submitted on 30/Jan/2013
Suggestion:
Minor Revision
Review Comment:

This paper describes a data conversion from geo-referenced linguistic data to RDF, so that the linguistic dataset can be geo-visualized by using WGS84 Geo Basic Vocabulary in map4rdf software. The paper can give linguistic researchers a geospatial overview of Dogon languages and a data process from CSV to RDF. The linguistic resources transferred to linked geo data can provide researchers not only a data-semantic browse but also a geospatial exploration

The first question of mine comes from the title of the paper. The title said “Linguistic Resources Enhanced with Geospatial Information”. However, I’m wondering if the use of “enhanced with geospatial information” is inappropriate. This research only uses map4rdf software to geo-visualize the linguistic dataset on the map, but no exploitation of geospatial information. Moreover, the paper focuses on discussing about the linguistic data transformation, but not how link up with other geospatial information so that it is easy to see how danger the Dogon languages are.

The research illustrates linked data process from spreadsheet to RDF. Unfortunately, it doesn’t go further step to link up with other linked dataset such as Geonames.org. The villages in the linguistic dataset actually are the same as the population places in Geonames. For example, the village “Ninngari” (http://linguistic.linkeddata.es/mlode/resource/Village/Ninngari) can be linked to the entry of geonames (http://www.geonames.org/2452685/ninngari.html). Linking up with geonames.org can provide researchers more villages surrounding Dogon-spoken villages. Thus it is possible to explore avenues of genealogical decent due to geographic proximity of the villages.

In the Fig. 1, the label “geospatial information” on two columns is not correct. They are actually coordinates. The geospatial information should be attribute data (Language + village names) with coordinates. Also, the resolution of the Fig.1 is not good enough. The caption of Fig 1 should be placed beneath the figure. In fact, this is a table. Why not change it to readable table?

The link http://linguistic.linkeddata.es/page/mlode/resource/Village/Boni is not available.

In the summary section, the sentence starting with “Whereas…” is very long and incorrect. Also, the integration of WALS and Dogon data is the result of this study or future work? If it is the research result, why is mentioned in the end section?

Review #2
Anonymous submitted on 02/Mar/2013
Suggestion:
Reject
Review Comment:

The authors present their experiences in converting a csv file which contains geospatial content to RDF, and then visualizing it to a map. The process and the tools required are presented in an adequate manner so that readers can follow a similar workflow.

Overall, the work presented could be an interesting presentation and/or tutorial targeting scientists from other scientific disciplines. It could serve as an interesting introduction to semantic web and/or geospatial data visualization.

However, this work should not be accepted as a "Full Paper". It includes no scientific innovation, nor technical contribution. Therefore its overall scientific quality fails to match the standards of the journal.

**** Revised Review *******************************************
This is a second review of the submitted paper, reviewed under the category and guidelines of “linked dataset description”.
The authors present a process of converting spreadsheet data containing geospatial information (coordinates) to RDF, and then visualize the produced RDF data using the map4rdf mapping framework.
The dataset used is described in Section 3.1. It contains GPS coordinates of villages where specific languages (including ISO 639-3 codes) are spoken. The authors only provide a URL of the “Dogon and Bangime Linguistics” project (http://dogonlanguages.org) and not a direct URL to the actual dataset.
My comments, according to the reviewing guidelines:
The authors provide a description of the linked dataset, but not the following information: name, URL, version date and number, licensing, availability, provenance. Further, searching the provided web site, I have downloaded the dataset from this page: http://dogonlanguages.org/geography.cfm. It contains several sheets with data (.xls file), but the authors do not mention exactly what sheet(s) they use in the paper.
Regarding the quality of the dataset, the authors do not provide any arguments.
Regarding the usefulness of the dataset, the authors do not provide any arguments. Instead they mention the potential application of the process they present for mapping linguistic resources in general. I strongly believe the usefulness of the dataset to be extremely low.
Regarding the clarity and completeness of the descriptions, i find them adequate.
My overall impression is that the paper presents an overview of a process for converting spreadsheet data to RDF and then visualizing it in web maps. Unfortunately, the focus of the paper should be on presenting a specific, interesting, and useful linked dataset. This is a strong weakness of the paper, and I do not believe it is something the authors can improve, even with a major revision. Their work is based on a niche, specialized, small and largely uninteresting dataset. Quite simply, while the submission has its own merits, it falls outside the scope of this journal.

Review #3
By Claus Stadler submitted on 02/Apr/2013
Suggestion:
Major Revision
Review Comment:

This paper describes the RDF conversion of spreadsheet containing basic information about where specific varieties of the Dogon language family, to be found in Mali, are spoken.

The deployed technology stack leverages state-of-the-art open-source tools and demonstrates nicely how RDF data can be generated from spreadsheets and visualized by the use of a RDB-RDF mapper for the conversion, an RDF store for hosting and a spatial RDF viewer for the visualization.

Unfortunately there are several short comings:
- Most severe, no linking attempt has been made in any way, which is a key requirement for _Linked_ Data. For example, usual suspects for this (among others) could be DBpedia, LinkedGeoData, and Lexvo, although having more specialized link targets would be worthwhile.
- The input data is only a single CSV file with a manageable number of rows and columns containing rather basic content, so I would especially have expected more focus on the interlinking part that would make this data more interesting (such as statistics on how many speakers there are for each dialect).
- The RDF transformation was done in a very basic way: The column values were simply turned into string literals (without language tag; should be 'en'), although all of them would deserve to become resources on their own (e.g. language(Sub)Familiy, language code, etc.)
-- While for the the Language{*} properties this is a flaw of the GOLD ontology, the authors ad-hoc vocabulary stays with the use of literals, making interlinking impossible.
- The authors should consider whether the social_info property should be e.g. rdfs:comment, so that tools could honor it automatically in user interfaces.
- The ontology is not available with the dataset (e.g. this prevents quickly checking whether the use of literals is mandated by the used version of the ontology)
- There are no statistics (how big was the input data, how many triples are there)
- There is no link to the source data
- License information is missing
- Content negotiation with Linked Data is broken: curl -LH 'accept: text/plain' http://linguistic.linkeddata.es/mlode/resource/Village/Daka-Zigiya
- I could not find a landing page. This should contain (ideally versioned) data/ontology downloads, links to the SPARQL endpoint, the map application and basic documentation.

In conclusion, the followed approach is sound. There are several data quality related issues of which most are easy to address. Yet, substantially more work has to be invested for this dataset to meet the quality standards, most importantly on the linking part.