Countering language attrition with PanLex and the Web of Data

Tracking #: 509-1708

Patrick Westphal
Claus Stadler
Jonathan Pool

Responsible editor: 
Guest editors Multilingual LOD 2012 JS

Submission type: 
Dataset Description
At present, there are approximately 7,000 living languages in the world. However, some experts claim that the process of globalization may eventually lead to the world losing this linguistic diversity. The vision of the PanLex project is to help save these languages, especially low-density ones, by allowing them to be intertranslatable and thus to be a part of the Information Age. For this reason, PanLex gathers and integrates information from thousands of linguistic resources, such as monolingual dictionaries, bilingual dictionaries, multilingual dictionaries, glossaries, standards and thesauri. In this dataset description paper we detail how we transformed this data to RDF, interlinked it with Lexvo and DBpedia and published it as Linked Data and via SPARQL.
Full PDF Version: 

Minor Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
Anonymous submitted on 31/Jul/2013
Minor Revision
Review Comment:

This is a new revision of a previously reviewed paper. Looking at the comments from the first round, some but not all have been addressed. I don't believe a version number of the dataset has been added.

A major issue is that of approvers. Ultimately, one should probably view PanLex not as a dataset, but as a large collection of datasets that have been brought into the same format. Instead of using the approver terminology, VoiD descriptions should be used together with existing standards to describe the individual dataset licenses etc. In any case, the current modeling of approvers is a bit confusing. The term "approver" makes us think of these entities as people or institutions, but all the attributes used in the modeling makes them seem like datasets. Hopefully, this is what the authors remark as future work when they say that information sources and approvers should be distinct entities.

While a conversion and aggregation of datasets is extremely useful, the project should give a lot more credit to the original datasets, among other things by making separate downloads available.
It would also be very useful to see how many approvers really are open source, because PanLex also seems to include data extracted from commercial sources. This could be provided in a simple table listing the most frequent licenses.

Review #2
Anonymous submitted on 07/Aug/2013
Minor Revision
Review Comment:

This is a resubmission of an earlier paper that I also reviewed. I have read through it side-by-side with the earlier submission, and the revisions do not appear to be especially substantial. For instance, an earlier reviewer (not me) wrote:

"I find the paper is a bit fragmented in that it presents PanLex as a project to counter language attrition, and provides some points regarding language preservation and PanLex as a tool to support pan-lingual translation, but after the introduction the rest of the paper describes PanLex's structure and its conversion to RDF."

I find this to be a very valid point, and I don't see much of anything in this revision to address it. Indeed, I'd recommend simply removing the paper's discussion on the topic of countering language attrition. It may be a goal of PanLex as a project, but does not seem very relevant here.

This example is indicative of the fact that various comments of the earlier reviewers do not appear to have been addressed, or at least not clearly. Some lack of response is to be expected, but there seems to be more here than I would normally like to see. I did, however, note that the online tools to access the data seem to be improved compared to my last attempt to access them and efforts have been made to begin to address a key weakness of the present work, the lack of re-use of existing vocabularies. So, that should be taken into account as a positive sign.

Still, however, I do not find the paper publishable at present since it is still not clear to me what one can do with this resource, especially since there is so little use of existing vocabularies and models that would help guide a user in this regard. At the same time, fortunately, it does not seem to me that it would be hard to address this. For instance, the paper says that, "For example, the TeraDict translation service could now be easily realized using simple SPARQL queries." If that's the case, including example SPARQL queries would go a long way to showing people how to use the database for their own work. As it stands, from the dataset description, I don't see how one could build a translation service, for instance, given the variety of "approvers" involved in the project. The "apple" example discussed in the paper indicates some of the difficulties.

In sum, my recommendation is that, if some of the introductory material were removed and more concrete discussion of how the dataset could be used could be given, then I think this paper would be publishable. I would also like to see issues surrounding re-use of vocabularies addressed better, and it would be nice if section 6 did not merely list other work but actually discussed how PanLex could relate to it in more detail, but these potential revisions are, in my view, less essential. Without some revision, however, I believe this paper does not quite manage to be a "dataset description", but, rather, is primarily a "discussion of data conversion". Of course, the latter topic is a good one, but not what this submission is supposed to be, as far as I understand it.