Countering language attrition with PanLex and the Web of Data

Tracking #: 422-1545

Authors: 
Patrick Westphal
Claus Stadler
Jonathan Pool

Responsible editor: 
Guest editors Multilingual LOD 2012 JS

Submission type: 
Dataset Description
Abstract: 
At present, there are approximately 7,000 living languages in the world. However, some experts claim that the process of globalization may eventually lead to the world losing this linguistic diversity. The vision of the PanLex project is to save these languages, especially low-density ones, by allowing them to be intertranslatable and thus to be a part of the Information Age. For this reason, PanLex gathers and integrates information from thousands of linguistic resources, such as monolingual dictionaries, bilingual dictionaries, multilingual dictionaries, glossaries, standards and thesauri. In this paper we describe how we transformed this data to RDF, interlinked it with Lexvo and DBpedia and published it as Linked Data and via SPARQL.
Full PDF Version: 
Tags: 
Reviewed

Decision/Status: 
Minor Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
By Steven Moran submitted on 01/Feb/2013
Suggestion:
Accept
Review Comment:

This paper describes PanLex, a project that integrates lexical information from thousands of resources including dictionaries, word lists, etc., into an open source database system. Word forms and phrases are gathered and then linked to various components (word classes, expressions, meanings, definitions, etc.) and to each other, and translations can be inferred from relations and patterns from these connections. In this way PanLex is taking disparate linguistic resources and leveraging graph technologies to create access to assertions in thousands of languages.

I think the vision of universal translation is a very welcome one, particularly given the perilous state of more than half of the world's languages, many of which are unfortunately not yet documented.

I find the paper is a bit fragmented in that it presents PanLex as a project to counter language attrition, and provides some points regarding language preservation and PanLex as a tool to support pan-lingual translation, but after the introduction the rest of the paper describes PanLex's structure and its conversion to RDF. For example, nothing is said about how PanLex will achieve one of its goals of saving thousands of languages; perhaps it would be better to say that PanLex will "preserve" the data from thousands of languages? Language preservation is a complicated issue and involves many factors. Devising tools to support global communication is just one facet towards endangered languages preservation and revitalization. There is also no mention of countering attrition in the conclusion.

The authors' approach is inline with the Semantic Web vision of open data and with the theme of the present call for papers. It's also great that they use and highlight several open source standards and tools (e.g. Open Lexicon Interchange Format, Dublin Core Metadata, Sparqlify, Lexvo, etc.) and that they link their data to Lexvo, DBPedia, the Linguistic Linked Open Data cloud, etc. Also a bonus is that they make their RDF conversion code openly available on Github and that the design, structure and various access to the PanLex data is also available via the PanLex website.

The project is also an interesting use case in RDF conversion due to the size of PanLex database (18GB and growing!). The authors note that it "takes impractically long to convert all the data", so they take steps with RDF 2 RDF and Sparqlify to reach their goal.

Other criticisms:

Minor:

- acronym for ETL used on page 4 is never defined, even though for this crowd its probably obvious

- "paradim" misspelled on page 5

- "the co-occurences" -> "their co-occurrences"

More critical:

Although referenced in the text, at the time of my reading the SPARQL and SNORQL endpoint services are not available:

http://panlex.org/sparql
http://panlex.org/snorql

Review #2
Anonymous submitted on 15/Feb/2013
Suggestion:
Minor Revision
Review Comment:

Review of
Countering language attrition with PanLex and the Web of Data

Submitted to Semantic Web Journal

Overall recommendation: Accept with minor revisions (though on the borderline between major and minor revisions)

Summary review
This paper describes the efforts of the PanLex project to make its data available in Linked Data form. PanLex represents one of the largest efforts (if not the largest) to create a cross-linguistic lexical database where elements with similar meanings are linked together across thousands of languages. As such, a paper describing these efforts seems completely appropriate for a collection of papers on Multilingual Linked Open Data--and indeed, it would seem important to include it. At the same time, I think the paper could be improved in various ways, and probably the linked data itself could be improved too. I am not accustomed to reviewing papers on Linked Data datasets like this. So, I am not sure if it is appropriate to suggest changes to the actual RDF, but it does seem to me that, at least, the paper could address why certain choices in representing the data were made when others were available (see below for more details).

I will give detailed comments below, but first I will mention the most global concern I had about the paper: The paper does a good job of going through the mechanics of making the PanLex dataset available as Linked Data, but it was hard for me, on the whole, to understand the underlying conceptual model of linked lexical data that PanLex assumes. Because of this, it was not easy for me to see how similar or different it is from other projects aiming for interoperation of lexical data (e.g., for NLP). One of the apparent selling points of this database is its potential applications for translation, but I didn't see anywhere in the paper discussion of how precisely this would be facilitated. I gather this is somehow done via the Meaning relation, but too little was said about how meanings were encoded for me to see how this might work. Is this essentially an "interlingual pivot" approach? If so, it would be good to know more about the "interlingua". If not, it would be good to make the nature of the intertranslatability mechanism clearer. More succinctly, I could summarize this comment as: The paper would have more impact, I believe, if the discussion was balanced more towards the conceptual side and less toward the technical one.

Detailed comments

Title: The title of the paper describes the goals of PanLex more than the contents of this specific paper.

p.2, section 2.1, first paragraphs: Can more be said about how meanings are represented? I quickly visited the link in footnote 4 and learned meanings are approver specific and the have identifiers, but I still didn't know what they look like and, in any event, I probably shouldn't have to leave the paper to find this out.

p.2, section 2.1, second column: Can PanLex really extract lemmas reliably from so many sources?

p.2, section 2.1, towards the bottom: The label "approver" is very confusing for this object! Also, can the logic of making this pairing a foundational object for PanLex be made clearer? On its surface, it would seem to hinder potential applications for intertranslatability.

p.2, figure 1: The use of the singular "Metadatum" seems odd here. Is there only one metadata field being encoded?

p.3, section 2.1: As with meanings, I don't have a good sense of what kind of content is associated with a denotation. How are similar denotations across languages identified as such? Also, the end of this section implies an annotation schema for things like register, but can the possible feature-value pairs be consulted somewhere?

p.3, section 2.2: The discussion of how PanLex meanings are encoded in RDF clarified some aspects of the structure of meanings that would have been better stated explicitly earlier, e.g., that they can include "definitions" represented as strings. Also, why is rdfs:label considered an appropriate way to encode these definitions? A definition does not strike me as a "label", or do I misunderstand? Also, I wonder if the use of SKOS terms, such as skos:definition, might be more appropriate here. (In general, my sense is that this dataset could make better use of available data encoding schemes like SKOS, TEI, etc.)

p.3, section 2.2: Why use a plx:wordClass concept for denotations when there must be other places where this concept has been defined for Semantic Web purposes.

p.4, bottom of second column: "vocubulary"->"vocabulary"; also I was not familiar with the acronym ETL. Can it be expanded?

p.5, section 3: Can a brief note be made about the reason why linking to DBpedia is deemed desirable in the context of this dataset? What specific applications is this linking expected to support?

p.5, table 3: Is there some way for us to evaluate if the number of links obtained is "good" or not? For example, how many potential links would one have estimated as possible? What percentage of potential links were realized? How was the linking better or worse across languages? In what cases may "missing" links be due to problems with PanLex, when might DBpedia be the problem, etc.?

p.6, section 6: This section describes related work in very general terms, but it would be good to talk more specifically about how standards for lexical data interoperation relate to this project. Some relevant work was (relatively recently) summarized in this paper, http://www.aclweb.org/anthology-new/W/W10/W10-2101.pdf, and it may serve as a good starting point. It would be welcome, in particular, if there was explicit discussion about why existing coding systems weren't used for the RDF version of PanLex. For instance, lemon is mentioned, but that makes one wonder why it wasn't used to encode the data. In general, merely stating that related work exists is less interesting than describing why the conventions of that work were or were not adopted and/or placing this project more explicitly with respect to the goals of other projects.

p.6, section 7: Discussion of weaknesses seems relevant, but do these weaknesses actually affect the ability of this resource to be used effectively as linked data? That is the kind of weaknesses that I would think would be most relevant here.

Addendum to earlier review to ensure coverage of key areas of review:
---------------------------------------------------------------------

URL for the data: Various URLs to PanLex are given, but it was not clear to me if there was a standard URL for the RDF data. Also, I tried various links to get at the RDF in the paper but couldn't access any of them (e.g., the /sparql URL gave an error and http://ld.panlex.org/rdf.html, found on the project's main webpage, never resolved). This should be checked before publication.

Version date and number: I didn't see this explicitly discussed. It may be that there is no single version date and number since each source is associated with its own version date and number. As far as I know, this seems fine, but it would be good to make this clear.

Licensing: There was discussion of licensing of the data in PanLex in general, but I didn't see discussion of how licensing applies to the RDF data. Is it merely derivative of the licensing used for each of the Panlex sources or does the RDF version have its own license? (Is the licensing information encoded in the RDF if it is different for each set of data?)

Availability: See above. I could not access the URLs where the data was supposed to be available. Certainly wide availability is intended, and the PanLex data has long been available. So, I imagine this is merely a technical glitch.

Authorship: I am not sure of the relationship of the first two authors to the dataset, but the third author is director of the PanLex project, meaning at least one author has a close connection to the data.

Use of vocabularies: As noted in my review, I have concerns that, more often than necessary, PanLex internal concepts are used in the RDF when equivalent concepts may already be available in, for instance, SKOS.

I believe my earlier review covered the core review criteria, but I repeat my evaluation of them explicitly here for ease of editorial decision making.

1. Quality of the dataset: PanLex represents one of the largest efforts (if not the largest) to create a cross-linguistic lexical database where elements with similar meanings are linked together across thousands of languages. I have not explored the dataset thoroughly, but, from what I know, it is a high-quality dataset--moreover, there is nothing comparable of higher quality that I am aware of

2. Usefulness of the dataset: I am not sure it is explicitly described in the paper (but if it is not, it should be), but the coverage of this dataset, in terms of number of languages, is at a level goes well beyond any other publicly available resource to the best of my knowledge. Moreover, its attempts to interlink lexical items give it clear applications for (certain kinds of) multilingual translation and interoperation, making it clearly useful.

3. Clarity and completeness of the descriptions: I raised this point explicitly in my initial review. It is hard to understand key aspects of the PanLex conceptual data model from this paper. I think the paper can be readily revised to address this since this is almost certainly an issue regarding presentation of the project rather than resulting from a lack of internal understanding on this point.

Review #3
Anonymous submitted on 21/Feb/2013
Suggestion:
Accept
Review Comment:

This paper describes a conversion of the PanLex data collection to RDF.
PanLex is a large project that has gathered lexical data from around
2,000 different sources, mostly online dictionaries and thesauri.
It covers 20 million meanings (including duplicates however).
For the conversion, the authors define new RDF classes and properties
corresponding to the types of tables used in the PanLex database design.
The conversion is performed on-the-fly using a relational database to
RDF mapping system, with additional links to two other Linked Data resources.

Among the data collected by PanLex, there are bound to be some sources
that have already been converted to RDF by others. Perhaps Wiktionary
or WordNet? For these, it would be good to see whether links can be created
at the Meaning level, as these are more useful than links at the
Expression level.

Overall, this is a well-written paper that serves as a very good example
of how one would go about taking an existing data source, model the data
in RDF and establish a Linked Data server. It would have been even better
if the authors had re-used more existing vocabulary, but I understand that
defining new items often is the most practical choice.

Minor issues and suggestions:
- The caption of Table 2 says that "rdfs:label" is omitted for brevity,
but it still seems to be there several times in the table.
- There is also a Lexvo paper submitted to the same journal special
issue. If both papers are accepted, a cross-reference could be added.
- In Table 3, I don't think it is necessary to use two forms of ISO codes.
- The term "approver" is introduced under "Meanings", which a reader could
easily miss. Perhaps add a separate list item for approvers so that the
definition of this term can be found more easily.