Lexvo.org: Language-Related Information for the Linguistic Linked Data Cloud

Tracking #: 420-1543

Authors: 
Gerard de Melo

Responsible editor: 
Guest editors Multilingual Linked Open Data 2012

Submission type: 
Dataset Description
Abstract: 
Lexvo.org brings information about languages, words, and other linguistic entities to the Web of Linked Data. It defines URIs for terms, languages, scripts, and characters, which are not only highly interconnected but also linked to a variety of resources on the Web. Additionally, new datasets are being publishing to contribute to the emerging Linked Data Cloud of Language-Related information.
Full PDF Version: 
Tags: 
Reviewed

Decision/Status: 
Major Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
By Menzo Windhouwer submitted on 28/Jan/2013
Suggestion:
Minor Revision
Review Comment:

This paper is a well written description of lexvo.org, which provides an interesting hub for language-related information in the linked data cloud.

(0) Information on the data set

General information like name, URL, and version date are given, although the specific version is not available for download on the website (http://www.lexvo.org/linkeddata/resources.html).

The license under which the linked data set is available (http://www.lexvo.org/legal.html) should be mentioned in the paper.

For some data sets or code tables the sources are not clear, e.g., the ISO 639-3 code table is downloaded from its Registration Authority (SIL) (see http://www.lexvo.org/linkeddata/references.html) but this is not stated in the paper. The authors could include an overview as given on the website in the paper.

(1) Quality of the data set

Lexvo.org integrates important but scattered data sets and code tables into the linked data cloud. The quality of the source data can vary depending on the origin, the quality of Lexvo.org lies in the mapping from one the other which is sound. Some of the data sets and code tables are updated regularly. The paper currently lacks a description of how Lexvo.org keeps up with those changes, e.g., is there an automatic harvesting process, how long does it take for changes in the source data to be reflected in Lexvo.org, are there provisions for retired codes (as linked resources might still use them)?

(2) Usefulness

The current integrated set of data sources and code tables is very powerful en functions already as a hub between other linguistic data sources. The paper mentions several of these users. Some metrics and statistics on this connectivity will strengthen these claims.

Furthermore some suggestion for additional entry points/data sets:

* many older resources still use SIL Ethnologue 14 (or older) language codes, using the code tables available at http://www.ethnologue.com/ its possible to create mappings from 14 to 15 and thus to ISO 639-3; making version 14 codes available would help link in older data sets

* actually http://www.sil.org/iso639-3/ retains mapping tables for retired codes due to merges, also these codes might help to link in older data sets

* I think an entry point of full @xml:lang tags (see http://tools.ietf.org/html/bcp47), e.g., "sr-Latn-RS" represents Serbian ('sr') written using Latin script ('Latn') as used in Serbia ('RS')" would be valuable to be able to follow the information available on the various parts without the need to understand BCP 47, i.e., it would enable easier linkage to any dataset using @xml:lang
(3) Clarity and completeness of the descriptions

In general the descriptions are clear. Here is some feedback to improve the paper:

* how to group languages into families is an ongoing debate; give the reader a pointer to find the background of this grouping, e.g., http://en.wikipedia.org/wiki/Language_family or http://en.wikipedia.org/wiki/List_of_language_families

* in section 3.1.1 the steps to construct a term URI are given, a small example would help to see what is going on, e.g., just a result URL like http://lexvo.org/id/term/cmn/%E6%9C%8B%E5%8F%8B

* in section 4.5 a reference to the CMU Pronunciation Dictionary is missing

Review #2
By Jose Emilio Labra Gayo submitted on 07/Feb/2013
Suggestion:
Minor Revision
Review Comment:

The paper describes the lexvo.org datasets, which contain very useful
linguistic information as linked data. The dataset is a reference in
the multilingual linked data field and so, the paper describing it is
very interesting.

The paper describes the ontology employed, the main datasets (in my
opinion lexvo.org contains more than one dataset) and the motivation
for their inclusion.

Although the paper is readable as is, ome information that is missing
is about the technical aspects of lexvo.org. For example, I would be
interested to know about the methodology of the creation and the data
sources that they employed. Also, the authors don't tell about the
availability of the dataset. Is the any SPARQL endpoint?

Following the SWJ reviewers guidelines for "Dataset Descriptions", I
think the quality of the dataset is good (although the authors do not
provide any hint on its quality), the usefulness is very good (I have
already been using lexvo.org for some projects) and the clarity and
completeness of the descriptions are normal (I think some more
examples and some description about how to consume the vocabulary
would improve the paper).

Some minor points:

Page 1, Section 2.1, 2nd paragraph. "instead of having a language
column in a database that might <> values like"

Page 2, Section 2.1. "Lexvo.org's language identifiers are used by
<> British Library..."

Page 3. The authors talk about the Java API but they don't give any
further description of it. I think some description of its methods
would at least be necessary.