Review Comment:
This paper presents a survey of patterns to publish Multilingual Linked Data (MLD) on the Web. First, the problems of internationalisation on the Web and multilingualism in LD are introduced. Then, several patterns are proposed for naming, dereferencing, labelling, long descriptions, linking, and reuse linked data in a multilingual environment. The patterns are described in a comprehensive way by providing a description, context (with examples), discussion, and pointers to related information in the literature. The article finishes describing DBpedia as a use case of some of these patterns.
This work is relevant and timely for the community. The paper is very clear and very well structured and covers the problem well. The position of the authors is not to give a set of best practises but a survey of patterns so that each MLD creator can chose the ones that best fit to their necessities. This is a good point and is consistent with the general discourse of the paper. Nevertheless, in the abstract it is written that best practises for publishing MLD will be presented in the paper, which contradicts what Section 1 says about best practices (this can be fixed by simply omitting that reference in the abstract).
Here are some detailed comments that I hope help to improve the quality of the paper:
- The authors refer to multilingualism in Linked Open Data frequently. I wonder which part of the discussion can be applied to Linked Data in general also (not only to LOD). This could be briefly mentioned somewhere in the paper.
- I would reformulate the first sentence in Section 2. It sounds like if there weren't language barriers in the current WWW.
- Citations or footnotes describing technical concepts (like SPARQL, RDF, ASCII, etc.) should be introduced the first time that they are mentioned in the paper.
- In the introduction it is said that 4.7% of the non-information resources employ one language tag. The notion of "non-information resources" should be explained at that point.
- In Section 2.2, the sentence "Although IRI supported increases incrementally..." has to be rephrased for clarity. It would be good to add some example of concrete techniques (RDF or SPARQL specifications?) that support IRIs and others that not.
- In Section 2.2, I would explain "homograph attacks" a little more (one or two sentences).
- In section 3, I would say that numeric data are "language neutral" rather than "intrinsically multilingual". Although, strictly speaking, this is also arguable as numeric systems are culturally dependant (see http://en.wikipedia.org/wiki/Armenian_numerals for instance).
- The penultimate paragraph in Section 3 adds very little and could be omitted.
- In the last paragraph of Section 3, the term "localized" should be explained (for readers not familiarised with that expression).
- At the beginning of Section 4.1 (and later in the paper) they use the term "URI schemes", but I think not in its most commonly used meaning (see RFC 3986 specification for URIs). See examples of URI schemes in http://www.iana.org/assignments/uri-schemes.html. The term "URI scheme" should not be overloaded and, for instance, the sentence "the first step in a linked data development lifecycle is to design good URI schemes" could be safely changed by "... to design good URIs"
- In Section 4.1.1 they mention "local names". The term should be defined before.
- Regarding Sections 4.1.1 and 4.1.2, it has to be further clarified what the authors understand by opaque/descriptive URIs: is it the whole URI? Or is it just the local name?
- In 4.1.2, the sentence "Using opaque URIs may help to separate the concept from its different labels" is a bit confusing. It would be good to say something more about this.
- The last sentence of Section 4.2.1 is syntactically ambiguous: it is unclear which one is the "above mentioned functionality" (language or different representation of content).
- In 4.3 it is said that "Labels could be considered as units of textual information." I would remove that sentence to avoid wrong interpretations (like interpreting labels as words or as lexical entries).
- In Section 4.3.2, about the "multilingual labels" pattern, the authors wrote that "this pattern can be applied when labels have information in some natural language." I guess it is "...several natural languages”.
- In Section 4.4.1 ("divide long descriptions" pattern), they state that shorter descriptions benefit localisation, but this is not clear to me. In fact, SMT systems typically work better with longer texts (which provide more context to disambiguate the meaning).
- In 4.4.2 ("lexical description" pattern), "Using this pattern, we can describe the lexical content of longer descriptions". Actually, this pattern can be applied also to short labels, if richer lexical information is needed. Lemon model has to be briefly introduced before Example 11, to better understand this and other examples. Also, Example 11 could be rewritten in terms of lemon only, substituting rdfs:label by lemon:writtenRep, for instance:
:University a lemon:LexicalEntry ; lemon:form [ lemon:writtenRep "University"@en].
- In the discussion of 4.4.2, they state that providing lexical metadata for a resource supports fully automated software agents. Some example is needed here to illustrate it.
- Example 13 has to be reviewed: first, two separate URIs are introduced to represent Armenia, but then one of them changes when they are linked with sameAs.
- Section 4.5.3 ("add linguistic metadata") seems to overlap with 4.4.2. I think that the differences have to be emphasised. For instance, why not including lemon-based metadata also here? Further, I would mention SKOS-XL, which reifies the class Label so assertions in RDF can be made about labels.
- In 4.5.3, in addition to Lexvo, more references (and maybe a comparative) could be added to resources that can provide URIs to represent languages, such as id.loc.gov or http://www.lingvoj.org. See this interesting thread http://lists.w3.org/Archives/Public/public-lod/2012Feb/0073.html
- English is correct in general but has to be reviewed for typos. Some examples:
In abstract: "data usually contains labels" -> "data usually contain labels".
In Section 1: in the last paragraph, the sentence starting "Section 5 describes..." lacks a connector ("and"?) before "we describe..."
In Section 2: "Browsers supporting punycode automatically and convert the IRI to its punycode representation." Delete “and”?
Section 4: "Not all textual information attached to resources are labels and in fact," -> another comma before "in fact" would make the sentence clearer. Actually the whole sentence is very long and could be split in two.
Section 4.5.2: "...hence must be used careful" -> "...carefully".
|