Review Comment:
Disclaimer: this review was written together with Bettina Klimek.
The present paper "The Apertium Bilingual Dictionaries on the Web of Data" describes a Linked Data dataset based on the Apertium family of bilingual dictionaries. The aim of the paper lies in the presentation of the development and usage of the Apertium RDF dataset as a Linked Data conversion of 22 Apertium LMF-in-XML bilingual dictionaries. Overall, the paper is understandable and well written. Furthermore, the following **positive aspects** can be enumerated:
* for the dataset URLs, a dedicated website, a SPARQL endpoint and a web portal allowing human-readable search of translations are provided and working
* a direct download and access of the datasets is enabled and working via an external repository (dathub.io)
* the data has been always available when accessed
* SPARQL query results are provided in Linked Data and various other formats
* the data is integrated into the LLOD cloud and linked to other datasets; the number of internal and external links is given and correct
* metadata stating the creator, license and source of the data is explicitly declared in RDF
* the creation method described applies to Linked Data dataset creation best practices and standards
* the usage of the data is fully described
* added value has been proposed by using Apertium RDF as a multilingual dataset in contrast to the bilingual source data
In the following, aspects of the paper are discussed which are advised to be **majorly revised**:
*1. vocabulary use*
For the conversion of the Apertium dictionaries a representation model has been presented which consist of the two well-established *lemon* and LexInfo vocabularies and the less-established vocabulary of the *lemon* translation module. A detailed investigation of the representation model with regard to the aim of the paper reveals that it is highly appropriate to describe lexical translations of two or more languages. However, from what has been practically undertaken it seems that the model has not been fully used, which is due to the information in the source data. As the authors state correctly “translations occur between specific meanings of the words” (cf. p.4), but the underlying LMF model does not provide specific meanings. Rather, the provided Sense IDs (cf. example on p.2) state only two orthographic representations of words in two different languages which are supposed to share the same meaning. Nevertheless, there is neither an explicit information about the content of
the meaning given nor a relation stating the semantic similarity between the words. As a consequence, in the Apertium RDF (without any external links) there are no meanings according to the proposed usage of the *lemon* translation model (as described in J. Gracia et al. “Enabling language resources to expose translations as linked data on the web.” 2014) provided. Thus, the lemon:LexicalSense resources do not point to an ontological entity via lemon:reference but rather serve as kind of place holders. What is more, the only properties of tr:Translation used in the data are tr:translationSource and tr:translationTarget, hence, omitting the also explained tr:context and the especially important tr:translationCategory properties. This rather insufficient usage of the available vocabulary reveals that the Apertium RDF datasets are a mere transformation from LMF XML to RDF adopting the flaws of the original dataset. The authors are advised to critically discuss the points just men
tioned and to justify their actual vocabulary usage.
*2. Usage of Apertium RDF as multilingual language resource*
In the paper the authors introduce an additional value of Apertium RDF in contrast to the original Apertium bilingual dictionaries in that the Linked Data transformation results in a (potential) multilingual dataset. However, the quality of the obtained indirect translations between languages which are traversed via a pivot language, is not convincing. It is comprehensible that the One Time Inverse Consultation method has been chosen to propose a way of identifying correct indirect translation candidates, given that the data does not contain explicit sense references. Nonetheless, an enrichment of such references has been undertaken by adding BabelSynset resources to the lexical senses, which enables a more straightforward approach of creating further direct multilingual translation links. The fact that many translations are linked to BabelSynsets which are identical for both the translation target and the translation source enables the introduction of the translation categories. Th
at means for each translation resource which fulfils this condition a translational equivalent relation could have been stated, e.g.:
apertium:tranSetEN-ES/bench_banco-n-en-sense-banco_bench-n-es-sense-trans a
tr:Translation ;
tr:translationSource apertium:tranSetEN-ES/bench_banco-n-en-sense ;
tr:translationTarget apertium:tranSetEN-ES/banco_bench-n-es-sense ;
**tr:translationCategory trcat:directEquivalent .**
With this information correct and true indirect translations are obtainable without any measurement and threshold filter. This can be shown by taking the same example as proposed. The task was to find the correct Catalan translation for the Spanish word “banco” by using English as pivot language (cf. p.8). What is known by traversing the two direct ES-EN and CA-EN Apertium RDF graphs is that “banco”ES has the two direct translations “bank” and “bench” in English and “bank”EN has the two direct translations “banc” and “riba” in Catalan, also known is that “bench”EN translates directly to “banc” in Catalan; resulting in altogether five translation pairs. Looking up the lexical senses of those pairs reveals that for each translation (except for the translation of “bank”EN and “riba”CA which has no BabelSynset) the lexical senses of both the translation source and the target point to the same BabelSynset, meaning that they are direct translatio
n equivalents. The manual compilation of the bilingual dictionaries adds to the quality of these sense references. Under the assumption that BabelSynsets are senses and therefore language independent concepts, it holds true that each translation that shares the same underlying concept is a direct translation of expressions in two different languages. Consequently, the correct translation of “banco”ES into Catalan can only be the one which shares the same BabelSynset(s) with an English word as the Spanish word “banco”. As the query showed, the translation pairs “banc”CA-”bank”EN and “banco”ES-“bank”EN share the same BabelSynset () and “banco”ES-”bench”EN and “bench”EN-”banc”CA share the same BabelSynset (). This means that the concept defined in the resource is encoded in the three expressions “bank”EN, “banco”ES and “banc”C
A and the concept defined in the resource is encoded in the three expressions “bench”EN, “banco”ES and “banc”CA. That there are two concepts involved is not problematic. All this declares is that there are two concepts which are encoded with the same expression in Catalan and Spanish but with two different expressions in English. What matters is that the same BabelSynset is shared between at least one EN-ES and EN-CA translation pair of “banco”ES, which applies to “banc”CA in this case. This is the same result as calculated by the authors with the OTIC method but with higher precision. This synonymy investigation and matching could have been undertaken by the authors. Of course, this is only a reliable method under the prerequisite that the links to the BabelSynsets are correct. Since the authors omit an explanation on how these links have been created and of what quality the linkings are, a judgement on the correctness of the created links cannot be undertaken. The authors are advised to add this missing information.
Overall, the current state of multilinguality in the Apertium RDF dataset bears an amount of uncertainty of the obtained translations which reduces the quality of the multilingual data presented. Also, a use of the calculated multilingual translations by third parties in machine translation could not be proven.
Next to these two major issues, the following **minor** points also require **revision**:
*a) version information*
- Given that future work shall include an investigation of the quality of the dataset and future changes/extensions might occur, the authors are advised to add a version number to the files.
*b) http://linguistic.linkeddata.es/def/translation/lemonTranslation.owl contains turtle not rdfxml*
i: curl -L -H "Accept: text/turtle" http://purl.org/net/translation redirect to .owl file
ii: better to use owl:subClassOf than rdfs:subClassOf
*c) wrong content-type header*
- should be text/turtle instead of application-x/turtle http://www.w3.org/TR/turtle/#sec-mime
curl -I -H "Accept: text/turtle" -L http://linguistic.linkeddata.es/id/apertium/tranSetEN-ES
HTTP/1.1 303 See Other
Date: Tue, 08 Dec 2015 11:54:57 GMT
Server: Apache-Coyote/1.1
Vary: Accept,User-Agent,Accept-Encoding
Location: http://linguistic.linkeddata.es/data/id/apertium/tranSetEN-ES
Content-Type: text/plain
Content-Length: 114
Via: 1.1 linguistic.linkeddata.es
HTTP/1.1 200 OK
Date: Tue, 08 Dec 2015 11:55:02 GMT
Server: Apache-Coyote/1.1
Vary: Accept
Content-Type: application/x-turtle
Content-Length: 5451525
Via: 1.1 linguistic.linkeddata.es
*d) curl -I -L http://linguistic.linkeddata.es/id/apertium/tranSetEN-ES redirects to Location:http://linguistic.linkeddata.es/page/id/apertium/tranSetEN-ES*
- why /page/id and not just /page. ?
*e) wrong link redirects to lemon*
- the *lemon* vocabulary links used in the data (e.g. lemon:LexicalSense here:http://linguistic.linkeddata.es/page/id/apertium/tranSetEN-ES/bank_bankD...) redirect to a page resulting in a 404 error. All lemon URIs should be checked so that http://lemon-model.net/lemon#Form is used instead of http://www.lemon-model.net/lemon#Form (note the www. subdomain)
*f) on section 5.2*
- The calculation with the equation on page 8 with the examples in Fig.4 results in a score of 0.66 for "riba"@ca and not 0.5. If the given score results are based on information not shown in Fig. 4 this should be explicitly stated or the necessary information added to the figure.
- With regard to the table here http://figshare.com/download/file/2201205/1 precision for threshold = 1 ranges from 61% - 83%.
*g) http://linguistic.linkeddata.es/def/translation-categories is not well formed turtle*
- it mixed qnames with angle brackets: rdfs:label "direct equivalent"@en ;
*h) An update of the Virtuoso version which enables SPARQL 1.1 would be appreciated.*
*i) Orthography and mode of expression*
- Linked Data is a proper noun and is to be written capitalized.
- The whole paper should be checked for minor mistakes, e.g. “As result of..” → “As a result of..” (p.1 and p.6), “..described in the remainder if this section” → “..described in the remainder of this section” (p.4).
**Summary:**
Overall the Apertium RDF dataset presented in this paper is of reaasonable quality and provides references to all resources involved as well as to the evaluation results. Further, the data applies to the [five star rating for Linked Open Data](http://www.w3.org/DesignIssues/LinkedData.html) and is a valuable addition of linguistic resources, including currently underrepresented languages, in the Web of Data. The representation model is well chosen and sufficiently explained. Also, the RDF generation process is clearly described and applies to W3C best practices and standards for creating multilingual Linked Open Data. With regard to the vocabulary choice and the described Linked Data generation method the Apertium RDF dataset can, therefore, be seen as a showcase for other linguistic Linked Data datasets. The aim of the authors to convert the initial Apertium bilingual dictionaries into RDF has been fulfilled. However, with regard to the actual usage of the vocabulary and the unified
graph as a multilingual dataset extension, the full potential of the Semantic Web technologies described in the paper is not exploited. With regard to the [Linked Data vocabulary use rating](http://www.semantic-web-journal.net/system/files/swj653.pdf), the applied *lemon* translation module achieves four out of five stars due to missing links pointing to the dataset. In order to raise the usability and the quality of the Apertium RDF dataset it is proposed to clearly state the known shortcomings of the original Apertium data and extend the current Linked Data with triples identifying translation equivalents. Further, the generation method of the BabelSynset links and an evaluation of their quality should be included. Additionally, a sense matching as proposed could be also undertaken and these results could evaluated in comparison to the OTIC method already applied (facultative).
Therefore, the paper is rated as a minor reject and the authors are strongly encouraged to revise the paper according to the major and minor critical aspects outlined. Having done that, it is more likely that third-parties will recognize, and thus make proper use, of the Apertium RDF dataset.
|