Dbnary: Wiktionary as a Lemon Based RDF Multilingual Lexical Ressource

Tracking #: 401-1509

Authors: 
Gilles Sérasset

Responsible editor: 
Guest editors Multilingual Linked Open Data 2012

Submission type: 
Dataset Description
Abstract: 
Contributive resources, such as wikipedia, have proved to be valuable in Natural Language Processing or Multilingual Information Retrieval applications. This work focusses on Wiktionary, the dictionary part of the collaborative resources sponsored by the Wikimedia foundation. In this article, we present our effort to extract Multilingual Lexical Data from wiktionary data and to provide it to the community as a Multilingual Lexical Linked Open Data (MLLOD). This lexical resource is structured using the LEMON Model. This data, called "dbnary", is registered at http://thedatahub.org/dataset/dbnary.
Full PDF Version: 
Tags: 
Reviewed

Decision/Status: 
Major Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
By Jorge Gracia submitted on 07/Feb/2013
Suggestion:
Minor Revision
Review Comment:

This short paper describes Dbnary, an effort to extract multilingual lexical data from wiktionary and to provide it to the community as Linked Open Data. The goal is to create a lexical resource structured as a set of monolingual dictionaries and their corresponding bilingual translations. The system reported here won the first prize in the Monnet Challenge that was organised at the MLODE'12 workshop. The paper is clear and well structured. It is of "dataset description" type and fits into the topics of the special issue very well.

Dbnary uses the lemon model to structure and represent the extracted data from wiktionary in RDF. This is one of the first "practical" realisations of lemon beyond its community of creators, thus constituting a valuable experience for those interested in representing lexical information as linked data on the Web. The author, though, encountered that lemon did not meet some of his expectations to represent lexical data, and extended the model with some additional ingredients under the Dbnary namespace. For instance, lemon does not contain mechanisms to represent direct translations between lexical senses as a reified relation. In fact, some proposals have been made in that sense (see E. Montiel-Ponsoda et al., "Representing Translations on the Semantic Web", Proc. of 2nd Workshop on the Multilingual Semantic Web, at ISWC'11) but they were not available at the time Dbnary was developed.

Further, the author did not use the lemon properties between LexicalSenses (such as synonymy) because, in his words, "LEMON assumes that all data is well formed and fully specified […]. While this is correct to assume as a principle, this does not account for the huge amount of legacy data that is available in dictionaries and lexical databases." In order to cope with this legacy data, the author extended lemon by adding new classes and properties, such as "Vocable", "LexicalEntity", "Nyms", etc. Nevertheless, this is arguable, and I did not read strong reasons in the paper against using lemon ingredients in many of these cases. For instance dbnary:LexicalEntity class, which is the union of lemon:LexicalEntry and lemon:LexicalSense, was defined as a way to cope with underspecified lexical relations in legacy systems, which were not clearly defined neither between lexical entries nor between lexical senses. In my opinion, other design decisions could have been adopted at this point to solve this issue: for instance lexico-semantic relations (eg., synonymy) in the legacy source could have been defined as relations between lexical senses in lemon, no matter how they were (under)specified in the legacy source.

Also, the translation is given as a string through the dbnary:writtenForm property. Why not defining new written representations in lemon, associated to their corresponding lexical entries, for the target translations? There is also a dbnary:glose property associated to a translation. Maybe lemon:definition (as a property of the lexical senses involved in the translation) could have been a better option.

Despite the above, it is clear that the author adopted a pragmatic solution for this initial version of Dbnary which, in my opinion, is OK for its current purposes. In fact, Dbnary is in an early stage and will evolve with time, so it still can be further enriched as well as more tightly coupled with lemon if necessary. I would appreciate, though, some additional comments on the role of lemon in the future of Dbnary, as well as some additional counterexample for the cases in which lemon cannot be used.

A few more (minor) comments:
- dbnary:writtenForm (a property of Translation) could have a more significant name like dbnary:targetWrittenForm.
- The definition that appears in the paper for "dbnary:glose" should be rephrased for clarity.
- In section 2.1, second paragraph, in the example "(e.g. {fr} precedes French translations)", are the translations FROM French or INTO French?
- English is good, although a further review for typos is needed (e.g., in Section 4: "Table 1 give" -> "Table 1 gives"; in Section 5: "This resources describes" -> "These resources describe" or "This resource describes")

Review #2
By Judith Eckle-Kohler submitted on 24/Feb/2013
Suggestion:
Minor Revision
Review Comment:

This paper describes Dbnary, a Semantic Web dataset consisting of Wiktionary data from 6 different Wiktionary language editions. Dbnary is represented in the lemon lexicon model which had to be extended in order to account for Wiktionary-specific characteristics.

Quality of the dataset
The Dbnary specific URIs given in the ontology provided
at http://kaiko.getalp.org/about-dbnary/lemon/dbnary.owl are not properly dereferencable, i.e. they do not link to data about the dbnary-specific extensions of the lemon model. (instead, they link to
the dbnary homepage, see e.g., http://kaiko.getalp.org/about-dbnary/#senseNumber).

The dataset has not been linked to other LOD datasets.
While links to other datasets could be created in a separate effort, I see an issue with the Dbnary senseNumber which is likely to change between different Wiktionary dumps. It would be interesting to see if and how different Dbnary versions could be linked to the LLOD cloud at the sense level.

Usefulness (or potential usefulness) of the dataset
Medium. Currently, the dataset is not linked to other datasets in the LLOD cloud.
Yet, dbnary provides translation data from 6 different language editions of Wiktionary which could be very useful in NLP applications.

Clarity and completeness of the descriptions
Overall, the paper is well organized and clearly written.

I was wondering about a small contradiction in sec. 2.1 and sec. 2.2
- sec 2.1: "we decided that we would extract as much data as we can from wiktionary"
- sec. 2.2: "The monolingual data is always extracted from its
own wiktionary language edition. For instance, the
French lexical data is extracted from French language
edition2. Hence, we completely disregard the French
data that may be found in other language editions."

Regarding the decision to disregard data from other languages in each particular language edition:
Was this decision based on some quantitative analysis of the data from other languages?

The paper needs to be revised for minor spelling and grammar issues.

Consistent use of terminology should be improved, e.g.
Contributive resources vs collaborative resources
Wiktionary vs wiktionary

Review #3
By Sebastian Hellmann submitted on 12/Mar/2013
Suggestion:
Major Revision
Review Comment:

The submitted paper talks about a data set extracted from several language editions of the freely available Wiktionary Wikimedia project. The data is converted via a software framework, made available online under open licenses and hosted as Linked Data.

The data set has a good uptime and everything I tested, was technically working and on a very high level. Also the usage of the lemon vocabulary, where applicable, seems semantically correct. Still there are some issues with the paper as well as with the data set, which I will outline in the remaining sections.
The paper should receive a major revision for not including relevant information in the text. The work as such is quite good, so I have a good feeling, that the author will succeed in submitting an acceptable revision of the paper.

# Details
## The title seems to have a spelling mistake: English is "resource". The double "ss" is the French (and also German) spelling of resource.

## Vocabulary
I was not able to access: http://kaiko.getalp.org/dbnary#Vocable, however and therefore I was not able to look at the lemon extension created by dbnary.
Getting this straight is always a nuisance, especially with linked data and '#' URIs as the part after '#' gets cut away during the http request. Maybe, you can copy some .htaccess rules from here to do static file hosting of the schema: https://github.com/NLP2RDF/persistence.uni-leipzig.org/blob/master/ontol...
Switching to '/' might solve the problem, as well.

- I noticed however that dbnary:Vocable is capitalized in the paper, but in the data it is dbnary:vocable . What exactly is the difference between lemon:LexicalEntry and dbnary:Vocable? Subclassing should imply, that there exist resource that are of type lemon:LexicalEntry, but not of type dbnary:Vocable. Is this the case in the dbnary data set? Otherwise, the only distinction criteria would be that Vocables where extracted from Wiktionary, but this might not justify an extra OWL Class.
- dbnary:Equivalent might be quite misleading, as it is not clear, what exactly is equivalent. In translations the "equivalence" relation does normally not hold between source and target. For the word "cat"@en, "chat"@fr, "Katze"@de, the expected gender does not match in a way I would consider "Equivalence", i.e. the complete agreement of all properties (with my Leibnitz hat on) . Why not simply call it dbnary:Translation?
- The explanation for dbnary:glose is lost on me: "used to dentate the lexical sense of the source of the equivalent"
- dbnary:targetLanguage could link to lexvo.org instead of being a literal
- Figure 2 and 3 are quite confusing as they display a class dbnary:LexicalEnty and I am not so sure whether this should be "Entry" or "Entity" now.

## Usefulness
The usefulness of the data is obvious. The paper would gain a great deal, if the use cases were made explicit and really well explained. Please clarify why and for what exactly GETALP and LIG needs this data. It remains quite vague in the paper. Could you extend upon this and maybe even a concrete example? This would help motivate the work you did on extracting RDF from Wiktionary. Do you already have any hits on your endpoint or linked data interface or any reported usage of your resource?

## Quality
The data you submitted in the paper is quite raw and I am not really able to judge the quality of your extracted data. One way to improve upon this is to elaborate for what you are using the data (see section above). It might be possible to deduce that your data quality is sufficient to be useful in certain use cases and NLP methods.
Several other projects and approaches have tied to evaluate their Wiktionary extraction. Some of these statistics would be nice for dbnary as well:
http://code.google.com/p/wikokit/#Statistics
http://svn.aksw.org/papers/2012/JIST_Wiktionary/public.pdf
http://downloads.dbpedia.org/wiktionary/stats_2013_04_06.csv
created with this query: Select ?g ?p count(?p) as ?count where { Graph ?g { ?s ?p ?o } } group by ?p ?g order by desc (?g) desc(?count)
http://www.igi-global.com/chapter/ontowiktionary-constructing-ontology-c... (not openly available online)

You might even have a look here to compare with official stats:
http://meta.wikimedia.org/wiki/Wiktionary

I am writing this, because I found table 3 suboptimal. Ordering alphabetical by ISO code is confusing. The table as such takes up a lot of space, but does not really provide any insight. It might be better moved to a HTML page at http://kaiko.getalp.org/about-dbnary/ as extended statistics.

There is a way to query the MediaWiki api for interwiki links. See e.g.
http://en.wiktionary.org/w/api.php
http://en.wiktionary.org/w/api.php?action=parse&page=flight&format=json

"parse":{
"title":"flight",
"revid":20059852,
...
"iwlinks":[
...
{
"prefix":"fr",
"url":"http://fr.wiktionary.org/wiki/vol",
"*":"fr:vol"
},
{
"prefix":"fr",
"url":"http://fr.wiktionary.org/wiki/fuite",
"*":"fr:fuite"
},
...

This might be used to evaluate table 3 and your extractor for translations (not perfect, of course, but it would help)
I am fully aware that evaluation is quite difficult. I don't expect, that you will realize all things I wrote above. But I would definitely require that the second submission contains a better evaluation of quality in some form.

## Namespaces
Are not resolved in the paper. They should be at least once so readers know, what dbnary stands for. Daily votes at http://prefix.cc/dbnary also help establish this namespace.

## Interlinking and related work
There is also a similar approach as a subproject of DBpedia, called Wiktionary2RDF.
Sebastian Hellmann, Jonas Brekle, Sören Auer: Leveraging the Crowdsourcing of Lexical Resources for Bootstrapping a Linguistic Data Cloud. JIST 2012, http://svn.aksw.org/papers/2012/JIST_Wiktionary/public.pdf
Note, that I am not trying to coerce a citation (http://en.wikipedia.org/wiki/Coercive_citation). Your paper is about your data set and criteria are quality, usefullness and completeness of description, not so much about the comparison with other approaches. However, mutual links between the data sets would be useful. There might even be a way to merge or fuse both approaches in the future, although I am not yet sure how.

We are also double-typing some resources as LexicalEntry and and LexicalSense, so schematically, the dbnary-lemon extension would be applicable for this project as well.

## Interlinking part 2
Is the data set interlinked to anything?
ask from {?s owl:sameAs ?o}
on http://kaiko.getalp.org/sparql returns false

## Technical details about the data:
### Seems like http://datahub.io/dataset/dbnary/resource/2002de88-2f86-48c6-a24c-f70d0e... was uploaded to CKAN. Although technically available for everyone, this feature was created for people without hosting capabilities (e.g. researchers from humanities ) and it should be used with care.

### Linked Data is working for URIs. IRIs do not seem to be supported, compare:
curl -L -H "Accept: text/rdf+n3" "http://de.dbpedia.org/resource/Rüdiger"
with
curl -L -H "Accept: text/rdf+n3" "http://kaiko.getalp.org/dbnary/fra/thésaurus"
There has been a proposal for Transparent Content Negotiation rules here:
Internationalization of Linked Data. The case of the Greek DBpedia edition, Dimitris Kontokostas, Charalampos Bratsas, Sören Auer, Sebastian Hellmann, Ioannis Antoniou, George Metakides, Jornal of Web Semantics: http://www.websemanticsjournal.org/index.php/ps/article/view/319

However, it is difficult to set up, if the tools does not support it out of the box. Virtuosos can be configured to do so, but it is not an easy task.

### The graph name in the virtuoso contains a trailing '/' (http://kaiko.getalp.org/dbnary/) . I am not sure, what the best practice is here. Personally, I prefer no '/', but I am not insisting, rather asking.
23613592

## Minor
### The article contains a lot of capitalization issues:
- Wikipedia as well as Wiktionary should always be capitalized
- Multilingual Information Retrieval -> multilingual Information Retrieval
- extract Multilingual Lexical Data -> multilingual lexical data
- Why is Wikimedia italic?
- (Study group -> (study group

### Please check. IIRC correctly footnotes go after punctuation:
language edition^2. -> language edition.^2


Comments

A resubmission of this article has be uploaded with tracking number 504:

http://www.semantic-web-journal.net/content/dbnary-wiktionary-lemon-base...