Dbnary: Wiktionary as a Lemon Based RDF Multilingual Lexical Resource

Tracking #: 504-1702

Gilles Sérasset

Responsible editor: 
Guest editors Multilingual Linked Open Data 2012

Submission type: 
Dataset Description
Contributive resources, such as Wikipedia, have proved to be valuable to Natural Language Processing or multilingual Information Retrieval applications. This work focusses on Wiktionary, the dictionary part of the resources sponsored by the Wikimedia foundation. In this article, we present our effort to extract multilingual lexical data from Wiktionary data and to provide it to the community as a Multilingual Lexical Linked Open Data (MLLOD). This lexical resource is structured using the LEMON Model. This data, called dbnary, is registered at http://thedatahub.org/dataset/dbnary.
Full PDF Version: 

Minor Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
By Judith Eckle-Kohler submitted on 31/Jul/2013
Minor Revision
Review Comment:

The author addressed my comments. I appreciate in particular that he has pointed out the problem of potentially changing URIs of lexical senses in different Wiktionary dumps along with a possible strategy to further investigate this issue (by way of diachronic studies across Wiktionary dumps of a longer time span).

However, there is a minor issue left:
the proper reference for UBY is not Zesch et al. (2008) - this is the proper reference for the Java API to Wiktionary developed at UKP Darmstadt, JWKTL -
but this one:

author = {Iryna Gurevych and Judith Eckle-Kohler and Silvana Hartmann and Michael
Matuschek and Christian M. Meyer and Christian Wirth},
title = {Uby - A Large-Scale Unified Lexical-Semantic Resource Based on LMF},
booktitle = {Proceedings of the 13th Conference of the European Chapter of the
Association for Computational Linguistics (EACL 2012)},
year = {2012},
pages = {580--590},
month = {Apr},
location = {Avignon, France},
pdf = {fileadmin/user_upload/Group_UKP/publikationen/2012/uby_eacl2012_cameraready.pdf},
pubkey = {TUD-CS-2012-0023},
research_area = {Ubiquitous Knowledge Processing},
research_sub_area = {UKP_p_QAEL, UKP_p_EduWeb, UKP_a_ENLP, UKP_p_UBY, UKP_p_InCoRe},
website = {www.ukp.tu-darmstadt.de/uby (Link: http://www.ukp.tu-darmstadt.de/\"http://www.ukp.tu-darmstadt.de/uby\"

Review #2
By Jorge Gracia submitted on 12/Aug/2013
Minor Revision
Review Comment:

As I wrote in my first review, this short paper fits into the topics of the special issue very well, as a "dataset description" paper. In this version, many typos were corrected and details added, as well as more recent data were provided.

I still miss, though, a more critical analysis of the chosen representation scheme and how the authors expect to evolve it in the future. In fact I have the feeling that they are underutilising lemon, and creating some weird constructs such as the union between LexicalSense and LexicalEntry (see my previous review). I understand, though, that some pragmatic temporary solutions were adopted at the beginning of this work, which is perfectly OK. But I miss more details about how the model will evolve (towards lemon or not).

Review #3
By Sebastian Hellmann submitted on 12/Dec/2013
Minor Revision
Review Comment:

Overall, the quality of the paper has improved a lot and the issues raised have been addressed. I checked also all the technical issues as well and the database uses IRIs now and responds well.

Section "2.2 Scope..." now describes the usefulness of the system. The link to http://blexisma.ligforge.imag.fr seems to be missing, but would be useful.

Furthermore the quality evaluation has been addressed in an optimal way. According to my judgement the measures "comparison to the MediaWiki API" and the "evaluation of time slices" are two very good ideas and help to sustainably track data quality.

The description is also very well understandable, clear and complete, now.

Below are some minor comments, which can be fixed quite fast:

I am still unsure about the class "LexicalEntity". Are there any advantages of the current definition. i.e. when querying the model?
If it were just for the rdfs:domain and rdfs:ranges for properties, then one can just define the owl:unionOf there without introducing a new class.
However, this is more a question out of interest and not a request to change the ontology.

One more thing: I would like to see the title changed to:
Wiktionary as a Lemon-Based Multilingual Lexical Resource in RDF.

"Lemon Based" -> "Lemon-Based" http://www.grammar-monster.com/lessons/hyphens_in_compound_adjectives.htm
"RDF Multilingual Lexical" is too much -> "Resource in RDF"

"However, such studies are not trivial to implements as a change in a definition does not necessarily implies that the lexical sense has changed."
-> "However, such studies are not trivial to implement as a change in a definition does not necessarily imply that the lexical sense has changed."

"legacy lexical data is underspecified" -> "are underspecified"

"an unusually ambiguous"-> "a unusually ambiguous"

"# to transl." -> "# of transl."

"lexicon-semantic" -> "lexica-semantic"


I realized that the urge to submit the revised version of the paper led to (at least) 2 errors in the paper:

1. In Figure 2, dbnary:Equivalent has not been changed to dbnary:Translation as in all other places in the paper,
2. In table 3 the caption is not clear enough, it should read "Extracted translations vs interwiki links RATIO, on a random sample of 1000 entries". Moreover, the ratio are presented as a percentage (99.1% instead of .991) which is not a good way to present a ratio, especially when such a ratio leads to values above 1...).

Should the paper be accepted, these mistakes will be corrected (I did not find a way to update the submission pdf...)