Converting the PAROLE SIMPLE CLIPS Lexicon into RDF with lemon

Tracking #: 487-1683

Authors: 
Riccardo del Gratta
Francesca Frontini
Fahad Khan
Monica Monachini

Responsible editor: 
Guest editors Multilingual Linked Open Data 2012

Submission type: 
Dataset Description
Abstract: 
This paper describes the publication and linking of (parts of) PAROLE SIMPLE CLIPS (PSC), a large scale Italian lexicon, to the Semantic Web and the Linked Data cloud using the lemon model. The main challenge of the conversion is discussed, namely the reconciliation between the PSC semantic structure which contains richly encoded semantic information, following the qualia structure of the generative lexicon theory and the lemon view of lexical sense as a reified pairing of a lexical item and a concept in an ontology. The result is two datasets: one consists of a list of lemon lexical entries with their lexical properties, relations and senses; the other consists of a list of OWL individuals representing the referents for the lexical senses. These OWL individuals are linked to each other by a set of semantic relations and mapped onto the SIMPLE OWL ontology of higher level semantic types.
Full PDF Version: 
Tags: 
Reviewed

Decision/Status: 
Minor Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
Anonymous submitted on 30/Jul/2013
Suggestion:
Accept
Review Comment:

This paper describes the work done in order to publish some parts of the Italian lexicon Parole Simple Clips (PSC) as linked data according to the lemon ontology-lexicon model.
The authors have successfully addressed the comments made in my previous review, and I would be very happy if the paper is accepted for publication.

The process performed in the transformation of the PSC into lemon is now much clearer, and the examples provided in section 3 and section 4 are very illustrative and contribute to the understanding of the process.

Nevertheless, I have some comments and/or questions that could still be addressed by the authors in order to help them “polish” the paper.
In section 2, at the end of the first paragraph, the authors say “This is particular useful when it comes to modeling the meaning of terms in different domains”. What do the authors exactly mean? This is not quite clear.
They should add some references (or pointers to) regarding the work done by UKP and UPF with lemon.

The third paragraph in section 3, the one that describes the qualia structures, is difficult to read. The authors should review the English of that paragraph.
In section 4., 3rd paragraph, I think that the argument they make to justify the difficulties in identifying PSC USems with lexical sense objects is not clear. I don’t think you can just relate lexical senses by “incompatible” or “equivalent” relations. I would recommend the authors to check that.

In section 4, when they describe the final output of the conversion, and refer to SIMPLE Entries, they refer to them as “individuals”, but also as “concepts”. I would recommend they stick to the first denomination for the sake of clarity.
At the end of section 4, when they refer to the SIMPLE-OWL ontology and the set of relations contained there, do they mean the 66 relations subsumed by the original four?
In Fig. 3, couldn’t a link “reference” be established between frutto 1 and simple:Fruit?

Spelling mistakes:
• Shouldn’t generative lexicon be capitalized in the abstract?
• Section 3. 3rd paragraph: These qualia structures plays ->play
• Closing parenthesis missing at the end of section 3 (Figure 1
• First sentence in section 5.3, …resources are… ->resources were
• Section 6, 1st paragraph, …, in which the all SIMPLE -> … in which all

Review #2
Anonymous submitted on 31/Jul/2013
Suggestion:
Minor Revision
Review Comment:

This paper describes a conversion of the noun parts of the Italian PAROLE-SIMPLE-CLIPS lexica to RDF. The authors had to duplicate
each entry to satisfy the distinction between lexical and ontological items required by the Lemon model. This also entailed splitting the original relations and properties accordingly among these two types of items. The authors drew on previously published work (the Simple OWL project) to map legacy semantic types to OWL classes.

I am not quite sure whether the paper makes a significant enough contribution, given that a large part of the paper just summarizes previous work (the lexicon and the lemon model). I am a bit hesitant to have simple data transformations be published in a journal. The provided data is useful, but I think it would be even more useful if the verbs in the PAROLE-SIMPLE-CLIPS data had also been converted. The paper would be much stronger if it included a discussion of how to model verb frames in RDF and some of the issues that arise when doing this.

Review #3
By John McCrae submitted on 12/Dec/2013
Suggestion:
Minor Revision
Review Comment:

This paper describes the PAROLE/SIMPLE Clips dataset and its publishing as linked data. Such a lexicon containing deep semantic information is clearly of use to a large number of researchers working on Italian, and the resource is mostly of good quality and well described in the paper. I have a few issues

lemon is not a "set of core modules", but "a core and a set of modules".

lemon does not limit relations between senses to only equivalent/incompatible/narrower/broader. Rather, these properties are all subproperties of senseRelation and other properties may be defined by the dataset provider or in a linguistic category ontology, such as LexInfo. However, the authors are correct to distinguish the USem into both a lexical sense and an ontological entity, as a lexical relation (which refers to a particular word) should be made at the level of lexical senses and a conceptual relation (which is agnostic to the word used to express the concept) should be made at the ontological level.

The base URL is given as http://www.languagelibrary.eu/owl/simple, which gives a 403 error. While URIs for individual entities do resolve, I would recommend the following:
* An HTML page or similar is served at the URL above
* An 'index' or search interface is made available and the URL included in the paper to demonstrate how to access all individual entries
* Make a download of all the data (pref. in N-Triples) available

I also note that the files returned by the service have the MIME-type "text/xml" instead of "application/rdf+xml" and no HTML (or Turtle, JSON, NT, etc.) version is available.

Tested with:
curl -D headers -H "Accept: application/rdf+xml" http://www.languagelibrary.eu/owl/simple/inds/2/299/USem1450limone

Minor:
Please write full names of conferences in the references (i.e., not LREC, ESWC, MSW, etc.)
Ref 8. "In in Wordnet"
Ref 14,15,16. Paper titles incorrectly capitalized
p2. PSC*,* the lexical resource ... *,* is a ...
Turtle is a proper noun, please capitalize

Review #4
By Sebastian Hellmann submitted on 26/Jun/2014
Suggestion:
Minor Revision
Review Comment:

## General
The SIMPLE ontology (http://www.languagelibrary.eu/owl/simple/SimpleOntology) as such is not well documented. I opened it in Protege and there wasn't a single label or comment in it documenting the concepts. So if you want look up http://www.languagelibrary.eu/owl/simple/SimpleOntology#hasAgentive there is no documentation whatsoever on the web. The authors should try to include some information there.

For the sake of self-containedness, I would like to see the quality evaluation of the legacy data mentioned in this article (maybe a short summary of the results). There were two EU projects, so I assume there has been some quality control. Which one?

Furthermore the usefulness is obvious, but not well described.

simple:hasIsa ;
simple:hasIsamemberof ;
a simple:Animal, owl:NamedIndividual ;
rdfs:comment " The lemma of USem873animale is animale" ;
rdfs:label " animale_as_Animal" .

One obvious use would be to build an index from the lemmas. But there is really no property that allows to get the lemma.

## Layout
the image is hardly readable, the script should be the same size as in the article. On my print-out there is a lot of space wasted.

## Technical issues:
There are still quite a few technical issues remaining, which should be resolved:

1. http://www.languagelibrary.eu/owl/simple/inds/5/5c2/USem873animale

contains the definition of an Ontology:

Actually this should be removed as it is expressed by:

2. The Ontology at http://www.languagelibrary.eu/owl/simple/SimpleOntology
is provided in functional syntax, which is quite unusual. In fact, I think, only OWL API based tools such as Protege can open this kind of syntax. Even Apache Jena can neither serialize not read it. I am unsure whether it is a standardized syntax at all. Normally, Turtle or RDFXML are used: http://jena.apache.org/documentation/io/#formats

3. It is still weird to duplicate URIs:
http://www.languagelibrary.eu/owl/simple/inds/SimpleEntries#USem59452pub...
will retrieve 15MB of data. So if you crawl this, you will download 15MB for each URI. I think this is really tough on your servers causing a 750MB traffic per crawl.
I think, it is best to simply replace # URIs with '/'. 50k files in the same forlder shouldn't be a problem for a normal file system.
Otherwise you can just use http://www.languagelibrary.eu/owl/simple/USem/59452/pubblico
Does the number have any meaning?

Minor:
- CLIPS is explained in a footnote, I would rather have it lifted to the normal text.
- EuroWordNet is mentioned in the context of "Linked Open Data", but I am unsure, whether it is "open" in any sense. As far as I know there is quite a big fee for obtaining it and it is not even free for science nor open access. Could you please clarify this? I am well aware that the previous reviewer asked to include the reference.