PAROLE/SIMPLE ‘LexInfo’ ontology and lexicons

Tracking #: 398-1505

Authors: 
Marta Villegas
Núria Bel

Responsible editor: 
Guest editors Multilingual Linked Open Data 2012

Submission type: 
Dataset Description
Abstract: 
The PAROLE/SIMPLE 'LexInfo' Ontology and Lexicon are the OWL/RDF version of the PAROLE & SIMPLE lexicons (defined during the PAROLE (LE2-4017) and SIMPLE (LE4-8346) IV FP EU projects) once mapped to LexInfo model. Orig-inal PAROLE/SIMPLE lexicons contain morphological, syntactic and semantic information, organized according to a common model and to common linguistic specifications for 12 European languages. The data set we describe includes the common PAROLE/SIMPLE model mapped to LexInfo ontology and the Spanish & Catalan lexicons. All data are published in the Data Hub and are distributed under CC Attribution 3.0 Unported licence. The Spanish lexicon contains 199466 triples and 7572 lexical entries fully annotated with syntactic and semantic information. The Catalan lexicon contains 343714 triples and 20545 lexical entries annotated with syntactic information half of which are also annotated with semantic information. In this paper we briefly describe the resulting data, the mapping process and the benefits obtained.
Full PDF Version: 
Tags: 
Reviewed

Decision/Status: 
Major Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
By Judith Eckle-Kohler submitted on 08/Jan/2013
Suggestion:
Major Revision
Review Comment:

This dataset description is about the OWL/RDF version of two manually constructed lexicons providing morphological, syntactic and semantic information for the languages Spanish and Catalan.
(1) Quality of the dataset: the two lexicons have been created manually by linguistic experts in the PAROLE/SIMPLE projects. The data quality can be assumed to be excellent. Therefore I strongly recommend to accept this paper.
(2) Usefulness: In the context of the emerging Linguistic Linked Open Data cloud, this dataset can be expected to be of great value, as there are a number of highly interesting linkings that could be created based on this dataset.
(3) Clarity
The paper lacks clarity. Many details on the lexicons which have been converted to OWL/RDF are not described explicitly. Important references are missing. The "big picture" is missing as well. I had to google "PAROLE SIMPLE lexicons" and look at the "Final SIMPLE Spanish Lexicon Report" in order to get an overview of the two projects and the lexicons created in these projects.

Please address the following comments in order to improve the clarity:
General comments:
1) The introduction needs to be completely restructured. This section does not "introduce" anything. Some content of the introduction could be moved to sec 2.
2) Please make the relationship between Lexinfo and Lemon clear. Although Lemon is one of the keywords, the first mention of "the Lemon model" occurs on p. 3, sec. 4. A reference for Lemon is missing
3) The paper does not mention which parts-of-speech are included in the lexicons.
4) It would be nice to mention the relationship between GENELX and LMF somewhere in the paper

Some more detailed comments:
- briefly introduce the Lexinfo model and the Lemon model, describe the relationship between the two
- p.1 Sanfilippo et al. : add a proper reference to the references section
- the references list a paper on LMF, however it is not used - as LMF derives from GENELEX it would be good to mention LMF
- sec 3: "Syntax semantic linking in the PAROLE/SIMPLE
model is also complex and, in most cases, useless." Why is it useless? Please elaborate.

*********** Revised Review *****************************************************************

This paper describes the OWL/RDF version of
- two manually constructed lexicons providing morphological,
syntactic and semantic information for the languages Spanish and Catalan,
- the PAROLE/SIMPLE lexicon model.
The latter has been mapped to an OWL/RDF ontology based on the LexInfo/lemon lexicon model.

Quality of the dataset
The two lexicons have been created manually by linguistic experts in the PAROLE/SIMPLE projects.
Therefore, the quality of the original data can be assumed to be very high.
Since the mapping to OWL/RDF is grounded in a mapping of the PAROLE/SIMPLE lexicon model
to the LexInfo/lemon lexicon model, this is likely to hold for the
dataset described in the paper as well.

Usefulness (or potential usefulness) of the dataset
Medium. Currently, the dataset is not linked to other datasets in the LLOD cloud.
Yet, this lexical resource provides syntactic information which could be
very useful in NLP applications.
There are a number of highly interesting linkings that could be created in the future based on this dataset,
e.g., a linking to lemonUby at the sense level based on subcategorization frames.

Clarity and completeness of the descriptions
The paper lacks clarity. Many details on the lexicons which have been converted to OWL/RDF are not described explicitly. Important references are missing. The "big picture" is missing as well. I had to google "PAROLE SIMPLE lexicons" and look at the "Final SIMPLE Spanish Lexicon Report" in order to get an overview of the two projects and the lexicons created in these projects.

Please address the following comments in order to improve the clarity:
General comments:
1) The introduction needs to be completely restructured. This section does not "introduce" anything. Some content of the introduction could be moved to sec 2.
2) Please make the relationship between Lexinfo and Lemon clear. Although Lemon is one of the keywords, the first mention of "the Lemon model" occurs on p. 3, sec. 4. A reference for Lemon is missing
3) The paper does not mention which parts-of-speech are included in the lexicons.
4) It would be nice to mention the relationship between GENELX and LMF somewhere in the paper
5) sec.2: this section refers to Classes and properties. In the context of RDF/OWL and
LMF-based lexicon models "Class" is a very ambiguous term:
- class in the UML sense, this is the lexicon model use
- rdfs:Class
- owl:Class
The same is true for Property: rdf:Property or owl:ObjectProperty or owl:DataProperty?
This data modeling section would gain much clarity, if you could be more specific when using the terms
Class and Property.

Some additional comments:
- briefly introduce the Lexinfo model and the Lemon model, describe the relationship between the two
- p.1 Sanfilippo et al. : add a proper reference to the references section
- the references list a paper on LMF, however it is not used - as LMF derives from GENELEX it would be good to mention LMF
- sec 3: "Syntax semantic linking in the PAROLE/SIMPLE
model is also complex and, in most cases, useless." Why is it useless? Please elaborate.

The paper needs to be revised for minor spelling and grammar issues.

Review #2
Anonymous submitted on 10/Feb/2013
Suggestion:
Minor Revision
Review Comment:

This paper attempts to show how PAROLE and SIMPLE lexicons can be mapped to the LexInfo model and turned into OWL/RDF versions. It presents the benefits of this process in three different layers in a well organized way. Moreover, the authors describe Spanish and Catalan resources that are two under-represented languages in the LLOD. So the topic is relevant for the Journal. However, the description of the process carried out at the three layers is really brief. Admittedly, it isn’t an easy task to include much information in a short paper, but a few more explicit explanations would make the paper clearer to the reader, as they have the possibility of expanding it to six pages.
Some of the sections that would need further clarification:
- Section 2 to clarify the changes carried out with some other examples.
- The beginning of section 3 is quite general, so some more specific explanation to clarify "the aspects" would be needed and to show the relations to lemon and lexInfo. Section 4 would also benefit from a clearer explanation.
- Some tables would be necessary to show changes performed at different layers. For example, figures 3 & 4 in section 5 are very illustrative, so other tables in the rest of the sections would help to improve the paper. Part of the explanation in the paper is also provided in the following link in which they add some other example/ figure: http://gilmere.upf.edu/corpus_data/ParoleSimpleOntology/ParoleOntology.html.
- A few lines providing more information about the amount of work needed and whether all the information in the original data sets could be mapped would be advisable.
As for the formal aspects, the second paragraph in the Introduction needs some syntactical revision.
- Some punctuation (unnecessary commas in the Introduction (i) after encoding, whereas a comma is needed before (v)Semantic Roles
- In section 2, (iv) "...element become.. should be element becomes"
- In section 2 (V) second line, "results in relevant...." should be "results in a relevant property".
-In section 3, second line, third page, it says "in the latter, case the lexical sense.." It should say In the latter case, the lexical sense ..."
On the whole, the paper is relevant for the Journal and I recommend acceptance with these minor changes

Review #3
By Philipp Cimiano submitted on 27/Feb/2013
Suggestion:
Minor Revision
Review Comment:

This short paper describes interesting work in mapping the PAROLE/SIMPLE to the lemon/lexinfo ontology. It decribes the mapping in such a level of detail that it can be understood. A nice aspect of the paper is that it describes in detail the benefits we obtain by modelling lexica in RDF. The query at the end of the paper actually shows this. I have a few issues nevertheless that in my view should be addressed:

1) The authors should differentiate between lemon and lexinfo. The title should mention lemon and not lexinfo IMHO. Lemon can be seen as a meta-model for lexica expressed in RDF and thus provides the structure for such a lexicon. lemon is thus comparable to LMF in the sense that it is a lexicon format. Lexinfo provides a linguistic ontology that provides suitable data categories to be used in creating a particular instance of a lemon lexicon. The paper should be reworked having this in mind and referencing either lemon or lexinfo appropriately.

2) The authors could summarize the benefits of using lemon/RDF for modelling leica in the conclusion. They mention these benefits in the text, but it would be good to have them as summary.

3) When talking about the benefits, it would be good to make clear if these benefits come from i) the modelling as RDF, ii) the use of lemon, or iii) the use of the LexInfo ontology

4) The provision of data is certainly nice, but the rationale for providing the data in this way should be better explained. First of all, why did the authors choose the datahub site? Why are there always two files per lexicon, an OWL file and a Turtle RDF file? As the lexica contain actually mainly instances, I was expecting that one RDF file per lexicon would be enough. Releasing the data as Linked Data or through a SPARQL endpoint would be really good.

5) Reading the paper, it sounds like it was really straightforward to map the PAROLE/SIMPLE model to lemon / LexInfo. Is that really the case? It would be interesting to see if the authors found any problems in doing the mapping.

=========

I downloaded the dataset and after inspecting it I can definitely say that this is a very useful dataset that my be exploited by the community doing NLP.

However, at the same time, I urge the authors to give more details about the dataset in the final version of the paper including some statistics related to:

1) number of triples
2) number of lexical entries
2.5) number of average triples per lexical entry => should follow from 1 and 2
3) number of lexical entries with syntactic frames
4) number of types of frames used
5) linking to other datasets (e.g. by use of vocabulary), i.e. how many vocabulary elements are linked, how many individual links are there to other datasets

It would be great if the authors coud add some statements describing the intended use of the ressource.

I am happy to review a second version of this paper to see if my recommendations have been implemented.

Overall, this is a very useful resource and it is certainly good that it has been converted into Linked Data.