LIdioms: A Multilingual Linked Idioms Data Set

Tracking #: 1554-2766

Diego Moussallem
Mohamed Sherif
Diego Esteves
Axel-Cyrille Ngonga Ngomo

Responsible editor: 
Philipp Cimiano

Submission type: 
Dataset Description
In this paper, we describe the LIDIOMS data set, a multilingual RDF representation of idioms containing five languages. The data set is intended to support natural-language processing applications by providing links between idioms across languages. The underlying data was crawled and integrated from various sources. To ensure the quality of the crawled data, all idioms were evaluated by at least two native speakers. Herein, we present the model devised for structuring the data. We also provide the transformation rules implemented in our extraction framework. The resulting data set relies on best practices in accordance with Linguistic Linked Open Data Community. We also detail the link creation process as well as possible usage scenarios for the linked idioms data set
Full PDF Version: 

Reject (Two Strikes)

Solicited Reviews:
Click to Expand/Collapse
Review #1
By Maud Ehrmann submitted on 03/Mar/2017
Minor Revision
Review Comment:

In this revised version of the paper, authors took into account most of reviewers' comments. The situation regarding copyrights has been clarified; the methods used to collect and interlink data are more detailed; and the model used to represent information is clearer.
These modifications bring significant improvements, and what has been done and why is now clear for the reasonably well-prepared reader. However, there are 2 main aspects which, in my opinion, still need to be improved.
The first one regards some remaining inaccuracies and unclear points about, among others, the building process and interlinking.
The second and major one regards the overall quality of the text which, in my opinion, needs a great deal of work. There are minor English mistakes but also, and this is much more critical, incorrect sentences, inaccuracies and inconsistencies. This is unfortunate for I believe the resource presented in this paper -- although small -- is valuable and lays the groundwork for a better handling of idioms on the LLOD.
The following subsections detail these 2 points. Regarding section 2 on text edition, comments might not be exhaustive and stop after section 5.
It is imperative to check extremely carefully the whole article, and to ask several persons to proofread it.

The paper introduces potential application scenarios but there is no evidence of real third-party use.

Additional comments

- "The Portuguese idioms were written by four native speakers"
It would be good to provide a bit more information about this. How was it done? Did the native speakers start from scratch, or did they use the collected resources?

- Listing 1: Shouldn't it contain an example of indirect translation, i.e. having a link between the it and de, or de and pt, or pt and it idiom lexical senses?

- [inaccurate/incomplete] "BabelNet started by extracting language resources from WordNet. [...] Posteriorly, BabelNet compiled knowledge from various lexical resources."
=> BabelNet integrates WordNet and Wikipedia right from the beginning, cf. the cited paper.
=> WordNet is an English resource (the one cited), therefore the proposition "language resources from WordNet" with a plural on resource is not really exact, or it is unclear.

- Section 3.2 should present all information related to data collection: the fact that definitions are also retrieved should be mentioned (it not expressed), and the fact that 13% of them are translated should also be mentioned here, instead of in the section related to semantic representation.

- Quality evaluation of retrieved idioms, section 3.3 first §: the reader understand between the lines, but it is never said how evaluators had to evaluate definitions. What did they have to judge? What is a "wrong" definition? I suppose it is similar to the filtering criteria used beforehand, but it should be specified.

- Section 4: "we use the ISO according to the best practices"
=> ISO should be specified

- "string distances between property values (such as cosine, n-grams and levenshtein distances)."
=> This formulation is too vague and inexact.

- "Linking LIDIOMS to other external knowledge bases is based on the string similarities between LIDIOMS’s resources and the other data sets’ resources."
=> Linking towards Babelnet was not done based on string similarities.

- "We chose LIMES because it has been shown to be time-efficient in previous works"
=> Is the time efficiency criteria important for small-sized dataset?

- Regarding interlinking towards BabelNet, it is not clear why filtering candidates based on the synsetType is not possible. Given the fact that BabelNet contains encyclopedic knowledge, that would be helpful.

- In Table 3, the number of English idioms is different than in Table 1.

- In Table 3, the precision is not the one of the resources but the one of the linking strategies. Moreover, these strategies are different so not comparable.

- Section 6.3, if it appears to the authors that BabelNet contains imperfect information encoding in some cases, best would be to report it directly so that it could be corrected. If this regards the fact that labels contain underscores, it is relatively easy to circumvent.

- Section 7.3: From previous section 5 ("we computed the multilingual translations by inference") we understand that 315 indirect translations are already computed and available. Does the query in Listing 5 bring additional information?

- Conclusion: "We thus regard LIDIOMS as a first effort towards a better LLOD"
Many research work/initiatives are doing efforts for a "better LLOD" and Lidioms is not the only one, this proposition could be tone down.

- Unfortunately the sparql endpoint was down when I accessed it so I could not test the queries of Listing 3/4/5/6.

- "A main limitation in the currently available data sets in LLOD is the lack of proper categorization of MWE."
=> The lack of proper categorization of MWE is often emphasized throughout the paper. Beyond this, it would be good to briefly elaborate on the reasons why.

Text editing

> abstract
- [minor] "natural-language processing" => no hyphen

- [inappropriate?] "The resulting data set relies on best practices in accordance with Linguistic Linked Open Data Community."
=> do you mean rely or comply?

> Introduction

- [poorly worded] "a large number of diverse linguistic data sets types"
=> maybe a bit too much modifiers, which all apply to "types" (therefore nothing is said about the large number of datasets)
"data set" should be singular.

- [inappropriate?] "However, most of these resources are still described in a bilingual way on the LLOD."
=> "describe" is a bit weird here and maybe not the most appropriate verb, linguistic resources are in this or that language.

- [incorrect] "Thus, becoming worthwhile to develop multilingual knowledge bases reusing the bilingual contents."
=> incorrect sentence, parts are missing.

- [unclear] "Multilingualism is important not only for sharing contents but also for learning new concepts from other cultures."
=> This is not really clear.

- [grammar] "have been extracted from various sources and being represented as Linked Data (LD)."
being => been

- [minor] "Although the current linguistic data sets on the LLOD are able to cover different types of linguistic resources as LD"
=> "are able" can be removed

- [grammar] "Idioms have no clear definition due to their sense comes from concepts"
=> due to the fact that their senses come (plural on senses) OR because their senses

- [unclear] "Idioms have no clear definition due to their sense comes from concepts of everyday life particular to a given culture."
=> isn't the case of most of language aspects, to come from every day life and to belong to a given culture?

- [inappropriate] "whose meaning cannot be derived from the semantic meaning of words that constitute them"
=> semantic can be removed

- [inappropriate] "Idioms are classified in general as “non-compositional”."
=> there is no need for quotes around non-compositional.

- the 4th § of the introduction repeats most of the 3d one.

- [inconsistent] "Additionally, idioms are divided into several syntactic types including noun phrases, verb phrases and partly compositional idioms"
=> "syntactic type" is a bit vague, especially since you talk about part of speech later on. Terminology should be consistent throughout the article, or it should be clear what is being talked about.
=> "partly compositional idioms" is a category which is not of syntactic nature. Plus it is contradictory with what has been said just before. Here is the whole difficulty of defining idioms; their compositionality varies, and this aspect is not always recognized as a definition criteria.

- [grammar] "Russian was chosen due to it comes from Cyrillic alphabet."
=> needs correction

- [unclear] "to the same old Germanic family"
=> is old necessary? From the full sentence, one infers that the Germanic family is older than the Romance, this would need to be supported by a reference.

- [minor] "LIDIOMS aims to support other Multilingual NLP tools" => multilingual does not need uppercase

> Section 2

- [inaccurate/incomplete] "A large range of ontologies has been developed to represent natural language data as LD on the Web of Data."
=> In my opinion this is not really exact. Ontologies are primarily developed to model linguistic information (in our context), and they are used to represent linguistic resources as LD on the Web of Data.

- [inconsistent] "Thus, the well-known ontology lemon was originally developed"
=> "thus" should not be, there is not causative link. "in this context"

- [inconsistent] "Although these resources are linked lexical multilingual data sets, they contain a limited number of idioms described correctly along with their respective translations across languages."
It is not because a resource is multilingual that is should contain idioms. "Although" is not really appropriate here. If you wish to express that it is good that these resources are multilingual, but that none cover correctly idioms, this would need to be reformulated, maybe using "however".

- [poorly worded] "Despite Lexinfo ontology [19] contains a property about idioms, appropriated classes did not exist to reuse this information."
=> to be reformulated. "appropriated" => appropriate

- [minor] "for handling on this phenomena properly"
=> "on" should be removed

> Section 3.1

- [minor] "we discuss how we assure the quality of the collected data."
=> ensure

- [minor] "as well as the well-known linguistics dictionaries of Collins and Oxford"
=> "linguistics" and "of" should be removed, or simpler: as well as Oxford and Collings dictionaries.

- [minor] "Although the most expressions are provided by crowd-sourcing which means there is no strict evaluation about their
lexical definitions in terms of linguistic professional accuracy, the knowledge base is extremely worthwhile."
=> "Although the most" => Although most
=> "which means" => comma before
=> "linguistic professional accuracy" => "professional linguistic accuracy" would be better, but "accuracy" would suffice.

- [minor] "post-graduate research into computational linguistics." into => in

- [minor] "big-name" => no hyphen

- [inaccurate] "lexical sources" => lexical resources

- [unclear/grammar] "Memrise and Oxford contributes to English, German, Italian, and Russian languages as well as Phrase Finder and Collins only for English."
=> needs reformulation, it is not understandable.

- [minor] "which belong to our computer science department" => can be deleted

- [minor] "provide to us" => without to

- [minor] "in terms of research and replication of the data." => of data

> Section 3.2

- "The retrieval was straightforward because each source has specific pages about MWE thus making the configuration of each page easier
as a seed page."
=> unclear, and not sure it is useful.

- [incorrect/unclear] "Note that, all data sources are bilingual not restricted from any language to English."
=> Really not understandable. suggestion: translations hold between language pairs which do not necessarily contain English (?)

- [poorly worded] "Moreover, we had two main observations going through the crawled data."

- [minor] First one => First

- [inaccurate/poorly worded] "... some MWE have meanings deductible by the semantic meaning of each contained word e.g “by the book”"
This sentence could be rephrased, e.g. "in some cases the meaning of MWE can be deduced from the meaning of their components.... while in others....""
You want to say that semantic compositionality applies for some MWE, but not for others. In this last case, this is not a "pragmatic meaning" (pragmatics has to do with taking into account locutors and the usage they do of language in specific context), but non-compositional meaning (non-compositional expressions can still be pragmatically equivalent).
"deductible by" => from
"the semantic meaning" => the meaning

- [minor] "The phrase finder" => rm The

- [poorly worded] "The phrase finder has expressions in English divided by location where American idioms which come from The United States of America and British idioms which come from England. Therefore, this characteristic is an important and shows that it always reflects
in the idioms meanings."
=> should be reformulated

- [unclear] "For instance, idioms which contain the word “run” have different meanings. This observation is explained properly in the next section."
=> The example is not explained in the next section.

- [poorly worded] "the collection of only idioms was a hard task."
=> needs reformulation

- [unclear] "It has been done manually by discarding the entries in the resources that were semantically equivalent to their lexical definitions."
=> unclear formulation. You want to express the fact that you adopted, as selection/filtering criteria, strict non-compositionality.

- [poorly worded] "is deductible by the semantic meaning of each word"
=> from the meaning of each of its components

- [poorly worded] "Through this semantically-based selection, we selected 50% of the MWE retrieved by our process on average to be idioms"
=> last proposition would need to be reformulated

> Section 3.3

- "each of the two"
=> each

- [inaccuracy] "This procedure resulted in a gold standard data set"
=> a data set is usually called "gold standard" in an evaluation context. Here the data set is manually checked, or corrected.

- [poorly worded/incorrect] "The representation model of LIDIOMS aims to describe the idioms correctly as a sub-type of MWE
also modelling their translations along with their geographical usage area."
=> needs to be rephrased

> Section 4

- [minor] footnote "11/ Prefix lexinfo:idiom stands for http://www."
=> idiom should be removed

- [poorly worded] "LIDIOMS has the core of the Ontolex model containing the main class ontolex:LexicalEntry to represent a lexical entry"
=> Lidioms uses one of the core Ontolex's classes, that is to say...

- [poorly worded] the second § of Section 4 should be re-written
"Thus becoming indispensable for our model as well as the ontolex:LexicalSense class."
=> misses the subject
"and ontolex:Lexicon is responsible for indicating details about the language and the entries in general."
=> the main role of ontolex:Lexicon is to gather lexical entries.

- [inaccurate] "The vartrans:Translation has the property vartrans:category which is responsible for describing relations among translations"
=> for describing translations. the "relations among translations" are the translations themselves, and the relations are between lexical senses.
Cf. the ontolex definition "The category property indicates the specific type of relation by which two lexical entries or two lexical senses are related."

- [unclear] "...and also representing variations of these relations across entries in the same (i.e same entry from a given language with different meanings) or different languages."
=> This is unclear.

- [unclear] "In addition, we had an important insight for modeling the translations in LIDIOMS"
=> Not really clear what is the insight + modeling => modelling

- [poorly worded] "While most of the retrieved idiom definitions were in English, just in some cases the definition was in other language."
=> while most of the definitions of the retrieved idioms (or did you retrieve definitions separately?)
=> the second proposition need to be rephrased + in other => in another

- [minor] " We then decide to provide the definition of all idioms"
=> decided + definitions plural

- [poorly worded] "The other (13%) of idioms which have the definitions in other language, they were translated by a native
speaker to English without keeping the definition in the original language."

- [incomplete] "Thus, the English language became our pivot language."
=> English definitions are the pivot (should be specified for it could be confused with English idioms)

- [minor] "the provision of indirect translations (represented by dotted line) of idioms through a pivot language relying on Semantic
Web technologies."
=> "relying on Semantic Web technologies" could be deleted

- [minor] "The geographical area is of great importance because the meaning of an idiom can vary in the same language depending on where it is used not only in English"
=> not only in English could be deleted

> Section 5

- [minor] "The original idioms we crawled from the Web"
=> we can be removed

- [minor] "The model underlying the RDF translation"
=> the RDF conversion?

- [minor] missing dot after "Table 2"

Section 6.3
- [grammar, missing parts, incorrect] "DBnary has presented a good precision, the only its lower score comes from Portuguese. It is due to the Portuguese to be a language a bit exploited by linguistic resources in terms of MWE as well as Russian thus containing only few idioms."

Section 8:
- [inaccurate] "Also, the idiom “All in the same boat” has the same name as a book"
=> it is rather the contrary

- [grammar] "qualitatively superior NLP systems based on Semantic Web technologies, including but not limited to better machine translation applications."

- Gilles Srasset => Sérasset

References: title case needs to be double-checked.

Review #2
By John McCrae submitted on 04/Mar/2017
Major Revision
Review Comment:

This paper presents the construction of a dataset for of idioms that is available across multiple languages. This dataset is quite small and I feel the use cases for this dataset could be better motivated. The paper describes four `use cases', which are really little more than SPARQL queries, and only one of which (translation) is clearly useful for other tasks. Moreover, the authors claim that their dataset is useful because existing resources such as BabelNet do not identify multi-word expression, but this is quite trivial to do (MWEs are the entries with more than one word!). The authors should better motivate why idioms (non-compositional MWEs) are of such interest. Moreover, the authors fail to give a good definition of idioms or define the criteria that they gave to the annotators. However, other works (e.g., have developed complex guidelines for this task. This is particularly troubling as the authors claim in one example that `by the book' is not an idiom, yet it is a very syntactically inflexible MWE and (this) native speaker would reject something like `we did it by his book'. Nevertheless, the manual effort that has gone into creating and linking idioms makes this a valuable resource that is of use to researchers investigating figurative language processing.

As far as I can see there is no third party use (and none described in the paper). The dataset is also quite small and it is not clearly stated why this would be of more interest to a linguist than existing larger resources such as WordNet, BabelNet etc.

There are many minor issues.
p1 (abs). "idioms *in* five languages"
p1. "Thus, *it is* becoming*
p1. "due to *the fact that* their sense"
p2. "(no old) Germanic family *and* Portuguese"
p2. "due to it comes from Cyrillic alphabet" => "due to it being written in the Cyrillic alphabet" I guess but why does this matter??
p2. "Thenceforth" is not a word in (modern) English
p2. Neither really is "posteriorly"
p2. "turning up": also doesn't mean what the authors seem to want here
p2. "appropriated" => "appropriate" (twice on this page)
p2. "Despite Lexinfo ontology" is not grammatical
p3. Why aren't Phrase Finder and Memrise capitalised?
p3. "big-name" => "widely-known"
p3. "speakers..., *who* belong"
p3. "provide to us a fully grant permission" => "grant us full permission"
p3. "Note that all" (no comma)
p3. "restricted from any language to English" I don't understand this
p3. "First one" => "Firstly"
p3. "deductible" => "deducible"
p4. What is an "ISO resource" and why doesn't Brazilian Portuguese have one. Do you mean ISO 639 codes? In which case "pt-BR" should be used.
p4. "just in some cases" => "only in a few cases" I think
p5. Please add literal translations for all the portuguese
Fig 1. Lexico*n* Italian (and it should really be 'Italian Lexicon' also)
p7. "in contrast *to* BabelNet (see 18*)*. This problem contributes *to*"
p7. "does not handle with it": do you mean "handle it"?
p8. Latex quote is wrong before "it's raining cats and dogs"
p10. I am not really sure adding more fine-grained locations is a good idea. You are already claiming that `it's raining cats and dogs' is only known in England, yet it is used in Scotland, Wales and Ireland...
p10. "Gilles Sérasset"

Review #3
By Elena Montiel Ponsoda submitted on 08/Mar/2017
Minor Revision
Review Comment:

In this resubmitted version of the paper, the authors have taken into account most of the recommendations made by the reviewers, but there are some aspects that still need to be reconsidered.
Following the evaluation dimensions suggested by the journal, the paper has been evaluated as for: quality and stability of the dataset - evidence must be provided; clarity and completeness of the descriptions; usefulness of the dataset, which should be shown by corresponding third-party uses - evidence must be provided.

Quality and stability of the dataset - evidence must be provided

The code is available in Github, but the Sparql endpoint was not working.
As for the data collection process, the authors had to perform a manual review of the automatically retrieved idioms (by a custom web crawler) to discard those multiword expressions that could not be considered idioms. Then, native speakers and linguists checked the corresponding set of idioms in their respective languages. After this evaluation step, automatically retrieved idioms were reduced by a half approximately, considerably reducing the impact of the dataset. (It is worth noting that because of potential law infringement issues raised by one of the reviewer, the authors had to previously remove idioms from Cambridge and Wiktionary).

Clarity and completeness of the descriptions

As for the Related Work section, the authors adequately refer to datasets that represent linguistic and terminological information in Linked Data. However, when identifying the gap that this resource is going to fill (namely, no specific resource in the LOD cloud about idioms), they refer to lexinfo, ontolex, etc., which are models intended to represent linguistic information in RDF. So, in my opinion, two levels are being mixed up here: the modeling level and the instantiation level. This should be clarified.
Section 4 is devoted to the representation model, namely, ontolex, and its vartrans module for representing term variants and translations. Some modeling decisions would need to be better clarified. For instance, what is the interest of including lexical concepts? How is vartrans:category defined? Why did they remove those definitions of the idioms which were in the original language (after translating them into English)? Why not using the property usage of the LexicalSense class for restricting the meaning of an idiom to a geographical area?
Regarding this last question, the authors mention that they provide an example of (and I quote) “a translation of two idioms from Portuguese to English”, but I only see one idiom and its equivalent in English. In fact, it would be interesting to see how they have solved the issue of an idiom in the same language with two different senses in different geographical areas.

Usefulness of the dataset, which should be shown by corresponding third-party uses - evidence must be provided.

In section 6 the authors describe both the internal and the external linking processes. The internal linking process was performed manually. The external one was performed with LIMES in the case o DBnary, and with the BabelNet API or manually in the case of BabelNet.
The language in section 6.3 needs to be reviewed, especially the last paragraph in that section. It is not clear why precision was poorer in the case of BabelNet, and if it could not be easily solved.
Then, section 7 deals with “application scenarios for the dataset”. Three use case scenarios are depicted, but these do not seem to be real uses of the data by third-party users, which may limit the interest of the work presented here.