|Review Comment: |
In this revised version of the paper, authors took into account most of reviewers' comments. The situation regarding copyrights has been clarified; the methods used to collect and interlink data are more detailed; and the model used to represent information is clearer.
These modifications bring significant improvements, and what has been done and why is now clear for the reasonably well-prepared reader. However, there are 2 main aspects which, in my opinion, still need to be improved.
The first one regards some remaining inaccuracies and unclear points about, among others, the building process and interlinking.
The second and major one regards the overall quality of the text which, in my opinion, needs a great deal of work. There are minor English mistakes but also, and this is much more critical, incorrect sentences, inaccuracies and inconsistencies. This is unfortunate for I believe the resource presented in this paper -- although small -- is valuable and lays the groundwork for a better handling of idioms on the LLOD.
The following subsections detail these 2 points. Regarding section 2 on text edition, comments might not be exhaustive and stop after section 5.
It is imperative to check extremely carefully the whole article, and to ask several persons to proofread it.
The paper introduces potential application scenarios but there is no evidence of real third-party use.
- "The Portuguese idioms were written by four native speakers"
It would be good to provide a bit more information about this. How was it done? Did the native speakers start from scratch, or did they use the collected resources?
- Listing 1: Shouldn't it contain an example of indirect translation, i.e. having a link between the it and de, or de and pt, or pt and it idiom lexical senses?
- [inaccurate/incomplete] "BabelNet started by extracting language resources from WordNet. [...] Posteriorly, BabelNet compiled knowledge from various lexical resources."
=> BabelNet integrates WordNet and Wikipedia right from the beginning, cf. the cited paper.
=> WordNet is an English resource (the one cited), therefore the proposition "language resources from WordNet" with a plural on resource is not really exact, or it is unclear.
- Section 3.2 should present all information related to data collection: the fact that definitions are also retrieved should be mentioned (it not expressed), and the fact that 13% of them are translated should also be mentioned here, instead of in the section related to semantic representation.
- Quality evaluation of retrieved idioms, section 3.3 first §: the reader understand between the lines, but it is never said how evaluators had to evaluate definitions. What did they have to judge? What is a "wrong" definition? I suppose it is similar to the filtering criteria used beforehand, but it should be specified.
- Section 4: "we use the ISO according to the best practices"
=> ISO should be specified
- "string distances between property values (such as cosine, n-grams and levenshtein distances)."
=> This formulation is too vague and inexact.
- "Linking LIDIOMS to other external knowledge bases is based on the string similarities between LIDIOMS’s resources and the other data sets’ resources."
=> Linking towards Babelnet was not done based on string similarities.
- "We chose LIMES because it has been shown to be time-efficient in previous works"
=> Is the time efficiency criteria important for small-sized dataset?
- Regarding interlinking towards BabelNet, it is not clear why filtering candidates based on the synsetType is not possible. Given the fact that BabelNet contains encyclopedic knowledge, that would be helpful.
- In Table 3, the number of English idioms is different than in Table 1.
- In Table 3, the precision is not the one of the resources but the one of the linking strategies. Moreover, these strategies are different so not comparable.
- Section 6.3, if it appears to the authors that BabelNet contains imperfect information encoding in some cases, best would be to report it directly so that it could be corrected. If this regards the fact that labels contain underscores, it is relatively easy to circumvent.
- Section 7.3: From previous section 5 ("we computed the multilingual translations by inference") we understand that 315 indirect translations are already computed and available. Does the query in Listing 5 bring additional information?
- Conclusion: "We thus regard LIDIOMS as a first effort towards a better LLOD"
Many research work/initiatives are doing efforts for a "better LLOD" and Lidioms is not the only one, this proposition could be tone down.
- Unfortunately the sparql endpoint was down when I accessed it so I could not test the queries of Listing 3/4/5/6.
- "A main limitation in the currently available data sets in LLOD is the lack of proper categorization of MWE."
=> The lack of proper categorization of MWE is often emphasized throughout the paper. Beyond this, it would be good to briefly elaborate on the reasons why.
- [minor] "natural-language processing" => no hyphen
- [inappropriate?] "The resulting data set relies on best practices in accordance with Linguistic Linked Open Data Community."
=> do you mean rely or comply?
- [poorly worded] "a large number of diverse linguistic data sets types"
=> maybe a bit too much modifiers, which all apply to "types" (therefore nothing is said about the large number of datasets)
"data set" should be singular.
- [inappropriate?] "However, most of these resources are still described in a bilingual way on the LLOD."
=> "describe" is a bit weird here and maybe not the most appropriate verb, linguistic resources are in this or that language.
- [incorrect] "Thus, becoming worthwhile to develop multilingual knowledge bases reusing the bilingual contents."
=> incorrect sentence, parts are missing.
- [unclear] "Multilingualism is important not only for sharing contents but also for learning new concepts from other cultures."
=> This is not really clear.
- [grammar] "have been extracted from various sources and being represented as Linked Data (LD)."
being => been
- [minor] "Although the current linguistic data sets on the LLOD are able to cover different types of linguistic resources as LD"
=> "are able" can be removed
- [grammar] "Idioms have no clear definition due to their sense comes from concepts"
=> due to the fact that their senses come (plural on senses) OR because their senses
- [unclear] "Idioms have no clear definition due to their sense comes from concepts of everyday life particular to a given culture."
=> isn't the case of most of language aspects, to come from every day life and to belong to a given culture?
- [inappropriate] "whose meaning cannot be derived from the semantic meaning of words that constitute them"
=> semantic can be removed
- [inappropriate] "Idioms are classified in general as “non-compositional”."
=> there is no need for quotes around non-compositional.
- the 4th § of the introduction repeats most of the 3d one.
- [inconsistent] "Additionally, idioms are divided into several syntactic types including noun phrases, verb phrases and partly compositional idioms"
=> "syntactic type" is a bit vague, especially since you talk about part of speech later on. Terminology should be consistent throughout the article, or it should be clear what is being talked about.
=> "partly compositional idioms" is a category which is not of syntactic nature. Plus it is contradictory with what has been said just before. Here is the whole difficulty of defining idioms; their compositionality varies, and this aspect is not always recognized as a definition criteria.
- [grammar] "Russian was chosen due to it comes from Cyrillic alphabet."
=> needs correction
- [unclear] "to the same old Germanic family"
=> is old necessary? From the full sentence, one infers that the Germanic family is older than the Romance, this would need to be supported by a reference.
- [minor] "LIDIOMS aims to support other Multilingual NLP tools" => multilingual does not need uppercase
> Section 2
- [inaccurate/incomplete] "A large range of ontologies has been developed to represent natural language data as LD on the Web of Data."
=> In my opinion this is not really exact. Ontologies are primarily developed to model linguistic information (in our context), and they are used to represent linguistic resources as LD on the Web of Data.
- [inconsistent] "Thus, the well-known ontology lemon was originally developed"
=> "thus" should not be, there is not causative link. "in this context"
- [inconsistent] "Although these resources are linked lexical multilingual data sets, they contain a limited number of idioms described correctly along with their respective translations across languages."
It is not because a resource is multilingual that is should contain idioms. "Although" is not really appropriate here. If you wish to express that it is good that these resources are multilingual, but that none cover correctly idioms, this would need to be reformulated, maybe using "however".
- [poorly worded] "Despite Lexinfo ontology  contains a property about idioms, appropriated classes did not exist to reuse this information."
=> to be reformulated. "appropriated" => appropriate
- [minor] "for handling on this phenomena properly"
=> "on" should be removed
> Section 3.1
- [minor] "we discuss how we assure the quality of the collected data."
- [minor] "as well as the well-known linguistics dictionaries of Collins and Oxford"
=> "linguistics" and "of" should be removed, or simpler: as well as Oxford and Collings dictionaries.
- [minor] "Although the most expressions are provided by crowd-sourcing which means there is no strict evaluation about their
lexical definitions in terms of linguistic professional accuracy, the knowledge base is extremely worthwhile."
=> "Although the most" => Although most
=> "which means" => comma before
=> "linguistic professional accuracy" => "professional linguistic accuracy" would be better, but "accuracy" would suffice.
- [minor] "post-graduate research into computational linguistics." into => in
- [minor] "big-name" => no hyphen
- [inaccurate] "lexical sources" => lexical resources
- [unclear/grammar] "Memrise and Oxford contributes to English, German, Italian, and Russian languages as well as Phrase Finder and Collins only for English."
=> needs reformulation, it is not understandable.
- [minor] "which belong to our computer science department" => can be deleted
- [minor] "provide to us" => without to
- [minor] "in terms of research and replication of the data." => of data
> Section 3.2
- "The retrieval was straightforward because each source has specific pages about MWE thus making the configuration of each page easier
as a seed page."
=> unclear, and not sure it is useful.
- [incorrect/unclear] "Note that, all data sources are bilingual not restricted from any language to English."
=> Really not understandable. suggestion: translations hold between language pairs which do not necessarily contain English (?)
- [poorly worded] "Moreover, we had two main observations going through the crawled data."
- [minor] First one => First
- [inaccurate/poorly worded] "... some MWE have meanings deductible by the semantic meaning of each contained word e.g “by the book”"
This sentence could be rephrased, e.g. "in some cases the meaning of MWE can be deduced from the meaning of their components.... while in others....""
You want to say that semantic compositionality applies for some MWE, but not for others. In this last case, this is not a "pragmatic meaning" (pragmatics has to do with taking into account locutors and the usage they do of language in specific context), but non-compositional meaning (non-compositional expressions can still be pragmatically equivalent).
"deductible by" => from
"the semantic meaning" => the meaning
- [minor] "The phrase finder" => rm The
- [poorly worded] "The phrase finder has expressions in English divided by location where American idioms which come from The United States of America and British idioms which come from England. Therefore, this characteristic is an important and shows that it always reflects
in the idioms meanings."
=> should be reformulated
- [unclear] "For instance, idioms which contain the word “run” have different meanings. This observation is explained properly in the next section."
=> The example is not explained in the next section.
- [poorly worded] "the collection of only idioms was a hard task."
=> needs reformulation
- [unclear] "It has been done manually by discarding the entries in the resources that were semantically equivalent to their lexical definitions."
=> unclear formulation. You want to express the fact that you adopted, as selection/filtering criteria, strict non-compositionality.
- [poorly worded] "is deductible by the semantic meaning of each word"
=> from the meaning of each of its components
- [poorly worded] "Through this semantically-based selection, we selected 50% of the MWE retrieved by our process on average to be idioms"
=> last proposition would need to be reformulated
> Section 3.3
- "each of the two"
- [inaccuracy] "This procedure resulted in a gold standard data set"
=> a data set is usually called "gold standard" in an evaluation context. Here the data set is manually checked, or corrected.
- [poorly worded/incorrect] "The representation model of LIDIOMS aims to describe the idioms correctly as a sub-type of MWE
also modelling their translations along with their geographical usage area."
=> needs to be rephrased
> Section 4
- [minor] footnote "11/ Prefix lexinfo:idiom stands for http://www. lexinfo.net/ontology/2.0/lexinfo"
=> idiom should be removed
- [poorly worded] "LIDIOMS has the core of the Ontolex model containing the main class ontolex:LexicalEntry to represent a lexical entry"
=> Lidioms uses one of the core Ontolex's classes, that is to say...
- [poorly worded] the second § of Section 4 should be re-written
"Thus becoming indispensable for our model as well as the ontolex:LexicalSense class."
=> misses the subject
"and ontolex:Lexicon is responsible for indicating details about the language and the entries in general."
=> the main role of ontolex:Lexicon is to gather lexical entries.
- [inaccurate] "The vartrans:Translation has the property vartrans:category which is responsible for describing relations among translations"
=> for describing translations. the "relations among translations" are the translations themselves, and the relations are between lexical senses.
Cf. the ontolex definition "The category property indicates the specific type of relation by which two lexical entries or two lexical senses are related."
- [unclear] "...and also representing variations of these relations across entries in the same (i.e same entry from a given language with different meanings) or different languages."
=> This is unclear.
- [unclear] "In addition, we had an important insight for modeling the translations in LIDIOMS"
=> Not really clear what is the insight + modeling => modelling
- [poorly worded] "While most of the retrieved idiom definitions were in English, just in some cases the definition was in other language."
=> while most of the definitions of the retrieved idioms (or did you retrieve definitions separately?)
=> the second proposition need to be rephrased + in other => in another
- [minor] " We then decide to provide the definition of all idioms"
=> decided + definitions plural
- [poorly worded] "The other (13%) of idioms which have the definitions in other language, they were translated by a native
speaker to English without keeping the definition in the original language."
- [incomplete] "Thus, the English language became our pivot language."
=> English definitions are the pivot (should be specified for it could be confused with English idioms)
- [minor] "the provision of indirect translations (represented by dotted line) of idioms through a pivot language relying on Semantic
=> "relying on Semantic Web technologies" could be deleted
- [minor] "The geographical area is of great importance because the meaning of an idiom can vary in the same language depending on where it is used not only in English"
=> not only in English could be deleted
> Section 5
- [minor] "The original idioms we crawled from the Web"
=> we can be removed
- [minor] "The model underlying the RDF translation"
=> the RDF conversion?
- [minor] missing dot after "Table 2"
- [grammar, missing parts, incorrect] "DBnary has presented a good precision, the only its lower score comes from Portuguese. It is due to the Portuguese to be a language a bit exploited by linguistic resources in terms of MWE as well as Russian thus containing only few idioms."
- [inaccurate] "Also, the idiom “All in the same boat” has the same name as a book"
=> it is rather the contrary
- [grammar] "qualitatively superior NLP systems based on Semantic Web technologies, including but not limited to better machine translation applications."
- Gilles Srasset => Sérasset
References: title case needs to be double-checked.