MMoOn Core - The Multilingual Morpheme Ontology

Tracking #: 2419-3633

Bettina Klimek
Markus Ackermann
Martin Brümmer1
Sebastian Hellmann

Responsible editor: 
Philipp Cimiano

Submission type: 
Ontology Description
In the last years a rapid emergence of lexical resources evolved in the Semantic Web. Whereas most of the linguistic information is already machine-readable, we found that morphological information is mostly absent or only contained in semi-structured strings. An integration of morphemic data has not yet been undertaken due to the lack of existing domain-specific ontologies and explicit morphemic data. In this paper, we present the Multilingual Morpheme Ontology called MMoOn Core which can be regarded as the first comprehensive ontology for the linguistic domain of morphological language data. It will be described how crucial concepts like morphs, morphemes, word-forms and meanings are represented and interrelated and how language-specific morpheme inventories as a new possibility of morphological datasets can be created. The aim of the MMoOn Core ontology is to serve as a shared semantic model for linguists and NLP researchers alike to enable the creation, conversion, exchange, reuse and enrichment of morphological language data across different data-dependent language sciences. Therefore, various use cases are illustrated to draw attention to the cross-disciplinary potential which can be realized with the MMoOn Core ontology in the context of the existing Linguistic Linked Data research landscape.
Full PDF Version: 

Minor Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
By John McCrae submitted on 19/Mar/2020
Minor Revision
Review Comment:

This paper present the MMoOn ontology for the representation of morphological linguistic databases. This model has been around for a while and this paper presents a welcome overview of the model. The ontology seems in generally to be soundly designed and well suited to its intended use, and the authors have shown that this is applicable to a couple of use cases, with other future use cases including machine translation planned out in this paper. Moreover, this model has been integrated with relevant similar standards, most notably the OntoLex model for ontology-lexica, as well as Ligt, OLiA and DBnary. The authors are well-connected and well-versed in the current state of the art around linguistic linked open data and this shows in the quality of this paper.

I had one technical question: the authors talk about assigning meanings to morphs, however they do not make it clear what happens if there are multiple morphs with the same form but different function. For example the authors propose a '-er' suffix for German identified as `deu_inventory:Suffix_er1`, however the '-er' suffix can do several things, including making adjectives comparative and deriving nouns from verbs. It would be good to see how this ambiguity is modelled.

Minor comments:
There is a bit of Denglisch in this paper that should be corrected

The authors like hyphenating terms such as "word-form", "word-formation" and "word-class". This is not generally correct in English.

p1. "AbstractIn"
p1. "resources *has* evolved"
p1. "inventories *can be created* as a new possibility
p1. "has *ever since* played" => do you mean 'always'?
p1. no "the" before "linguistic knowledge"
p2. "in a granular way *and* assign"
p2. There are many sentences started with "I.e." that is not a conventional English style. It should be lowercase, italicised or better yet avoided.
p3. Sentence "Especially field researchers ... linguistic data" is very hard to understand
p4. "big range" => "large range"
p4. "Insofar" => "Thus far" I think... or do you mean "Insofern" = "To this extent/end"?
p5. "allows to represent" => "allows (the) representation of"
p7. "*Given* the current state *of the art*, Linked Data vocabularies"
p8. "It describes which morphemes words can be segmented *into*"
p12. "relations *for modelling*"
p14. "allow (the) modelling of"
p16. "functions *at* the language-independent level"
p16. "on a kind of meta-level" - I don't know what you mean here
p21. "Eventually, the creators" - I don't know what is meant by "eventually" here
p22. "Insofar" again
p23. LaTeX quotes `the Alps', `female ...', `I have...'
p23. English is abbreviated as "eng." and a line later as "engl."
p25. capitaization of Xhosa in [5]

Review #2
Anonymous submitted on 01/Apr/2020
Minor Revision
Review Comment:

This manuscript was submitted as 'Ontology Description' and should be reviewed along the following dimensions: (1) Quality and relevance of the described ontology (convincing evidence must be provided). (2) Illustration, clarity and readability of the describing paper, which shall convey to the reader the key aspects of the described ontology.

The manuscript describes a multilingual morpheme ontology called MMoOn Core (MC) intended by its authors to support the modelling and publication of morphological language data, and in particular morpheme inventories, as linked data datasets. MC is meant to be used both in the case of NLP-oriented resources as well as for linguistic datasets; it is language independent and its approach to creating language specific datasets is based on a specially defined layered architecture.

The submission starts by motivating the need for such a vocabulary/ontology and by relating MC to other vocabularies/ontologies in the domain of linguistic linked data and in particular to ontolex-lemon and ligt; this part of the submission also features a subsection on already existing morphological resources. This is afterwards followed by a domain analysis and subsequently a detailed description of the MC ontology is given, first its main classes, then its properties; the use of some of these is illustrated by an example. Next, a description of how language specific resources can be integrated into the architecture proposed by the authors is laid out. Other aspects of the ontology’s design are also touched upon and an extended discussion of the relationship between ontolex-lemon and MC is also detailed. The final section gives a number of possible use cases for MC.

The paper gives a convincing argument for the need for an LLD ontology/vocabulary of the kind described in this paper. However, I would have liked to have seen some discussion of or at least reference made to previous work in publishing morphological data as computational resources in non-Semantic Web contexts since this work does have bearing on the current case. In particular a comparison with the approach taken by LMF for both intensional and extensional morphological data would be useful here: especially since LMF is a format neutral model (insofar as it is described in UML and not in any specific serialisation format such as RDF) and was extremely influential on lemon (although, alas, not from the point of view of morphology).

Another thing which I would like to see better described and justified is the distinction between the classes Morph and Morpheme in MC. Morphemes are defined, in very many texts, as the smallest meaning bearing elements in a language whereas morphs are defined as essentially strings of phonemes that can represent one or more morphemes (in the MC case morphs can only represent one morpheme), and indeed morphs get their meanings in specific cases from these morphemes. In addition, affixes are usually described as kinds of morphemes and not morphs (in MC it is a subclass of Morph which means that in MC being an affix is not part of the meaning of a Morpheme). In the MC model the distinction between morpheme and morph is somewhat blurred and both morphemes and morphs can have meanings via the hasMeaning property. This means that in the player example given in the paper both the ‘er’ morph and its morpheme have the AgentNominalizer meaning. Since there doesn’t seem to be any constraint given to the use of hasMeaning for morphs and their corresponding morphemes, this could lead to differing interpretations of the model which might potentially make datasets that use it less interoperable than they might otherwise be. I’m also puzzled as to how both LexicalEntry and WordForm can both be subclasses of Word when they are two different kinds of conceptual entity (this is bad form in ontology modelling) -- and indeed in ontolex-lemon and LMF Word is a subclass of LexicalEntry (this for instance would make interconnecting MC with ontolex-lemon as proposed in Section 9 much more challenging) -- this choice needs to be motivated. Moreover the difference between Lexical Entry and Lexeme in Figure 2 should also be explained and made clear to users of MC since the two terms are often used interchangeably.

Overall I think some of the individual ontological decisions taken in the model, at least those pertaining to the main classes and properties, should be better explained: especially those which seem to differ from how numerous other sources (especially in the domain in morphology) define them. An expansion of Section 5 including descriptions of some of the other classes in the ontology (e.g., Lexeme, LexicalEntry, and those whose definition might not be immediately obvious to non linguists) and better descriptions of the classes already featured (especially MorphologicalRelationship) with some more illustrative examples, would make this paper much more useful; something similar should also be done with the main properties in the model. It would also make the article much better suited to the ontology description submission track.

In addition there are numerous errors in English/general typos to be found throughout the paper. For instance the first line reads “Morphological language Data (MLD) has ever since played a crucial role across various interdisciplinary research fields”...ever since what? The first paragraph also mentions “large text amounts” instead of “large amounts of text”. The paper would definitely benefit from being proofread by a native speaker of English.