When Linguistics Meets Web Technologies. Recent advances in Modelling Linguistic Linked Open Data

Tracking #: 2709-3923

Authors: 
Fahad Khan
Christian Chiarcos
Thierry Declerck
Daniela Gifu
Elena González-Blanco García
Jorge Gracia
Max Ionov
Penny Labropoulou
Francesco Mambrini
John McCrae
Émilie Pagé-Perron
Marco Passarotti
Salvador Ros
Ciprian-Octavian Truica

Responsible editor: 
Philipp Cimiano

Submission type: 
Survey Article
Abstract: 
This article provides an up-to-date and comprehensive survey of models (including vocabularies, taxonomies and ontologies) used for representing linguistic linked data (LLD). It focuses on the latest developments and both builds upon and complement previous works covering similar territory. The article begins with an overview of recent trends which have had an impact on linked data models and vocabularies, such as the growing influence of the FAIR guidelines, the funding of several major projects in which LLD is a key component, and the increasing importance of the relationship of the Digital Humanities with LLD. Next, we give an overview of some of the most well known vocabularies and models in LLD. After this we look at some of the latest developments in community standards and initiatives such as OntoLex-lemon as well as recent work which has been in carried out in corpora and annotation and LLD including a discussion of the LLD metadata vocabularies METASHARE and lime and language identifiers. Following this we look at work which has been realised in a number of recent projects and which has a significant impact on LLD vocabularies and models.
Full PDF Version: 
Tags: 
Reviewed

Decision/Status: 
Major Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
Anonymous submitted on 03/May/2021
Suggestion:
Major Revision
Review Comment:

This paper provides a very comprehensive state-of-the-art survey on research on linguistic linked data (LLD), in general, and on Semantic Web models/vocabularies currently being used (or proposed) to represent LLD. It establishes a relation to the Digital Humanities field, in which such models are being employed nowadays, and refers to current projects in the areas of Natural Language Processing (NLP), Semantic Web (SW), as well as Digital Humanities (DH), in which these models are being further developed and used.

The paper starts by justifying the importance of LLD in the framework of Open Science initiatives, specifically, the one promoting the so-called FAIR principles for data resources. The purpose of this section is to prove that SW models have a clear advantage over other (non-SW) standards. The authors also devote a section at the introduction to highlight the overlaps that exist between the research on LLD and some projects in the DH area.

In section 3, the authors give an overview of the most widely used models and vocabularies in LLD, according to the type of language resource they model, that is: Corpora (and Linguistic Annotations), Lexicons and Dictionaries, Terminologies, Thesauri and Knowledge Bases; Linguistic Resource Metadata; Linguistic Data Categories, and Typological Databases. They make substantial references to previous papers and surveys in this area, and focus on those extensions or further developments of the models that have not been dealt with in much detail in previous works. This is probably the right approach, but may provide an unbalanced picture of the models being currently used in the modelling of language resources in the Linguistic Linked Open Data cloud, and, specifically on their relevance in the whole picture. Therefore, I would suggest to the authors that they use Table 1 to provide a summary of LLD vocabularies in general, and somehow highlight the ones they will provide more details on in the paper.

After that, in section 4, the authors refer to current standardisation activities related to LLD. The first subsection refers to the the Ontolgy-Lexical Community Group and the latest extensions to the Ontolex-Lemon model intended to represent information in lexical resources (lexicog module), to represent morphological data in lexical resources, and to enrich lexical resources with information from annotations in corpora (FrAC). The different modules seem to be dealt with with different levels of detail. Even when referring to previous publications, my impression is that the section would be better balanced if all Ontolex-lemon extensions would receive similar attention, number of examples and pictures (I think that this is what one would expect to find in such a survey paper). Moreover, if survey articles in this journal are to be presented as “introductory texts to get started on the covered topic”, maybe some descriptions of the Ontolex-Lemon core are needed for readers to understand some of the details or examples provided in the descriptions of the extensions.

The second part refers to standards proposed for the annotation of linguistic information in corpora, although some are not widely used in the Linguistic Linked Open Data cloud context (as the same authors claim), namely, NIF, Web Annotation, TEI/XML or LAF. Additionally, the authors refer to vocabularies developed to address specific problems of certain user communities, such as Ligt or CoNLL_RDF, and refer to the problems of convergence among such annotation standards. The level of detail here again seems a bit unbalanced in favour of some vocabularies over the others. I would suggest that the authors provide similar type and amount of information for all vocabularies, or justify why some get more attention than others. This section would benefit significantly from a final summarising table with the most relevant features of the vocabularies with respect to the research in LLD and the FAIR principles. Finally, section 4.3 is devoted to metadata and standards for language identification.

Section 5 is devoted to provide a historical account of EU and national funded projects that have contributed to the development of models and vocabularies for LLD, or to the conversion of resources to the LLOD cloud. The summarising table here seems very appropriate although it was difficult to read, because in the pdf version made available for review, the table overlapped with the text. In the project descriptions, the mentioned unbalance among sections is to be observed again. Some provide details even for modelling solutions adopted for certain use cases (see ELEXIS description in section 5.7), in which the reader needs to be very acknowledged with the mentioned models, whereas others give a very brief overview of the project objectives (as is the case of the Prêt-à-LLOD project).

The conclusions seem quiet succinct in comparison with the extension of the paper and the wide range of topics covered. Some hints as for future challenges or research directions would be recommendable.

In summary, this paper intends to provide a very comprehensible overview of models, standardization initiatives and projects related to LLD. It is an ambitious, but very needed survey for those who aim to get an overview of past and current work in this area.

I would suggest that the authors review the paper for some parts that sound a bit repetitive and that could be shorten on behalf of conciseness (for instance, section 2.2, when there is a whole section devoted to projects or section 2.3).

Such an extensive review would benefit significantly from two additions:
(1) summarising tables at the end of each of the sections or even sub-sections, as suggested above;
(2) Introductory paragraphs in the main sections of the paper, in which the content of the section is anticipated (as a bulleted list, for example).

Review #2
By Armando Stellato submitted on 12/May/2021
Suggestion:
Minor Revision
Review Comment:

The presented article aims at providing a survey of models for the representation of several kind of languages resources, going under the common umbrella name of “Linguistic Linked Data”, touching several related topics such as FAIR guidelines (and compliancy of these models with FAIR requirements, or how they support this compliancy in datasets modeled after them), relationship with Digital Humanities, metadata targeted at linguistic assets and projects impacting development of LLD vocabularies and models.

The article is nicely written and well organized and, to my knowledge, is effectively widely covering the state of the art on the matter, providing indeed a great introduction to Linked Open Data in Linguistics to new, unexperienced, readers and a good compendium for experts in the field. The theme is also gaining more momentum and is surely of interest for the broader audience of the Semantic Web community.

Personally, I find the sections on FAIR unnecessary and quite detached from the rest. FAIR principles have surely have a large merit in that they extended the sensibility to openness, reusability, discoverability, interoperability to communities such as digital humanities, digital archives etc.. that have been not often close to the stream of innovation brought by the Semantic Web. Concretely, they added nothing to what SW standards, L(O)D policies and, more on a popular, disseminative, level, TBL’s 5-star path to open data all together did for guiding towards an enlightened publication of data. Shortly, we could say that there’s a thesis to be proven, i.e. SW stack of protocols and languages fully supports compliancy with FAIR principles. However, this is no way specific to data for linguistics nor should be proven in this survey. I understand though that, given the current global scenario and the separate identities (though with overlap) and diverse acknowledgements that FAIR principles and SW technologies have, it might be useful to restate the obvious. I thus just limit myself to point out this redundancy and leave to the authors the choice whether to reduce or all the content related to FAIR.

For an article having a title starting with “When Linguistics Meets Web Technologies” I would have expected more emphasis also on the technological stack that can enable the proliferation and use of LLOD, such as editing tools. It is not only a technical aspect, in that, as the authors themselves stress out in one paragraph, there are various stages in the evolution of the diffusion of given standards, which start from convergence towards common models, then production of data (in some cases, flooding the market with something that is not felt necessary, until its wide availability kickstarts the acknowledgement of it and fosters need for it) and finally real adoption. We are now in the “data production” phase, which is not strongly requiring editing systems as it mostly involves conversion of existing resources; however, it is through platforms that allow for a thorough analysis and development of resources that we can:

* discover resources that are not properly developed. If we just “fire off resources on the web” and we consider the task done by setting up a SPARQL endpoint, we might miss many issues in the data and, as long as the models behind them are still young, potential issues in the models that govern them as well. This is more easily spotted when the resources have to be properly loaded and read by a system that conforms to the same standards they conform to.

* develop a new generation that really exploits the full power given by these news standards, instead of adapting poorer information coming from legacy resources to modern (suite of) vocabularies such as OntoLex Lemon and LexInfo.

My personal experience with VocBench 3, which offers support for OntoLex, in the context of the PMKI project [2] which, among other things, included bringing a few resources to the modern OntoLex lemon model, has been that many resources developed in the context of other projects (GWA WordNet, IATE converted from the LIDER project, etc..) contained several (in some cases major) conversion bugs that made them unusable by OntoLex-compliant tools. Similar experiences came when we simply tried to host these resources through Ontolex-compliant publication tools, which led back to fix-and-retry iterations. While this is perfectly normal (it’s part of the lifecycle of a resource, and all findings have been contributed to the maintainers of the original resources or of the converters that produced their porting for OntoLex), it is in this “normality” that the role of development and publication platforms emerges.
I would thus suggest to dedicate a small section to this aspect, mentioning existing systems (not many currently if we consider OntoLex) like the already mentioned Lexvo, VocBench and others enabling editing and/or publication of LLD resources.

[1] Armando Stellato, Manuel Fiorelli, Andrea Turbati, Tiziano Lorenzetti, Willem Gemert, Denis Dechandon, Christine Laaboudi-Spoiden, Anikó Gerencsér, Anne Waniart, Eugeniu Costetchi and Johannes Keizer VocBench 3: A collaborative Semantic Web editor for ontologies, thesauri and lexicons, Semantic Web, doi:10.3233/SW-200370, 1-27, 05, 2020
[2] https://ec.europa.eu/isa2/actions/overcoming-language-barriers_en

Besides the few considerations above (which are not prescriptive), I think the article is already in a very good state and is almost ready for publication. I leave here a few more notes touching some specific points of the work and that can be easily dealt with.

TECHNICAL NOTES:

References have forms: pxcyrz, meaning: page x column y row z

p3c1r41: since registries are mentioned, maybe mention “reachability” / “discoverability” (or “findability”as it is called in the mentioned FAIR principles) as the mentioned qualities refer only to the use of domain vocabularies rather than the use of registries

p3c2r26: The description of the advantages of OWL for LLD models seems to evoke some mambo-jambo (i.e. unexplained) capabilities of OWL which I don’t think it possesses. While it is true that the shared semantics of OWL allow for a better axiomatization of terms (to paraphrase: to add further characteristics of them that are true in all possible interpretations of the logical term), the authors seem to stress the fact of being able to disambiguate the meaning of the terms, that is their interpretation, which is something that a logical modeling language does not do. The authors seem to hint at this aspect (talking in terms of limitations) in the sentence within round brackets in row 36. This should be however not a detail, rather the whole point being made.

p4c1r3: I think saying “OWL and PROV-O” is confusing. The described characteristics belong to PROV-O alone. Possibly what the authors mean is that OWL has the advantage of being a general purpose KR language binding then all those models (modeled in turn after it) under one common umbrella. This is something that is missing from other specific (and precedent) initiatives and could be stated as a different paragraph. Putting OWL and PROV-O together in that sentence is too much a simplification. Furthermore, I could possibly miss something in PROV-O but the way the authors said that allows to specify “whether we are describing an hypothesis or not” might suggests that PROV-O may inject some modal support into OWL (I. e. anything described in OWL can then be framed into a “hypothesis” dimension, separated from explicit facts) , which is not the case. PROV-O may simply offer the possibility to describe events, process… and hypotheses.

p10c1r42. “The latter has been described above”. Lime has been actually “previously introduced”, while its description follows in the dedicated section 4.3.2

p11c1. It might be worth highlighting that Lexinfo 3.0 is the first version that is compliant with Ontolex-Lemon

MINOR REMARKS and TYPOs:

Abstract
Complement - - > complements

p2c1r29: Not sure “build upon and complement” may be intended as attached to the auxiliary “will”, so not sure if if the auxiliary needs to be repeated for them or if they require the “s” for third person or if the expression can be left as is

p12c2r38. Converge in also in - - > converge also in

p28c1r32 so this is allows a compact : either “this is a compact” or “this allows for a compact”

Fig. 4 exceeds the first column, overlapping with the second column


Comments

LexVo (mentioned as editing tool) is actually LexO (currently LexO-Lite)
I would add also Evoke (paper currently under review here: http://semantic-web-journal.net/content/evoke-exploring-and-extending-le...)

We will take the errata into consideration in our revised version