Multilinguality and LLOD: A Survey Across Linguistic Description Levels

Tracking #: 3259-4473

Authors: 
Dagmar Gromann
Elena-Simona Apostol
Christian Chiarcos1
Marco Cremaschi2
Jorge Gracia
Katerina Gkirtzou
Chaya Liebeskind
Verginica Barbu Mititelu
Liudmila Mockiene
Michael Rosner
Ineke Schuurman
Gilles Sérasset
Purificação Silvano
Blerina Spahiu
Ciprian-Octavian Truica
Andrius Utka
Giedrė Valūnaitė Oleškevičienė

Responsible editor: 
Harald Sack

Submission type: 
Survey Article
Abstract: 
Limited accessibility to language resources and technologies challenges communities of speakers of any language other than English. Linguistic Linked (Open) Data (LLOD) holds the promise to ease the creation, linking, and reuse of multilingual linguistic data across distributed and heterogeneous resources. However, individual language resources and technologies accommodate or target different linguistic description levels, e.g. morphology, syntax, phonology, and pragmatics. In this comprehensive survey, the state-of-the-art of multilinguality and LLOD is being represented with a particular focus on linguistic description levels, identifying open challenges and gaps as well as proposing an ideal ecosystem for multilingual LLOD across description levels. This survey seeks to contribute an introductory text for newcomers to the field of multilingual LLOD, uncover gaps and challenges to be tackled by the LLOD community in reference to linguistic description levels, and present a solid basis for a future best practice of multilingual LLOD across description levels.
Full PDF Version: 
Tags: 
Reviewed

Decision/Status: 
Major Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
Anonymous submitted on 03/Nov/2022
Suggestion:
Major Revision
Review Comment:

The survey categorizes the state-of-the-art on linguistic linked open data from the aspect of multilingualism. It targets a very interesting topic, where the authors at the end of each category/work discuss open problems. It also discusses broad open challenges and the research gaps.

Section 2: Subsections 2.3 and 2.4 could be merged since this seems to be repetitive information.

Section 3.1.1 discusses the defined keywords. The authors could provide those keywords as a list in a file and share the link there. This would help future surveys.

Figure 01 could be made a bit smaller. A graphical representation of the procedure followed, i.e., at which steps the experts were involved would also help in getting a clearer picture at a glance. In this figure, it is also hard to figure out the final number of submissions (these details are of course available in the text).

The studies discussed are divided into seven categories. The authors should also provide one table listing the works in each of the categories with the targeted languages. A bit more structure within each of the broader categories would help in a better representation of the works that are being discussed there.

Even though the paper is a good depiction of existing studies, the authors should also discuss the application of these resources in NLP tasks, for example, Framester is evaluated on the task of frame disambiguation (frame identification).

The authors did not discuss the linguistic resources such as BabelNet, VerbNet, WordNet, etc. in any of the categories. Is there a particular reason for this or is it assumed that Framester covers it?

More details in Section 6.6 about under-resourced languages where the references 34-37 and 204 are cited should be discussed.

There has also been a recent special issue in SWJ on linguistic linked open data [1] which is in print. Where several papers discuss LLOD from different perspectives.

The open challenges which are discussed in the paper are very interesting.

Minor comments:
- page 12: SKOS has already been used as an acronym, the full form should be introduced the first time.
- page 13: DBPedia --> DBpedia
- page 15: Linguistic Linked Open Data formalism (a link should be cited)

[1] https://www.semantic-web-journal.net/blog/call-papers-special-issue-late...

Review #2
Anonymous submitted on 13/Nov/2022
Suggestion:
Major Revision
Review Comment:

This survey provides an overview of the current state regarding the treatment of multilinguality in the Linguistic Linked Open Data Cloud. It focuses in particular on foundations and recent developments with respect to the following levels of description: lexical semantics, pragmatics, lexicography, etymology and diachronicity, translation and terminology,

Overall, this is a very diligently conducted systematic review providing a synthesis of the state-of-the-art in dealing with multiple languages on the LLOD. The review has been very thoroughly conducted following the PRISMA approach and includes a vast amount of references and pointers to relevant work. This review has the potential to become the reference for researchers wanting to get an overview of recent work on multilinguality in the context of the LLOD.

I have two major comments on the article and a number of minor / stylistic points:

Major points

It would be good if the authors could add a glossary to the paper in which they explain the linguistic terms used in the text and that might not be known by a general reader of the journal. Examples are discourse, etymology, diachronicity, typologies, inflectional morphology, etc. This could be done as part of an appendix

In Sections 4.1 - 4.7 it would be good to close the sections with a summary of the key standards and methods that are available and a summary of the problems / questions that are still open so that the reader has a clear take home message for each section. As the sections stand, it remains unclear what problems are solved and which ones are still open or unaddressed at the respective level of description. The authors could add sth. like "In summary, X and Y have been addressed successfully / there is substantial work on X and Y, but Z is still open / insufficiently addressed.

A general comment: the article at multiple places including in the abstract mention that "speakers" are affected by the discussed challenges. I have doubts whether it is helpful to talk about "speakers" in general as the common speaker might not even be aware of LLOD nor about the fact that the own language is under-ressourced. The authors could be more precise in terms of who exactly is affected by the challenges they discuss.

Minor points

In general, the authors are using references as part of a sentence which is bad style, I point to the relevant places below.

Page 2 “Introduction”

The 1st paragraph of the introduction is quite unfocused and not really easy to digest / read. A lot of things are mixed and mentioned that are not really related to each other. The authors talk about the fact that language shapes interaction, that they also “conceptualize” the world (this is wrong IMHO as only agents but not languages can conceptualize sth.). They talk about “language pluralism”, “pressure by major language”, “linguistic relativity”. etc.

The 1st paragraph should be rewritten to have a clearer focus and motivation for the work presented. As it stands, the first paragraph is an enumeration of several aspects of languages that do not form a coherent view or perspective. It is btw. puzzling to use “Furthermore” in “Furthermore digital language data” as there is no obvious connection to the previous sentence which talks about pluralism and cultural heritage. The authors should have a more focus introductory set of sentences that does not need to refer to such a breath of concepts / aspects of language.

Page 3 bottom / Top of page 4

The sentences are quite long and should be shortened. In general, the article tends to use longer sentences spanning 3-4 lines oftentimes. Shorter sentences of 2-3 lines should be favoured to facilitate reading and understanding.

“will to link” sounds weird, “desire to link” sounds better to me.

The following sentence / claim is puzzling:

“As a result of these trends *comma missing* we find ourselves today in a situation where the semantic layer is no longer the only bridge between languages. Translations are, in principle, possible via the linguistic layer, …”

This is not clear as translation is inherently a semantic task; I simply do not understand what the authors are implying here. This needs to be clarified.

Page 5

I can not follow really why static vs. dynamic is a reasonable distinctive criterion between language resources on the one hand and services and tools on the other. Services and tools typically compute an output from an input but that does not make them necessarily dynamic as the computing function can stay the same over time. Language resources also might evolve over time with new texts being added etc.

The definition of “knowledge-based structures” is not a good one IMHO. It talks mainly about what knowledge-based structures are *not* (natural language words) but fails to give a good positive definition or examples.

Section 2.3

“It is a truism” => this is not a scientific statement, nothing in science is a truism.

“there may be several possible kinds of connection” => connections

2.3.1.

"Ultimately, it has to bottom out in the association of an entity of some kind with a universally accepted language label."

=> Not clear what the authors mean here; to which “some kind” of entities does this apply to? What is a universally accepted language label ?

2.3.2.

mini-language corpora; it is quite a stretch to call the input to a service a “mini-language corpus”. Corpora represent a collection of texts or other linguistic materials that are assembled together for some purpose. They are intentionally created artifacts that have a purpoSE and deliberate choices. are made regarding what to include and what not in the corpus. Calling the text input to a service a “corpus” is in this sense a stretch and unnecessary IMHO.

2.3.3.

I have not heard before that a “proposition” can be seen as a knowledge structure. In which sense of a “proposition” is this the case?

What do the authors mean with: “However, that connection is less direct than for a string”. Please elaborate.

In general, conceptual structures are linked to language not only to make the concept understandable, but to ground the concept in some symbol system that has already a meaning. Otherwise it would be difficult to define / express the meaning of the concept without making reference to an existing system of reference (language). This aspect could also be highlighted.

Section 2.4.

Who is “they” in “but they emphasize in addition two key points, …”

Page 13

DBpedia is written as DBPedia on the same page, please be consistent. DBpedia is the right spelling btw.

syntactic information is provided by [57]. => reference is used as sentence element

“Several future, additional features that should be addressed” => ungrammatical start of sentence, not clear

Page 14

Sentence: “This general line of research from work on ontology-based parsing”... the verb seems to be missing from this sentence…”

Page 15

“the area is generally suffering from” => suffers from

What is PDTB ? This is not explained / described.

"Bosque-Gil et al. discusses" => discuss

What is bidix? This is not explained / described.

Page 16

Top: “propose in [109]” => reference used as sentence element

“such as cuneiform signs in LLOD should be considered” => no comma

Diatopic-diachronic as well as diatopcy-synchronic representations of languages is one description” => “are” ???

Page 17

“Phonetics studies… Phonology studies”. Repetition at the beginning of both sentences. This is suboptimal from a stylistic point of view.

role as in [23] => reference as sentence element

the method proposed by [122] => reference as sentence element

which LIDIOMS [126] introduces by means of ontolex and vartrans => which are introduced in LIDIOMS by …

Page 18

as LLOD described in Lewis => as LLOD as described by Lewis ???

Furthermore, in terminology and translation *comma* varying degrees

in order to allow *for* a cross-resource analysis

Page 20

demonstrated the applicability of *the* Multilingual Morpheme Ontology

Page 21

presented in [159] => reference used as sentence element

DBpedia project consists => The DBpedia project consists …

In [160] => reference used as sentence element

In [162] => reference used as sentence element

Page 22

interlinking high-quality government data via *no the* RDF and SPARQL

in [174] => same as above

Such moderated repositories enables => enable

Add enumeration to the different annotation scheme levels in the paragraph on “Linguistic Data Categories”

Page 23

Autom ted Similarity Judgement Program => blank instead of “a”

GLOTTOLOG is written uppercase and lower cases as “Glottolog” on the same page, please be consistent throughout the article

Page 26

PHOIBLE … 2.000 language. Start with new sentence: “However, …”

communal base => common base?

Page 27

"the use of lOD for research …. require" => requires

From the perspective of conceptualisation, another issues arise => other issues arise ?

The TIAD task has being beneficial => been beneficial ?

Section 6.6.

There are a number of works in the scientific literature that clearly illustrate**. => no "s"

“which contrasts with the still low adoption” => “contrasts” is not the right word I think.

Review #3
Anonymous submitted on 24/Jul/2023
Suggestion:
Minor Revision
Review Comment:

Multilingual Linked Open Data is essential for breaking down language barriers and promoting global knowledge sharing. It encourages cross-lingual information retrieval and analysis and allows users with varied language backgrounds to access and contribute to a shared pool of knowledge. This paper provides an interesting one-stop solution for beginners on how to promote linguistic description levels in the LLOD, identifies challenges and gaps, and lays the groundwork for future best practices in expressing, modelling, and linking various levels across multilingual LLOD resources.

Pros:

1. The survey detailed in the texts provides an overview of available representation models, tools, and methodologies for and across different language description levels, highlighting existing issues and limitations.
2. It has also provided a detailed description of the selection of the papers for this survey.
3. It offers a state-of-the-art reference for academics and practitioners interested in making their linguistic data available as LLOD, with an emphasis on the methods that are currently being used for various linguistic description levels.
4. Additionally, it exposes unresolved issues and limitations in the multilingual LLOD resources' capacity to accommodate particular linguistic description levels.
5. In order to identify gaps and obstacles and focus future cooperative efforts on multilinguality and LLOD, the survey can therefore aid both researchers and professionals in the field of multilingual linguistic linked data.

Cons:

1. Since the survey intends to contribute as an introductory text for beginners In the field of multilingual LLOD, it would be interesting and beneficial to know the different application areas of multilingual LLOD. There are many uses for multilingual linked open data, including efficient cross-lingual information retrieval, global knowledge sharing, cultural heritage preservation, healthcare and medical research, e-government services, media and journalism, language learning and education, cross-cultural business and commerce, translation and localization services, disaster response and humanitarian aid, environmental monitoring and research, as well as multilingual digital libraries. It would be interesting to know how much multilingual LLOD has contributed to the different application areas until now.
2. Would it be possible to provide an overview of the coverage of information across different languages already enclosed in the multilingual LLOD out there? This would provide better insight into the missing multilingual information in the existing LOD.