Paving the Way for Enriched Metadata of Linguistic Linked Data

Tracking #: 2994-4208

Authors: 
Maria Pia di Buono1
Hugo Gonçalo Oliveira
Verginica Barbu Mititelu
Blerina Spahiu
Gennaro Nolano

Responsible editor: 
Guest Editors Advancements in Linguistics Linked Data 2021

Submission type: 
Full Paper
Abstract: 
The need for reusable, interoperable, and interlinked linguistic resources in Natural Language Processing downstream tasks has been proved by the increasing efforts to develop standards and metadata suitable to represent several layers of information. Nevertheless, despite these efforts, the achievement of full compatibility for metadata in linguistic resource production is still far from being reached. Access to resources observing these standards is hindered either by (i) lack of or incomplete information, (ii) inconsistent ways of coding their metadata, and (iii) lack of maintenance. In this paper, we offer a quantitative and qualitative analysis of descriptive metadata and resources availability of two main metadata repositories: LOD Cloud and Annohub. Furthermore, we introduce a metadata enrichment, which aims at improving resource information, and a metadata alignment to META-SHARE ontology, suitable for easing the accessibility and interoperability of such resources.
Full PDF Version: 
Tags: 
Reviewed

Decision/Status: 
Accept

Solicited Reviews:
Click to Expand/Collapse
Review #1
By Manuel Fiorelli submitted on 20/Feb/2022
Suggestion:
Minor Revision
Review Comment:

I read the response letter and the resubmitted manuscript carefully. Again, I was impressed by the breadth of its survey of related work, which also serves to set the context for this work. Still, despite the success in addressing the concerns I expressed about the previous submission, I have found new problems in the amended manuscript that need to be fixed before it can be considered ready for publication.

In my opinion, the biggest problems in the manuscript are related to the RDF representation of MELLD and how it has been described in the paper. I will thus discuss these first and leave other minor considerations to the end, in particular those related to typographic/writing errors the correction of which is obvious and unquestionable.

* In Section 2, the authors introduced META-SHARE, its XSD metadata schema and the effort [30] to convert the latter into an OWL ontology.
However, they also cited [29] that discusses meta-share.owl. Quoting [29], the “Meta-Share.owl ontology was designed from the META-SHARE XML-based model as starting point”. This ontology was presented in the introduction of [29] as part of “a more general effort [that] was carried out in the context of the W3C Linked Data for language Technologies (LD4LT) community group […”]. The ontology is available at http://purl.org/net/def/metashare but a new version was being developed (at the time) in the GitHub repository https://github.com/ld4lt/metashare. Interestingly, the ontology IRI resolves to “http://ld4lt.github.io/metashare/metashare.owl”, i.e. meta-share.owl file is hosted as GitHub pages in the LD4LT GitHub repository.
The LD4LT GitHub repository contains three branches: master, dev and gh-pages. Based on the README, the master branch contains the stable v1 of the ontology, while the dev branch contains the v2 of the ontology being developed. Based on the GitHub documentation (https://docs.github.com/en/pages/getting-started-with-github-pages/about...), the branch gh-pages contains the source for the GitHub Pages of the project. By comparing the owl files in the master and gh-pages branches, I verified that they are the same. This should prove that (at least today) meta-share.owl mentioned in [29] is actually the stable v1 of the ontology by LD4LT.
Another problem is that in section 4.1 the authors introduced the META-SHARE ontology published by META-SHARE itself (in footnote 45) without any mention of it in Section 2.
Interestingly, this ontology looks very similar to the the v2 ontology contained in the dev branch of the LD4LT repo, as they have the same IRI (http://w3id.org/meta-share/meta-share) and the same creators.
I suggest the authors to reorganizing the citations, making it clear that meta-share.owl is another attempt
I’ve also found a new ontology based on the META-SHARE model:

Villegas, M., Melero, M., & Bel, N. (2014, May). Metadata as Linked Open Data: mapping disparate XML metadata registries into one RDF/OWL registry. In Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14) (pp. 393-400).

In my opinions, the authors should reorganize the citations to these ontology, clarifying their chronological order and, if possible, their lineage (i.e. is one explicitly based on another?). E.g. the official meta-share ontology seems to be somehow based on the one being developed by the LD4LT group. I suggest the authors to check whether this is true.

Going on, I will use the term META-SHARE ontology to refer to the official version used by the authors.

* On p. 8, it seems that the authors have deleted “domain” in front of “categoryLabel”. Looking at the META-SHARE ontology, categoryLabel "Introduces a human readable name (label) by which a classification category (e.g. text type, text genre, domain, etc.) is known". So, I think it is better to use the term "domain" if this is what they intended to capture. Furthermore, the words categoryLabel and domain are sometimes used inconsistently in rest of the paper.

* On p. 10, the authors claim to construct language identifiers appending an ISO code to “http://www.lexvo.org/page/iso639-3/”. This mechanism is flawed since the identifier must be constructed by appending the code to “http://lexvo.org/id/iso639-3/”: note “id” in place of “page”. This is the result of applying the pattern “303 See Also”: https://www.w3.org/TR/cooluris/#r303uri

* When introducing the ISA Programme Person Core Vocabulary on p. 10, the authors made not clear what they have done. Looking at the RDF files of MELLD, I found that the authors used this core vocabulary incorrectly. Indeed, they use the namespace of the vocabulary (https://www.w3.org/ns/person#) to mint IRI for new agents (e.g. https://www.w3.org/ns/person#1), which are then described using the Meta-Share ontology. This is incorrect from a linked data perspective as they are stealing someone else namespace to mint new IRIs. Above all, the authors disregarded the real purpose of the Core Person ontology to provide a vocabulary to describe persons. As the authors are committed on the use of the Meta-Share ontology, I suggest abandoning Core Person altogether and instead mint IRIs in a new dedicated namespace.

* The authors should consider whether to included an example of the RDF serialization of MELLD in the paper.

* Actually, the worst problems are in the RDF files of MELLD hosted at GitHub (https://github.com/unior-nlp-research-group/melld)

I took a sample resource

a ms:LanguageResource ;
dct:source "AnnoHub" ;
ms:resourceName "Apertium Dictionary English-Spanish" ;
ms:resourceCreator ;
ms:licence "GPL" ;
ms:categoryLabel "linguistics" ;
ms:lcrSubclass "lexicons and dictionaries" ;
ms:lingualityInfo [ ms:lingualityType ] ;
ms:language , ;
ms:annotationScheme "OntoLex-lemon" .

* It reuses IRIs from annohub and mints new IRI for the creator in the namespace of the Core Person ontology. The first is problematic as it is not possible to setup the IRI so that it resolves to this resource description. The second is, instead, wrong for the reasons I have mentioned previously.
My suggestion is to define a new prefix, say melld and associate it to a namespace controlled by the authors, so that they could serve MELLD as linked data. The new IRI for the language resource could be linked to the one in annohub through the property prov:wasDerivedFrom defined by PROV-O (https://www.w3.org/TR/prov-o/).

* The property ms:licence does not exist. It should be ms:licenceTerms ("license" is only the label of the property). ms:licenceTerms is an object property, the value of which should be a resource rather than a literal. Moreover, its domain is ms:Distribution. So, you should not apply it to a language resource directly. The range of ms:licenceTerms is ms: LicenceTerms.
In theory, different licenses should be represented as distinct instances of this class. I am unsure whether there is an already compiled list of instances for common licenses, though.
Alternatively, the authors could define new instances.
° Then, they might consider to give a value for ms:licenceTermsShortName (e.g. ms:licenceTermsShortName “CC-BY-1.0”) and ms:licenceTermsURL (e.g. ms:licenceTermsURL “https://creativecommons.org/licenses/by/1.0/”^^xsd:anyURI).
° As an alternative, they could instead set a value for the properties datacite:hasIdentifier. Following this example http://www.sparontologies.net/examples#datacite_1, the right approach should be (the authors should verify!!!):

melld:licence_CC_BY_1_0
a ms:LicenceTerms ;
rdfs:label "CC-BY 1.0" ;
datacite:hasIdentifier melld:licence_identifier_CC_BY_1_0 .

melld:licence_identifier_CC_BY_1_0
a ms:LicenceIdentifier ;
literal:hasLiteralValue "CC-BY-1.0" ;
rdfs:label "CC-BY-1.0";
datacite:usesIdentifierScheme ms:SPDX
.

Note that ms:SPDX is a LicenceIdentifierScheme “referring to the codes (identifiers) for licenses used by SPDX (https://spdx.org/licenses/))”. The authors might consider other predefined licenceIdentifierScheme or mint a new one.

* ms:categoryLabel has ms:Classification as domain, so it should not be applied to language resources directly. Instead, you should link the language resource to a classification item via the property ms:domain. The authors should look at ms:history (domain), ms:DDC900 (domain identifier) and ms: DDC_classification (domain classification). They probably want to define new individuals for the domains used in MELLD, the corresponding identifiers in the LOD cloud (which probably are the same) and an individual the LOD Cloud domain classification.

* ms:lcrSubclass is an object property, and as such its value should be a resource identifier by an IRI. I have noticed that META-SHARE already defines individuals (as instances of the class ms:LCRSubclass) to represent ontologies, thesauri, dictionaries, etc… The problem is that most values in the column are underspecified such as in the case of “lexicons and dictionaries”. In my opinion, there are two approaches to address this:
° mint a new individual representing this underspecified type (maybe not fully correct but quite easy to work with)
° for the language resource, assert that it belongs (via rdf:type) to the anonymous class consisting in restriction on the ms:lcrSubclass such that it has a value in the anonymous class consisting of exactly the two individuals. In Turtle,

melld:some_language_resource

a [ a owl:Restriction ;
owl:onProperty ms:lcrSubclass ;
owl:someValuesFrom [ a owl:Class ;
owl:oneOf (ms:lexicon ms:dictionary)
]
]
.

If you prefer, use an OWL2 qualified number restriction to say that the resource has exactly one among the two provided values.
The latter solution is probably more semantically accurate, but for sure more difficult to work with.

* A greater problem with the property ms:lcrSubclass is that it also have the value “corpora”. Looking for a predefined instance, I’ve noticed the description of the class ms:LCRSubclass: “A classification of lexical/conceptual resources into types (used for descriptive reasons)”. However, “corpora” and “lexical/conceptual resources” are distinct subclasses of ms:DataLanguageResource, which in turn is a subclass of ms:LanguageResource (distinguished from ms:ToolService).
In my opinion, the right approach is not to use the generic type ms:LanguageResource and prefer to use ms:Corpus and ms: LexicalConceptualResource. Then, only for the latter, use the property ms:lcrSubclass.

* The property ms:language is a datatype property with range xsd:anyURI; so its value should be given as a type literal “http://lexvo.org/id/iso639-3/swe”^^xsd:anyURI

* Moreover, the property ms:language has not ms:LanguageResource in its domain. It seems that the right mechanism to use it is to state that a language resource has a part with a certain language.

* ms:annotationScheme is a named individual not a property. I think you should use ms:annotationSchema (ending with an a), with a value an IRI of type ms:LexicalConceptualResource.

* I have noticed

[ ms:lingualityType ]

You should not use the prefix ms: within angle brackets. It should be

[ ms:lingualityType ms:bilingual ]

* I can't find the property ms:lingualityInfo. It seems to me that ms:lingualityType applies to a media part just like ms:language.

* I’ve found another interesting example

a ms:LanguageResource ;
dct:source "LOD Cloud" ;
ms:resourceName "WordNet 3.0 (VU Amsterdam)" ;
ms:metadataRecordIdentifier "vu-wordnet" ;
ms:licence "CC-BY" ;
ms:sizeInfo [ ms:size 4573749.0 ;
ms:sizeUnit "triples" ] ;
ms:categoryLabel "linguistics" ;
ms:lcrSubclass "lexicons and dictionaries" ;
ms:lingualityInfo [ ms:lingualityType ] ;
ms:language ;
ms:landingPage ;
ms:dataFormat "rdf" ;
ms:ontology "ttl" ;
ms:externalReference ;
ms:downloadLocation ;
ms:accessLocation "unknown" ;
ms:contact .

In addition to the already mentioned issues, here I’ve noticed a few more.

* The property ms:metadataRecordIdentifier should be applied to metadata records not to language resources. You should use more terms from the META-SHARE ontology to say that this language resource ms:hasMetadata in some record that datacite:hasIdentifier,etc..

* The property ms:sizeInfo does not exist: it should be ms:size. Moreover, this property applies to distributions of language resources. Similarly, ms:accessLocation and ms:downloadLocation also apply to distributions. Both should be given as typed-literals with data type xsd:anyURI. I would avoid the value “unknown” and simply skip that property.

* Looking at the definition of a resource used as download location, I found a definition like the following:

a ms:ToolService ;
ms:comment "Working" .

I think that ms:ToolService is not intended to capture this, since it is defined as “a tool/service/any piece of software that performs language processing and/or any Language Technology related operation.”
Moreover, if the download location is modeled as a literal following the specifications of the META-SHARE ontology, it is no longer possible to describe it such way (as literal can’t be subject of a triple).

* Again for the size, the value of ms:sizeUnit should be an individual standing for “triples”. Take as an example ms:byte.

* The resource ms:ontology is an individual, not a property.
The property ms:dataFormat seems not directly applicable to language resources.
The definition of the property ms:externalReference “Provides reference to another resource to which the lexicalConceptualResource is linked (e.g., link to a wordnet or ontology)” seems incompatible with the use made by the authors to link the void file providing metadata about the resource.

* Looking at the definition of https://www.w3.org/ns/person#128, I’ve found

a ms:actor ;
ms:name "Mark van Assem, Antoine Isaac, Jacco van Ossenbruggen" ;
ms:email "Jacco.van.Ossenbruggen@cwi.nl" .

I’ve verified that this problem originates from LOC Cloud itself. However, I would have expected that such errors were corrected during the manual error fixing step.
The class name is ms:Actor with upper-case initial.

Minor formal issues that need to be fixed
-------------------------------------------------------

* A lot of references in the bibliography are not actually cited in the paper.

* When enumerating several initiatives to support data harmonization on p. 1, the authors may consider complementing the citations of their web pages with citations of the associated research papers. Without pretending that they are the best and most up-to-date references, I provide the following example references. In fact, the paper contains reference [59], which seems relevant but it has not been cited in the text

LRE map:
Calzolari, N., Del Gratta, R., Francopoulo, G., Mariani, J., Rubino, F., Russo, I., & Soria, C. (2012, May). The LRE map. Harmonising community descriptions of resources. In Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12) (pp. 1084-1089).

ELG:
Rehm, G., Piperidis, S., Bontcheva, K., Hajic, J., Arranz, V., Vasiļjevs, A., ... & Renals, S. (2021, April). European language grid: A joint platform for the european language technology community. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: System Demonstrations (pp. 221-230).

CLARIN:
Hinrichs, E., & Krauwer, S. (2014, May). The CLARIN research infrastructure: Resources and tools for ehumanities scholars. In Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14) (pp. 1525-1531).

Prêt-à-LLOD:
Declerck, T., McCrae, J. P., Hartung, M., Gracia, J., Chiarcos, C., Montiel-Ponsoda, E., ... & Cooney, K. (2020, May). Recent developments for the linguistic linked open data infrastructure. In Proceedings of the 12th Language Resources and Evaluation Conference (pp. 5660-5667).

Elexis:
Krek, S., Kosem, I., McCrae, J. P., Navigli, R., Pedersen, B. S., Tiberius, C., & Wissik, T. (2018, July). European lexicographic infrastructure (elexis). In Proceedings of the XVIII EURALEX International Congress on Lexicography in Global Contexts (pp. 881-892).

* While I argued that “principles” is probably not ideal to use in conjunction with Linked Data, I am unsatisfied by the replacements “fundamental” (used on p. 3, row 33) and foundations (used on p. 8, row 15): I strongly suggest the authors to revert to the original word “principles”.

* On p. 2, row 19: “for different use cases such as: they provide […]”. I would replace “such as:” with just “as”, since what comes after the colon is not stated as use cases, but rather as assertions about what metadata can do.

* On p. 3, row 45: “linguistic resources,, in Chiarcos et al. […]”. Double comma ”,,”

* On p. 3, row 1: the citation to LexInfo is numbered with a question mark

* In the footnote on p. 11, there are two typos: 1) an unmatched closed parenthesis after the pypi URL, 2) “the domaincategoryLabel” seems a typo as well

* On p. 17, rows 6-9: while Jena, RDF4J and RDFLib are advertised as frameworks or libraries by their authors, I am unsure whether Virtuoso is a framework. I suggest to rephrase: “Triple stores and RDF processing frameworks, such as Vituoso, … , usually offer a SPARQL interface”.

* Reference 47 is wrong, as the word “SPARQL” has been omitted from the title

Minor issues that do not necessarily require a change
-----------------------------------------------------------------------

* It seems to me that the authors did not reply to the following concern that I expressed in the previous review:

Concerning metadata alignment/mapping, I am satisfied by the table just added to the manuscript to make the alignment explicit. Still, I have noticed that downloadLocation is not mapped. I am sure that LOD cloud provides the address of data dumps, and probably AnnoHub as well. ORCID is not mapped either, although in the section on metadata enrichment the authors explain how to derive it from author/contact names.

Review #2
By Frank Abromeit submitted on 21/Feb/2022
Suggestion:
Minor Revision
Review Comment:

In their essay the authors provide an analysis of the LOD (LLOD) cloud which gives valuable insights about available linguistic Linked Data resources. In particular, the paper reveals shortcomings, such as underrepresented languages or the problem of the unavailability of LOD resources due to broken links or unavailable SPARQL services. As such, i like the paper very much.

Additionally, the authors conduct a detailed case study for aligning the metadata present in two prominent metadata sources (LOD-cloud and Annohub). They describe the obstacles in doing so, and develop an examplary solution with the MELLD dataset, which includes a combination of the metadata of the two.

Apart from the presented approach, i'm missing in the paper a more general view on the alignment problem, and it should be disscussed if alternative solutions exist. So, maybe instead of fixing the problem of incompatible metadata afterwards, mechanisms for a validation or certification of Linked Data sets could be implemented, to ensure that the metadata in LLOD datasets complies to certain standards.

Issues:

Regarding the usage of an older version of the Annohub dataset for the evaluation:
I can understand that the effort of updating the data-basis to the most recent version of Annohub is too much of an effort. Nevertheless, the information about the included metadata in Annohub should be correct. By the way, the missing infos listed below, are already present in the version of Annohub that was used for this study!

see, p.11 table 1, column Annohub shows missing attributes that actually do exist in the Annohub dataset
ontology: rdfs:isDefinedBy
size: dct:bytesSize
email: vcard:hasEmail

p.4,right col. 20 repetition
Despite the attempts, none of the approaches was able to correct and complete knowledge at the
same time. In particular, what is highlighted is the absence of approaches that are able to find and correct
errors at the same time.

p.8 left col.49
Annohub uses dc:subject for that purpose

p.10 right col. 33
http://lexvo.org/id/iso639-3/ must be used as a prefix NOT http://lexvo.org/page/iso639-3/ which is the web-URL.
Also, beware that lexvo is not complete. Not all ISO-639-3 codes have a corresponding lexvo web-URL.

p.17 right col. 14
As a workaround, the Linghub SPARQL service utilizes a SPARQL version with a reduced instruction set.
https://linghub.org/sparql/
https://github.com/jmccrae/yuzu/blob/master/YuzuQL.md