Paving the Way for Enriched Metadata of Linguistic Linked Data

Tracking #: 2697-3911

Maria Pia di Buono1
Hugo Gonçalo Oliveira
Verginica Barbu Mititelu
Blerina Spahiu
Gennaro Nolano

Responsible editor: 
Guest Editors Advancements in Linguistics Linked Data 2021

Submission type: 
Full Paper
The need for reusable, interoperable, and interlinked linguistic resources in Natural Language Processing downstream tasks has been proved by the increasing efforts to develop standards and metadata suitable to represent several layers of information. Nevertheless, despite those efforts, the achievement of full compatibility for metadata in linguistic resource production is still far from being reached. On the other hand, access to resources observing these standards is hindered either by (i) lack of or incomplete information, (ii) inconsistent ways of coding their metadata, and (iii) lack of maintenance. In this paper, we offer a quantitative and qualitative analysis of descriptive metadata and resources availability of two main metadata repositories: LOD Cloud and Annohub. Furthermore, we introduce a metadata enrichment, which aims at improving resource information, and a metadata alignment to a descriptive schema, suitable for easing the accessibility and interoperability of such resources.
Full PDF Version: 

Minor Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
By Frank Abromeit submitted on 05/Apr/2021
Minor Revision
Review Comment:

This paper describes a quantitative and qualitative evaluation of descriptive metadata for linguistic Linked Data resources. Aside from that, the text provides a very good overview on Linked Data in general, especially its use in the linguistic domain.
For their evaluation the authors choose two repositories that host metadata of Linked Data resources: 'The Linked Open Data Cloud' website ( and the Annohub RDF dataset hosted at . After giving an overview on the purpose of these metadata providers the authors evaluate the given metadata on different levels, e.g. for wrong data, incomplete data, unavailability of the data, etc.. It is reported that by correction and completion of missing values, e.g. for language and licensing info, metadata quality overall could be improved. Finally, in order to publish the enriched metadata a new Linked Data set is proposed.

The text is well written and contains an excellent list of references of research done in this field. The metadata analysis provides valuable insights about available linguistic Linked Data resources. In particular, it reveals shortcomings, such as underrepresented languages or the problem of unavailability of resources due to broken links or unavailable services. Also, an analysis of the distribution of languages covered in existing linguistic Linked Data resources by type (e.g. corpus, dictionary, etc.) should prove very useful for creating new data sets in the future. We should thank the authors because of the manual work they have invested, for example to check SPARQL services and resource availability.


1) p.4, right column, row 30

With respect to the Linked Data resources in the Annohub repository, some clarification is needed, since metadata included in Annohub not only refers to Linked Data (data that is modeled using RDF technology) but also to metadata for language resources that are distributed in non-Linked Data formats like CoNLL (TSV) or XML. In particular, the most recent Annohub release contains 297 RDF, 61 CoNLL and 249 XML data sets. However, non-RDF data can be converted to an RDF representation, by means of already available converter-tools ( and (
In this regard, the statement at the beginning of section 3: "As previously stated, our survey, conducted within the framework of Nexus Linguarum CA 18209, is based on the information about resources in linked data formats available from two sources: the LOD Cloud and Annohub." is only partially true for Annohub, but can be maintained by clarifying the background.

2) p.4, footnote 18

Looking up the definitions for resourceType, mediaType, lingualityType mentioned in section 2, p.4 with the provided URL in footnote 18 was not possible. Could a reference to be used instead ?

3) p. 5, right column, row 17

The statement "At the moment, the Annohub repository contains 604 resources with associated metadata, automatically enriched with information about the language and the annotation model used" might be misleading in that the Annohub metadata is solely the product of an automated process. In fact, all automatically extracted language and annotation scheme information have been been thoroughly reviewed by expert linguists. Wrong or incomplete results from the NLP-detection have been corrected or completed accordingly.

4) p.6 right column, row 9, and also the example on p.8 for Universal Dependencies

It is mentioned that Annohub aggregates resource type info, etc. in the description field. This is not true. The description information presented here is directly taken from the original metadata providers (Linghub, Clarin).
Also, Annohub does not harvest any metadata information from resource descriptions. Moreover, the type of a resource, used languages and tagset information are automatically determined from the resource data itself. We regard this as more reliable, since descriptive metadata is often incomplete or ambiguous and is notoriously hard to parse. In addition, the Annohub metadata contains more than just plain text info for languages and used annotation schemes. All extracted metadata information from the resource data is encoded with dedicated RDF properties (see the Annohub metadata model [8]). Some of which are a resource's type (corpus/lexicon/ontology), included languages as ISO639-3 code and annotation tag information as OLiA reference.

5) p.7, left column, row 10

Using the META-SHARE size property for indicating the quality of a resource is not recommended

6) p.8, left column, row 37

The sentence "On the other hand, Annohub covers 530 resources, all with an assigned language and all from the domain of linguistics Annohub resource metadata is generated or manually assigned." may be misleading because language resources included in Annohub contain at least one language. On the other hand, some resources, e.g. DBpedia, may contain thousands of languages.

7) p.12, right column, row 35

The current Annohub RDF dataset contains 607 resources, of which 559 are verified to be online. Of the 48 missing datasets, 36 are RDF versions of Universal-Dependenies treebanks, so far hosted at

6) As the text is in manuscript state it contains typos and requires slight revision.

Review #2
By Manuel Fiorelli submitted on 21/Apr/2021
Minor Revision
Review Comment:

The paper contributes an empirical study of descriptive metadata of LOD datasets, as found in the LOD Cloud and Annohub repositories, focusing on linguistics datasets. In fact, the study is largely based on an enriched repository, which combines and enriches the information found in either metadata repository.

The empirical study is certainly inspired by similar efforts concerned with the whole LOD cloud, and for this reason the study can be considered an incremental research; nonetheless, the focus on the Linguistics Linked Open Data sub-cloud improves the originality of the research and makes the findings significant to the community around the LLOD cloud.

The paper cites numerous papers relevant to the proposed empirical study; however, the authors should put some effort to better explain how the paper is positioned with respect to related work, e.g. whether the paper confirms the findings of related work, falsify the findings of related work or add new (perhaps complementary) results.

From a methodological viewpoint, the paper should clarify right from the beginning of Section 4 the role of the enriched metadata. Indeed, in the subsequent sections it appears clear that the authors will compute some statistics on the original metadata and on the enriched one. Another related concern is whether some inconsistencies (e.g. different names for the same language, linguistic vs linguistics) were fixed in the “original” metadata or just in the “enriched metadata”. In the end, I would like a better explanation of the methodology, possibly accompanied by a figure depicting it.

The paper should include concrete examples of metadata for all repositories, in order to make the discourse more grounded. Additionally, the alignment of the two metadata repositories should be given in a more explicit manner (e.g. a table).

It is unclear to me how languages have been represented, e.g. by name, language code in some standard, or language URI (in some datasets, such as Lexvo). The authors should clarify (they could look at this message on the OntoLex mailing list for interesting insights

Furthermore, it seems that accessibility has been reported as a comment, which may not be considered a self-describing, semantically clear approach.

I am unsure whether the enriched metadata repository is valuable as a standalone catalog or is just an artifact instrumental to the quantitative study. Furthermore, the paper does not provide a link to access, download or otherwise evaluate this repository. Moreover, I do not understand whether the authors have a plan to maintain/update the enriched metadata as the source repositories are updated.

The paper does not discuss the automatic enrichment process in sufficient detail, as explained later in the review.

Other major concerns per section which should be addressed by the authors.

* Introduction *

I am concerned about the absence of any citation to the FAIR principles:

Web site (the most up to date)

(scientific paper)
Wilkinson, M. D., Dumontier, M., Aalbersberg, I. J., Appleton, G., Axton, M., Baak, A., ... & Mons, B. (2016). The FAIR Guiding Principles for scientific data management and stewardship. Scientific data, 3(1), 1-9.

* Background *

As the authors are addressing metadata about language resources published as Linked Data, I think that it would be appropriate to include approaches for the description of linked datasets, such as VoID, VOAF, DCAT, DataID, HCLS, LIME. The latter is a module of OntoLex-Lemon, a release-candidate version of which has been described in a research paper:

Fiorelli M., Stellato A., McCrae J.P., Cimiano P., Pazienza M.T. (2015) LIME: The Metadata Module for OntoLex. In: Gandon F., Sabou M., Sack H., d’Amato C., Cudré-Mauroux P., Zimmermann A. (eds) The Semantic Web. Latest Advances and New Domains. ESWC 2015. Lecture Notes in Computer Science, vol 9088. Springer, Cham.

The authors also forgot to mention the metadata vocabularies adopted in the repositories they are going to use. In fact, these repositories use metadata profiles combining several metadata vocabularies. For example, Annohub reuses Dublin Core and DCAT, among others.

The authors SHALL look at and cite the following work, which describes describes DCAT, DCTERMS and META-SHARE OWL, and explicitly take into consideration its findings:

Cimiano P., Chiarcos C., McCrae J.P., Gracia J. (2020) Modelling Metadata of Language Resources. In: Linguistic Linked Data. Springer, Cham.

The authors should differentiate between the original XML-based schema and the new META-SHARE ontology. The following work describes the effort for developing the META-SHARE ontology:

McCrae J.P., Labropoulou P., Gracia J., Villegas M., Rodríguez-Doncel V., Cimiano P. (2015) One Ontology to Bind Them All: The META-SHARE OWL Ontology for the Interoperability of Linguistic Datasets on the Web. In: Gandon F., Guéret C., Villata S., Breslin J., Faron-Zucker C., Zimmermann A. (eds) The Semantic Web: ESWC 2015 Satellite Events. ESWC 2015. Lecture Notes in Computer Science, vol 9341. Springer, Cham.

The authors should also consider whether it is appropriate to consider and cite the following references:

Cimiano P., Chiarcos C., McCrae J.P., Gracia J. (2020) Discovery of Language Resources. In: Linguistic Linked Data. Springer, Cham.

Chapman, A., Simperl, E., Koesten, L. et al. Dataset search: a survey. The VLDB Journal 29, 251–272 (2020).

Ben Ellefi, M., Bellahsene, Z., Breslin, J. G., Demidova, E., Dietze, S., Szymański, J., & Todorov, K. (2018). RDF dataset profiling–a survey of features, methods, vocabularies and applications. Semantic Web, 9(5), 677-705.

Jonquet, C., Toulet, A., Dutta, B. et al. Harnessing the Power of Unified Metadata in an Ontology Repository: The Case of AgroPortal. J Data Semant 7, 191–221 (2018).

Vandenbussche, P. Y., Umbrich, J., Matteis, L., Hogan, A., & Buil-Aranda, C. (2017). SPARQLES: Monitoring public SPARQL endpoints. Semantic web, 8(6), 1049-1065.

Ermilov I., Lehmann J., Martin M., Auer S. (2016) LODStats: The Data Web Census Dataset. In: Groth P. et al. (eds) The Semantic Web – ISWC 2016. ISWC 2016. Lecture Notes in Computer Science, vol 9982. Springer, Cham.

* Repositories *

From a reproducibility point of view, I would like to know when the repositories have been downloaded and (if available from the repositories or republished by the authors) a persistent link to the dumped data.

I would like to see example metadata about datasets in each repository.

* Methodology *

As anticipated, I would like to see a diagram/picture summarizing the methodology.

* Metadata Alignment/Mapping *

As anticipated, I miss a table or other structured means to indicate the mapping of the original metadata repositories to this mediated schema. That would also help understand the coverage of the original metadata repositories.

* Metadata Enrichment *

As anticipated, I am concerned about the automatic extraction process, which is described in a very implicit manner. Regarding the software implementation, I would like to know the following.
- license (it is open source?)
- availability (is it downloadable, reusable, free-of-charge?)
- architectural/implementation details
- more details on the supported extraction techniques

I do not like the extensive use of the modal "could", since it is unclear whether the authors have really implemented the software or are describing extraction strategies that they could implement (in the future).

Another big concern is about the enriched metadata resource, which is – at the time of review – not publicly available, and consequently it has been impossible to carefully review it.

* Metadata Overview *

It is unclear to me how the number of distinct datasets can be 1908 if LOD cloud and Annohub contains 1440 and 530 datasets, respectively, and 69 datasets are in common.

* LCRSubclass *

The authors seem not to use the members of LCRSubclass defined by META-SHARE, but instead they use new categories borrowed from the LLOD cloud diagram.
Why does Table 5 only reports datasets in the enriched repository? Perhaps, because these categories are not present in the original repositories.
The enriched model seems to lose the information on the annotation model found in Annohub.

* Resource Accessibility *

The authors claim that Annohub datasets are accessible since their accessibility was checked in Spring 2019. In my opinion, this is too much time ago. Moreover, thy did not say if this check was done by them (for this paper). Concerning the LOD Cloud the authors did not say when they checked the availability of the listed datasets.
The authors say that “There exist two ways to consume LLOD data”: in fact, there is another one, that is to say the HTTP resolution of the resource URI (remember that second Linked Data rule requires the use of HTTP(S) identifiers for resources).

Table 7 does not list AnnoHub. Indeed AnnoHub at least provide links to the downloadable dump. Additionally, the authors should say whether HTTP resolution works.

I am concerned about how the number at the end of the section (e.g. “Only 21% of linguistics…”) have been derived from the available tables. Please clarify.

Minor concerns:

P1, 26: “On the other hand, …” The use of this conjunctive adverb seems incorrect to me, since this sentence somehow confirms what has been told in the previous one.

P1, 30: “… alignment to a descriptive schema …” The authors could say explicitly that they used the META-SHARE ontology

P1, 41-42: “to develop standards and metadata suitable to…” the authors should phrase this better, as it is unclear whether they are talking about standards for representation and for metadata or standards (for something) and actual metadata about concrete resources. In the rest of the paper, the term “metadata” is often used to mean “metadata vocabularies/properties”, while many would have interpreted it as “description of an actual resource”.

P2, 1: “Semantic Web [3]” The authors should expand the citation: i) the canonical citation about the Semantic Web [Berners-Lee, T., Hendler, J., & Lassila, O. (2001). The semantic web. Scientific american, 284(5), 34-43.)], and ii) cite OntoLex-Lemon using the already cited paper and the URI of its specs []. Maybe spend a few words saying that it is the result of a community group [] that sought consensus and agreement on a shared model.

P2, 3: The right citation for the Linked Data is this:
Additionally, the “principles” of LD are usually termed “rules”.

P2, 9: I think that it could be appropriate to also cite: Tom Heath and Christian Bizer (2011) Linked Data: Evolving the Web into a Global Data Space (1st edition). Synthesis Lectures on the Semantic Web: Theory and Technology, 1:1, 1-136. Morgan & Claypool.

P2, 26-27: “provide an analysis of the current status of linguistics resources” the authors should perhaps proofread their work and see whether linguistic or linguistics should be used (consistently) in this and other contexts.

P3, 21-22: “…with the aim to produce a web application to make such data querable..” I do not think that the goal of Chiarcos et al is to produce a web application.

P3, 27-29: “(i.e., the variations of language encoding standards and the lack of common metadata schemas for LD),”, Looking at the cited paper, I think that more than the lack of common metadata schemas, Annohub attempts to precisely describe the language and the language annotations (e.g. tagsets used in a resource, etc.). The authors should revise.

P3, 37-38: “An attempt to model linguistic LD datasets is the one by Bosque et al.”, I think that more than "an attempt to model linguistic LD datasets", the cited paper is a survey of models.

P4, 45: “The LOD Cloud is a diagram that offers…” I am unsure whether is appropriate to call LOD Cloud since the authors have referred to it as a metadata repository.

P5, 31-33: “(e.g., thesauri from tourism or life sciences, such as EARTh – the Environmental Applications Reference Thesaurus [21]);” speaking of thesauri, I suggest the authors to look at “The AGROVOC Linked Dataset”: Caracciolo, C., Stellato, A., Morshed, A., Johannsen, G., Rajbhandari, S., Jaques, Y., & Keizer, J. (2013). The AGROVOC linked dataset. Semantic Web, 4(3), 341-348.
Indeed, AGROVOC is part of the LLOD cloud and provides machine-readable metadata in form of a VoID description (, also containing LIME metadata allowing for a richer linguistic description. Following the recipes contained in the VoID specification, metadata can be found easily, as each resource (e.g. is linked (through the property void:inDataset) to the dataset description. Tools may benefit from this mechanism to automatically find a dataset description once any of its resources has been reached. One such tool is VocBench 3:

(scientific paper)
Stellato, A., Fiorelli, M., Turbati, A., Lorenzetti, T., van Gemert, W., Dechandon, D., ... & Keizer, J. (2020). VocBench 3: A collaborative Semantic Web editor for ontologies, thesauri and lexicons. Semantic Web, 11(5), 855-881.

(web site)

(relevant documentation)

P5, 26-31: “The main reason for choosing these repositories relies in the type of information they encompass. The LOD Cloud [..] about annotated linguistic resources from reliable sources.” This paragraph at the end of section 3.2 (dedicated to Annohub) should be moved elsewhere, perhaps at the beginning of section 3.

P7, 34-35: “description to report a short free-text account”: missing bullet

P7, 45-46: “AccessLocation A URL to the SPARQL endopoint;” Looking at the documentation (, it seems to me that this bullet describes the specific use of this property made by the authors

P8, 47: “from Annohub (Original)” I think that the attribute “original” is not necessary here
P8, Table 1: the caption only cites LOD metadata (again, original may be unnecessary) and forgets Annohubs. Furthermore, the second column should be named LOD Cloud rather than LOD

P9, 1 “BioPortal”, add a citation to a scientific paper. Try to search a reference citation on the project web site

P9, 5-6 “federal or local governments”, I am fairly sure that this expression has been borrowed from some publication about data initiatives in the USA (which is a federal state).

P9, 11 “Prominent datasets…” add links and references to the mentioned datasets

P10, Table 2, The second column should be named “LOD Cloud”

P10, 43-45, “Apart from Swedish …. more than 100 linguistics datasets” The authors may consider to add a table or a chart representing the statistics/aggregations presented in this paragraph

P13, 34 “Jena”, Jena is in fact a framework, providing implementations of a triple stores such as TDB and Fuseki. I would add a reference to the other important RDF framework for Java, Eclipse RDF4J (, formerly OpenRDF Sesame (

P14, 21, “In the metadata of the original LOD”, This should be phrased to state more clearly that the authors are talking about the metadata provided by the LOD cloud (if I am right)

Review #3
By Sebastian Hellmann submitted on 09/Jul/2021
Minor Revision
Review Comment:

The paper is about the relevant problem of metadata for (Linguistic) Linked Data Datasets to boost discovery, reuse, integration and other purposes. The paper specifically investigated Annohub and LLOD, but also includes the whole meta data of in its observation. The authors followed a manual annotation/curation approach to improve metadata and used the META-SHARE schema as target for integrating the metadata.

Regarding the paper I have these major comments:

# Linked Data
There is a discrepancy between the introduction and the rest of the paper. The intro talks a lot about linked data (minor comment Footnote 7: I would prefer over .
Section 6.4 "There exist two ways to consume LLOD data; (i)
download their data dump, and (ii) query them by their SPARQL endpoint. The SPARQL language is the standard query language proposed by the W3C^45 to query a collection of RDF triples [33]". I really think this should be written differently: 1. the main or intended way to consume LLOD is via Linked Data, i.e. HTTP requests to a URI that 303-redirects to a URL that delivers the data. Recently that access mechanism has been complemented by embedded in HTML. This mechanism is the most important one, because that forms the largest knowledge graph on earth, i.e. LOD. Here you can easily jump from one dataset to the next and access data for the entities under the URI. This access mechanism can be cached well and it is easy to make it scale similar to Web pages. Then additionally there are dumps and SPARQL, whereas a) dumps offer bulk download, so if you need all the data they are prefered, but of course they get stale fast unlike Linked Data access. Also dumps are difficult to deploy. Providing dumps is actually the cheapest option in terms of stability as you only need a file server, storage and traffic and maybe do backups, but hardly any other maintenance. SPARQL on the other hand can be up-to-date as well, but needs more resources depending on the amount and complexity of queries answered in a certain time period.
So as a guideline:
1. Dumps for slow-changing data, very cheap and therefore sustainable, i.e. a good option for post-project provision of results via simple file-hosting.
2. Linked Data: Access via resolvale URIs. Good for long-term projects that provide a website as well as data. by Gerard de Melo is a good example here. The URIs can be reused by other projects, if stable and they can fetch fresh data now and then. This is the most scalable paradigm.
3. SPARQL: The idea that we have data available as SPARQL and can do federated querying is quite nice, of course. However, this seems to be too much of a burden for individual researchers. In my opinion just providing dumps or Linked Data should be sufficient. John McCrae follows this approach with iLOD[1], i.e. whoever needs the data currently can load dumps easier and have SPARQL.

[1] same special issue

The main point, I have trouble with, is that the authors completely ignored option 2 in their assessment. The paper has enough other merits, so re-doing the Linked Data eval is not necessary. However, I think that this part should be clarified and mentioned that it was out-of-scope.

# META-SHARE Schema & Use Cases
I was unable to find the exact reference to the META-SHARE schema used. [18] and was referenced, but none of these contain any schematic information except a pdf and picture. At the moment, I am unable to judge this material used. In principle, it looks like the META-SHARE schema sole purpose is to do the facet-based browsing at
Actually, I was wondering, why the META-SHARE Ontology was not used: This looks like a good foundation to build upon and create extensions in a sustainable manner.

I am also puzzled by footnote 28 "The metadata resource will be made available after the review process ends." Normally, it is the other way round: Authors definitely have to disclose to the reviewers, but then might not make things public after acceptance. Could I get access?

# Accessibility property
In the Accessibility section on page 7, Linked Data is missing, however, no standard is accepted at the moment with VOID being the most popular.

# META SHARE Accessibility
Personally, I consider availability/accessibility of data as an important factor for data. Most metadata items become quite pointless, if the data behind it is not available, e.g. license or language do not matter, if you have no chance of downloading the data anyhow. On I had a quick look there and it seems that the accessibility situation is also tricky at META-SHARE. Only ~160 of 2888 (5.5%) language resource have more than 1 download and total download numbers are 27,630 for 2880 datasets. These 27630 downloads mostly come from the top 20 resources it seems. I worked with LREC MAP a while back and this had a lot of issues with availability of resources as well, if I remember correctly.
I would be quite curious now, what the big picture is here. Some datasets in LOD are unavaible (dump, Linked Data, SPARQL), of course, but then it is unclear whether overall, the (L)LOD is far better or worse when compared to other approaches. It could actually be a very good improvement over the state-of-the-art. Also the iLOD approach by McCrae and Nasir [1] seems to improve this further. Maybe the authors could comment on this and depending on their statement rewrite certain parts like: "Even though LOD Cloud is considered a gold mine, its value is threatened by the unavailability of resources over time." -> Are there other approaches at all in the linguistic area (e.g. LREC, CLARIN, META-SAHRE) that have yielded that many available datasets (available = actually findable & downloadable). Such a comparison would also clarify, whether it is a gold mine or a problem. At least it should be mentioned that such a comparison has not yet been done in the Future Work section.

# Overall
Above I mentioned many general issues, that I see in the research area of the paper, i.e. high quality metadata of data. Overall, I see a lot of merit in the current paper as the authors were able to produce a good status and analysis on the current situation. My expectation as a reviewer is not for them to fix everything of the above, but to clarify things that have been done and things that are still unclear or todo at the moment. I would consider this a good fit for this kind of paper, i.e. an analysis of the status based on these manual improvement and then also clear points to be addressed in future work, either by the authors or other researchers.

# Minor
* SPARQL endopoint -> SPARQL endpoint
* SPARQL enpoint -> SPARQL endpoint (other than that the whole paper was very well written and without spelling issues, good job ☝)

* SnetiWS was hosted in SPARQL as part of a Hackathon. The download link of the original data changed a bit: . It is true that I googled it, which can probably be consider a meta-data search.