Review Comment:
With the revision of the paper the authors present the results of the metadata alignment process for the language resource metadata from The Linked Open Data Cloud (https://lod-cloud.net) and Annohub (https://annohub.linguistik.de/) as a CSV file available at
https://github.com/unior-nlp-research-group/melld.
I evaluated the data in the CSV file against the latest Annohub RDF data dump (see above).
First of all, i have to report that the authors obviously build their analysis on an outdated version of Annohub.
(The latest Annohub data was released 10/2020 and can be obtained at https://annohub.linguistik.de/)
MELLD.csv evaluation
====================
The MELLD.csv file includes 666 records, of which 461 are taken from Annohub and 205 from the
https://www.lod-cloud.net website. My evaluation shows, that the metadata in the MELLD CSV file is clearly a subset of the latest Annohub data since 74 datasets are completely missing.
Otherwise, the language metadata is identical with very little change (0.06 %).
Summary
=======
Annohub records in the latest RDF dump : 535
Annohub records in MELLD.csv : 461
1) Identical languages for 451 records
2) Difference in languages only for 10 records
3) Out of the 10 datasets of Annohub in MELLD that have a different language assignment :
- 8 times a single language was missing in Annohub
- 1 time a single language was missing in MELLD
- 1 time two languages were missing in MELLD
4) The total number of languages in Annohub records that appear in MELLD is 13080. Therefor the rate of different language metadata is only about 0.06 percent.
5) Four Annohub records in MELLD appear under a different name in the latest Annohub release.
6) For all 461 Annohub datasets in MELLD an ORCID identifier could be assigned.
More detailed results can be found in appendices A and B.
Similarly, i compared the data from https://www.lod-cloud.net with the data in MELLD. I discovered basically 6 metadata
types that are not present in the original lod-cloud data.
1) Language (e.g. Basque), (is displayed in the html at https://lod-cloud.net,
but not included in the lod-cloud JSON export)
2) ORCID id
3) LCRsubclass (e.g. lexicons and dictionaries)
4) META-SHARE property : (float value, e.g. 813.0)
5) distributionLocation/comment (available yes/no)
6) accessibleThroughQuery (available yes/no)
On the other hand, some metadata from lod-cloud was pruned, for example keyword information, but also several other info like citation info, etc. which i could not find in the CSV.
In general, the data in CSV file is sparse since only 43% (8000/18648) of all possible attributes
are filled with values.
In the data model of MELLD i found several issues. Foremost, i regard the absence of references to the original
metadata records of lod-cloud and Annhub as a fatal error. Following Linked Data principles i would suggest to link
entries in the MELLD dataset to the specific resource entries in the catalogs they refer to, like Annohub, lod-cloud,
Linghub etc., because they provide many other useful metadata information. Also, the language information is not
respresented as an URL or ISO-Code, but simply as plain-text. Finally, a complex datatype for modelling the
(size/amount) of a resource is used. In the respective column, triples numbers were encoded, but i suspect it
to be used for file sizes as well (as the name suggests).
Conclusion
==========
Despite the additions, like adding ORDID ids and checking SPARQL endpoints /availability of datasets, etc.
most of the metadata in MELLD is simply a copy of already existing metadata from Annohub and lod-cloud.
The benefit of the resulting dataset is therefore questionable. Also, i would not regard the approach described in the paper to be best-practice. Instead of creating a compilation (of metadata) i would rather like to see the existing metadata from https://www.lod-cloud.net converted to RDF. This would allow linking its metadata (to Annohub) and other LLOD datasets, but also querying it via SPARQL.
After all, I found the paper to be very informative. The analysis of the LOD (LLOD) cloud provides valuable insights
about available linguistic Linked Data resources. In particular, it reveals shortcomings, such as underrepresented
languages or the problem of the unavailability of resources due to broken links or unavailable services.
Formal issues :
===============
p3, footnote 24, link is not available
https://ckan.org/datahub/
p5, right column 24
"The main reason for choosing these repositories (over which other repositories ?)
A short overview of other available language resource providers, might be useful.
For example http://www.meta-share.org/.
p4, right column 26, check spelling
However the attempts, none of the approaches was able to correct and ...
p6, footnote 39, check spelling
There were only 133 resources ...
p6, right column 7, check spelling
Annohub also comes with tools for type of resources, language and annotation model detection from the resource content
and even encodes metadata in linked data format ->
Annohub also comes with tools for resource type, language and annotation model detection and represents all generated
metadata as RDF.
p7, right, column 37
Information of annotation models, languages and resource types is encoded in dedicated RDF properties
in the Annohub metadata. In rare cases some information gain can be achived by harversting the description info.
For example if an appropriate OLiA annotation model is not availble for a certain tagset,
e.g. for "Interset interlingua for morphosyntactic tagsets", as described in the example.
p9, left, column 27, check spelling
p10, left, column 28, check spelling
In addition to this, in some cases, when available, the resources content has also been considered ...
p11, table 1, The following metadata is included in Annohub, but marked as not available in the table
1. metadataRecordIdentifier : obviously each RDF resource record is identified by its unique URL.
Altough, this is not an explicit property this should be mentioned in the table.
2. ontology : for each Annohub resource the URL of a used annotation scheme is encoded with a dedicated
RDF property. Its value is the URL of the used ontology for annotations, e.g.
http://svn.code.sf.net/p/olia/code/trunk/owl/stable/suc-link.rdf.
3. size/amount : dct:bytesSize
4. contactEmail : vcard:hasEmail
5. downloadLocation : dcat:accessURL
6. accessibleThroughQuery : As a remark, by now, none of the Annohub resources have a designated
SPARQL endpoint.
Appendix A
Records in Annohub and MELLD.csv with different language assignment :
1 title :ASPAC – Swedish-Lower Sorbian (2017-10-16); ASPAC – svenska-lågsorbiska (2017-10-16)
#languages in Melld 2
#languages in Annohub 1
notInAnnohub : [ Lower Sorbian] error
notMelld : []
2 title :ASPAC – Swedish-Czech (2017-10-16); ASPAC – svenska-tjeckiska (2017-10-16)
#languages in Melld 2
#languages in Annohub 1
notInAnnohub : [ Czech] error
notMelld : []
3 title :ASPAC – Swedish-Macedonian (2017-10-16); ASPAC – svenska-makedonska (2017-10-16)
#languages in Melld 2
#languages in Annohub 1
notInAnnohub : [ Macedonian] error
notMelld : []
4 title :EMEA
#languages in Melld 1
#languages in Annohub 2
notInAnnohub : []
notMelld : [German] error
5 title :ASPAC – Swedish-Greek (2017-10-16); ASPAC – svenska-grekiska (2017-10-16)
#languages in Melld 2
#languages in Annohub 1
notInAnnohub : [ Modern Greek (1453-)] error
notMelld : []
6 title :ASPAC – Swedish-Molise Slavic (2017-10-16); ASPAC – svenska-moliseslaviska (2017-10-16)
#languages in Melld 2
#languages in Annohub 1
notInAnnohub : [ Molise Slavic] error
notMelld : []
7 title :ASPAC – Swedish-English (2017-10-16); ASPAC – svenska-engelska (2017-10-16)
#languages in Melld 2
#languages in Annohub 1
notInAnnohub : [ English] error
notMelld : []
8 title :ASPAC – Swedish-Bulgarian (2017-10-16); ASPAC – svenska-bulgariska (2017-10-16)
#languages in Melld 2
#languages in Annohub 1
notInAnnohub : [ Swedish] error
notMelld : []
9 title :ASPAC – Swedish-Croatian (2017-10-16); ASPAC – svenska-kroatiska (2017-10-16)
#languages in Melld 2
#languages in Annohub 1
notInAnnohub : [ Croatian] error
notMelld : []
10 title :Freedict RDF dictionary Afrikaans-English
#languages in Melld 1
#languages in Annohub 2
notInAnnohub : [Modern Greek (1453-)] ok
notMelld : [English, Afrikaans] error
Appendix B
Records in MELLD that appear under a different name in the latest Annohub release:
1) DBnary - Wiktionary as Linguistic Linked Open Data (English Morphology)
-> DBnary - Wiktionary as Linguistic Linked Open Data (English Edition w. Morphology)
2) DBnary - Wiktionary as Linguistic Linked Open Data (Serbo-Croatian Morphology)
-> DBnary - Wiktionary as Linguistic Linked Open Data (Serbo-Croatian Edition w. Morphology)
3) DBnary - Wiktionary as Linguistic Linked Open Data (German Morphology)
-> DBnary - Wiktionary as Linguistic Linked Open Data (German Edition w. Morphology)
4) DBnary - Wiktionary as Linguistic Linked Open Data (French Morphology)
-> DBnary - Wiktionary as Linguistic Linked Open Data (French Edition w. Morphology)
|