A smart data case study using Wikidata to expand access and discovery in the Schoenberg Database of Manuscripts

Tracking #: 3377-4591

L.P. Coladangelo
Lynn Ransom

Responsible editor: 
Guest Editors Wikidata 2022

Submission type: 
Full Paper
This case study explored the results and lessons learned from the initial contribution of over 9,600 name identifiers to Wikidata and considered the use of Wikidata for enhancement of data related to premodern manuscripts. Wikidata, as a Linked Open Data (LOD) repository and hub, was used in the semantic enrichment of a particular dataset from the Schoenberg Database of Manuscripts (SDBM) Name Authority, yielding unique insights only possible from linking data from Wikidata and the SDBM. Mapping named entity metadata related to premodern manuscripts from one context to another was also explored, with a particular emphasis on determining property alignments between the linked data models of the SDBM and Wikidata. This resulted in a workflow model for LOD management and enhancement of name authority data in LAM contexts to encourage the manuscript studies community to contribute further data to Wikidata. This research demonstrates how the application of smart data principles to an existing dataset can address knowledge gaps related to people traditionally underrepresented in the digital record and opens new possibilities for access and discovery.
Full PDF Version: 

Major Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
Anonymous submitted on 18/Mar/2023
Review Comment:

The paper presents a case study in which a linking between 9,600 Schoenberg Database of Manuscripts entries and Wikidata entries was performed in a semi-automatic way. The linking is based on the VIAF IDs that are reported in both knowledge bases (KBs). A specific property (P9756) was added to the Wikidata model to perform the linking.
Overall, the paper is well written. However, I suggest expanding the LAM acronym (in the abstract and Section 1) and rewriting the final paragraph of Section 2 in a more readable way.
Creating linking information spaces is an important research field to enrich semantic datasets. In this context, Wikidata is a precious and huge knowledge base that can be used to represent and add knowledge to existing KBs. Despite the interesting research topic, the manuscript presents several limitations.
The first limitation of the paper is the literature review: the authors report general descriptions of the concepts of Big Data, Smart Data, semantic enrichment and Wikidata, but they do not provide any information about the context of their research. No similar project or approach is reported and described. Nevertheless, exploring a Wikidata page, it seems, for example, that several libraries and archives have already performed this linking.
In Section 4, the authors state that the linking was established using the VIAF IDs associated with the SDBM and Wikidata KBs. They write that an SDBM RDF dataset exists, but no detail about the ontological model on which it is based is reported. Is this model created based on standard ontologies? How named entities are represented in the SDBM model? Which queries did the authors perform to retrieve information reported in Section 4.1?
Furthermore, the semi-automatic workflow proposed in Section 4.1 is not supplied with supplementary material (including queries and software) that allows researchers to replicate the proposed process. Why did the authors not use an open science-oriented methodology?
In Section 5, the authors report some examples of queries that allow enhancing searches in the SDBM dataset by exploiting the Wikidata properties. My question is: are these queries implemented as pre-defined queries and integrated into the SDBM website so that SDBM users (e.g. scholars, students, and general users) can benefit from this KB enrichment? Indeed, an assessment of the enrichment of the SDBM KB is completely missing in the manuscript.
In conclusion, the proposed workflow implements a basic approach, and technically speaking, it does not seem very generalizable, except for datasets that already have a direct link with VIAF. Furthermore, no software is provided to replicate the experiment or use the workflow with different input data.

Review #2
Anonymous submitted on 21/Mar/2023
Major Revision
Review Comment:

This work presents a domain-specific workflow of semantic enrichment applying wikidata. The semantic enrichment is performed on SDBM, an open-access RDF structured data of premodern manuscripts. Relying on the specific id, VIAF, and OpenRefine, the authors detect the entities in SDBM in wikidata and enrich the corresponding data in wikidata by appending the SDBM ids to the aligned entities in wikidata using a dedicated new property.

Strengths: (3) quality of writing

The paper is very well written considering the following details:
1. The background knowledge is well covered, making it easy for a reader with a different background to get on board.
2. Provides an extensive overview of the related works
4. The approach and the contributions are well organized and precisely explained.

Weaknesses: (1) originality, (2) significance of the results

(1) The contribution of the paper involves the application of the existing tool, i.e., OpenRefine, and relies on the availability of an existing ID, i.e., VIAF, in order to align the entities between SDBM and wikidata. Considering the fact that one of the main obstacles in entity alignment is the lack of expected/unified identifier in one of the compared data/knowldge bases, I see the contribution not generic enough to fall into the category of "research paper"s.

(2) The authors demonstrate the significance of the achieved result, i.e., added links, in terms of a set of SPARQL queries which, personally appreciate very much. The queries are also provided clearly and are. accessible. However, considering that the paper is a research paper and not an "application", nor a "tool" report paper, it is expected to have a stronger experimental study section addressing a set of concrete research questions. For instance, experiments that target investigation of: a) how the added links improve the connectivity of the nodes (entities) in the graph considering network analysis parameters? b) what is the accuracy/precision of the aligned entities? (how do you assess them?).

Furthermore, there are statements regarding the selection of the strategy for the entity alignment process in section 4.3 that are worth more explanation. For instance, in "the most successful strategy to secure automatic matches was to use the SDBM name, the corresponding VIAF ID recorded in the SDBM, and the Wikidata item type. Any additional information did not improve matching and ranking", what is the definition of "successful" here? is it an exact match? have they applied any statistical evaluation? Also, it would be great to add some examples of information that is considered to fail to improve.

Overall, I see the contribution of this paper as very valuable and significant for the community and specifically the domain knowledge engineers and can consider the paper a good candidate for the "Application reports" or "Reports on tools and systems" tracks. As a candidate for "Full papers", I recommend the authors enhance their empirical studies.

Review #3
Anonymous submitted on 05/Apr/2023
Minor Revision
Review Comment:

This manuscript was submitted as 'full paper' and should be reviewed along the usual dimensions for research contributions which include (1) originality, (2) significance of the results, and (3) quality of writing. Please also assess the data file provided by the authors under “Long-term stable URL for resources”. In particular, assess (A) whether the data file is well organized and in particular contains a README file which makes it easy for you to assess the data, (B) whether the provided resources appear to be complete for replication of experiments, and if not, why, (C) whether the chosen repository, if it is not GitHub, Figshare or Zenodo, is appropriate for long-term repository discoverability, and (4) whether the provided data artifacts are complete. Please refer to the reviewer instructions and the FAQ for further information.

The paper presents a case study of identifier based linking between a very domain specific database (SDBM) and a very general one (Wikibase). The process of data preparation, linking and contribution are described and discussed. Finally some examples of using the newly added data in queries are given and discussed along with some results on how to promote LOD practices within the community.

1) originality

As stated/cited in the paper, Wikidata has become a major hub for identifiers from variety of sources. The process, tools (OpenRefine) and additional data sources (VIAF) presented in this paper do not show considerable originality. Linking only persons with VIAF identifiers is the "low-hanging fruit" when it comes to linking persons and I would have welcomed more discussion related to the difficulties and failed experiments related to the more difficult linking cases. The originality of the paper comes from the mentioned recommendations and data model alignments related to the LOD practices for the premodern manuscript community.

2) significance of the results

The authors position their contribution under the topic of "smart data". According to the Zeng (2017), also cited in the paper, smart data relates especially to the making sense of big dataset for quality assurance. In that theme, I would have liked to see more examples involving both SDBM and Wikidata datasets i.e. queries that would not have been not have been possible without the linking of the datasets. On other hand, one could argue that since the described linking process is based on shared VIAF identifiers, similar results would have achievable without the data contribution using federated SPARQL queries. Authors emphazise the fact that the data SDBM is well-structured and verified, but the examples only demonstrate the added value in one direction. The text mentions that linked places contain identifiers from TGN and Geonames, but does provide similar statistics for percentage of matched entities as with persons.

I would like see more and more concrete results from the LOD practices side of the work, especially as the use of community curated databases such as Wikidata and the use of LOD for their use case is argumented convincingly in the literature section.

3) quality of the writing

The quality of the writing is good and without any major (or minor) mistakes that I could identify.

Small issues:

- The footnote link (2) on page 2 does not work and returns "The page you were looking for doesn't exist." error.

Zeng, M. L. (2017). Smart data for digital humanities. Journal of Data and Information Science, 2(1), 1-12. https://doi.org/10.1515/jdis-2017-0001