MCKG: A Community-Driven Knowledge Graph on Medieval Charters

Tracking #: 3881-5095

Authors: 
Jorge Álvarez-Fidalgo
Enrique Rodriguez-Martin
Jose Emilio Labra-Gayo

Responsible editor: 
Guest Editors 2025 OD+CH

Submission type: 
Dataset Description
Abstract: 
Medieval charters serve as essential primary sources for understanding medieval societies, yet their analysis remains a labor-intensive process reliant on domain experts, leaving large digitized collections largely unexplored. To address this challenge, we present a Wikibase-based Medieval Charters Community-Driven Knowledge Graph (MCKG) that combines expert annotations with community contributions through a provenance-aware framework, ensuring data quality while enabling scalable data integration. Our solution features a hybrid data model that combines elements from CIDOC-CRM and the Wikidata data model to capture the complex legal, social, and biographical relationships in medieval charters. A standardized pipeline enables efficient corpus integration into the MCKG. We demonstrate this approach's effectiveness by populating the MCKG with a corpus of Spanish medieval charters using our integration pipeline, and resolving interdisciplinary, SPARQL-based competency questions, showcasing its research value for historical studies.
Full PDF Version: 
Tags: 
Reviewed

Decision/Status: 
Minor Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
Anonymous submitted on 16/Jun/2025
Suggestion:
Major Revision
Review Comment:

Overall, the quality of the article is reasonable. It is well-structured and referenced. In particular, the documentation of the data modelling and mapping is well written and detailed. However, there are more issues regarding the data publication side, which can be improved.

The Figure captions would be better aligned in the center

Figure 1. CRM Class, CRM+WD CLASS, and WD CLASS use similar colors for text and graphics. For readability, it would be better to use more distinct colors.

While there is no doubt that the effort of the article contributes to the KG community and medieval historians alike, it is not very clear if the article is about MCKG as a dataset or CKG as an approach. More explanation will help readers understand it.

It would be valuable to publish and disseminate the data in different formats and channels. I suggest providing a clear explanation about the GitHub repository and Wiki, as well as reciprocal links between them. In addition, the Turtle files should have more visibility, so that the users can use them more easily.

There are SPARQL query examples. It would be good to put them on a more visible page of the Wiki, because they are valuable for data analysis.

It seems that there are no online resources available for the charters themselves (e.g, online charter image viewer, and/or metadata including the index number of the charter in an archive etc). This is why there is almost no URLs in the Turtle files? (this is especially a pity, because "provenance tracking" is excellent to distinguish the community and expert information in the data). If this is the case, it is hard for the end-users to know what KG actually represents and if it is trustworthy. Please try to describe where the users can find more information about the charters in the paper/Wiki/GitHub repo/Turtle.

One of the unique points of this article is the community-driven approach. However, there is limited information about both the experts and the community; Who are they? How many are they? Exactly what do/did they do for the creation, maintenance, and preservation of MCKG? It is crucial to describe this in more detail to make this article a valuable contribution to the journal. In addition, there are some tools to facilitate the community approach. It would be necessary to describe them in more detail, because data often comes with tools to generate/use it.

I suggest adding more basic information (and/or metadata) about the data on the main page of Wiki and GitHub repository (not just in the article):

-What kind of medieval charters are they (About where? When? Who? etc)?
(i.e., AMSPO medieval charter corpus)
-Who created and is responsible for MCKG?
-How to contact them?
-For whom is KCKG for?
-Size of data, quality of data, version, licenses, terms of use, date of creation, last update etc
-Explain what CIDOC-CRM and Wikidata, (and possibly Knowledge Graph) are (and provide links)

The Future Work section provides a brief outlook, but the entire article describes mostly positive aspects of MCKG. Are there no limitations? It would be nice to describe challenges and limitations, when working on this research (e.g., how successful data modelling and NER/NEL are in statistics). I would also suggest considering/discussing Open Refine (and Reconciliation API) for NEL.

Maybe trivial, but isn’t the name “Medieval Charters Community-Driven Knowledge Graph (MCKG)” too generic and perhaps misleading for the end-users, when looking at the actual scope/coverage of the KG case study?

Review #2
Anonymous submitted on 24/Sep/2025
Suggestion:
Minor Revision
Review Comment:

Summary
The paper presents the Medieval Charters Knowledge Graph (MCKG), an RDF-based dataset designed to support scholarly exploration of medieval charters. The infrastructure is hosted on Wikibase Cloud, and the modeling approach uses a hybrid of CIDOC-CRM and Wikidata-style properties. Validation is performed using Shape Expressions (ShEx) and sheXer, with shapes stored as EntitySchemas in the instance.

The authors are transparent about simulating community contributions, and demonstrate that the platform is extensible and viable for cultural heritage applications. While at an early stage, the dataset aligns well with Semantic Web principles and provides significant scholarly value.

Overall the paper presents a well-executed dataset with strong potential impact. It is technically sound, clearly written, and thoughtfully designed for future extensibility and community involvement. While a few minor revisions are necessary to improve metadata completeness and usability, the core contribution is solid and aligns well with the goals of the Semantic Web Journal and this special issue.

Evaluation Against SWJ Dataset Criteria
Dataset metadata (name, URL, version, license): URL provided; license and version not specified
Topic coverage, domain, and source description: Highly relevant and clearly defined.
Use of standard vocabularies (RDF, OWL, SKOS, FOAF, CIDOC-CRM): Strong vocabulary reuse (CIDOC-CRM, FOAF, Wikidata)
Language expressivity & modelling patterns: Clearly explained hybrid model (CIDOC + Wikidata)
Validation schema and completeness: ShEx shapes provided and validated via sheXer
Internal linking (within dataset): Well-modeled through events and roles
External linking (to other KGs): Not implemented yet; acknowledged as future work
Documentation clarity and completeness: Excellent; transparent and well-written
Usefulness and third-party usage: No external adoption yet; simulated contributions
Long-term hosting and accessibility: Stable on Wikibase Cloud
README / data completeness: Generally complete.
Known limitations: Clearly stated
5-star vocabulary reuse assessment: Not provided; recommended by SWJ guidelines

Section-by-Section Comments

Introduction
Provides a clear and concise motivation for the work
Well framed within the context of digital humanities and Semantic Web practices

Related Work
Covers relevant datasets and projects (e.g., FactGrid, WarSampo)
Could benefit from a more explicit comparison to datasets with active external linking

Methodology and Modelling
Hybrid model (CIDOC + Wikidata properties) is pragmatic and clearly described
Event-based modelling aligns with best practices in cultural heritage KGs
The authors acknowledge modelling trade-offs (e.g., direct document links for incidental mentions)
Optional: Consider exporting schema as OWL/Turtle or RDF for improved interoperability

Validation and Shape Expressions
ShEx shapes are auto-generated using sheXer and published as EntitySchemas
Example SPARQL queries function as intended
Recommendation: Provide direct link to the EntitySchema list for easier access

Infrastructure and Dataset Access
Dataset is hosted on Wikibase Cloud with stable URIs and GitHub repo
However, no explicit license or version number is visible—this should be addressed

Evaluation
Summary statistics provided (2,211 entities, 12,429 statements)
Competency questions are practical and meaningful
Contributions are currently simulated—this is clearly stated
Corpus coverage metrics (e.g., percentage of AMSPO corpus processed) would improve transparency

Applications and Utility
Use cases (e.g., querying persons by office, exploring geographic coverage) show real potential
No third-party adoption yet—understandable at this stage, but should be noted explicitly.

Additional Insights from Graph Exploration

UI Navigation Asymmetry
The graph currently supports one-way navigation: entity pages (e.g., persons, places) list the documents in which the entity appears, but document pages do not display the entities they mention. This creates a usability barrier for non-SPARQL users.

External Linking to Other Knowledge Graphs
Currently, the dataset does not include any data-level links to external datasets such as Wikidata, GeoNames, DBpedia, or VIAF. This prevents it from reaching the 5th star in the 5-Star Linked Open Data model. The authors acknowledge this and list it as future work.

Required and Recommended Revisions

Required
Add License and Version Information
Clearly specify an open license (e.g., CC0, CC-BY) and include versioning metadata for the dataset. This is critical for proper reuse, citation, and compliance with Linked Open Data standards.

Improve Navigability from Document Pages
Document pages currently do not display the entities they reference. To improve usability, especially for non-technical users, add inverse properties or helper statements (e.g., mentionsEntity) so entities documented in a charter are directly visible on that charter's page.

Strongly Recommended

Add External Entity Linking (for 5-Star LOD Compliance)
The dataset currently lacks data-level links to external knowledge graphs such as Wikidata, VIAF, GeoNames, and DBpedia. Begin linking key entities (especially places, persons, and offices) to existing URIs to enable broader interoperability and LOD Cloud eligibility. Tools like LIMES, SILK, or OpenRefine reconciliation can assist with this.

Include a 5-Star Vocabulary Reuse Self-Assessment
The Semantic Web Journal encourages authors to assess their dataset using the 5-Star Vocabulary Use guidelines. Including this will help demonstrate the dataset's alignment with Linked Data best practices.

Clarify Future Community Contribution Plan
While the authors currently simulate contributions, a roadmap for real user engagement would improve confidence in long-term sustainability. Suggestions include contributor guidelines, editorial workflows, validation mechanisms, or outreach to domain experts and institutions.

Provide Corpus Coverage Metrics
To help reviewers and future users understand dataset completeness, include quantitative coverage information. For example, state how many charters from the AMSPO corpus have been processed, or what percentage of referenced entities (people, places) have been modeled.

Add a Direct Link to EntitySchemas
The paper references Shape Expressions (ShEx) and their use in validating entities via Wikibase EntitySchemas, but no direct link is given. Include a link to the EntitySchema list

Review #3
Anonymous submitted on 13/Feb/2026
Suggestion:
Minor Revision
Review Comment:

This paper presents the Medieval Charters Community-Driven Knowledge Graph (MCKG), a Wikibase-based framework that integrates expert annotations with community contributions through a provenance-aware pipeline. The Knowledge Graph (KG) utilizes a hybrid data model based on CIDOC CRM and Wikidata properties, incorporating knowledge from domain experts alongside automated inferences and NLP tasks.
While the paper is categorized as a “data description” article, it focuses more extensively on the data model than on the dataset itself. The choice and existence of the data seem secondary to the pipeline and modeling framework. In my view, this is the paper’s primary weakness. Conversely, the proposed data model is flexible and well-reasoned, and the pipeline offers a useful methodology that could be successfully adapted to other case studies.
The journal’s requirements for data description papers are only partially met. Specifically:
- Accessibility: The authors provide a link to the Wikibase instance, but the link to the full dataset is missing from the manuscript (I was only able to locate it via the review platform).
- Metadata: The GitHub repository, while well-structured, lacks essential information regarding the dataset’s versioning, date, and licensing.
- Other Criteria: Other relevant aspects—such as topic coverage, data sources, purpose, maintenance methods, and modeling patterns—are described briefly but adequately.
Overall, the paper is engaging and easy to follow. However, the writing style requires refinement:
- Structure: The paper outline (at the transition between pages 1 and 2) fails to mention Section 3. Additionally, Section 2 feels loosely connected to the overall scope of the paper.
- Redundancy: There are several "truisms" and repetitions (e.g., p. 1: “inference mechanisms which could enable to infer”; abstract: “leaving large digitized collections largely unexplored”).
- Clarity and Grammar: Some phrasing is overly vague (e.g., the description of historians' tasks on p. 1). There are also minor grammatical errors (e.g., p. 1: “the vast majority remain” should be “remains”; p. 6: “various research domain” should be “domains”) and inconsistent use of capitalization and bold text. Increased use of punctuation would also improve readability.
More specific comments:
- P. 4: The role of the "native label" and the property crm:P70i should be better defined.
- P. 6: The phrasing regarding "NER-detected entities as native Spanish speakers" is unclear and requires clarification.
- Section 5: The expression “the resolution of use cases” should be rephrased for clarity.
- P. 7: The authors claim that Wikibase features ensure "truthfulness." Since domain expertise does not inherently guarantee absolute "truth," I suggest replacing “truthfulness” with “reliability” to better reflect the nature of historical data.