Semantic Enrichment of Hadith Corpus - Knowledge Graph Generation from Islamic Text

Tracking #: 3791-5005

Authors: 
Amna Binte Kamran
Nigar Azhar Butt
Amna Basharat

Responsible editor: 
Guest Editors KG Gen from Text 2023

Submission type: 
Full Paper
Abstract: 
Knowledge graphs from text have garnered substantial interest across various domains due to their potential to facilitate efficient information retrieval and knowledge exploration. However, knowledge graph generation from textual sources presents unique challenges, particularly in the Islamic domain, where primary sources of knowledge are texts in Arabic, which exhibit complex linguistic and cultural nuances. This paper presents a comprehensive methodology for generating a knowledge graph from the hadith corpus. Hadith, a fundamental resource in the Islamic domain, stands as one of the primary sources of Islamic legislation, encompassing the sayings, actions, and silent approvals of the Prophet Muhammad ﷺ. Leveraging Natural Language Processing techniques, we systematically extract, annotate, and interlink semantic entities and relationships from the hadith corpus, extend the SemanticHadith ontology for entity organisation, and compute textual similarities to establish semantic connections. We generate a comprehensive knowledge graph by applying these methods to six hadith collections, facilitating efficient information retrieval and knowledge exploration in the Islamic domain. This is an essential step towards annotating and linking the hadith corpus to allow semantic search to support scholars or students in creating, evolving, and consulting a digital representation of Islamic knowledge. The SemanticHadith knowledge graph is freely accessible at http://www.semantichadith.com.
Full PDF Version: 
Tags: 
Reviewed

Decision/Status: 
Accept

Solicited Reviews:
Click to Expand/Collapse
Review #1
By Md Kamruzzaman Sarker submitted on 11/Jan/2025
Suggestion:
Accept
Review Comment:

This manuscript was submitted as 'full paper' and should be reviewed along the usual dimensions for research contributions which include (1) originality, (2) significance of the results, and (3) quality of writing. Please also assess the data file provided by the authors under “Long-term stable URL for resources”. In particular, assess (A) whether the data file is well organized and in particular contains a README file which makes it easy for you to assess the data, (B) whether the provided resources appear to be complete for replication of experiments, and if not, why, (C) whether the chosen repository, if it is not GitHub, Figshare or Zenodo, is appropriate for long-term repository discoverability, and (4) whether the provided data artifacts are complete. Please refer to the reviewer instructions and the FAQ for further information.

Review #2
Anonymous submitted on 22/Sep/2025
Suggestion:
Minor Revision
Review Comment:

Knowing that this work is an extension of the first study, I began by reviewing the original version to understand its methodology, ontology design, and competency questions.

Competency questions (CQs) presented in the second paper are essentially the same as those in the first paper, with only minor surface-level variations in wording. For example, the first paper includes questions such as “What is the lineage of a particular narrator?” or “Find hadith narrated by Narrator A,” whereas the second paper rephrases these into “Which narrators have not narrated any sacred hadith?” or “Find hadith ‘discussesTopic’ Topic X.” Despite slight differences in terminology (e.g., shifting focus from narrators to topics or events), the underlying structure, intent, and scope of the CQs remain unchanged.

This repetition indicates that the second paper does not demonstrate new competency questions that reflect the distinct methodological contribution it claims (i.e., text-driven knowledge graph generation with NLP and similarity computation). Instead, the CQs are largely recycled from the first work, which suggests that the extension is not fully articulated at the level of use cases and evaluation.

Page 15, line 24: As mentioned in Section 5, the authors state that the scope of the ontology is defined through a set of competency questions (CQs). However, they do not acknowledge that these CQs are essentially the same as those used in their earlier work. This omission is problematic because it suggests a lack of transparency in how the evaluation requirements were established. If the paper is positioned as an extension, the authors should have either (i) clarified that the CQs are intentionally reused and explained why they remain appropriate, or (ii) introduced new or refined CQs aligned with the novel contributions of the work. By failing to do so, the paper leaves uncertainty about the standards used to validate the extended ontology and weakens the claim of substantive advancement over the previous study.

As previously mentioned by a reviewer: “1. Generalization Issues: The section mentions that existing studies in hadith primarily focus on specific domains, like prophetic medicine and the chain of narrators, too broadly and do not acknowledge other studies in the domain.”

Although the authors have added detailed explanations about leveraging NLP techniques, preprocessing steps (pages 2, lines 18–49), and the integration of external knowledge graphs to enhance the SemanticHadith ontology, they do not address the previous concern regarding generalization across the hadith domain. The discussion focuses on technical methodology, linguistic processing, and practical applications, but it does not acknowledge other studies beyond the specific domains of prophetic medicine or chains of narrators, nor does it provide evidence that the framework generalizes to the full breadth of hadith literature. Without such references or broader validation, the previous reviewer comment regarding generalization remains unaddressed, even though there is an enhanced discussion of NLP techniques in sections 2.3 and 2.4.

Finally, the SemanticHadith knowledge graph is claimed to be freely accessible at http://www.semantichadith.com
. However, despite numerous attempts, I was unable to access the knowledge graph. Therefore, I am unable to judge its applicability, SPARQL functionality, or other practical capabilities, preventing a proper evaluation of the extended ontology in practice.

To strengthen the paper and address these shortcomings, the authors may consider referencing additional relevant work, including:

=Quran Knowledge Graphs=

Elsayed, E., & Fathy, D. R. (2019). Evaluation of Quran recitation via OWL ontology-based system. Proceedings of the International Conference on Computer and Communication Engineering (ICCCE 2019).

Iqbal, R., Azmi Murad, M. A., & Ashraf, A. (2020). Quantitative assessment of concept maps for conceptualizing domain ontologies: A case of Quran. Proceedings of the International Conference on Computer and Communication Engineering (ICCCE 2020).

Jiang, S., & Mosa, M. A. (2022). Reliable semantic communication system enabled by knowledge graph. Entropy, 24(12), 1704. https://doi.org/10.3390/e24121704

The authors of SemanticHadith 2.0 do not sufficiently discuss prior closely related works, particularly Mosa (2025) and Shafie (2021-2023). Both of these studies already apply hybrid AI and knowledge graph approaches to address critical tasks in Hadith analysis—narrator disambiguation in Mosa and retrieval plus semantic-similarity classification in KASHAF. By failing to acknowledge these contributions, the manuscript overlooks important context for positioning its methodology and novelty. A proper comparative discussion would clarify how SemanticHadith 2.0 extends, complements, or differentiates itself from these approaches, especially in terms of multi-collection coverage, expert-labeled similarity pairs, and explainable reasoning. Without this, readers may misinterpret the claimed contributions as more original or distinct than they are.

Mosa, M. A. (2025). Synergizing structure and semantics: a knowledge graph-transformer framework for narrator disambiguation in hadith networks. Digital Scholarship in the Humanities, fqaf088.

Shafie, Omar Abdulfattah. "KASHAF: A Knowledge-Graphs Approach Search-Engine for Hadith Analysis & Flow-Visualization." Master's thesis, Hamad Bin Khalifa University (Qatar), 2021.

Shafie, O., Darwish, K., & Jansen, B. J. (2023, July). Robust Hadith IR using Knowledge-Graphs and Semantic-Similarity Classification. In CS & IT Conference Proceedings (Vol. 13, No. 12). CS & IT Conference Proceedings.

===Method and results===

1.Section 6.3 Expert Validation and Insights:

Author states that 100 hadith pairs were “randomly selected from the top similarity bins” for expert validation, but it does not clarify the method used for random selection. Without specifying the procedure—whether it was simple random sampling, stratified sampling across collections, or another approach—it is unclear how representative these pairs are of the broader corpus. This is particularly important given the hierarchical and uneven distribution of hadith across collections, and the potential for systematic biases in similarity scores. In religious domains, unlike industrial or scientific datasets, interpretations can vary, and without a transparent selection process, it is difficult to assess whether the expert validation accurately reflects the reliability of similarity computations across the full corpus.

2. Section 6.4 Integration into knowledge graph

The section appears intended to justify why only expert-validated hadith pairs were included in the knowledge graph, while also highlighting a refinement (the “strongly similar” property) for the top similarity bin. However, it is confusing because it suggests a broader application to other collections (Sahih Muslim, Ibn Maja, Sunan Abi Dawood, and Nisai) without explaining whether this has actually been implemented or is purely prospective. The paragraph blurs the line between what has been done in the current study and what is planned for future work, making it unclear to the reader whether the knowledge graph currently contains only Sahih Bukhari pairs or has been extended to other collections. It is not immediately clear what assumptions or decisions led the authors to limit the graph to expert-labeled pairs, and the mention of future expansion could be misinterpreted as an existing contribution.

Mirarab, A. (2024). Explainable large language model for Islamic and humanities studies. STIM Journal of Islamic Studies and Technology. https://stim.qom.ac.ir/article_3085.html

3. Section 6.5 Challenges and Insights

The authors discuss the potential use of LLMs in the future work section as a way to improve similarity computations and capture semantic and contextual nuances. However, the manuscript does not emphasize in the literature review that such approaches already exist and have been applied in related domains, such as Quranic verse similarity, religious question answering, and semantic knowledge graphs. By not situating LLM-based methods within the existing body of work, the paper misses an opportunity to acknowledge prior approaches, contrast them with their current methodology, and justify the novelty or limitations of their proposed framework. This omission weakens the contextualization of the proposed future directions.

================================================================================================

1.Originality: Work in the Islamic hadith collection context is original, but not for general knowledge graph development. The extension largely recycles competency questions from the previous work.

2.Significance of the results: Without access to the knowledge graph and ontology, the practical impact and applicability of the results remain uncertain.

3.Quality of writing: The manuscript is well-written. However, the organization of tables and figures could be improved for clarity—figures should appear close to the discussion about them, as currently, some figures (e.g., Figure 4) are referenced on one page but appear much later (page 18), disrupting the flow for the reader.

4. Data availability

a. Data organization and README: The repository on GitHub is reasonably organized and contains a README file, which provides basic guidance for understanding the provided files. This makes it easier to navigate the resources.

b. Completeness for replication: While the repository contains ontology and knowledge graph files, it lacks a functional SPARQL endpoint or fully accessible knowledge graph. As a result, replication of experiments or direct verification of similarity computation and interlinking results is not possible.

c. Repository suitability: The repository is hosted on GitHub, which is a recognized platform for long-term accessibility and basic discoverability. However, the main resources (knowledge graph and SPARQL endpoint) are not operational, limiting the practical utility of the repository.

d. Completeness of data artifacts: The provided data artifacts are incomplete for full replication or evaluation. Without a working knowledge graph or query endpoint, critical aspects of the study, such as interlinking and similarity computations, cannot be independently verified.

Review #3
Anonymous submitted on 28/Sep/2025
Suggestion:
Accept
Review Comment:

Review Report

The revised version of the manuscript demonstrates significant improvements and shows that the authors have taken great care in addressing the concerns raised in the earlier submission. Overall, the manuscript is well-structured, coherent, and clearly contributes to the field.

Summary

The recent manuscript showcases an updated methodology for constructing a KG from the Hadith Corpus, building upon the SemanticHadith Ontological data model (domain-centric ontology). The authors have refined their data processing, NLP-based entity extraction, and semantic modelling approaches to improve interoperability and accessibility of Islamic knowledge resources. The revised version of the manuscript has comprehensively strengthenedngs in the first version and strengthened the manuscript considerably and now offers methodological rigour and practical relevance.

Strengths of the Revised Version

• Responsiveness to Feedback: The authors have diligently addressed the comments from the previous review round. They have been improved by mitigating issues (suggested by reviewers) related to generalization, detail on NLP techniques, clarity, methodology transparency, assumptions and limitations, comparative analysis, and reference consistency.
• Clarity and Structure: The manuscript now flows more smoothly, with more precise articulation of the problem statement, background, methodology, and outcomes.
• Methodological Detail: The revisions add greater depth to the NLP techniques, data curation processes, and expert validation steps. This makes the work more transparent and reproducible.
• Ontology Design and Results: The extended SemanticHadith ontology is now more comprehensively described, with improved explanations of modelling decisions and interoperability considerations.
• Overall Contribution: The work demonstrates both scholarly and practical significance. It meaningfully contributes to the growing research on semantic technologies in underrepresented domains such as Islamic studies.

Issues Previously Raised – Now Addressed

• Lack of detail on NLP methodologies: the authors addressed this with more precise explanations and examples (see Section 2.4, Section 4.1.1, 4.1.2 with concrete examples).
• Terminology inconsistencies and formatting: the authors corrected these throughout the manuscript.
• Limited discussion of assumptions and limitations: the authors improved with a more balanced perspective with concrete examples throughout the manuscript.
• Missing clarity on expert validation processes: now elaborated with examples (see Section 3.6).
• Comparative analysis: The authors showcase the training setup for a customized NER model using the spaCy NLP (see Section 4.2.1) and justify performance metrics, including precision, recall and F1-score (see Section 4.2.2). They also improved the observations and results section with a micro average strategy (see Table 2), dealt with the entity extraction process with a concrete example, and dealt with entity variations.
• Discussed integration challenges between SemanticHadith and external ontologies for reconciling structural and semantic differences.
• The authors also highlight and discuss futuristic approaches which handle isolated matan by preprocessing the text to exclude sanad before embedding using Euclidean distance or Manhattan distance. They also mentioned the inclusion of the LLMS fine-tuned method for Islamic texts for better semantic and contextual nuances.
• Reference inconsistencies: updated and properly cited.

Review #4
Anonymous submitted on 15/Nov/2025
Suggestion:
Accept
Review Comment:

The authors have provided the details I had mentioned earlier.