Semantic Quran a Multilingual Resource for Natural-Language Processing

Tracking #: 405-1518

Mohamed Sherif
Axel-Cyrille Ngonga Ngomo

Responsible editor: 
Guest editors Multilingual LOD 2012 JS

Submission type: 
Dataset Description
In this paper we describe the Semantic Quran dataset, a multilingual RDF representation of translations of the Quran. The dataset was created by integrating data from two different semi-structured sources. The dataset were aligned to an ontology designed to represent multilingual data from sources with a hierarchical structure. The resulting RDF data encompasses 43 different languages which belong to the most under represented languages in Linked Data, including Arabic, Amharic and Amazigh. We designed the dataset to be easily usable in natural-language processing applications with the goal of facilitating the development of knowledge extraction tools for these languages. In particular, the Semantic Quran is compatible with the Natural-Language Interchange Format and contains explicit morpho-syntactic information on the utilized terms. We present the ontology devised for structuring the data. In this paper, we also provide the transformation rules implemented in our extraction framework. Finally, we detail the link creation process as well as possible usage scenarios for the Semantic Quran dataset.
Full PDF Version: 

Major Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
By Riccardo del Gratta submitted on 15/Jan/2013
Review Comment:

The paper is interesting under many points of view, which, on my side, need to be better specified. Some answers are got from the manuscript but not immediatly.
1)section 4 -> one2one mapping "Given the regularity...etc" It is clear from the previous description of the source dataset, but a plain example of how the structure 'chapterIndex|verse|verseText' is mapped onto 'LOCATION FORM TAG FEATURES' could be helpful for readers. This example will better clarify the ontology.

2)parallel corpus -> In section 6 and 7 appears "parallel corpus.. 42 languages". I guess that it is the work done on the sources (manual revision an so on) but I'd like this aspect could be better specified, for example which kind of parallelism: sentence, word, lexicalItem...

3) Lemon -> section 5, listing 1 '?y rdf:type lemon:LexicalEntry' Are you planning to extract a lexicon from the corpora? If so add it to the conclusion; if already done, just tell it.

Minor notes
- add a reference to the jena framework
- is possible to move table 1 after the citetion?
- 'to words in the same language with have exactly the same label' is 'to words in the same language with exactly the same label' or to words in the same language which have exactly the same label' ?

Review #2
By John McCrae submitted on 04/Feb/2013
Major Revision
Review Comment:

This paper describes a multilingual corpus based on the Koran, in 42 languages, many of which are severly underresourced. As such this work has interest for researcher in both translation and comparative linguistics.

It was significantly difficult to review this paper as the resource described did not seem to available! The authors must clarify how they intend to maintain this resources availability and future evolution.

The resource itself it described by an ontology that reuses several other linked data vocabularies. The authors do not clarify why they made these particular choice, e.g., why GOLD was used for basic linguistic categories, as opposed to ISOcat or OLiA. Or why existing linked data models of corpora, such as POWLA were not used.

The linking section is particular weak, links are made to DBpedia and Wiktionary but it is not clear if any disambiguation is applied. The only description of this mapping is an XML snippet showing the input to one of the author's system. It needs to be more clearly defined what this matching is... from reading the original paper I guess the "trigram" metric means it is a fuzzy string based match?? Furthermore, there should be some evaluation of the correctness of these mappings.

The use case section is a bit too technical and only really understandable to someone familiar with SPARQL. A more general, accessible overview of the intended uses would be desireable.

p1. "English and American labels"?? American is not another language!

Review #3
Anonymous submitted on 16/Feb/2013
Major Revision
Review Comment: