A Topic Ontology for Modeling Topics of Old Press Articles

Tracking #: 2366-3579

Authors: 
Mirna El Ghosh
Nicolas Delestre
Jean-Philippe Kotowicz
Cecilia Zanni-Merk

Responsible editor: 
Special Issue Cultural Heritage 2019

Submission type: 
Ontology Description
Abstract: 
This article introduces Topic-OPA, a general topic ontology for modeling topics of old press articles. In Topic-OPA, topics are represented as nodes in a structure with two different schemes: hierarchical and non-hierarchical. The hierarchical scheme is expressed by taxonomic (is-a) edges among the topics. The non-hierarchical scheme is represented by cross-references that relate different topics. The hierarchy of topics is extracted from the open knowledge graph Wikidata using SPARQL queries. Furthermore, a curation process is applied to refine and enrich the results. Topic-OPA is designed to be small enough for maintainability and curation and is aimed to cover the most relevant topics of old press articles domain. An experiment use-case is presented to demonstrate the utility of Topic-OPA for topic labeling of old press articles.
Full PDF Version: 
Tags: 
Reviewed

Decision/Status: 
Reject

Solicited Reviews:
Click to Expand/Collapse
Review #1
Anonymous submitted on 18/Dec/2019
Suggestion:
Major Revision
Review Comment:

I enjoyed reading this paper. It is quite well-written and motivated, and the proposed approach is simple and interesting.

The biggest concern I have with this submission is that the Use-case experiment of Section 6 does very little to validate the approach: sure, we can use Topic-OPA to compute path-based similarity measures between concepts, but how can we tell whether this similarity measure (and, as a consequence, Topic-OPA) is suitable for representing old press articles? I would have liked to see, for example, whether co-occurrence of terms in old press articles is more correlated with this Topic-OPA-based similarity measures than with similarity measure based to more generic topic ontologies - something like that could be a powerful argument for the suitability of the proposed ontology, whereas the described use case does in my opinion very little to "demonstrate the utility of Topic-OPA for topic labeling of old press articles".

Topic-OPA is largely handcrafted, as the primary data obtained from Wikidata required quite a lot of human curation to clear it anomalies and entities not relevant to the topic and enrich it further, and also because the SPARQL queries used to extract the primary data were also handcrafted. This is not a criticism, but it further highlights the necessity of validating the resulting ontology for the chosen domain. Such a validation is largely absent from the submission; but if it were to be added, I believe that this could be a very solid contribution and one certainly deserving of publication.

Review #2
Anonymous submitted on 02/Jan/2020
Suggestion:
Reject
Review Comment:

The paper presents an ontology-based method for topic extraction from old press articles.

Albeit the paper is fairly well written and the scope of the work is clear, I am sorry I cannot recommend publication of the work on swj, at least in the present form. The decision is based on the evaluation criteria specified for works submitted to the special issue as 'ontology description', as well as on further considerations.

This manuscript was submitted as 'Ontology Description' and should be reviewed along the following dimensions: (1) Quality and relevance of the described ontology (convincing evidence must be provided).

The underlying ontology is not provided in full. Furthermore, as far as I could understand, the core of the ontology is actually the wikidata ontology, while the contribution of the authors is rather limited.

(2) Illustration, clarity and readability of the describing paper, which shall convey to the reader the key aspects of the described ontology.

The manuscript is generally well written, but it lacks details on some very relevant aspects of the work carried out.

1. Which is, specifically, the set of classes and properties of the proposed ontology?

2. How much of the proposed ontology is a reproduction of wikidata and how much is novel?

Besides the above comments, there are two further points which I believe should be better explained and/or methodologically revisited:

1. The authors eventually aim at associating documents to topics. They choose the following topics: politics,
political system, sport, economy, activity, location,
news media, humanities, science, product, transport
and organization. The methodological grounds behind this decision should be further explained.

2. The final linkage to topics is effectively carried out via named entity recognition, which allows one to map the documents' textual content to the ontology. However, no details on how this task is carried out are provided. Because this step is key for actually applying the ontology, the authors should make an effort in clarifying how they deal with this issue. Furthermore, it would be relevant to know, at least for a sample set, what percentage of documents is effectively covered by the ontology. This would help assessing the completeness and the applicability of the proposed ontology to carry out the topic classification task.

Review #3
Anonymous submitted on 03/Jan/2020
Suggestion:
Reject
Review Comment:

The paper under review describes how a topic ontology can be developed and applied for the task of topic labelling of documents.
The general idea behind the latter task is that:
1. for each document we find a set of named entities
2. for each named entity we find a set of its topics
3. the document is then labelled with the topics found thereby.
The topic ontology described in the paper is used in stage 2.

I found the paper's solution to the problem document annotation of minor novelty and originality, but unfortunately the paper is plagued with a number of more substantial issues.

The main defect of the paper is that it does give its reader access to the ontology itself ('Topic-OPA').
So although the paper describes some general features of this ontology and provides its, somewhat idiosyncratic, evaluation, its main result is not verifiable.

The next serious hinderance is the description of the ontology development process.
If I got it right, it goes as follows:
A. We start from a set of named entities: 'The named entities are extracted, in previous work, from the knowledge graph Wikidata.' (p. 4)
B. Given this set a set of general topics is found: 'By analyzing these entities, we have defined the most relevant general topics of the old press articles domain' (p. 4)
C. Given the set of general topics a set of topics is found by collecting all subconcepts of the general topics.
Only step C is described in sufficient detail. Steps A and B are, as far as I can tell, described only by the sentences quoted above and these descriptions are appallingly insufficient.
For example, what does it mean that the authors analysed the named entities (found in the previous work)? How am I to verify that they didn't make any mistake doing this? Etc.

The third main problem with the paper is that it does not describe the corpus of the 'old press articles' in any detail.
We don't know how many document the corpus contains, what language they are written in, how long they are, etc.
(There is a reference to some ASTURIAS project, but without any specification of this project.)
Also I suspect that given the TOPIC-OPA and the methodology described below it would be relatively easy to annotate the document in the corpus by the topics and evaluate the annotations.
Still we find no such result in the paper.

Some minor issues:
- it is rather odd to put the related works section in the middle of the paper;
- some claims/sentences look obscure to me. For example, what does '130k' stands for in 'The results of this phase (130k) contain direct inconsistencies and anomalies that need to be revisited and curated by human curators.'?
The number of topics, nodes, edges?
- I am not a native speaker of English, but the grammar of some sentences looks wrong to me. For instance: 'Given a named entity n, from a set of named entities N, and a topic t, from a set of topics T.'