Ontology-Driven Extraction of Research Processes

Tracking #: 1668-2880

Authors: 
Vayianos Pertsas
Panos Constantopoulos

Responsible editor: 
Andreas Hotho

Submission type: 
Full Paper
Abstract: 
Extracting information from a research article, associating it with information from other sources and deriving new knowledge constitute a challenging process that has not yet been fully addressed. Here we present Research Spotlight, a system that leverages existing information from DBpedia, retrieves articles from repositories, extracts and interrelates various kinds of named and non-named entities by exploiting article metadata, the structure of text as well as syntactic and semantic constraints, and populates a knowledge base in the form of RDF triples. An ontology specifically designed to represent research processes and practices is driving the whole process, the outcome adhering to linked data standards. The system is evaluated through two experiments that measure the overall accuracy in terms of token- and entity- based precision, recall and F1 scores, as well as entity boundary detection, with promising results. Error analysis provides useful insights into the capabilities and optimization possibilities of each module of the system.
Full PDF Version: 
Tags: 
Reviewed

Decision/Status: 
Reject

Solicited Reviews:
Click to Expand/Collapse
Review #1
Anonymous submitted on 26/Jul/2017
Suggestion:
Major Revision
Review Comment:

This manuscript was submitted as 'full paper' and should be reviewed along the usual dimensions for research contributions which include (1) originality, (2) significance of the results, and (3) quality of writing.

Summary:
In this paper, the authors present a workflow of several steps which allows to build a research spotlight using web data sources. The output of this workflow is an RDF knowledge base. First, using a set of keywords, they generate a list of named entities from dbpedia knowledge base. Then, using this set of named entities and a set of articles (after a segmentation step and classifier training) selected from the Web they extract metadata, named-entities and non-named entities. Finally, a relation extraction step is applied. It uses the set of named/non-named entities and a set of constraints and provides a set of related entities (like has-goal) and parthood/sequence combinations like part-of and isFollowedBy relations. An experimental evaluation has been conducted using 9659 extracted entities. The results have been evaluated regarding manual annotations done by two human experts. Eight research domains has been considered in this experiments.

1- Originality:
The problem of building a research spotlight has received little attention, which justifies the work. Rather than presenting an approach from scratch, the authors present a novel and interesting combination of existing techniques. The use of a scholar ontology and declared constraints make the approach generic which easy to be applied on different domains.

2- Significance of the results: . Providing a RDF knowledge base that is able to be used by both researchers and research governmental and non-governmental organizations to find-out relevant and synthetic information about research work, is of good value. Designing a complete workflow that allows to build such knowledge base is a challenging task, since as many tools are introduced and used as many error reasons will be generated. However, the approach is improvable in the usability point of view. The user interface should be improved and the querying modes should be developed and better discussed in the paper.
The evaluation part should be improved in the sense that as it is presented in the paper it evaluates in a “holistic” way the system and not the different modules one independently from each others. One may expect to have an evaluation for: named and non-named entity extraction, relation and sequence extraction and the linking modules. How trustable could be the gold-standard, since only two human experts have been solicited to annotate the data from eight different domains. Are they experts in all these different domains?
In addition, for the linking step with linked open data, it is not clear which kinds of links are generated? sameAs only or others? For sameAs links why data linking tools are not used? Why only dbpedia has been used? Why not dblp and google-scholar datasets are not used?

3- Quality of writing: .
The paper is readable but needs improvements and clarifications. The explanation of the algorithms is hard to read since they are not always illustrated. The related work should also be better organized by giving subsections that correspond more to the different aspects that are addressed in the paper: entity extraction, relation extraction and linking related work.

Some miner remarks:
- A more readable style of algorithm should be used (like algorithm2e of Latex) with numbering and boxes.
- The figures 4, 5, 6, 9 and 10 are too small and not readable.
- In figure 7, there is mistake, the relation “isFollowedBy” should be put between the activity “applied Kernel PCA …” and the activity “applied random forest …” and not btw “gathered the data using …” and “applied Kernel PCA …”.

Review #2
By Giuseppe Rizzo submitted on 10/Nov/2017
Suggestion:
Reject
Review Comment:

This paper presents Research Spotlight, a system that extracts from scientific literature key information such as entities and relations properly classified using an already published ontology to populate a knowledge base that is ultimately published using semantic technologies. The system is a hybrid implementation of machine learning and rule-based reasoning. Where the machine learning is a re-use of popular tools, while numerous task specific rules have been proposed in this paper. The evaluation is two-fold blending both quantitative and qualitative analysis: i) entities and tokens (ie. non-classified sequences of characters) are analyzed through a conventional linguistic approach using a gold standard generated by human annotators over a dataset created on purpose; 2) error inspection over the output produced by the entity and token classifiers. The work is strongly inspired by automated systems that extract information from textual content and by the linking to external knowledge, being these two the main inspirational points reported in the related work.

Such a review follows the rules of reviewing a Full Paper, however I strongly recommend the authors, when re-submitting, to modify the paper type as, presented as such, is a System Report.

(1) originality

There is a bunch of related works in the field of Systematic Literature Review (SLR) concerning automated approaches to classify automatically research materials. Then, the SciKnow workshop has proposed numerous initiatives towards the automated deep inspection and analysis of scientific studies. I suggest authors to dig into these resources and compare their work.
Anyway, the complexity of the task, as presented by the authors, is in aggregating different modules and streamlining the entire value chain from the scientific publication (HTML/PDF) till the generation of a knowledge base.
The generation of a knowledge base has been investigated extensively in numerous other fields leveraging semi-structured content. For this reason, given the great emphasis on the entire solution value chain, and lacking for an original point, I suggest to change it to System Report.

(2) significance of the results

There is no doubt that this system may be of use to a large community. Such a thing is already something that I value significantly. However, the evaluation section has been set up around the measuring of how good or bad is the entity/token classification. The annotators created a gold standard that is used as reference benchmark, but no details about the annotators have been given (see below for recommendations) nor the dataset. Then, the qualitative analysis is performed only on these two contributions discarding the relation extraction and the entire "usability and correctness" of the system as whole from a user centric point of view.

(3) quality of writing.

The concepts are clear, and the phrasing is OK. However, the paper will need a re-organization before a re-submission. A few suggestions are reported below.

Questions and comments that can be a starting point for a re-submission:

- In Sec 1, authors mention that "The assessment of content, the extraction of valuable information, ... are left to the reader" which I found a bit bold, as both commercially (see Google Scholar for instance) and experimentally (automated SLR and SciKnow related materials) propose approaches going into that direction. Perhaps authors want to further discuss this?

- In Sec 1, authors present this approach as the one-size-fits-all solution to extract valuable content from scientific materials (PDF/HTML) and save into queryable knowledge bases. Let's take again the example of Google Scholar, for instance, what is the real added value of this approach and why it is convenient/profitable for a final researcher or library? Authors miss to really motivate this work shedding light on the real contribution. A revision in that perspective will strengthen the contribution

- Sec 2 presents an already published ontology. I understand that this section has been added for the sake of being comprehensive and self-contained, however I recommend to further elaborate the section lowering the detailing of the ontology, and advancing why this ontology is useful in this domain

- Sec 3 is very fragmented and many concepts are only introduced and then references to subsequent sections are provided. This provides confusion and generates redundant content from Sec 3 and 4. The two can be then merged

- It is mentioned IO format. Please provide a reference and formulate why IO instead of the IOB. This comes as a technology constraint?

- In Sec 4:
+ which are the keywords? Are them paper specific (as I would expect)?
+ "through the APIs of various ..." meaning what? what is the criteria of using those APIs? Pertinence with the content of the paper or purely aiming to maximize the recall? If so, how scalable is this solution?
+ it is arguable to annotating articles with rule-based approaches, knowing that they are considered obsolete given the low performance (this is highlighted also in Sec 5.3). Why such a choice?
+ it's unclear which is the training set of the Stanford CRF. Does it come from the rule-based approach? If so, authors need to motivate that the rule-based approach is almost perfect, otherwise the CRF learns from bad data affecting the over prediction of the CRF. This can be motivated with an extensive analysis over the performance of the rule-based component with different selection criteria and different domains
+ it's clear that, as general rule of thumb, it's easier to processing structured content such as HTML/XML rather than PDF. However, being one of the claim of this paper, did the authors consider an error analysis for measuring the metadata extraction?
+ is "the module" mentioned in Sec 4.3 the one based on the CRF? please clarify
+ Alg 3/4/5/6/7 are, again, based on rules. They need to be further motivated and evaluated extensively to prove that they work in practice, since this usage goes against the last 20+ years of research in linguistics/machine learning
+ for all algorithm formulations, it would be better to stick with the same notation within the boxes and in sections (name of algorithms, variables, ...)
+ I recommend to re-organize this section, where each of the tasks can be organized as presenting first the formulation, then the empirical assessment

- how successful is Research Spotlight is measured according to the ability of recognizing entities and tokens. To some extents, this means being able to generate the right nodes in the graph. What about the relations?

- please report the statistics of the dataset and the number, profile of annotators

- the error analysis highlights the problem of using rule-based approaches. Authors put it in terms of not being enough robust to cope with writing style, however this is an expected problem of rule-based approaches that they usually work if the domain matches their design, otherwise their contribution is faulty. How do the authors intend to go beyond this limitation? As being an essential part of the overall contribution, this needs further work in the resubmission

- the selection of the links is neglected in the evaluation, being this the first component of the entire solution. This needs to be properly studied

- Sec 6 would be better moved after Sec 1, as I had the impression that many concepts in Sec 1 needed a proper introduction and comparison with state-of-the-art approaches at the beginning of the paper

Review #3
Anonymous submitted on 17/Nov/2017
Suggestion:
Major Revision
Review Comment:

The paper describes a tool for extracting knowledge about research processes from scientific papers and related data sources. Such a tool is of great interest to the scientific community, as it might support researchers in retrieving relevant information, in quickly browsing large numbers of publications, and in discovering dynamics of scientific domains.

The paper is well written in terms of organisation and usage of the english language. My impression is that the authors have invested a lot of effort in implementing, researching and evaluating the system, and that their work is worth being published. However, it remains unclear to me what the genuine contribution of this paper is. Is it the presentation of Research Spotlight as a new tool, which relies on previously performed research of the authors; or is it a research paper (with the content of Chapter 4 and its evaluation in Chapter 5) as the main contribution? Both would have strong benefits, and likely high impact; but it should be made clear which is attempted. From the introduction I derived that this paper is primarily the former. If this is the case, however, I would have expected
1) to get access to a web interface of the system to get a first hand experience, or at least see a file with the entities and relationships derived in Chapter 5 - seen that the system is located in an open data setting,
2) to see an explanation which algorithms in Chapter 4 are presented here for the first time, and which have been published before.

Concluding I recommend to ask the authors to submit a substantially revised version that is addressing the intention of the paper more clearly and is focussing on it in the presentation of the results.

Disclaimer: I am more an informed outsider of NLP rather than an expert in the field. Therefore my review is focussing on structural aspects of the overall approach.


Comments