Semantic Enrichment for Recommendation of Primary Studies in a Systematic Literature Review

Paper Title: 
Semantic Enrichment for Recommendation of Primary Studies in a Systematic Literature Review
Authors: 
Giuseppe Rizzo, Federico Tomassetti, Antonio Vetro’, Luca Ardito, Marco Torchiano, Maurizio Morisio, Raphaël Troncy
Abstract: 
A Systematic Literature Review (SLR) identifies, evaluates and synthesizes the literature available for a given topic. This generally requires a significant human workload and has subjectivity bias that could affect the results of such a review. Automated document classification can be a valuable tool for recommending the selection of primary studies. In this paper, we propose an automated pre-selection approach based on text mining and semantic enrichment techniques. Each document candidate is firstly processed by a named entity extractor. The DBpedia URIs coming from the entity linking process are used as external sources of information. Our system collects the bag of words of those sources and it adds them to the initial document. A Multinomial Naive Bayes classifier discriminates whether the enriched document belongs to the positive example set or not. We used an existing manually performed SLR as golden dataset. We trained our system with different configurations of relevant papers and we assessed the goodness of our approach using a statistical approach. Results show a reduction of the manual workload of 18% that a human researcher has to spend. As baseline, we compared the enriched approach with one based on a normal Multinomial Naive Bayes classifier. The improvements range from 2.5% to 5% depending on the dimension of the trained model.
Full PDF Version: 
Submission type: 
Full Paper
Responsible editor: 
Decision/Status: 
Reject and Resubmit
Reviews: 

Solicited review by Tuukka Ruotsalo:

The paper proposes a method to reduce human workload and subjectivity bias in selecting related publications given a set of source publications by comparing two bag-of-words settings of naive Bayes classifier with one only text features originating from the source documents and another with text features originating from source documents and associated Wikipedia pages linked using the Open Calais tagger.

The problem is interesting and worth exploring. There are three major problems with the paper: 1) experimental setup is invalid, 2) the approach is not novel, and 3) the results do not show improvement over the baseline. The approach uses a relatively simple document expansion and a well-known - not novel classifier. The empirical comparison only compares a bag-of-words classifier with a bag-of-words classifier enriched with Wikipedia content. Expanding bag-of-words with wikipedia content may be a good idea, but the experimental setup is not adequate and no conclusions can be drawn. There is way too little data, it is not clear how the gold standard is produced, and the evaluation is not adequate.

If I understood the numbers correctly, based on the data reported in the paper the precision can be computed at recall level 95%.

The assumption of 50 relevant documents of which 5 was used as an input and a total of 1829 documents to be read (what the authors call workload) (Table 2) and the difference between the baseline and the enriched being 49.99 = 50 (Table 3). The result is that precision is 0.11 for the baseline and 0.12 for the enriched method.

Given that the approach is not really novel, the experiment is invalid and the results do not show improvement over the baseline, I suggest to reject the manuscript.

Detailed comments:

I don't understand how exactly relevance assessments were created. In particular, the sentence "all of them evaluated from the SRL taken as reference" is unclear.

I don't understand how the input sets of papers were created exactly "we built 30 different I0 sets per each dimension choosing them randomly among 50 relevant papers." Afterwards, the sets are different in size, but how exactly are these sets formed if the procedure is random?

The statistical tests and hypothesis are ill-defined. The experimental setup is invalid. The problems, in particular, are:

- I don't see why a requirement of a recall equal to or greater than 95% should be selected?

- The authors should use precision and recall to report the results. This is a simple classification task. Now the manual workload measure is artificial and hides the actual performance.

- The paper uses fishing by running the method with all possible configurations (over 30000) and then selects the configurations for which a difference between the sets can be found.

- If I understand correctly, the set of relevant papers is always under or equal to 50, while the number of false-positives is over 2000, which is way too low and unbalanced to draw conclusions.

Solicited review by Siegfried Handschuh:

Good Points:
* The paper is well written and the structure is clear. The problem is interesting and a clear reduction of human effort wrt scientific literature review is important.
* The algorithm is clearly defined. An evaluation methodology is provided with a gold standard dataset. Results are reasonably described and the work is interesting.

Critique:
* Paper seems too short and the content too inmature for a journal publication. The result of the experiment (2100 to 1800 papers to read) seems to be disappointing low and needs more explanation, i.e. why is this such a hard problem?
*Please clarify the justification of the 95% recall assumption. I would assume that a good F-measure (balance of recall and precision) should be the goal; so why do you use a different assumption?
* Related work should be much more discussed for a journal publication. I.e. you hint that related work might have better results then your approach. You should clarify that and compare your results directly to state of the art. Much more content for the related work should be added. The reader needs a better justification how you improve state of the art. That point is not very clear for me.
* Much more experimental work is required. Other ML classifier algorithms should be tested as indicated such as SVM/Markov chains. You should provide a more thorough and broader text classification experiment.
* The disambiguation quality of Open Calais in your test-set should be measured, i.e. how many of the entities are misclassified by Open Calais in your experiment? What is the precision?
* The effect of using NERD vs Open Calais should also be measured.
* You propose a type k fold cross validation. I think you proposed that rightly so, and you should provide that in a journal publication.
* Can you explain you dataset? Why did you chose this dataset? Are there datsets available from related work, so you could compare your approach more directly and effectively?
* Table 2 is misleading wrt reduction of workload. It seems that you do not make a substantial reduction on manual workload, or? Your proof is basically that doing something is better then doing nothing. You reduce the effort from 2100 to 1800 paper. This is still a massive effort for the researcher. Where is the benefit? And why is this an apparent hard problem to solve? Please discuss.
* Also the improvement of 2.5% by semantic enrichment seems somehow low. Can you discuss this problem? And point to related work in semantic enrichment?
* Future work should consider a user study as well with researcher.
* A minor issue: maybe you can improve the variable naming? I personally found it confusing to use 'w' as a variable for documents, where 'd' seems to be more logical.
* Overall the paper is promising but more clarification, discussion and experimental work is required.

Solicited review by Anna Lisa Gentile:

The work explores an application of semantic techniques and provides a in-vivo evaluation, based on the task of Systematic Literature Review (SLR). The authors tackle the problem of semi-automation of SLR, which is the process aimed at identifying, evaluating and synthesising the literature available for a given topic.
They exploits text mining and semantic enrichment techniques to perform an automated pre-selection of available literature.
Research questions concern improvement of human performance on the the task of SLR (quantified as the amount of time saved)
with the introduction of the automated selection step. Specifically two approaches are compared: (i) one using only text mining techniques and (ii) one enriching the original texts exploiting a Knowledge Source.
The gold standard used for the evaluation is an existing manually performed SLR on the field of Software Engineering consisting of 50 papers. The pool of candidate literature is selected from the IEEEXplore portal (for a total of 2215 papers, including the 50 correct ones).
Results show a reduction of the manual workload of 18% by using approach (i), with an additional improvement of 2.5% to 5% by using approach (ii). The workload is quantified in terms of number of papers to read, and the improvement is calculated considering as baseline that all available literature is to be read by the researcher if no automation is available.

The proposed work is interesting and introduces the usage of semantic techniques for the task of SLR.
My two main concerns regard:

- the proposed "enriching approach" consist of: (i) extracting a Bag of Words from each paper only using text from title and abstract (ii) enriching the BOW exploiting linkage to DBpedia for the extraction of additional keywords.
The authors motivate discarding other important pieces of text (such as the conclusion) as compensated by the semantic enrichment. My question is why not comparing the BoW obtained with the "enriching approach" with the the BoW from the full text, with no semantic enrichment. If performance of "enriching approach" is still better, this will further motivate the additional effort of semantic enrichment.

- the baseline used for comparisons, depicts the worst scenario, as it considers that all available literature must be read, although sometimes a partial reading from a human reader is sufficient to discard a paper.
As a future work information could be collected using the tool that the authors describe.

Minor notes:
- the sourceforge page linked do not contain a downloadable tool (in the File section).
- check consistency on introduction of acronyms e.g. Complement Naive Bayes (CNB) is used in section 3.4 but introduced in section 6; Multinomial Naive Bayes (MNB) also could be introduced at the first usage (abstract or section 1) rather that later (section 3.4).

Tags: