Analyzing the generalizability of the network-based topic emergence identification method

Tracking #: 2689-3903

Authors: 
Sukhwan Jung
Aviv Segev

Responsible editor: 
Guest Editors DeepL4KGs 2021

Submission type: 
Full Paper
Abstract: 
The field of topic evolution helps the understanding of the current research topics and their histories by automatically modeling and detecting the set of shared research fields in the academic papers as topics. This paper provides a generalized analysis of the topic evolution method for predicting the emergence of new topics, where the topics are defined as the relationships of its neighborhoods in the past, allowing the result to be extrapolated to the future topics. Twenty fields-of-study keywords were selected from the Microsoft Academic Graph dataset, each representing a specific topic within a hierarchical research field. The binary classification for newly introduced topics from the years 2000 to 2019 consistently resulted in accuracy and F1 over 0.91 for all twenty datasets, which is retained with one-third of the 15 features used in the experiment. Incremental learning resulted in a slight performance improvement, indicating there is an underlying pattern to the neighbors of new topics. The result showed the network-based new topic prediction can be applied to various research domains with different research patterns.
Full PDF Version: 
Tags: 
Reviewed

Decision/Status: 
Major Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
By Angelo Salatino submitted on 28/Mar/2021
Suggestion:
Major Revision
Review Comment:

Review

In this paper, the authors present an approach for detecting the emergence of new topics. The authors argue that common NLP approaches are limited, but if they are integrated with network-based approached it is indeed possible to forecast the emergence of new topics in the following years.

Although the approach is fairly good, interesting, and timely, their narrative requires extensive rework before being worth of acceptance. Many parts of the paper are confusing, and I had to read paragraphs multiple times before being able to picture the process in my mind. Not straightforward.

I will highlight in what follows some of the weaknesses more in detail.

The abstract does not sell properly. Indeed, you mention you select 20 FOS from MAG.
How different from each other? Are they all in one giant field or they are quite spread over the whole domain of science? First, you say you provide a topic evolution method for predicting the emergence of new topics, then you selected 20 fields of study. Is your predictor going to detect the emergence of only those 20 topics? There is ambiguity.
In general, these questions are cleared, somehow, later in the paper, but since our reading process starts from the title toward the conclusion, it would be nice to be on the same page from the beginning. Can you add some more details?

In general, your work is very interesting and timely, however, it does fully compare with the state of the art. You propose a *very* similar approach to Salatino et al [1] but you do not compare how you differentiate from them.

In the introduction, you state “twenty topic networks were generated”. So this is 20 networks in 20 years? Or 1 for each topic? The dataset has been verticalized in certain areas? How many papers did you analyse at the end?
Did you analyse whether there was a large overlap between the datasets? This can explain why the classifier retains high classification accuracy.

In related words, you state current approaches for identifying the evolution of topic, and such phase is enabled once we have a clear idea on how to identify/extract topics from a corpus. The same research team of above developed an interesting approach [2] that you might want to look at.

In section 3.2 you state “where topics in year y are classified as new or old…”. Why do you need so? I mean this is clarified later, but not clear at this stage why. The same applies to the state of a node. Why do you need to compute the state of a node?
Later in the same paragraph, you talk about the more prominent topics. What are the top 100 topics with the largest number of nodes? If a node represents a topic, I am reading that sentence as 100 topics with the largest number of topics. Not sure what exactly it means.

In section 3.3 you talk about pairs K*(K-1). Should it be (K*(K-1))/2 to avoid duplicate pairs?

In section 4 you show more details about the experiments. In particular, as topics, you use Microsoft’s Fields of Study. However, I would like to point out two of the drawbacks of using FOS.
----- First:
I think there isn't a report showing the performance (precision, recall, F-measure) of concept tagging done by Microsoft. We work a lot with MAG and on several occasions, we found inconsistencies. My suggestion is until Microsoft provides a sort of estimations of their algorithms, it is better to use FoS just at the first level (Computer Science, Medicine, Mathematics, Economics and so on). After all, FoS is very granular, and it is pretty normal to have misclassifications when you have more than half a million entities.

----- Second:
Papers have been tagged retroactively with concepts in FoS! So, the year the topic is first used in the dataset fy, is misleading. Indeed, Salatino et al. [1] use the author’s prompted keywords as they reflect more the status of science at that given time. Just to give you an actual example, I checked in my version of MAG and the topic Semantic Web (which we all know emerged around the beginning of this century) appears first in 1920, because Microsoft tagged https://www.journals.uchicago.edu/doi/abs/10.1086/360262 with fos ‘semantic web’.

In table 2, you list the 20 FOS. It is not clear. Are those 20 FOS the seed topics to extract all papers tagged with them?
But what are the final FOS analysed? Just those 20 or all the FOS available in all papers extracted using those 20 seed topics?
Here there is another issue with identifying the year of the first usage. Identifying the year in which a topic is firstly used from those vertical datasets might be higher than the actual year of the first usage from the whole dataset. This is because you are leaving out some papers that can be tagged with that topic and potentially be published before the identified year.

In section 4.3, you state: “Training size is set to t=9 as the increase”. What t stands for?
Then, in section 4.4, you state “This results in a total of 380 pairs for FoS used in the experiment”. Should it be 190? All possible unique pair from 20 items is 190. See above formula.

In table 5, you state decimal values as TP, FP and FN. This is very odd. TP should be a natural number as it counts the number of true instances that are actually true. Same for FP and FN.

Finally, I would expect to find a dedicated section for the gold standard, explaining how you built your gold standard, with all details.

In general, I find this work really interesting, however, before recommending it for its acceptance, I would like to see an effort from the authors in extensively rewriting this paper.

References:
[1] Salatino, Angelo A., Francesco Osborne, and Enrico Motta. "AUGUR: forecasting the emergence of new research topics." Proceedings of the 18th ACM/IEEE on Joint Conference on Digital Libraries. 2018.
[2] Salatino, Angelo; Osborne, Francesco; Thanapalasingam, Thiviyan and Motta, Enrico (2019). The CSO Classifier: Ontology-Driven Detection of Research Topics in Scholarly Articles. In: TPDL 2019: 23rd International Conference on Theory and Practice of Digital Libraries, Lecture Notes in Computer Science, Springer, pp. 296–311.

Review #2
Anonymous submitted on 18/May/2021
Suggestion:
Major Revision
Review Comment:

This paper presents a network-based topic emergence identification method and supplies some analysis in a specific scenario on the Microsoft Academic Graph.

The first impression is that the method is novel and potentially interesting, taking advantage from relation evolution in a co-occurrence graph. Here it comes a first negative note, that is: the relationships are merely co-occurence relations, and not semantic ones. A second negative note is that the topics seem to come from pre-redacted fields, which would limit the applicability of the method only to those collections in which the papers have fields that allow to specify topics (or in any case, there would be an additional NLP-based preliminary step requiring to process the text to retrieve the relevant keywords and topics, which is not always easy to do). However, even with these two problems that could undermine the applicability of the method, it seems still worth to be considered for publication.

Here it comes another series of issues with the paper, which is structure and readability. I had to re-read the paper multiple times because I found the organization particularly lacking and the information is not presented very well. For instance, it is not clear to me what is the baseline the authors are comparing with. I suggest the authors to create a sub-section explicitly dedicated to the task of presenting the baseline and regroup there all the information.

Some specific remarks:
- The notation used for the topic network definition is not clear (formula 1). Is V a set or a tuple? Where does the v (small) come from? Why and when fy becomes a function? The general gist of the idea is understood, but the presentation can be improved at least from a formal point of view.

- Fig.2 it is not clear if the orange line means that one of the features has been excluded to produce the result, or all the features up to the one for which the result is calculated

- Fig.3 it is not clear what is each point representing: a topic in the considered FoS? Please clarify.

In the state of the art maybe one mention could be done for Embedded Topic Models (which are another paradigm in which the topic is represented by an embedding).

Finally, I found the evaluation a bit lacking. There is only a comparison which is the baseline, but no comparable method.

The conclusions about generalizability seem also not completely supported. At least, it seems that most of the FoS are quite specific, even if they are pertaining to different domains. But it keeps me wondering what if the FoS were a bit more general than those selected for the evaluation.

In conclusion, I find the idea interesting but the authors need to improve their presentation because the current version looks a bit confusing and it makes it difficult to appreciate their work.

Review #3
Anonymous submitted on 23/May/2021
Suggestion:
Minor Revision
Review Comment:

Originality: the paper describes a network-based approach for predicting the emergence of new research topics. The topic is very relevant and compelling. In my opinion, the application of semantic-web technologies is not extensive in this work, however it is a fit for the special issue main topic. To this extent, some ML purists would rise concerns that the approach does not use deep NN as it mainly relies on logistic regression.
Significance of the results: the methodology is described in details and with competence. Results are interesting as the authors claim a quite high F1-score (above 0.9). A few aspects however are missing or should be better clarified (see comments below).
Quality of writing: the paper is well written and understandable.

Comments:
- Reproducibility and open science: this aspect seems largely unaddressed by the authors, which did not share, nor promised to, source code, datasets and other several other implementation details. I invite to share (and organise well and document) the code and the data so to foster openness and reproducibility of the study and the results here described.
- Sadly, MAG will be closing by the end of the year. How tightly is this study coupled with MAG? Could it be easily transferred to another scholarly data source? How would this affect the approach? It would be nice if the authors could spend a few words on this. Please note that many others do not model FOS, but at the same time I don't think you leverage the hierarchical structure in any part of your analysis, right?
- The literature review is extensive and well taken care of, however the authors left a few references out, which I reckon should be included. [1] is quite relevant for the technology forecasting section, while [2,3] are central for new research topic emergence.
- Which topics are predicted? Could you make some examples? This is entirely left to the reader imagination. Compelling examples (e.g., the emergence of a new field, say computational linguistics, from the cross-pollination of two others) would make your analysis stronger and give a better idea of its potential.
- For every FOS you selected, it would be interesting to have some stats about the extracted datasets; e.g., how many co-occurring topics, how many emerging topics are found and so on.
- Please rewrite the part where you describe your datasets' creation. In particular, the part “FieldsOfStudy and FieldsOfStudy tables are retrieved for FoS used in the journal and how they are assigned to individual publications.” Which journal? This comes out of the blue.
- In inter-domain, why topic pairs are k*(k-1) = 380, and not k*(k-1)/2 = 180?
- Typo: PaperFieldsOfStury

[1]Osborne, F., Mannocci, A. and Motta, E., 2017, December. Forecasting the spreading of technologies in research communities. In Proceedings of the knowledge capture conference (pp. 1-8).
[2]Salatino, A.A., Osborne, F. and Motta, E., 2017. How are topics born? Understanding the research dynamics preceding the emergence of new areas. PeerJ Computer Science, 3, p.e119.
[3] PhD thesis https://arxiv.org/pdf/1912.08928.pdf