Analyzing the generalizability of the network-based topic emergence identification method

Tracking #: 2862-4076

Sukhwan Jung
Aviv Segev

Responsible editor: 
Guest Editors DeepL4KGs 2021

Submission type: 
Full Paper
The field of topic evolution helps the understanding of the current research topics and their histories by automatically modeling and detecting the set of shared research fields in the academic papers as topics. This paper provides a generalized analysis of the topic evolution method for predicting the emergence of new topics, which can operate on any dataset where the topics are defined as the relationships of its neighborhoods in the past by extrapolating to the future topics. Twenty sample topic networks were built with various fields-of-study keywords as seeds, covering domains such as business, materials, diseases, and computer sci-ence from the Microsoft Academic Graph dataset. The binary classifier was trained for each topic network using 15 structural features of emerging and existing topics and consistently resulted in accuracy and F1 over 0.91 for all twenty datasets over the periods of 2000 to 2019. Feature selection showed that the models retained most of the performance using only one-third of the features used. Incremental learning was tested within the same topic over time and between different topics, which resulted in slight performance improvements in both cases. This indicates there is an underlying pattern to the neighbors of new topics com-mon to research domains, likely beyond the sample topics used in the experiment. The result showed the network-based new topic prediction can be applied to various research domains with different research patterns.
Full PDF Version: 

Minor Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
By Angelo Salatino submitted on 31/Aug/2021
Minor Revision
Review Comment:

Review (2nd round)

I really appreciate the effort of the authors in addressing reviewers’ comments. I read both the rebuttal letter and the manuscript, and I must admit that the latter is in much better shape compared to the initially submitted version.

However, I do feel this paper hasn’t addressed all comments and therefore must go through another revision phase before considering it for acceptance.

In particular, there are two points that I would like to stress and make sure the authors have addressed for the next iteration.

The first point is in the literature review. In my previous review, I pointed out a very similar approach (Salatino et al. 2018). Regardless of the underlying technology (clique-based vs machine learning), the idea/objective is almost similar: predicting emerging topics. And in both, this is done by looking at networks of topics. I do appreciate the effort of the authors in including such a piece of literature, however, I do not agree with how it has been described and compared. Indeed, the authors state: “The proposed method is not tied to a specific algorithm such as ACPM, which have a scalability issue with dense topic networks” which is false in two ways. The first reason is that ACPM is a clustering algorithm that extends the famous Clique Percolation Method (Palla 2005). Ideally, any other algorithm could work on those networks, but ACPM so far outperforms them. The second reason is that ACPM has been developed to tackle scalability issues, so it is not correct to claim such a statement above.
Then, the authors continue with “The patterns are found within the data itself using machine learning models.”, of which I am not sure of its utility in this context. Also, in Salatino et al. the patterns are found within the data. Where else would one find the patterns?

The second point was raised in the previous review iteration, but I am currently not yet satisfied with how it has been addressed. Specifically, I am talking about the drawbacks of using FOS in Section 4. It is clear to both me and the authors that the classification of MAG papers with FOS can be erroneous (as done by an algorithm) but also that it is retroactive, which makes a topic appear (emerge) earlier than it is supposed to be. Being this a big determiner for this paper, I encourage the authors in writing a “limitations” section at the end of the paper highlighting the current limitations of this approach. In this way, we can also make future reader aware of the FOS limitation.
Also in section 4, the authors added the statement: “…Using a set of pre-defined topics also keeps the proposed method from getting an undesired performance boost from the semantic detection methods shadowing the performance of the network-based topic evolution approach. High classification accuracy with the pre-defined topics would indicate even when non-goal oriented and non-domain specialized topic sets were used.”
I feel this is sentence is a big stretch. The authors claim, “performance boost”, but does it actually exist? Do they have proof of it? If not, please delete it.

In general, I believe the authors have made a great effort to welcome reviewers’ comments. However, before recommending it for acceptance it is important the authors perform another iteration over the paper.

Review #2
Anonymous submitted on 09/Sep/2021
Minor Revision
Review Comment:

Having read the rebuttal from the authors and the new paper iteration, I have to say that the paper improved its shape, but sadly it is not yet there for acceptance as-is.
I think some points raised by the other reviewers, with which I could not agree more, are not fully addressed yet (e.g., methodology, evaluation baseline, dataset construction), and I am sure the other colleagues will elaborate further on their regards.
In my opinion, the major flaw of the paper boils down to the clarity of the presentation, which I find most time convoluted and hard to parse straightforwardly.
Mainly, the reader needs multiple goes to get the gist of the whole thing, which is somehow undesirable; let alone delving into full-on implementation details, which requires overcoming the barrier of reading between the lines and connecting dots.

- A running ambiguity between "(research) field" and "topic" can be found throughout the paper. This emerges from many statements such as "related topics representing the set of shared themes, or research fields." or "Researchers understand the topics by first reviewing a multitude of articles, internalising the evolution occurring within the researchers' fields of interest", or "the field of polysaccharide" (this being one of the 20 selected FoS). Is a field just a topic at a "higher level" in the taxonomy? Isn't the hierarchical topic/subtopic organisation enough? If not, what makes a topic a topic, and what makes a field a field and not a topic? I.e., is chemistry a topic, a field, or a domain? Why so? Alternatively, you also use "knowledge domain", "concept", and "theme", leaving to the reader's imagination what they stand for and what not. For the sake of the record, you explicitly define FoS (the ones you select, and the other ones contained in the topic networks) and knowledge domains "such as business, chemistry, law, and medicine.". Any time you use such terms differently, you increase ambiguity; which is bad. I would suggest reducing jargon variability if these terms are all used interchangeably, or instead explain the differences, if any, w.r.t. your application.
- Similarly, in 3.3, when you describe inter-domain and intra-domain, and you refer to "pairs of domains". I assume you mean domain = FoS, is this the case? Please clarify.
- Also, "The field of topic evolution…" could work better just "(Research) topic evolution…".
- In the introduction, you write, "The topic networks are first extracted from an open bibliographical dataset, with each network representing publications in a specific research journal with a focused set of research interests." you mention a journal (or a set of journals?), but it is unclear where these come in play. Later on, you write that you select 20 FoS seeds to generate the evolving topic networks. No journal seems to take part in this process. Please, drop it if this is the case.
- Still on this note, in section 4.1, you describe MAG snapshot, then move on to describing FoS selection process; then at the end, and after having presented Table 2 (which should come last, IMHO), you go back to MAG tables and their structure. Isn't this unnecessarily convoluted? Shouldn't you mention this earlier to support the FoS selection process in the first place? Also, you mention "filtered papers", which come out of the blue. How do you filter papers? Why? Is this by any chance where you use journal(s) (see point above) or something else? In the rebuttal, at some point you mention "journal was supposed to refer to the SWJ", which however is not explicit anywhere in the paper. Also, in that case, shouldn't the FoS be all relevant to semantic web then?
- Also, "from an open bibliographical dataset" -> I think you can say up front that you are using MAG. No mystery reveal is required. Finally, at the bottom of the same column, after having hinted at the binary classification process, you go back to topic network extraction, reiterating that they have been extracted from MAG. Isn't this an unneeded repetition? Couldn't this be moved up and integrated with the previous point? Please, improve clarity.
- What happens if a new topic appears in a year Y and promptly disappears in Y+1? Is a new topic stability/persistence addressed at all? Would this still count as a new topic in your analysis? In my mind, an emerging topic is something never seen before that's meant to stay; e.g., computational genomics.
- It is still unclear how you reached to the prediction examples you provided in the results section (e.g. "with possible invisibility using its photo- luminescence properties"). Can your system make predictions such as "in year Y a new topic T was indeed flagged at the intersection of topic A and topic B"?
- Maybe a section describing the limitation of your approach could be added to the closing remarks of the paper.
- State of the Art. Better than before, but I noticed that [1] is mentioned in the introduction, but I didn't get why it is not mentioned in the "technology forecasting" paragraph in the related work.
- The text description citing table 2 lacks the newly added columns.
- "to understand the topic in each document" -> I would say topics as one paper generally deals with more than one topic.
- "is then tried" -> better tested
- "hierarchical concepts are then tagged to the papers" -> I would rather say that papers are tagged with concepts, not the other way round.
- "tagged FoS" what do you mean? FoS used in tagging papers?
- Sentences throughout the paper sometimes feel overloaded with the article "the"; e.g., "Researchers understand the topics…". Please revise accordingly.

Review #3
Anonymous submitted on 03/Oct/2021
Review Comment:

The paper has significantly improved from its first version. I appreciate the efforts made by the authors to address some issues that hampered the readability and understanding of the work.

I recommend acceptance in current form.