Large-Scale Semantic Exploration of Scientific Literature using Topic-based Hashing Algorithms

Tracking #: 2239-3452

Carlos Badenes-Olmedo
José Luís Redondo-García
Oscar Corcho

Responsible editor: 
Guest Editors Semantic E-Science 2018

Submission type: 
Full Paper
Searching for similar documents and exploring major themes covered across groups of documents are common actions when browsing collections of scientific papers. This manual, knowledge-intensive task may become less tedious and even lead to unforeseen relevant findings if unsupervised algorithms are applied to help researchers. Most text mining algorithms represent documents in a common feature space that abstracts away from the specific sequence of words used in them. Probabilistic Topic Models reduce that feature space by annotating documents with thematic information. Over this low-dimensional latent space some locality-sensitive hashing algorithms have been proposed to perform document similarity search. However, thematic information is hidden behind hash codes, preventing thematic exploration and limiting the explanatory capability of topics to justify content-based similarities. This paper presents a novel hashing algorithm based on approximate nearest-neighbor techniques that uses hierarchical sets of topics as hash codes. It not only performs efficient similarity searches, but also allows extending those queries with thematic restrictions explaining the similarity score from the most relevant topics. Extensive evaluations on both scientific and industrial text datasets validate the proposed algorithm in terms of accuracy and efficiency.
Full PDF Version: 


Solicited Reviews:
Click to Expand/Collapse
Review #1
By Anita de Waard submitted on 25/Jun/2019
Review Comment:

This manuscript was submitted as 'full paper' and should be reviewed along the usual dimensions for research contributions which include (1) originality, (2) significance of the results, and (3) quality of writing.

Review #2
Anonymous submitted on 08/Jul/2019
Review Comment:

Compare to the previous version, reproducibility is improved as more implementation details are added, including the libraries they use for the experiment.

Some new figures prove that the precision obtained by the algorithm is indeed robust to the dimension(number of topics). By the way, the authors may try the non-parametric version of LDA in future work, which infers the number of topics automatically.

Although I agree that the proposed method offers some nice properties in which other methods do not contain, more comparisons between different algorithms can show the proposed method at least does not sacrifice performance in other aspects such as precision.

In general, the authors address most questions I mentioned in the last review.

Review #3
By Daniel Garijo submitted on 14/Sep/2019
Review Comment:

The authors have thoroughly answered all my comments, and I think the paper should be accepted as part of the journal.

I leave just small comments that would be great to address in the final version of the paper:

- I recommend doing another proof read, because some of the changes introduced have typos. For example "Trained in corpora with different parameter" --> parameters.
- Try to use avoid using "very" that often. For instance, instead of "very difficult" you could use "challenging". There are other adjectives that help you quantify your text :)
- In the second paragraph of the intro, remove "Therefore". I think this work is not a consequence of the problem, but a contribution to address it.
- Footnote 4 should become a release, and if possible with its corresponding citation (e.g., Zenodo). The reason I ask this is because the dataset and pre-trained materials may change in the future, but the ones that you have produced in this paper are concrete.
- Finally, please try to reduce the amount of acronyms used in the paper. It's hard to follow, specially when some of them overlap with other acronyms in the state of the art (e.g., ANN stands for Artificial Neural Networks)