Assessing deep learning for query expansion in domain-specific arabic information retrieval

Tracking #: 2026-3239

This paper is currently under review
Wiem Lahbib
Ibrahim Bounhas
Yahya Slimani

Responsible editor: 
Guest Editors Semantic Deep Learning 2018

Submission type: 
Full Paper
Abstract. In information retrieval (IR), user queries are generally imprecise and incomplete, which is challenging, especially for complex languages like Arabic. IR systems are limited because of the term mismatch phenomenon, since they employ models based on exact matching between documents and queries in order to compute the required relevance scores. In this article, we propose to integrate domain terminologies into Query Expansion (QE) process in order to ameliorate Arabic IR results. Thus, we investigate two categories of corpus representation models, namely (i) word embedding; and (ii) graph-based representation. In the first category, we compare Latent Semantic Analysis (LSA) with neural deep learning-based model (i.e. Skip-gram, CBOW and GloVe). In the second one, we build cooccurrences-based probabilistic graph and compute similarities with BM25. To evaluate our approaches, we conduct multiple experimental scenarios. All experiments are performed on a test collection called Kunuz, which documents are organized through several domains. This allows us to assess the impact of domain knowledge on QE. According to multiple state-of-the art evaluation metrics, results show that incorporating domain terminologies in QE process outperforms the same process without using terminologies. Results also show that deep learning-based QE enhances recall.
Full PDF Version: 
Under Review