Studying the Impact of the Full-Network Embedding on Multimodal Pipelines

Tracking #: 1962-3175

Authors: 
Armand Vilalta
Dario Garcia-Gasulla
Ferran Parés
Eduard Ayguade
Jesus Labarta
E Ulises Moya-Sánchez
Ulises Cortés

Responsible editor: 
Guest Editors Semantic Deep Learning 2018

Submission type: 
Full Paper
Abstract: 
The current state-of-the-art for image annotation and image retrieval tasks is obtained through deep neural network multimodal pipelines, which combine an image representation and a text representation into a shared embedding space. In this paper we evaluate the impact of using the Full-Network embedding (FNE) in this setting, replacing the original image representation in four competitive multimodal embedding generation schemes. Unlike the one-layer image embeddings typically used by most approaches, the Full-Network embedding provides a multi-scale discrete representation of images, which results in richer characterisations. Extensive testing is performed on three different datasets comparing the performance of the studied variants and the impact of the FNE on a levelled playground, i.e., under equality of data used, source CNN models and hyper-parameter tuning. The results obtained indicate that the Full-Network embedding is consistently superior to the one-layer embedding. Furthermore, its impact on performance is superior to the improvement stemming from the other variants studied. These results motivate the integration of the Full-Network embedding on any multimodal embedding generation scheme.
Full PDF Version: 
Tags: 
Reviewed

Decision/Status: 
Minor Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
By Vered Shwartz submitted on 05/Aug/2018
Suggestion:
Minor Revision
Review Comment:

This manuscript focuses on multimodal image and sentence embeddings. It is an extension of previous work by Vilalta et al. (2017), in which the authors replaced a typical image representation (the last layer of a CNN) with the full-network embeddings (FNE) suggested by Garcia-Gasulla et al. (2017), a representation which offers a richer visual embedding space by deriving features from multiple layers while also applying discretization. These embeddings are evaluated on the parallel tasks of image annotation and retrieval.

The contribution of the current work is in (1) integrating FNE to the original pipeline of Kiros et al. (2014), as well as to two versions of order embeddings, showing consistently improved performance upon using a standard image representation across the three datasets on which the methods are evaluated; (2) performing extensive experiments to study the causes of performance gains, when the methods are trained with the best hyper-parameters; and (3) using curriculum learning to increase the stability and performance of these methods.

The evaluation is extensive and fair: the model is compared with the state-of-the-art methods for each task, as well as on the original model by Vendrov et al. (2015) without the FNA component. The proposed extension on its own performs worse than the state-of-the-art, but it suppresses the performance of the same method with a typical image representation. Moreover, the authors study the effect of performance gains in the different methods and control for various factors such as hyper-parameter values, dataset splits and training time.

Detailed Comments:

* Connection to semantic web: Thanks for addressing my concern about the connection to semantic web. The introduction reads a bit vague, and I think it can benefit from a motivating example - e.g. a concept which is vague (unclear how to represent in “semantic web”), how word embeddings make it representable, and how image embeddings improve its representation further.

* Performance: There is absolutely no problem with publishing a paper about a method that achieves less than state-of-the-art performance. I believe the comment (from all of the reviewers) was because the previous version didn’t present the goal of the paper clearly: "testing of the proposed methods, showing clearly the real impact of its main contributions" - the current introduction is clearer about this goal. [no changes required]

* Introduction: the first 3 contributions could be written in one (“Integrating the FNE into x, y, anc z”) - it is unnecessarily repetitive.

* Table 1: which values were tested for each hyper-parameter? You need not mention all the results with all the different hyper-parameters, but if one of the main goals of the paper is to study the sources of empirical gains, it is imperative to list the values tested (can be placed in an appendix).

* When referring to experiment results I’m always careful not to use “significant” when I didn’t run significance tests. I choose “substantial” instead. Although they are synonymous, the word “significant” often implies achieving a certain p value in a significance test, so it is a bit misleading.

* References: Regarding Jamie Ryan Kiros, this name is written in her PhD thesis: https://tspace.library.utoronto.ca/handle/1807/89798, and on her Google Scholar profile: https://scholar.google.ca/citations?hl=en&user=b_MXwoAAAAAJ. I believe she changed her name from Ryan to Jamie but keeps both names in citations of papers published by the name Ryan. It’s good that you emailed her and if she answers, do as she tells you, of course.

Review #2
By Luis Espinosa Anke submitted on 06/Sep/2018
Suggestion:
Accept
Review Comment:

I think the resubmitted version does three things well. First, it addresses the somewhat tricky issue of contextualizing the contribution within the scope of semantic web technologies (although this could be improved, see my last point). The introduction has been indeed reworked and the thread seems clearer in what the paper aims to achieve and how its findings and contributions can directly impact semantic web technologies. Second, the authors have provided a convincing case as of the motivation and narrative behind the results they report. In such a "fast-paced" research, it is highly commendable to see papers that make the effort to replicate and understand others' work. And third, the paper reads in general better.

- Minors -

"We did not observe a significant impact on performance with this reduction of the text pre-processing"
- The authors might want to reference/get inspiration from this paper for this or future work:
"On the Role of Text Preprocessing in Neural Network Architectures: An Evaluation Study on Text Categorization and Sentiment Analysis
Jose Camacho-Collados, Mohammad Taher Pilehvar" - https://arxiv.org/abs/1707.01780
... where different pre-processing choices are explored for neural text classification. While not being the same task, I wonder if the performance would not change with additional preprocessing steps. Also it would be good to know, at least at the highest level, which variant is better (no casing? no stopwords? any intuition why, perhaps related to the very short size of the text you work with?)

"Previous contributions [3, 9, 10] set the word embedding dimensionality to 300"
- In fact, it is (mostly) agreed that the optimal dimensionality of word embeddings (trained for capturing word similarity, as in glove or w2v) is 300 or somewhere between 300 and 400. I would assume this is why the relevant related work cited also uses this size.

- Typically I would use "state-of-the-art" (hyphenated) as an adjective, as in "state-of-the-art results", but not when using it as a phrase, as in "the current state of the art in this task is ...".

- ". Depending on the random initialization of the weights, the same model may start training or not" - Do you mean "start learning"?

- I am still missing an explicit reference to how semantic web technologies may benefit from this study in the Conclusion and Future Work sections, although most readers will easily find direct connections given the reworked introduction.

Review #3
By Jindrich Helcl submitted on 09/Sep/2018
Suggestion:
Accept
Review Comment:

I am satisfied with the authors' response.