Review Comment:
Vecsigrafo: Corpusbased WordConcept Embeddings
The paper presents an approach to training simultaneously wordconcept (wordsense) embeddings from semantically annotated corpus. The approach considers the corpus as a sequence of tokens and each token can be related to different linguistic and knowledge entries such as word forms, lemmas, grammatical features, senses, and concepts.
Then each of these entries can play two roles: (1) being a focus entry for which an embedding is trained, and (2) being a context entry for the training of the embedding for a focus entry. Entries of different kinds cooccur at the same position (the same token) in the corpus. The training is preformed within a window. Thus, an embedding for a lemma is trained with respect the other lemmas, word forms, and senses in the selected window. The approach also adopts dynamic context window by weighting cooccurrences by the distance between the focus and the context entries.
After presentation of related work and the formal model of embedding training the paper presents an extensive list of different evaluations of the proposed approach. These evaluations show that the trained embeddings are much better with respect to different tasks than the embeddings trained by other approaches.
Thus, I think the paper deserves to be published. The following is a list of some problematic issues:
In section 3.1. Notation and Preliminaries (page 4, second column)
1. A suggestion: V = T \cup C
2. D is defined as a collection of focuscontext entry pairs.
Although this is intuitively clear, I think it needs a better definition. The whole approach is based on one entry for which an entry embedding will be trained and a second entry for which a context embedding will be trained.
First, in order for the following definition to be more precise it will be good to define the position of the entry in the sequence of entries. The positions first needs to be defined for tokens (base case) and then inductively for the other entries related to each token.
Then, in the definition of #(x_i,x_j) it will good to stress explicitly that x_i is in V and x_j is in V.
Then here it will be useful to define the distance between \delta(x_i,X_j) where again x_i is in V and x_j is in V, but they are related to two concrete positions in the sequence of tokens in the same window (or other way of definition of context of cooccurrence of entries).
Although distance depends on the definition of the context of cooccurrence it will be good to select one such definition in the paper.
Compare to the definitions of distance within Section 3.2 Formal Definition on page 5, second column, equation (1) and (2). There are three notations: \delta^c_{x_i,x_j} related the context c, then \delta_(x_i,x_j) is left without a formal definition, and then it is used in the definition of \delta'_(x_i,x_j) in equation (2).
This definition is problematic in the sense that the distance between two entries for the same token is 1 and the distance between an entry for a given token and an entry for the next token is also 1.
In my opinion if the definition of distance is defined in section 3.1 it could contain the correction by adding 1 to all distances induced by the distance between the corresponding tokens as it is done in equation (2).
Also, please, note that the notation \delta'_(x_i,x_j) is not used anywhere in the paper except in eq. (2).
3. In the equation (1) on page 5, there are some unclear parts
First, the summation is from c=1 to #(x_i) but #(x) is defined (on page 4) only on the bases of the collection D which does not contain information for the actual positions and distances within any context of tokens.
I think that this needs to be fixed in the definition of D in such a way that each pairs (x_i,x_j) where x_i is in V and x_j is in V is connected to all token positions in the sequence of tokens.
Also at the end of the equation there is just (x_i,x_j) which is not defined. Probably it is #(x_i,x_j).
4. In the definition of Swivel loss function I think L_0 and L_1 are exchanged for each other
As it is stated now (second equation at the beginning of page 6) L_1 is always 0, because #(x_i,x_j) is 0
I think that intuitively it is clear what is the intention of the authors, but the formalization needs to be fixed.
5. Some of the evaluations are not described in enough details or are not very correct.
For the first case is the comparison presented in section 4.4.3. It needs some explanations of the results
For the second case is section 4.3.6 Effect of the Corpus.
The conclusions in this section are based on the size of the corpora, but they are not comparable as topics. Europarl and UN corpora are quite different from Wikipedia corpus in therms of style and topics. Thus some of the differences reported in the section could be because of these genre differences.
These two sections could be drooped in the final version or some explanations to be added.
