Vecsigrafo: Corpus-based Word-Concept Embeddings - Bridging the Statistic-Symbolic Representational Gap in Natural Language Processing

Tracking #: 1988-3201

Authors: 
José Manuel Gómez-Pérez
Ronald Denaux

Responsible editor: 
Guest Editors Semantic Deep Learning 2018

Submission type: 
Full Paper
Abstract: 
The proliferation of knowledge graphs and recent advances in Artificial Intelligence have raised great expectations related to the combination of symbolic and distributional semantics in cognitive tasks. This is particularly the case of knowledge-based approaches to natural language processing as near-human symbolic understanding rely on expressive, structured knowledge representations. Engineered by humans, such knowledge graphs are frequently well curated and of high quality, but at the same time can be labor-intensive, brittle or biased. The work reported in this paper aims to address such limitations, bringing together bottom-up, corpus-based knowledge and top-down, structured knowledge graphs by capturing as embeddings in a joint space the semantics of both words and concepts from large document corpora. To evaluate our results, we perform the largest and most comprehensive empirical study around this topic that we are aware of, analyzing and comparing the quality of the resulting embeddings over competing approaches. We include a detailed ablation study on the different strategies and components our approach comprises and show that our method outperforms the previous state of the art according to standard benchmarks.
Full PDF Version: 
Tags: 
Reviewed

Decision/Status: 
Minor Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
Anonymous submitted on 17/Sep/2018
Suggestion:
Minor Revision
Review Comment:

The article is definitely in a better shape than the initial version, and the authors made a huge effort to address reviewers’ comments.

However, there are still some points that can definitely improve the quality of the paper. The main point being how the authors present their findings and report it. As one of the criticism in my first review, I think it is very hard for a reader to get to the point the authors want to make. Especially in the evaluation, where there are different tasks and sections but they are not well connected, or do not seem to specifically answer research questions. Also, it is a bit strange how that section was organized, with ablation test in the middle of the section...It takes a lot of time and re-reading to understand it. This is not a major issue, as most of the important points are there, but one point I would highly encourage the authors to work on (e.g. instead of having long chunks of text in every subsection, some bullet points or organized conclusions may help).

In Section 4.3.5 it is interesting that the disambiguation algorithm does not seem relevant with the final results. A similar conclusion was drawn in a very recent EMNLP paper (which came out after this one): http://www.cs.huji.ac.il/~daphna/papers/DubossarskyEtal_EMNLP2018.pdf. I’m not writing this with the aim of including it in the paper, but rather to add something more in the discussion of Section 5.1.1, as maybe these two results suggest that word similarity in isolation may not be the most reliable task to test the performance of these models.

Other comments to improve the article follow:

The formatting of the article is still not consistent: e.g. Pages 4 and 18: columns have a different format.

At the beginning of Section 2: “To the best of our knowledge, this is the first work that studies jointly learning embeddings for words and concepts from a large disambiguated corpus”. But this does not seem to be totally true, as the SW2V model which also does that is presented, is presented afterwards. Same with the sentence “although no directly related approaches have been proposed...”

In section 2.2 it is mentioned that AutoExtend [33] is only for senses, but synsets are also learned in this model.

In general I would avoid the use of “seem” when stating facts that can be verified. For example in Section 2.2 “this approach only seems to produce embeddings for noun”. But particularly on the results section (e.g. “seems to improve significantly…”) which is repeated several times. I am not asking to remove all sentences which have not been totally verified, but in some sentences a sentence stating the objective results instead of the generalization may be better suited.

In Equation (1), “W” is not defined , and the sum should be until # (x_i, x_j). Also not clear what the final (x_i,x_j) stands for.

In Section 3.3 it is mentioned “the disambiguation algorithm is proprietary and out of the scope of this paper”, I would add some details about it, as it is clearly important for the construction of the model, and not knowing how it works prevents the reader to fully understand the model (partially based on it), even if in Section 4.3.5 the conclusion is that it is not that relevant. Otherwise I would report the results on known disambiguation algorithms only.

Finally, in addition to the formatting issues mentioned above, the usage of spaces for references is not consistent: Section 2.1: Swivel[30] and Section 2.2: NASARI[23] (no spaces). There are other cases.

Review #2
Anonymous submitted on 19/Oct/2018
Suggestion:
Minor Revision
Review Comment:

Vecsigrafo: Corpus-based Word-Concept Embeddings

The paper presents an approach to training simultaneously word-concept (word-sense) embeddings from semantically annotated corpus. The approach considers the corpus as a sequence of tokens and each token can be related to different linguistic and knowledge entries such as word forms, lemmas, grammatical features, senses, and concepts.

Then each of these entries can play two roles: (1) being a focus entry for which an embedding is trained, and (2) being a context entry for the training of the embedding for a focus entry. Entries of different kinds co-occur at the same position (the same token) in the corpus. The training is preformed within a window. Thus, an embedding for a lemma is trained with respect the other lemmas, word forms, and senses in the selected window. The approach also adopts dynamic context window by weighting co-occurrences by the distance between the focus and the context entries.

After presentation of related work and the formal model of embedding training the paper presents an extensive list of different evaluations of the proposed approach. These evaluations show that the trained embeddings are much better with respect to different tasks than the embeddings trained by other approaches.

Thus, I think the paper deserves to be published. The following is a list of some problematic issues:

In section 3.1. Notation and Preliminaries (page 4, second column)

1. A suggestion: V = T \cup C

2. D is defined as a collection of focus-context entry pairs.

Although this is intuitively clear, I think it needs a better definition. The whole approach is based on one entry for which an entry embedding will be trained and a second entry for which a context embedding will be trained.

First, in order for the following definition to be more precise it will be good to define the position of the entry in the sequence of entries. The positions first needs to be defined for tokens (base case) and then inductively for the other entries related to each token.

Then, in the definition of #(x_i,x_j) it will good to stress explicitly that x_i is in V and x_j is in V.

Then here it will be useful to define the distance between \delta(x_i,X_j) where again x_i is in V and x_j is in V, but they are related to two concrete positions in the sequence of tokens in the same window (or other way of definition of context of co-occurrence of entries).

Although distance depends on the definition of the context of co-occurrence it will be good to select one such definition in the paper.

Compare to the definitions of distance within Section 3.2 Formal Definition on page 5, second column, equation (1) and (2). There are three notations: \delta^c_{x_i,x_j} related the context c, then \delta_(x_i,x_j) is left without a formal definition, and then it is used in the definition of \delta'_(x_i,x_j) in equation (2).

This definition is problematic in the sense that the distance between two entries for the same token is 1 and the distance between an entry for a given token and an entry for the next token is also 1.

In my opinion if the definition of distance is defined in section 3.1 it could contain the correction by adding 1 to all distances induced by the distance between the corresponding tokens as it is done in equation (2).

Also, please, note that the notation \delta'_(x_i,x_j) is not used anywhere in the paper except in eq. (2).

3. In the equation (1) on page 5, there are some unclear parts

First, the summation is from c=1 to #(x_i) but #(x) is defined (on page 4) only on the bases of the collection D which does not contain information for the actual positions and distances within any context of tokens.

I think that this needs to be fixed in the definition of D in such a way that each pairs (x_i,x_j) where x_i is in V and x_j is in V is connected to all token positions in the sequence of tokens.

Also at the end of the equation there is just (x_i,x_j) which is not defined. Probably it is #(x_i,x_j).

4. In the definition of Swivel loss function I think L_0 and L_1 are exchanged for each other

As it is stated now (second equation at the beginning of page 6) L_1 is always 0, because #(x_i,x_j) is 0

I think that intuitively it is clear what is the intention of the authors, but the formalization needs to be fixed.

5. Some of the evaluations are not described in enough details or are not very correct.

For the first case is the comparison presented in section 4.4.3. It needs some explanations of the results

For the second case is section 4.3.6 Effect of the Corpus.
The conclusions in this section are based on the size of the corpora, but they are not comparable as topics. Europarl and UN corpora are quite different from Wikipedia corpus in therms of style and topics. Thus some of the differences reported in the section could be because of these genre differences.

These two sections could be drooped in the final version or some explanations to be added.

Review #3
Anonymous submitted on 29/Nov/2018
Suggestion:
Minor Revision
Review Comment:

The revised version of the paper takes into account in a convincing way the points raised in my previous review and contains an extended evaluation and a more self-contained explanation about the obtained results. In my opinion this version of the paper is greatly improved with respect to the previous version and present interesting results. A minor element that still needs to be fixed concerns the formatting of the second column of page 4 which seems different from the other colums.