Adversarial Transformer Language Models for Contextual Commonsense Inference

Tracking #: 3086-4300

Pedro Colon-Hernandez
Henry Lieberman
Yida Xin
Claire Yin
Cynthia Breazeal
Peter Chin

Responsible editor: 
Guest Editors Commonsense 2021

Submission type: 
Full Paper
Contextualized or discourse aware commonsense inference [1] is the task of generating commonsense assertions (i.e., facts) from a given story, and a sentence from that story. (Here, we think of a story as a sequence of causally-related events and descriptions of situations.) This task is hard, even for modern contextual language models. Some problems with the task are: lack of controllability for topics of the inferred assertions; lack of commonsense knowledge during pre-training; and, possibly, hallucinated or false assertions. The task’s goals are to make sure that (1) the generated assertions are plausible as commonsense; and (2) to assure that they are appropriate to the particular context of the story. We utilize a transformer model as a base inference engine to infer commonsense assertions from a sentence within the context of a story. With our inference engine we address lack of controllability, lack of sufficient commonsense knowledge, and plausibility of assertions through three techniques. We control the inference by introducing a new technique we call “hinting”. Hinting is a kind of language model prompting [2], that utilizes both hard prompts (specific words) and soft prompts (virtual learnable templates). This serves as a control signal to advise the language model “what to talk about”. Next, we establish a methodology for performing joint inference with multiple commonsense knowledge bases. While in logic, joint inference is just a matter of a conjunction of assertions, joint inference of commonsense requires more care, because it is imprecise and the level of generality is more flexible. You want to be sure that the results “still make sense” for the context. To this end, we align the assertions in three knowledge graphs (ConceptNet [3], ATOMIC2020 [4], and GLUCOSE [5]) with a story and a target sentence, and replace their symbolic assertions with textual versions of them. This combination allows us to train a single model to perform joint inference with multiple knowledge graphs.We show experimental results for the three knowledge graphs on joint inference. Our final contribution is a GAN architecture that generates the contextualized commonsense inference from stories and scores the generated assertions as to their plausibility through a discriminator. The result is an integrated system for contextual commonsense inference in stories, that can controllably generate plausible commonsense assertions, and takes advantage of joint inference between multiple commonsense knowledge bases.
Full PDF Version: 

Reject (Two Strikes)

Solicited Reviews:
Click to Expand/Collapse
Review #1
Anonymous submitted on 17/Mar/2022
Minor Revision
Review Comment:

This paper proposes a complex framework for solving the contextual commonsense inference problem. Specifically, it contains three major components: (1) hinting; (2) Joint inference from multiple knowledge graphs; (3) adversarial training. Compared with the previous version, the writing of this paper has been significantly improved. However, I still think the technical contribution of this paper is limited.

Even though this paper proposes three modules to solve the commonsense inference problem, Most of them are borrowed from other tasks. I appreciate the effort of conducting enormous experiments, however, the performance gain is not very significant based on the reported experimental results.

All my previous comments regarding the paper writing have been addressed.

Review #2
Anonymous submitted on 09/Apr/2022
Major Revision
Review Comment:

Overall, this paper is clearly written. This paper is about contextualized commonsense reasoning, and it has 3 main contributions. 1. a hinting method to control the conditional generation. 2. A method for aligning 3 different knowledge bases to stories for contextualized generation. 3. An adversarial training method for generating and discriminating the commonsense assertions.
Below are my major concerns about the proposed method.
Regarding the hinting method, in table 1, it’s clear that hinting can steer the model towards generating different outputs, which is nice. However, it looks like you’re providing parts of the information from the target assertions at both training and test time. If so, the task just becomes much easier to learn and it’s not surprising to see improvements in automatic evaluations. Also, this method can hardly generalize to settings when no information about the targets is available. It would be good to propose a method that does not rely on ground truth targets to derive hints, e.g. some heuristics, so that we can understand the true generalization ability of the hinting method.
From human evaluations, outputs generated with the hinting method are not preferred by humans, so one way to interpret this is there is a trade-off between quality vs diversity? If the ratings from annotators are not reliable (page 10, line 21), then a larger scale annotation study may help alleviate the issue.
I’m not sure if the results from table 6 and table 8 are comparable, for the adversarial training experiments, it would be good to show the gains from adding either adversarial training or confounder loss. However, the numbers from table 6 and table 8 are drastically different, so it’s hard to understand the benefit of using adversarial training.
Minor point, in section 3.3, it’s better to keep the order of columns for hint/no-hint consistent across tables

Review #3
By Julien Romero submitted on 11/Apr/2022
Review Comment:

This paper is a resubmission that I already reviewed previously. The authors improved the overall presentation of the article. However, the fundamental elements did not change, and most of the paper is identical to the previous submission. I recall that the authors got one reject and two major revisions, which means there were substantial issues with the article. It is a bit disrespectful to resubmit and ignore the previous reviews. It takes time to read a paper and write a review, so please consider them.

Some additional points:
Putting the code on GitHub does not make it clean. Currently, it is just a big mess.
I think there is confusion about what prompting is. What the authors do is closer to COMET than prompting (hard or soft) (i.e., it is more finetuning than prompting).
The authors should check what exactly COMET inputs are and how they differ from PARACOMET. Maybe also compare with COMET-2020.
I gave pointers to previous works in my last review. It was not for the authors to just include them to please me, but to really bring additional value!
The paper still lacks relevant comparisons with other works. The metrics are also different, and the results vary a lot on the baselines compared to previous works.
It is still very unclear how hinting works. Is it also used during the testing? Here it seems to be the case, but this should not be the case! In practice, we do not have hints.