A Theoretically-Grounded Benchmark for the Evaluation of Machine Common Sense

Tracking #: 3200-4414

Henrique Santos
Ke Shen
Alice M. Mulvehill
Yasaman Razeghi
Deborah L McGuinness
Mayank Kejriwal

Responsible editor: 
Guest Editors NeSy 2022

Submission type: 
Dataset Description
Achieving machine common sense has been a longstanding problem within Artificial Intelligence. Thus far, benchmarks that are grounded in a theory of common sense, and can be used to conduct rigorous, semantic evaluations of commonsense reasoning (CSR) systems have been lacking. One expectation of the AI community is that neuro-symbolic reasoners can help bridge this gap towards more dependable systems with common sense. We propose a novel benchmark, called Theoretically-Grounded Commonsense Reasoning (TG-CSR), modeled as a set of question-answering instances, with each instance grounded in a semantic category of common sense, such as space, time, and emotions. The benchmark is few-shot i.e., only a few training and validation examples are provided in the public release to preempt overfitting problems. Current evaluations suggest that TG-CSR is challenging even for state-of-the-art statistical models. Due to its semantic rigor, this benchmark can be used to evaluate the commonsense reasoning capabilities of neuro-symbolic systems.
Full PDF Version: 


Solicited Reviews:
Click to Expand/Collapse
Review #1
Anonymous submitted on 26/Aug/2022
Review Comment:

In this paper, the authors address the lack of a theoretically-grounded dataset for evaluating machine common sense (MCS). In fact, despite of several datasets being available in the state of the art (e.g., PIQA, SocialIQA, CommonsenseQA), mostly in the area of discriminative Q/A, it's often unclear which principles (if any) were followed in their construction. Such issue has ripercussions also on evaluating language models against such datasets: in the area of MCS, the sole measure of accuracy against ground truth doesn't tell if a model is able to address specific dimensions of commonsense (spatial, temporal, causal, qualitative, etc. see [1]), especially if these are not explicitly associated with question and answer pairs.
This leads me to highlight the first problem in this paper: TG-CSR is considered to be better than previous datasets because is grounded on Gordon-Hobbs theory and because, as initial results show, neural models are very far from performing like humans. The latter was true of previous datasets at the time of their release. The former is, at least, questionable. A compelling investigation, which I think is missing in general and not specifically in this work, is a study where humans annotate the most relevant MCS Q/A datasets with elements of the Gordon-Hobbs theory, and the results of this (admittedly challenging) task are used to rank them in terms of semantic relevance. Without such evaluation, I feel that stating that a dataset is better grounded than another, may lack of sufficient evidence. Given the recent proliferation of tasks and benchmarks in the MCS domain, I believe that before creating new datasets we should try to better charaterize the existing ones.
The major issue I see with this paper, though, is being out-of-scope for the Semantic Web Journal: there's no explicit use of semantic standards. The authors mention some ontologies/schemas that can be used to model Gordon-Hobbs categories (section 3.1) but, as they acknowledge on p 6. line 30-32, all benchmark data and metadata were constructured in spreadsheet, and later exported in JSON to facilitate the process of training language models.

[1] Ilievski, F., Oltramari, A., Ma, K., Zhang, B., McGuinness, D.L. and Szekely, P., 2021. Dimensions of commonsense knowledge. Knowledge-Based Systems, 229, p.107347.

Review #2
Anonymous submitted on 06/Sep/2022
Major Revision
Review Comment:

Text summary:

The article introduces a new dataset testing commonsense knowledge. Following current standard practice, it uses a question-answer format. What is different from existing resources is that this dataset is based on an ontology of commonsense knwoledge from a book by Gordon and Hobbs. Each question and answer in the new dataset is categorizedf with respect to a type of commonsense knowledge, with the eventual aim of coverage that is as comprehensive as possible. Dataset construction proceeded as follows: Given a topical domain, such as "bad weather", a first group of annotators designed a story context, and questions and answers matching some commonsense category, with at least one correct answer per question. A separate set of anntoators labeled each answer with a rating from 1-4 (where 1=bad fit, 4=very good fit). The aim of the dataset is to enable few-shot prediction, especially from neural-symbolic systems. The article reports on inter-annotator agreement on answer ratings, and on a benchmark system based on a large language model.

Overall assessment:
This seems a useful dataset, especially the ability to evaluate by type of commonsense knowledge sounds useful. However, the article has several larger issues. The discussion of related work is not a fair comparison. The main feature of the dataset, as touted in the introduction, is not measured in the benchmark evaluation. And there is no discussion of how naturalistic the hand-created data is, even though non-naturalistic formulations could in principle throw off the performance of large language models such as the one used in the benchmark evaluation.

Main comments on the data construction

The choice of commonsense categories for the dataset makes a lot of sense. It is great that the authors did a prior annotation experiment to identify categories with good inter-annotator agreement, and it is good that they included emotion as one of the categories.

One issue I worry about that is not discussed at all in the paper is about how natural the text in the dataset (contexts, questions, answers) sounds, especially given that annotation guidelines narrowly controlled what they could look like. The authors argue that because they target a few-shot setting, they do not need to worry about annotation artifacts, and that may be true. But if the texts are not natural sounding, that may still affect results especially by language model based systems, such as the one used for the benchmark. How do the authors control for this possible issue?

It is good that ratings on the answers were collected on a graded scale, allowing annotators to express nuance. But it is odd that the authods do nothing at all with the original, nuanced annotation. For all purposes, both inter-annotator agreeement and system prediction, human ratings are collapsed into 1-2 for "no" and 3-4 for "yes". Would it be possible to do an analysis of the graded ratings, for example to test whether some commonsense categories received more middling ratings than others?

Main comments on the evaluation:

The evaluation of inter-annotator agreement is quite odd: "we use the labels of each annotator in turn as
the ground-truth, and take the mode of the remaining k − 1 annotators as the human prediction." It should always be that a single annotator is the human prediction, and an aggregate of all others is the ground truth, not the other way round, this is the standard way to do it.

Please show precision and recall too, not just F1. You argue that the T0 system has very low recall but you never actually show the numbers.

The introduction placed great emphasis on the fact that this dataset, but not other existing ones, could be used to test commonsense knowledge of a system by separate commonsense categories to achieve greater insight -- but for the benchmark T0 system you do not mention its results by category. You need to either remove the claim of evaluation by category, or show the detailed results for T0.

T0 small versus large: The text says that the largest model with the most pretraining data achieves the best performance -- but you aren't showing any numbers for the smaller T0 system with more pretraining data. Can you show those numbers too?

Main comments on the argument made in the introduction:

The article references existing datasets with commonsense question-answer pairs, but dismisses them as "ad-hoc" and overall makes it sound as if those datasets were of lesser quality. This is not a fair argument. There is a reason that these commonsense datasets look the way they do: They are aimed at fine-tuning black-box neural system, and focus on naturalistic data.

There is an argument to be made for the framework chosen for the current dataset, and it can be made while being fair to existing datasets. It would go something like this: The categorization of commonsense knowledge into types, for a more in-depth analysis, could generate useful insights, in a similar way as the categorization of Natural Language Inference datapoints for linguistic phenomena was very useful for testing systems' strengths and weaknesses. A downside of the highly controlled setup for the dataset presented in this article is that there are likely to be artifacts, so the dataset is not set up to provide more than a few points of training data. This is a downside because it is known that fine-tuning is helpful for neural systems, but given the advantage of having categorization of questions and answers, this is a downside that the authors are willing to accept. Also, they hope that their dataset will be mainly useful for neural-symbolic approaches, which may not need fine-tuning on the end task (right?).

But this is not an argument that the current paper makes.

Detail comments:

[about the few-shot setup] "this forces the model to rely on more
fundamental, neuro-symbolic aspects of common sense, including
commonsense theory and semantics,"
Does it? You use a large language model for benchmarking, and do not show any use of a neural-symbolic system, even though you repeatedly mention this as the main application area.

The paper says that the Gordon-Hobbs theory provides a "formal semantics for the benchmark". What does that mean? It obviously does not mean that there is a a model-theoretic semantics, or proof-theoretic semantics, provided for all questions and answers. I think it means that questions and answers are linked to a classification through an ontology, but how is that a "formal semantics" for the actual data?

p 6
"To scale the benchmark, answers from all questions in a given category were combined into a
global per-category set. "
I did not understand this, please explain.

'By using averaging, rather than annotations from only
one individual, we were able to verify high agreement (correlation between annotators was greater than 0.9) and
to obtain a more statistically rigorous annotation per QA
instance. "
You are not averaging, you use a majority vote. Unclear: here you report correlation, later only F-score. What is this correlation? Is this before or after label collapse?

The first dataset is said to have 331 q/a pairs -- but how many different questions?

The text first simply states that 80% agreement among humans is good performance -- which seemed quite low to me, given that the dataset was constructed using clear types of commonsense knowledge and close guidelines. The discussion later explains this in more detail, arguing, very interestingly, that there is no clear dividing line between commonsense knowledge and cultural influences. You need to have a comment like this earlier, or your readers will be very skeptical about the claim of high inter-annotator agreeement.

Nice comment about interleaving of commonsense and cultural knowledge

Good point about limiting number of emotions as possible choices, and how that improves agreement

Generation as opposed to y/n answer: nice analysis, but are there numbers to accompany your findings?

Review #3
Anonymous submitted on 07/Sep/2022
Major Revision
Review Comment:

In this paper, the authors present a novel, theory-grounded benchmark for common sense reasoning.
I think the work is interesting and explore an important direction.

However, I think there are some issues with the presentation and with the design that might need to be solved before acceptance. First of all, the paper needs to be grounded in the existing NLP literature; authors should better justify why this new dataset is better. Secondly, there is the need to better describe some choices. I understand there might be page limit constraints, but I'll suggest what I think are the things that require some work in the rest of the review.


I believe this paper requires an introduction to the framework the authors use. I was not familiar with it and a more thorough explanation would have helped in understanding the paper. It would probably also help in justifying why the authors believe that this benchmark is better than past benchmarks.

Connected to the point above, more examples from the benchmark are needed in the paper.
I had to download the training and validation data to get a sense of what one could find inside (it was not also clear how to use the data for training, but that is another problem)

Related Work

I think the paper should cover a bit more of the work that has been done in NLP for reasoning and common sense. There are many datasets meant at evaluating general and common sense reasoning in NLP, I'll just mention a couple: Winograd and WinoGrande, SuperGLUE, and the various NLI datasets (for example, abductive inference [https://arxiv.org/pdf/1908.05739.pdf] or the ANLI) are all benchmarks meant at evaluating some of these properties. Benchmarks like ATOMIC should be described with details. I believe the authors should make a wider comparison (maybe a table?) describing the differences between all the datasets, with sizes and properties.
Another interesting paper might be https://arxiv.org/pdf/1907.13528.pdf, where BERT is probed for commonsense on different datasets. BigBench (https://github.com/google/BIG-bench/blob/main/bigbench/benchmark_tasks/k...) as an entire section that is specific to common sense tasks.

As another element of discussion, there is now a rich discussion in NLP about what can be learned with language or without language (https://aclanthology.org/2020.emnlp-main.703/, https://aclanthology.org/2020.acl-main.463/).


Important details of the datasets seem to be missing: sentence/question length is not reported, and the distribution of correct answers is not reported. It is also not clear how the splits have been defined and how they ensure there is no overlap between train, validation, and test (e.g., is there any question in the training set that could give a hint about the correct answer for a question in the test set?).

"Correlation between annotators was greater than 0.9" - I'd suggest using some inter-reliability measures (Gwet's or Krippendorff's maybe).

"We created and provided a concise set of "guidelines to the human annotators" - can this be made available?

The benchmark is very small and after the entire release, only 4 topics will be available. This makes me wonder if good performance on this dataset can indeed suggest that a model "has learned commonsense" when there are so few examples.


How has been t0++ used?

I believe T0++ gets a very good performance for a zero-shot model that has been pre-trained on data from different tasks. It makes me wonder if GPT-3 can solve this task or not (not an issue with the paper itself, I am just curious)

No trained model has been used to test the benchmark and I am not sure which models the authors expect to use for this challenge. Can the authors include some few-shot learning methods as baselines?