Review Comment:
Text summary:
--
The article introduces a new dataset testing commonsense knowledge. Following current standard practice, it uses a question-answer format. What is different from existing resources is that this dataset is based on an ontology of commonsense knwoledge from a book by Gordon and Hobbs. Each question and answer in the new dataset is categorizedf with respect to a type of commonsense knowledge, with the eventual aim of coverage that is as comprehensive as possible. Dataset construction proceeded as follows: Given a topical domain, such as "bad weather", a first group of annotators designed a story context, and questions and answers matching some commonsense category, with at least one correct answer per question. A separate set of anntoators labeled each answer with a rating from 1-4 (where 1=bad fit, 4=very good fit). The aim of the dataset is to enable few-shot prediction, especially from neural-symbolic systems. The article reports on inter-annotator agreement on answer ratings, and on a benchmark system based on a large language model.
Overall assessment:
--
This seems a useful dataset, especially the ability to evaluate by type of commonsense knowledge sounds useful. However, the article has several larger issues. The discussion of related work is not a fair comparison. The main feature of the dataset, as touted in the introduction, is not measured in the benchmark evaluation. And there is no discussion of how naturalistic the hand-created data is, even though non-naturalistic formulations could in principle throw off the performance of large language models such as the one used in the benchmark evaluation.
Main comments on the data construction
--
The choice of commonsense categories for the dataset makes a lot of sense. It is great that the authors did a prior annotation experiment to identify categories with good inter-annotator agreement, and it is good that they included emotion as one of the categories.
One issue I worry about that is not discussed at all in the paper is about how natural the text in the dataset (contexts, questions, answers) sounds, especially given that annotation guidelines narrowly controlled what they could look like. The authors argue that because they target a few-shot setting, they do not need to worry about annotation artifacts, and that may be true. But if the texts are not natural sounding, that may still affect results especially by language model based systems, such as the one used for the benchmark. How do the authors control for this possible issue?
It is good that ratings on the answers were collected on a graded scale, allowing annotators to express nuance. But it is odd that the authods do nothing at all with the original, nuanced annotation. For all purposes, both inter-annotator agreeement and system prediction, human ratings are collapsed into 1-2 for "no" and 3-4 for "yes". Would it be possible to do an analysis of the graded ratings, for example to test whether some commonsense categories received more middling ratings than others?
Main comments on the evaluation:
--
The evaluation of inter-annotator agreement is quite odd: "we use the labels of each annotator in turn as
the ground-truth, and take the mode of the remaining k − 1 annotators as the human prediction." It should always be that a single annotator is the human prediction, and an aggregate of all others is the ground truth, not the other way round, this is the standard way to do it.
Please show precision and recall too, not just F1. You argue that the T0 system has very low recall but you never actually show the numbers.
The introduction placed great emphasis on the fact that this dataset, but not other existing ones, could be used to test commonsense knowledge of a system by separate commonsense categories to achieve greater insight -- but for the benchmark T0 system you do not mention its results by category. You need to either remove the claim of evaluation by category, or show the detailed results for T0.
T0 small versus large: The text says that the largest model with the most pretraining data achieves the best performance -- but you aren't showing any numbers for the smaller T0 system with more pretraining data. Can you show those numbers too?
Main comments on the argument made in the introduction:
--
The article references existing datasets with commonsense question-answer pairs, but dismisses them as "ad-hoc" and overall makes it sound as if those datasets were of lesser quality. This is not a fair argument. There is a reason that these commonsense datasets look the way they do: They are aimed at fine-tuning black-box neural system, and focus on naturalistic data.
There is an argument to be made for the framework chosen for the current dataset, and it can be made while being fair to existing datasets. It would go something like this: The categorization of commonsense knowledge into types, for a more in-depth analysis, could generate useful insights, in a similar way as the categorization of Natural Language Inference datapoints for linguistic phenomena was very useful for testing systems' strengths and weaknesses. A downside of the highly controlled setup for the dataset presented in this article is that there are likely to be artifacts, so the dataset is not set up to provide more than a few points of training data. This is a downside because it is known that fine-tuning is helpful for neural systems, but given the advantage of having categorization of questions and answers, this is a downside that the authors are willing to accept. Also, they hope that their dataset will be mainly useful for neural-symbolic approaches, which may not need fine-tuning on the end task (right?).
But this is not an argument that the current paper makes.
Detail comments:
--
[about the few-shot setup] "this forces the model to rely on more
fundamental, neuro-symbolic aspects of common sense, including
commonsense theory and semantics,"
Does it? You use a large language model for benchmarking, and do not show any use of a neural-symbolic system, even though you repeatedly mention this as the main application area.
The paper says that the Gordon-Hobbs theory provides a "formal semantics for the benchmark". What does that mean? It obviously does not mean that there is a a model-theoretic semantics, or proof-theoretic semantics, provided for all questions and answers. I think it means that questions and answers are linked to a classification through an ontology, but how is that a "formal semantics" for the actual data?
p 6
"To scale the benchmark, answers from all questions in a given category were combined into a
global per-category set. "
I did not understand this, please explain.
p.6
'By using averaging, rather than annotations from only
one individual, we were able to verify high agreement (correlation between annotators was greater than 0.9) and
to obtain a more statistically rigorous annotation per QA
instance. "
You are not averaging, you use a majority vote. Unclear: here you report correlation, later only F-score. What is this correlation? Is this before or after label collapse?
The first dataset is said to have 331 q/a pairs -- but how many different questions?
The text first simply states that 80% agreement among humans is good performance -- which seemed quite low to me, given that the dataset was constructed using clear types of commonsense knowledge and close guidelines. The discussion later explains this in more detail, arguing, very interestingly, that there is no clear dividing line between commonsense knowledge and cultural influences. You need to have a comment like this earlier, or your readers will be very skeptical about the claim of high inter-annotator agreeement.
Nice comment about interleaving of commonsense and cultural knowledge
Good point about limiting number of emotions as possible choices, and how that improves agreement
Generation as opposed to y/n answer: nice analysis, but are there numbers to accompany your findings?
|