Abstract:
Achieving machine common sense has been a longstanding problem within Artificial Intelligence. Thus far, benchmarks that are grounded in a theory of common sense, and can be used to conduct rigorous, semantic evaluations of commonsense reasoning (CSR) systems have been lacking. One expectation of the AI community is that neuro-symbolic reasoners can help bridge this gap towards more dependable systems with common sense. We propose a novel benchmark, called Theoretically-Grounded Commonsense Reasoning (TG-CSR), modeled as a set of question-answering instances, with each instance grounded in a semantic category of common sense, such as space, time, and emotions. The benchmark is few-shot i.e., only a few training and validation examples are provided in the public release to preempt overfitting problems. Current evaluations suggest that TG-CSR is challenging even for state-of-the-art statistical models. Due to its semantic rigor, this benchmark can be used to evaluate the commonsense reasoning capabilities of neuro-symbolic systems.