Knowledge graphs for common-sense scientific question answering

Tracking #: 2949-4163

Authors: 
Guy Aglionby
Simone Teufel

Responsible editor: 
Guest Editors Commonsense 2021

Submission type: 
Ontology Description
Abstract: 
Knowledge graphs (KGs) can be used to structure the information necessary for a model to successfully answer questions. In this paper we specifically investigate the storage of common sense information that expresses properties of abstract concepts. Prior work has examined ontology design for specific kinds of common sense, but the general case is under-explored. We identify weaknesses in the structure of ConceptNet, the predominant resource for general common sense, and propose a new modular ontology for common sense — MOntCS — to store this information. MOntCS is designed to be suitable for structuring explanations for questions by limiting the complexity of concepts permitted. We draw on linguistic theory to ensure consistency and clarity in the relation set. We use MOntCS to structure the facts provided with WorldTree, a scientific common sense question answering dataset which originally stores information in tables, and release this as a resource for knowledge graph-augmented question answering. We show that, with an existing knowledge graph reasoning model, using this knowledge graph gives higher accuracy compared with three competitor knowledge graphs. We carry out an ablation study to identify which relation types most impact question answering performance, and study which properties of knowledge graphs correlate with higher performance. We provide empirical evidence for claims made in prior work that taxonomic relations may not be useful for common sense reasoning.
Full PDF Version: 
Tags: 
Reviewed

Decision/Status: 
Major Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
Anonymous submitted on 14/Dec/2021
Suggestion:
Major Revision
Review Comment:

The paper suggests an ontology called MOntCS (Modular Ontology for Common Sense) to represent knowledge with a goal to resolve some issues in other knowledge bases. In particular they target two issues they identify in ConceptNet's ontology, relation ambiguity and lack of structure.

The paper has a careful discussion of many of the aspects that influence the choice of relations and entity representation, and lands on a set of relations trying to strike a good balance. There is also good comparisons to various other approaches (although AMR could also have been discussed?).

As a test case, they rewrite the WorldTree knowledge base in this ontology, using a mixture of automatic (for some tables) and manual methods. They take the largest connected component as the KB (that could presumably be relaxed if need be), available for download in a github repository, and show that this is more dense and clustered than other KBs like ConceptNet and TupleKB. Finally they evaluate the KB by incorporating it into a QA-GNN model, measuring answer accuracy on the WorldTree question set.

On the positive side, the descriptions, discussions and comparisons are clear, and a concrete KB has been constructed and shared.

However, the utility of this ontology is not so convincing with the current evidence:

- Evaluating on WorldTree QA accuracy is very indirect (and "unfair" as the WorldTree KB focuses only on knowledge for the correct answers, not the distractor options, giving strong bias towards the correct answer). One motivation in the paper is to provide better explanations - it would be great to see at least qualitative explorations of this, if not some quantitative measures (which can be tricky).

- Applying the suggested ontology only to WorldTree is a bit narrow, it would be good to also discuss to what extent other knowledge sources, like ConceptNet, could be coerced into this ontology. How much would be covered, how much would be lost, etc? Also, calling WorldTree just "common sense" is a bit of a stretch, since it specifically tries to target elementary scientific principles as well.

- Although some types of ambiguity is resolved with this ontology, there are others that remain. E.g., word sense ambiguity could be an issue (two meanings of "bat" exist in the WT graph, e.g.), and the granularity of the connections might lead to incorrect paths. E.g., the "eat" node has many different agents (like "carnivore", "scavenger", "producer") and patients (like "acorn", "animal", "plant"), and only some of these are actually compatible (though these are indirectly specified in other links, like ["carnivore eat animal", structural-patient, "animal"]).

- Some more specific limitations are also mentioned in the paper, such as a lack of ability to easily express negation, comparison - it is good that this is explicitly discussed (with some "workarounds"), but these do seem to limit the universality of the proposal.

In summary, while the motivation of the paper is good, the current proposal does not seem all that convincing. Still, there are worthwhile ideas and suggestions throughout, and it could be more convincing with more concrete evidence from applying to other KBs or showing promise in generating useful explanations.

Review #2
Anonymous submitted on 23/Jan/2022
Suggestion:
Major Revision
Review Comment:

The authors propose MOntCS, a new ontology for structuring commonsense knowledge bases (KBs. MOntCS aims to address several shortcomings of popular KBs like ConceptNet, including the inconsistent granularity of the events and redundancy of relations. The design guidelines include restricting the nodes to verb and noun phrases (to control specificity) and introducing structural relations to connect compound nodes to their constituents. In addition, the proposed ontology contains four classes of relations, including verbal (semantic), taxonomic, and affective relations, to cover a broad category of commonsense relations. The authors instantiate MOntCS on the Worldtree factors corpus ([1]) using semi-automated annotation. Experiments on Worldtree QA show that the KB obtained using MOntCS marginally improves over ConceptNet.

Commonsense reasoning is an increasingly popular area of research, and resources like new KBs and Ontologies can be valuable to make progress. The proposed work highlights some important shortcomings of the existing KBs, like the inconsistency in granularity and redundant relations. The new contribution, MOntCS, takes promising steps towards achieving these goals. However, grounding some of the claims in experiments and analysis, and clarifying parts of the paper will make it stronger.

## Areas of improvement

1. Lack of focus: the exact problems that the paper can be better motivated. It would have been better if the said shortcomings (multiple relations and inconsistent granularities) were grounded in empirical analysis, but the current version offers little to no evidence for this. There are works, for example [2-3], which are strengthened by multiple relations between entities in conceptnet. Thus, it might be useful to explain the said shortcomings better. Further, L19 states: "The design of ontologies for more general, open-ended common sense reasoning has been so far under-explored, and this is where our current paper's focus lies." However, MOntCS is inherently dependent on the available knowledge and, as such, does not address the problem of the presence of more general resources for common sense reasoning.

2. Experiments

I think the current experiments do not offer sufficient evidence for the utility of the proposed Ontology. In detail:

2.1) Table 9, Column 1 shows that MOntCS has only marginal gains over Conceptnet, despite having a 1:1 alignment with the task, as the authors note. Further, the fact that the task overlaps with the graph makes WorldTree QA an uninteresting (and perhaps unfair) candidate for evaluation. To establish that the proposed scheme is general, I suggest that authors try MOntCS on at least one more task. For example, WIQA~[4] might align well with MOntCS.

2.2) Table 9 should have another row involving no graph for a fair comparison. I suspect that neither of the graphs is helping (and some might be adding noise). Adding a no-graph baseline will clarify this.

2.3) Adding significance tests and repeating the experiments for different seeds might also help establish the difference between the graphs in columns 1 and 2.

2.4) Table 10 essentially shows that taxonomic relations are not helpful. Do they still need to be included? Further, column 1 of Table 10 indicates that the performance is essentially the same without any graph? This relates to the point made earlier about a no-graph baseline.

3. MOntCS as a tool for explanation
As authors mention in Section 6, "Models are increasingly evaluated not just on performance, but on their ability to provide explanations for the choices they make. We design MOntCS to be a suitable medium for expressing explanations in the common sense question answering domain." However, the experiments section does not show any evidence that MONtCS can provide valuable explanations. Adding experiments on this front will significantly strengthen the paper.

## Grammar/typos, style, and presentation :

1. Page 2, L32: "Relation this scenario…"

2. Page 3, L43: "in the graph to take a.."

3. It might be better to either add a citation for statements like "ConceptNet [10] is perhaps the most frequently used knowledge graph for common sense reasoning applications." This will allow dropping speculative phrases like "perhaps," which might not work well in this setting. Alternatively, you can rephrase them to something like "ConceptNet [10] is one of the most frequently used knowledge graph for common sense reasoning applications.". A similar statement is present on page 5, L 11: "ConceptNet is most commonly used as the base knowledge graph, a subset of which is chosen for computational reasons."

4. "As a graph grows denser, it becomes easier to select relevant data that may otherwise require many hops to reach from the starting nodes." Similar to the above, this statement sounds general but will depend on the specifics of the underlying graph. The relevant information may or may not appear closer as the graph grows denser. Thus, it might be helpful to qualify this statement and explain why this is expected. Another such statement appears on Page 15, L37: "a path length of 2 as used in prior work is insufficient."

5. Page 8, L3: "additioanl"

6. Page 12, L34: "missing structural links where they were missing"

7. Page 15, L32: "However, because QA-GNN also includes this embedding within the GNN (figure 1, label '2'), the *langauge model can still be trained."

8. Section 3.4 is currently very nicely written and is one of the most interesting aspects of the paper. It is interesting because it clearly lays out the problems and presents possible solutions and design choices with motivating examples. I believe that some other parts of the paper (example, Section 5) can be improved with Section 3.4 as a reference.

9. Given the central role that Worldtree plays in this work, it is worth adding a sample Table either in the appendix or the main paper.

10. Section 5 can use a re-write for clarity. Several statements are either mentioned without appropriate citation (a path length of 2 as used in prior work is insufficient) or are unclear, like "Ensuring fairness in this scenario is difficult.”

## Questions:

Q1. Why is Causes (Table 5) not placed in Affective relations?

Q2: Will "semantic relations" be a better term for verbal relations?

Q3: Is redundancy necessarily a bad thing? The paper lists redundancy as one of the main shortcomings of conceptnet. However, it is unclear why the responsibility of disambiguating the proper relation should not be with the downstream application. Further, the redundancy can sometimes be advantageous by providing the multifaceted nature of the relation between two nodes.

Q4: The second shortcoming of the level of specificity of the nodes is related to the underlying data source and not as much as a problem with the nodes. Since the authors use WorldTree, whether or not the derived KB is at the right level of granularity is to a great extent a function of the granularity at which the facts in Worldtree are expressed?

Q5: Page 15, L27: "The design of QA-GNN does not ensure that the graph…difficult to know the extent to which they drive performance in this model." Doesn't this directly undermine the motivation for having a knowledge graph at all? Please also see my note on providing evidence for using MOntCS as an explanation.

[1] Jansen, Peter, Elizabeth Wainwright, Steven Marmorstein, and Clayton Morrison. “WorldTree: A Corpus of Explanation Graphs for Elementary Science Questions Supporting Multi-Hop Inference.” In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018). Miyazaki, Japan: European Language Resources Association (ELRA), 2018. https://aclanthology.org/L18-1433.

[2] Xu, Yichong, Chenguang Zhu, Ruochen Xu, Yang Liu, Michael Zeng, and Xuedong Huang. “Fusing Context Into Knowledge Graph for Commonsense Question Answering.” In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, 1201–7. Online: Association for Computational Linguistics, 2021. https://doi.org/10.18653/v1/2021.findings-acl.102.

[3] Wang, Han, Yang Liu, Chenguang Zhu, Linjun Shou, Ming Gong, Yichong Xu, and Michael Zeng. “Retrieval Enhanced Model for Commonsense Generation.” ArXiv:2105.11174 [Cs], May 24, 2021. http://arxiv.org/abs/2105.11174.

[4] Tandon, Niket, Bhavana Dalvi, Keisuke Sakaguchi, Peter Clark, and Antoine Bosselut. “WIQA: A Dataset for ‘What If...’ Reasoning over Procedural Text.” In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), 6076–85. Hong Kong, China: Association for Computational Linguistics, 2019. https://doi.org/10.18653/v1/D19-1629.

Review #3
By Aida Amini submitted on 08/Feb/2022
Suggestion:
Minor Revision
Review Comment:

The paper describes the authors' effort to define a modular ontology for common sense analysis. Their ultimate goal is to increase the density of the final knowledge graph by structuring and lowering the complexity. The have utilized the relation in WorldTree and created structured relations over those. Their ontology contains relations such as: structured relations, Verbal relations, Taxonomic relations, Affective relations, and grammatical exceptional relations. The paper carefully defines the relation type included in each category. Afterward they have described the annotation procedure to be combination of manual and automatic procedures and provided comparison of statistics of their KG with other common-sense knowledge-bases. Finally for comparing the quality of the ontology and collected KG, the paper describes the experiments over the task of QA and an ablation study to show the quality of every category of relations.

In terms of Quality and relevance of the described ontology, the paper provides a very structured ontology which helped in improving the results of the QA based task, yet there can be tested on more sub-tasks.

The paper is clear and readable but as a suggestion an outline can be added to the content in the introduction sections.

The authors have shared a GitHub repository, containing their final knowledge-base, yet it is a little confusing and hard to navigate. it might be better to add README file. In addition the GitHub link does not include their scripts used for annotation. It might be good if the authors make those also accessible for other researchers to use.

The strengths and weaknesses of the article are as follows:

Strengths:
+ The authors gave a better structure to the existing ontologies in order to create a dense knowledge graph which ultimately will result in better inference.
+ Using the described ontology the authors were able to use a combination of manual and automatic annotation scheme which reduces the manual effort needed to create the knowledge-base.
+ The evalution presented in the paper shows a proof that utilizing more structured graphs (even with less nodes) can result in improvement of the model's performance in the QA task.

Weaknesses:
- There are not many qualitative samples evolutions provided by the paper.
- For the evolution, the comparison is drawn over the task of QA using QA-GNN, but it might be better to have the evolution over couple of other tasks to establish the performance of the new ontology and the KG created via that.

Questions for the Authors:
* In the section, 3.3.2, it is mentioned that some concepts (e.g. distance and far) are kinda similar, is there a threshold for determining similarity?
* In the sentence : "bakers give bread to customers." The authors mentioned that the given triple will be (‘bakers give bread’, beneficiary, ‘customer’), but does it also provide other artifact relations such as ('bakers', 'own', 'bread')?
* In section 4.3, the authors performed post processing in order to find and prune the errors. In what are the percentages of occurrences of those errors.
* For section 4.3, are there any other type of the errors that remained within the annotated samples that are not caught, it might be interesting to see more qualitative samples of the final datapoints.
* In section 4.3, page 14, line 6, there is a mention of "In others this was genuine (‘open container’ is both a possible verb and noun phrase)." but it is not clear what is the authors' approach for these cases.
* Table 10 shows that the results of KG analysis without taxonomic relations is even higher than the model with those. If possible can the authors provide the samples that are marked correctly without taxonomic relation and not with them?
* The authors have started with the WorldTree relations, Is this annotation approach applicable to conceptNet?

Minor Comments:
* It might make the paper more readable if the authors can add the outline of the paper of what to be expected in each section and the overflow of the paper.
* In page 7, line 18: base classes are combined in --> to be combined
* In page 8 line 31 there are 2 "cause" words.