Beyond Facts - a Survey and Conceptualisation of Claims in Online Discourse Analysis

Tracking #: 2638-3852

Authors: 
Katarina Boland
Pavlos Fafalios
Andon Tchechmedjiev
Stefan Dietze1
Konstantin Todorov1

Responsible editor: 
Philipp Cimiano

Submission type: 
Survey Article
Abstract: 
Analyzing statements of facts and claims in online discourse is subject of a multitude of research areas. Methods from natural language processing and computational linguistics help investigate issues such as the spread of biased narratives and falsehoods on the Web. Related tasks include fact-checking, stance detection and argumentation mining. Knowledge-based approaches, in particular works in knowledge base construction and augmentation, are concerned with mining, verifying and representing factual knowledge. While all these fields are concerned with strongly related notions, such as claims, facts and evidence, terminology and conceptualisations used across and within communities vary heavily, making it hard to assess commonalities and relations of related works and how research in one field may contribute to address problems in another. We survey the state-of-the-art from a range of fields in this interdisciplinary area across a range of research tasks. We assess varying definitions and propose a conceptual model — Open Claims — for claims and related notions that takes into consideration their inherent complexity, distinguishing between their meaning, linguistic representation and context. We also introduce an implementation of this model by using established vocabularies and discuss applications across various tasks related to online discourse analysis.
Full PDF Version: 
Tags: 
Reviewed

Decision/Status: 
Minor Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
By Johannes Daxenberger submitted on 16/Jan/2021
Suggestion:
Minor Revision
Review Comment:

The survey tackles a challenging task: conceptualizing and integrating research from several disciplines about a concept which even within the same discipline is often not treated consistently: claims. This is a truly difficult, but probably worthwhile endeavor. Identifying and analyzing claims in online discourse is of high importance for fact checking, which (although not explicitly stated) is the driving force for the selection of material and concepts in the survey. As a result, it integrates research from NLP research on argument mining, stance detection and knowledge base construction/representation.

The survey is an interesting and informative piece of work and also a rather inspiring read.

My reasons for recommending revision are the following:

- Suitability as introductory text/clarity of presentation: From all I can say, the approach is neither theory-, nor data-driven. It is bottom-up in the sense that it surveys literature from many areas (Sect. 3) and tries to conclude a joint conceptualization from this (Sect. 4). It is also top-down in that it tries to map the conceptualization to "knowledge engineering tasks" (Sect. 5). However, in both directions, there is a break which prevents a clear connection: I didn't understand how the Open Claims model (from previous work of the same authors) was motivated from the survey in Section 3, nor how each of the relevant tasks around claim identification and analysis are mapped to (parts of) the Open Claims model. What is missing is a structured mapping between tasks from e.g. argument mining like argument extraction [1] and clustering [2] to concepts in the Open Claims model. While this might require certain simplifications, it would definitely increase the applicability and suitability as introductory text.
- Balance of presentation: As stated in the introduction, fact-checking as an application to claim identification and analysis motivates much of the material covered and the conceptualization and conclusions. This is not bad (probably even a necessary choice due to the complexity and broadness of the topic), however, it should be made more explicit (e.g. in the introduction). Argument mining is covered well, but rather superficially (probably ok given the aforementioned focus of the survey and existing surveys on argument mining). The distribution of venues in Fig. 2 also shows the predominance of NLP (and in particular, fact checking and argument mining) venues (as opposed to classic semantic web venues).
- Importance of the covered material: I wasn't fully convinced whether a common notion or conceptualization of claim across the various disciplines is really necessary and helpful in practice. Each of the NLP tasks outside of fact-checking you touch (argument mining and stance detection) might (and have, so far) live fine without a joint notion of "claim". Don't get me wrong - I do tend to think it is helpful - but the only place in the survey where this became obvious to me was at the very end where you discuss notions of claim relation. Formalization of (dimensions) of claim relation/similarity is an area from which potentially all discussed tasks could benefit. I recommend to go more into depth about this and add motivating examples way earlier in the paper.

Minor comments:
- The intro would benefit from a motivating (textual) example (see last point above)
- The selection of journals (Table 1), in particular for NLP, was a bit surprising to me. I would have expected to see Computational Linguistics, TACL, JMLR etc. here
- Why didn’t you consider pre-print venues such as arxiv (Table 1)?
- Page 7 line 19 "Stahlhut" (single-authored)
- In many places, instead of just adding plain references ("[19] state that …"), it would substantially increase readability to have author names ("Miller et al. [19] state that …")

References to be added:
- Schiller et al. (2020). Stance Detection Benchmark: How Robust Is Your Stance Detection? arxiv.
- Al-Khatib et al. (2020). End-to-End Argumentation Knowledge Graph Construction. AAAI-20. (Sect. 5.1.1)
- Chen et al. (2019). Seeing Things from a Different Angle: Discovering Diverse Perspectives about Claims. ACL 2019. (Sect. 5.1.1)

[1] Shnarch et al. (2018). Will it blend? Blending Weak and Strong Labeled Data in a Neural Network for Argumentation Mining. ACL 2018.
[2] Reimers et al. (2019). Classification and Clustering of Arguments with Contextualized Word Embeddings. ACL 2019.

Review #2
By Tobias Kuhn submitted on 22/Jan/2021
Suggestion:
Minor Revision
Review Comment:

This manuscript provides a survey on the topics of claims, fact-checking, and argumentation, proposes a conceptual model on these topics, and then reviews existing information extraction and knowledge engineering tasks in from the point of view of the introduced model. The paper is well written, easy to read, and generally well structured.

The topic is very interesting and relevant to the journal. In particular the fact that it brings together the approaches and viewpoints from natural language processing with the ones from formal modeling is very valuable.

The literature survey is well done and seems to have good coverage.

The model as introduced makes sense and seems to succeed in bringing the variety of existing works together. I have a number of more minor comments on the model below, but the overall structure is convincing.

The final discussion and review of existing information extraction and knowledge engineering tasks is also interesting and valuable, but could in my view be a bit better structured, in particular to emphasize the different connections to the conceptual model. These connections are indicated but it is a bit difficult to get a general overview of how the full set of existing tasks maps to the concepts of the model. It would, for example, be useful to have a version of Figure 5 where we can see where in this conceptual space the different tasks are located. I feel that they would cover the diagram to a large extent, which would be very nice to actually see.

Apart from this, I would like to raise the following important points:

- This seems to be an extended version of a previous paper of the authors [25]. The details of the extensions and differences to this earlier paper need to be made more explicit and more detailed in my opinion.

- I am missing a URL to a machine-readable version of the Open Claims model.

Below I provide a list of more minor issues. Overall, however, I judge the quality and value of this manuscript as very high and expect that a version that is revised accordingly should be accepted for publication.

Minor comments:

- sometimes acronyms are introduced but then never referred to, e.g. "pay-level domains (PLDs)"

- Generally, I would not introduce acronyms for short phrases like "argumentation mining (AM)" but use the full phrase throughout. Papers are often not read linearly, and then such acronyms can be confusing.

- "research focused on natural language claims": I think this research branch should be better labeled/described.

- lines 39-41 of page 1 column 1, and surroundings: the difference between the terms "utterance", "statement", and "sentence" is not clear here.

- "fact-checking sites [23, 24]": unclear how this relates to the main claim of the sentence about strongly diverging models.

- "This work is meant to facilitate an unambiguous representation of claims across various communities": I feel that "unambigous" is a bit too strong a word in this context (as the representation of the claims is only unambiguous at a relatively shallow/syntactic level).

- "[...] comes to show that these works do not fully contribute to closing the terminological and conceptual gap that exists in and across fields": I find this only partially convincing. Good quality and coverage of a survey/model doesn't necessarily imply good uptake. From this argument, moreover, it's not clear what should make us confident that the presented survey won't follow the same fate. A more convincing argument, in my view, would be to say that all these existing surveys looked at claims/facts in a more narrow sense or more narrow domain than the overarching model/survey presented here.

- the quoted introductory definition of evidence as "text, e.g. web-pages and documents [...]" doesn't seem very helpful. In fact, "evidence is a kind of text" seems conceptually wrong. Similarly later on with "Stances are usually defined as text fragments [...]" (though it is clarified somewhat later in the same paragraph).

- "Closely related to this is the notion of a rumour.": There is a bit of a sharp transition here, as the previous sentences talk about scientific claims and evidence, and the term "rumour" doesn't seem to be closely related in this domain.

- Maybe the two short subsections 3.2.5 and 3.2.6 could be merged.

- I find it a bit confusing that in section 3.3 entitled "Summary" new references are introduced (e.g. "[121]"). I think "Summary" is not a good title here.

- "In the case of a fact extracted from a knowledge base, the speaker equals the knowledge base reporting the fact": I feel a bit uneasy about equating the knowledge base with the speaker role. What about a knowledge base that stores provenance information about the stored facts/claims, including who said it? Who would in this case be the "speaker"?

- "In contrast to a claim, it is not necessarily embedded in a discourse": unclear what "it" refers to here.

- "A representation can have the form of freetext, e.g., a sentence that best describes the proposition": Wouldn't in that case the representation also be an utterance?

- "one or more reviews, and iii) one or more attitudes": shouldn't that be "*zero* or more ..."?

- "attitude is an opinion on a given topic (e.g., a viewpoint)": Does that imply that "viewpoint" would be a subclass of "attitude"? And what about the notion of a "stance" as introduced earlier? Would that also be a kind of attitude?

- I find "Annotation" to be a confusing class name, as many of these things can be seen as annotations. I think something like "Linguistic Feature" would be more appropriate.

- "what was said (linguistic representation of claim utterance)": shouldn't the "what" also include the content of what was said, so the claim proposition?

- Maybe "Time" or "Date/Time" would be a more appropriate class name instead of "Date".

- "Author" seems to indicate that something was written down. Does the model also cover spoken utterances? I think there is no reason not to, in which case the name "Author" seems confusing.

- The relations described in section 4.2.4 don't seem to be depicted in Figure 5, but I think that would be helpful.

- Can the formal representation of a claim proposition also point to a set of RDF statements, for example by the use of a named graph?

- Section 4.3: You don't mention OWL here. Did you not use OWL for basic ontological restrictions, e.g. domain/range for relations? I think that would be useful.

- I was wondering whether section 5.3.1 are part of the model definition. Are these three relation types part of the model introduced earlier? If not, wouldn't it make sense to include them?

- Typos: "Ressource", "et al.."

Review #3
Anonymous submitted on 31/May/2021
Suggestion:
Major Revision
Review Comment:

The paper presents a systematic survey of the terminology used for tasks related to the automatic computation of the veracity of assertions in text and knowledge graphs. After a well-written introduction, the authors present the methodology of their survey incl. queries and review process. Nothing surprising in that chapter, it is solid hard work. Chapter 3 claims to present a survey of definitions. However, this chapter is more than a survey as the authors begin to critique some of the terminology used in previous works. In 3.1, they introduce facts as things known or proved to be true. Facts are considered immaterial. This is in line with some of the basic assumptions behind RDF. However, the claim that there is no difference between a fact and the statement of a fact in the RDF community is rather surprising given the definition of ontology by (Gruber, 1993) and its implications pertaining to RDF knowledge graphs being mere formalizations. It also remains unclear, whether assertion or statement of assertion is to be preferred here. The definion of evidence and claim looks fine but the authors intertwine their thoughts on the correctness and consistency of terms, which sometimes makes it hard to spot whose semantics are currently being discussed, leading to partially unsound equivocations.

Q1: Facts are constituents of KGs: This holds is some formalisms but not in all. For example, nodes are constituent of property graphs. Please fix.
Q2: There is no distinction made between a fact and the statement of a fact: How is this statement related to reification in RDF and embedded triples in RDF Star?
Q3: "Checking whether facts are true" is an oxymoron: You assume a definition that was not assumed by the authors of the paper you mentioned (fallacy of equivocation). Please fix.
Q4: Definition of fact. In Section 3.1, you assume the definition at the beginning of the section to critique other uses of facts but some of these definitions are rather unspecific. For example, when can something be known? Do we know the ontologically same facts or do we actually know instantiations of some abstract ideal facts? Something that actually exists: Do contradictions exist? Epistemologically they do. Are they facts? It is unclear whether you mean that all definitions hold concurrently. Please state clearly which definition you go with and do point out the limitations of the definition you go for.
Q5: The definition of Topic in Table 2 is surprising. Why would a topic be a phrase? Should it not model a frame? Is a phrase enough for that?

The authors continue with their second core contribution, i.e., the ontology in chapter 4.
Q1: The authors claim that non-factual claims do not have truth values. Do they actually mean have unknown truth values?
Q2: Claim proposition: How does one model the meaning exactly?
Q3: Claim context: Why must it be a person uttering the claim? Would agent not be the better choice here?
Q4: Fig. 5: The mapping between direction of the arrows and the predicates is unclear. Please fix.

The use cases pointed out by the authors do make sense.

Overall, the paper is a very interesting attempt to unify diverse terminology across different domains. Clearly, such a unification is challenging and the authors take a systematic approach to address said challenge. The argumentation lines pertaining to semantics are not always perfectly sound (see Qs above) and a more formal model of the ontology (say a OWL-DL description) would have been of great help. Still, the results of the paper could serve to papers with clearer semantics and will probably see a community uptake by the more pragmatic subset. People concerned by epistemology and ontology might not share the authors' perspective on some of the questions addressed in the paper. However, this is just a claim.

Minor details:
et al.[ => et al. [
machine-interpretation => machine interpretation
use-case => use case
ressource => resource

Key evaluation criteria:
(1) Suitability as introductory text: Yes
(2) How comprehensive and how balanced is the presentation: Very comprehensive systematic survey. Some epistemological and ontological considerations missing.
(3) Readability and clarity of the presentation: Excellent
(4) Importance of the covered material to the broader Semantic Web community: Very high.
(A) No file to be found.