ProVe: A Pipeline for Automated Provenance Verification of Knowledge Graphs Against Textual Sources

Tracking #: 3467-4681

Gabriel Amaral
Odinaldo Rodrigues
Elena Simperl

Responsible editor: 
Guest Editors Wikidata 2022

Submission type: 
Full Paper
Knowledge Graphs are repositories of information that gather data from a multitude of domains and sources in the form of semantic triples, serving as a source of structured data for various crucial applications in the modern web landscape, from Wikipedia infoboxes to search engines. Such graphs mainly serve as secondary sources of information and depend on well-documented and verifiable provenance to ensure their trustworthiness and usability. However, their ability to systematically assess and assure the quality of this provenance, most crucially whether it properly supports the graph's information, relies mainly on manual processes that do not scale with size. ProVe aims at remedying this, consisting of a pipelined approach that automatically verifies whether a Knowledge Graph triple is supported by text extracted from its documented provenance. ProVe is intended to assist information curators and consists of four main steps involving rule-based methods and machine learning models: text extraction, triple verbalisation, sentence selection, and claim verification. ProVe is evaluated on a Wikidata dataset, achieving promising results overall and excellent performance on the binary classification task of detecting support from provenance, with 87.5% accuracy and 82.9% F1-macro on text-rich sources. The evaluation data and scripts used in this paper are available in GitHub and Figshare.
Full PDF Version: 


Solicited Reviews:
Click to Expand/Collapse
Review #1
By Michael Röder submitted on 26/Jun/2023
Review Comment:

# Publication Summary

The paper presents ProVe, an automatic approach to verify a given statement from a knowledge graph based on the references that are listed in the statement's provenance. The approach generates search phrases based on the statement and generates overlapping text passages from the reference documents. It is worth pointing out that ProVe is able to process references that contain plain text as well as HTML web pages. From this large set of passages, the Sentence Selection step assigns a relevance score to each passage and selects the top 5 passages for further processing. To this end, a BERT model is used for scoring. For each of these top passages, the Textual Entailment Recognition step assigns a score to each of the three FEVER classes, namely SUPPORTS, REFUTES, and NOT ENOUGH INFO. This step also relies on a BERT model. Finally, the Stance Aggregation step uses the class and relevance scores as well as further features from the passages as input and returns a final classification result whether the reference mentioned in the statement's provenance supports the statement or not. Three different aggregation methods are proposed: a weighted sum, a rule-based approach and a classifier (Random Forests are used in the evaluation).

The authors also present the Wikidata Textual References (WTR) dataset. The dataset is based on Wikidata. First, 7M references are selected. The authors sample a subset and process the statements and their references with ProVe's Sentence Selection. Crowdsourcing workers annotate each of the system's selected passages with one of the three FEVER classes with respect to the statement. The same is done for a set of evidence for a given statement. Finally, the authors annotate the references themselves with one of the three classes. WTR contains 416 statement-reference pairs with 76 distinct Wikidata properties.

Section 5 of the paper contains a detailed evaluation. Where possible, the single steps of the approach are evaluated on their own. I won't go into the details of the evaluation. However, the main result of evaluating the complete pipeline is that ProVe performs best if the Stance Aggregation uses a) a Random Forests classifier and b) the classification is carried out as binary classification (supports yes/no) instead of a multi-class classification with all three FEVER classes. The authors also compare ProVe with two other systems that aim at a similar use case (although not exactly the same) and show that ProVe achieves a better F1-score on the WTR dataset.

Section 6 discusses results and limitations. For example, the presented version of ProVe does neither take qualifiers within the provenance nor negations within the extracted passages into account.

# Review Summary

## Originality

The main difference between ProVe and the related work is the usage of language models. It could be argued that their application is obvious and has already been used in the related area of fake news detection. However, to the best of my knowledge, there is no work that uses language models for this particular application area. I also think that the problem of training the models is solved in a reasonable way by using the FEVER dataset.

The dataset is a well-described resource that can be useful for the community in the future.

## Significance of the Results

The authors provide a detailed evaluation of ProVe. This includes a comparison to other systems of which one represents the state of the art of the calculation of a veracity score for a given triple. The evaluation results suggest that these systems struggle to fulfill the task while ProVe achieves better F1-scores on the WTR dataset. The WTR dataset itself can also have a significant impact in the future.

## Quality of Writing

Overall, the writing is good. Some smaller typos are listed at the end of the review.

## Open Science Data

The repeatability of the experiments seems to be good.

- The code of ProVe is linked in the paper. It seems to come with the necessary data (e.g., the BERT models) and intermediate results to repeat the experiments. The paper also includes values for the parameters of the approach.
- The dataset is available on Figshare and linked in the paper. The dataset is documented in Annex B of the paper and on Figshare. However, I found it surprising that the dataset itself (i.e., the zip file that can be downloaded from Figshare) does not contain any README file. It might be better to make the dataset file more "self-contained" by adding a README file which describes the data (e.g., it could simply have the same text that the authors already used for the description on Figshare).

## Conclusion

In my humble opinion, this submission should be accepted. The issues that have been raised in the reviews of the previous version have been fixed and the quality of the article has been improved. The additional evaluation results are convincing and further underline the significance of the work.

## Typos and smaller Errors

- p 4 l 8: "A sentence selection step then identifies" --> this step seems to be named "Evidence Selection" in Figure 1.
- p 4 l42: "through sentence a embedding model." --> "using a sentence embedding model."
- p 4 l49: "and main category of )." --> and main category of."; if the authors refer to exact properties in this line, they may want to align the formatting of the property names to the formatting used later on in the article (see p14 l4-l6 for example).
- p22 l39-40: The text mentions an AUC score of 0.85 while Figure 10 shows the value 0.8371.
- p25 l28: "ElasticSearch" --> "Elasticsearch"

Review #2
By Houcemeddine Turki submitted on 28/Jun/2023
Minor Revision
Review Comment:

I have been recently honored to read "ProVe: A Pipeline for Automated Provenance Verification of Knowledge Graphs Against Textual Sources". The work is well-written and evocates an interesting issue: Fact-checking of statements in knowledge graphs. This topic is relevant to Wikidata as it supports adding references to its statements for verifiability. The outstanding contribution of the work is the use of pre-trained language models for identifying references for the statements as well as supporting sentences inside these considered resources. The work achieved higher rates of accuracy and can be turned into a tool that can be deployed to add reference claims and supporting sentences to Wikidata. However, it will be useful to adjust minor issues before the final publication:
1. It is difficult to see how language models have been used in ProVe. It was not clear how the language model has been pre-trained and used. Part 2.2.3 implicitly provided insights into this work. However, it will be better to explicitly explain this in Part 3.
2. There is a brief explanation of how references are defined in Wikidata. Wikidata references have been useful for evaluation purposes. In fact, not always Reference URL is used to provide the link to the reference. Sometimes, PubMed ID and other properties are also used for this purpose. It will be useful to explain why the work is restricted to considering Reference URL Wikidata statements.
3. Wikidata provides several properties that can be used to add supporting sentences like section, verse, paragraph, or clause (P958) or quotation or excerpt (P7081). It will be interesting to provide how the method can be used to enrich Wikidata with supporting sentences.
4. The work did not discuss the limitations of the work. An evident one is the consideration of common relation types that are easier to extract than more complicated ones. The authors can also deal with the limitations of using Wikidata for benchmarking.
Based on these points, I propose that the research paper is accepted for final publication after these minor revisions are applied.

Review #3
Anonymous submitted on 21/Jul/2023
Minor Revision
Review Comment:

This paper has two main contributions:
- a new dataset for Wikidata AFC, and an accompanying novel annotation guideline to that end, and
- a new AFC pipeline called ProVe, which has been tested to outperform the state of the arts on the above dataset.
I find the first contribution to be compelling and highly useful for future research in AFC.

However, I have some concerns regarding the proposed ProVe pipeline.
The pipeline lacks originality. Or, more specifically, it comprises parts that are not novel, as they are just reused from other work without significant modification. This alone is not a problem, as I agree with the authors that the overall pipeline is still useful. Moreover, it is shown to outperform the state of the arts.
Nevertheless, I find some claims and decisions made regarding the buildup of the pipeline problematic. I understand that considering every single thing properly would warrant an ungodly amount of ablation study. Still, I will highlight them below in detail, especially those that can be justified through rewriting and/or revisiting the related work.

Next, I find the data and code provided to be incomplete. The text extraction external link brings me here: which is empty. Other than that, they seem to be there (I haven’t personally tested anything), but I have to say that they are messy. For instance, I’m not sure others would need to see all these Jupyter Notebook files. Just some modular Python files that can be called to run all these multiple different scenarios should be tidier. Finally, while I see that each component is nicely separated into folders, I don’t see instructions on how to seamlessly run everything end-to-end (or to somehow integrate it with Wikidata as it is the goal of the work).

Other than that, I find the overall new dataset and the result discussion to be clear and well-written, though in some parts the pipeline results are not significantly great (e.g. ~0.4 F1-score on sentence selection, and the 0.6-0.75 F1-score on RTE, class imbalance on RTE but no reported attempt on addressing it). It would be great to have more discussion into these results and how to improve them in the future.

Detailed comments are below.
#1 Highlights related to claims and decisions.

#1.1 “HybridFC assumes sentences will be retrieved by either of those systems, ProVe relies on state of the art LMs for data-to-text conversion.” → HybridFC also uses a sota LM as a part of its data-to-text conversion. Whether HybridFC uses FactCheck or not is disjoint with its use of sota LM.

#1.2 “Lastly, evidence document corpora normally used in general AFC tend to have a standard structure or come from
a specific source. Both FEVER [25] and VitaminC [50] take their evidence sets from Wikipedia, with FEVER’s
even coming pre-segmented as individual and clean sentences. Vo and Lee [51] use web articles from
and only…. KGs, however, accept provenance from potentially any website domains. As such, unlike general AFC approaches, ProVe employs a text extraction step in order to retrieve and segment text from triples’ references.” → Other techniques such as HybridFC also perform snippet extractions. Although HybridFC is limited to Wikipedia, it does not rely on its structures but only on the texts, so in principle, it can be used for other websites. Therefore, I do not think that this warrants such a strong claim on ProVe’s side.

#1.3 “Additionally, explainability for task-specific graph architectures, like those of KGAT and DREAM, is harder to tackle than for generalist sequence-to-sequence LM architectures which are shared across the research community [52–54]. Slightly decreasing potential performance in favour of a simpler and more explainable pipeline, ProVe employs LMs for both sentence selection and claim verification.” → I don’t see enough evidence that the explainability of GNN is harder than that of seq2seq NLP models. Or that LM is more explainable than GNN. I have personally never seen any work to directly compare the two (the authors should provide some references otherwise). If the authors tried to argue (by proxy) that there is (qualitatively or quantitatively) more progress on the explainability of NLP models than GNN models, then this is a weak argument for that. Even qualitatively, to just name a few, there are GraphXAI or GraphLime for GNNs’ explainability. The current ProVe also does not address explainability, it is still something in the future work.

#1.4 “As a subtask in AFC on KGs, claim verbalisation is normally done through text patterns [27, 29] and by filling templates [28], both of which can either be distantly learned or manually crafted. ProVe is the first approach to utilise an LM for this subtask. Amaral et al. [55] shows that a T5 model fine-tuned on WebNLG achieves very good results when verbalising triples from Wikidata across many domains. ProVe follows suit by also using a T5. → Amaral et al. “show”.
What’s the goal, or the purported benefit, of using a fine-tuned LM in verbalization? This is not so clear to me and it’s not well-argued (nor sufficiently supported by evidence) in the paper or in the references. Is it just for the sake of being different?

I’d argue that rule-based (text pattern) verbalization approaches can still work well as long as it’s consistent and general enough and that, with regard to AFC purposes, a downstream sentence selection method (like the one used in HybridFC or KGAT) can recognize the verbalization similarity with the supporting sentence well-enough. In the original WDV paper, I also don’t see any discussion of whether the T5 model is a surefire better alternative than a rule-based approach for AFC. Rather, looking at the ~80% adequacy and ~3-4/5 fluency (and the weak agreement between annotators) reported in the WDV paper makes me wonder whether it really is a good enough alternative.

As far as I understood, WDV data is also manually crafted and distantly supervised because it relies on the surface labels of Wikidata (which is crowd-labeled) of the claims to build the seq2seq data, so in this regard, this is also not something that is (in an obvious way) a benefit compared to the previous work.

From a broader PoV, while it might be true that within the AFC context, this is the first work that leverages a fine-tuned encoder-decoder model for triple verbalization, it has been done in other contexts many times. For example, in “Semantic Triples Verbalization with Generative Pre-Training Model” by Blinov (fine-tuned GPT-2 model) and “Denoising Pre-Training and Data Augmentation Strategies for Enhanced RDF Verbalization with Transformers” by Montella et al. (fine-tuned BART-like model). In the context of ontology alignment, many approaches have utilized some verbalization of triples and have leveraged language models to check whether a pair of verbalized triples are similar. For instance, one can check some submitted papers to the annual OAEI (Ontology Alignment Evaluation Initiative).

All-in-all, I’d suggest making the motivation for using LM here explicitly clearer, making the novelty softer, and citing other verbalization work in other contexts.

#1.5 “Following on KGAT’s [22] and DREAM’s [21] approach to FEVER’s sentence selection subtask, ProVe employs a large pre-trained BERT transformer. ProVe’s sentence selection BERT is fine-tuned on the FEVER dataset by adding to it a dropout and a linear layer, as well as a final hyperbolic tangent activation, making outputted scores range from −1 to 1. The loss is a pairwise margin ranking loss with the margin set to 1.” etc. → I looked at KGAT’s repository and this is exactly how it went there. Does ProVe just reuse KGAT’s sentence selection approach or is there any novelty here? From the textual explanation, I would think they (ProVe and KGAT’s sentence selection) are the same, but these paragraphs are too verbose to just re-explain what previous work has done while adding no additional context. Just for example if the authors want to be verbose, perhaps they can explain things that were unexplained in the KGAT’s paper but reused in ProVe. For instance, why do they choose to go with tanh and margin ranking loss, when it is also possible to have a standard softmax layer with a cross-entropy loss? In Equation 7, the negative tanh output (rho) is scaled to 0 anyway, right? So why not have a softmax and set the irrelevant pairs to 0 and relevant ones to 1? Also, why choose to have a dropout layer here, but not in the RTE model?

It is also implied that ProVe retrains (re-fine-tunes) the sentence selection subtask. But if it is exactly the same as KGAT, why re-fine-tune? Why not just use the already fine-tuned model? If I go further, if we are just using fine-tuned models, then why not explore newer/larger models that are tuned for sentence similarity and use that for sentence selection that is potentially better than the 3-year-old KGAT? For example, the SBERT models or many other models within the huggingface repository.

There is also a strange choice of using BERT-large when KGAT showed that BERT-base is better on the dev-set and RoBERTa-large is better both on the dev-set and test-set. Why use the mediocre, but not-efficient option?

#1.6 About the RTE model:
There is a small (but important) contradiction to address. The textual explanation says that the concatenation is between a verbalized triple and a piece of evidence, but in formula 6 it is the other way around.

This is important because in BERT there is an issue with token length. If the concatenation is reversed, then the verbalized triple might be cut off which should never happen. However, cutting off a piece of evidence is also undesirable as you might cut off the important parts. How do the authors mitigate this? I don’t think this is explained.

#1.7 About the stance aggregation: this section is too verbose. The 3 strategies are each very straightforward. I think the author should just stick with the best one they found, explain it as clearly (but concisely) as possible, and leave the rest in an appendix as an ablation study.

Still, I do have some concerns about strategies 2 and 3. In the second strategy, it is assumed that the supporting pieces of evidence take higher precedence than the refuting ones. Why? This is not clear/trivial. If I take an extreme example, if there is a single supporting piece of evidence but 10 refuting ones, the strategy will aggregate them all as supporting. How does this make sense?

In the third strategy, the authors employ a simple classifier with the probabilities of sentence selection and RTE models, and also the number of pieces of evidence as features. Why are these considered sufficiently good features for the intended model? I mean, we are dealing with textual data and have language models lying around, why not leverage these powerful models for this purpose as well? In theory, they should offer better “features” than the ones in strategy 3, don’t they? Please motivate this classifier a bit more.
#2 Grammar and clarity

state-of-the-art not “state of the art” when used as an adjective (multiple occurrences in the paper).

"HybridFC converts retrieved evidence sentences into numeric vectors through sentence a embedding model.” → word order

“We define ontological predicates as those whose meaning serves to structure or describe the ontology itself, such as subclass of and main category of).” → dangling ‘)’

“These approaches score sentences based on relevance to the claim and use a supervised classifier to classify the entire web page.” → classify into what?

“Alternatively, triples with particular predicates can be easily selected.” → as an alternative to what?

“The existence of a document retrieval step depends on whether provenance exists or needs to be searched from a repository, with the former scenario dismissing the need for the step. This is the case for ProVe, but not for the DeFacto line, which search for web documents.” → which “searches”. Also, this is confusing to me, which “former scenario” removes the need for document retrieval? Also, how would one know beforehand whether provenance already exists within a repository? That doesn’t sound trivial.