ProVe: A Pipeline for Automated Provenance Verification of Knowledge Graphs against Textual Sources

Tracking #: 3296-4510

Gabriel Amaral
Odinaldo Rodrigues
Elena Simperl

Responsible editor: 
Guest Editors Wikidata 2022

Submission type: 
Full Paper
Knowledge Graphs are repositories of information that gather data from a multitude of domains and sources in the form of semantic triples, serving as a source of structured data for various crucial applications in the modern web landscape, from Wikipedia infoboxes to search engines. Such graphs mainly serve as secondary sources of information and depend on well-documented and verifiable provenance to ensure their trustworthiness and usability. However, their ability to systematically assess and assure the quality of this provenance, most crucially whether it properly supports the graph's information, relies mainly on manual processes that do not scale with size. ProVe aims at remedying this, consisting of a pipelined approach that automatically verifies whether a Knowledge Graph triple is supported by text extracted from its documented provenance. ProVe is intended to assist information curators and consists of four main steps involving rule-based methods and machine learning models: text extraction, triple verbalisation, sentence selection, and claim verification. ProVe is evaluated on a Wikidata dataset, achieving promising results overall and excellent performance on the binary classification task of detecting support from provenance, with 87.5% accuracy and 82.9% F1-macro on text-rich sources. The evaluation data and scripts used in this paper are available in GitHub and Figshare.
Full PDF Version: 

Major Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
By Christos Christodoulopoulos submitted on 08/Dec/2022
Major Revision
Review Comment:

### General comments
The paper is very well written. The related work, system architecture and dataset collection are documented in great detail with very few places containing gaps needed for replication.

The originality of the work is relatively limited to the synthesis of existing components (with minor alterations to fit the use case) and the main significance of the work comes from the generation of the dataset.

The main motivational use case for ProVe (use by human curators of KGs) is never explicitly tested. This is a critical limitation of this work since at the moment, it's unclear whether the effort required to use the tool (and given its not 100% accurate) is less than the type of check the authors did when annotating for reference-level verification. In addition, for triples without references (new and existing), a system like DeFacto would be more useful. Interestingly, on page 8 (line 12) it is mentioned that such retrieval modules could be plugged into ProVe, but this is never tested. Additionally, there isn't a compelling reason as to why the authors did not use all the components from a state-of-the-art verification system (like DREAM) with the only addition being the claim verbalisation component. This would not only demonstrate the generalisability of the approach, but also it would allow the authors to produce better results. Finally, a direct comparison with DeFacto and FactCheck (replacing the retrieval output with the reference URL) on the WTR dataset would be ideal to showcase the originality of the ProVe system.

The dataset, while meticulously annotated, is very small and the natural class imbalance since all references of a KG triple should (in theory) support it, make it less attractive for evaluating end-to-end verification systems. To balance the small size, it would be better to focus on increasing the quality and coverage as much as possible for instance by manually verbalising the claims (or fixing the language of the auto-verbalised ones) and therefore including qualifiers which is identified as an issue in section 6.3.

### Technical issues/questions
- Add a section to describe the relation between triples and references, at least in Wikidata since it's the taget KG for this work: a) are all triples supported by references? (no; e.g. Q92637182) b) can there be more than one reference per triple? c) are there non-supporting (refuting) references by design? can triples by verifiable only based on existing knowledge with internal consistency checks? All these properties are needed to establish the working assumptions of ProVe. Especially the prevalence of reference (% of triples w/o references) is a strong motivation for a retrieval based approach like DeFacto.

- Related to the previous question, is it reasonable to assume that there are references that explicitly refute a claim? Could references that don't support a claim (either textually or via other modalities) be due to changes on the source website? This would suggest that a better mechanism for ProVe would be to access archival versions of the pages (when they were first introduced as references).

- Regarding the websites where support is stated as text but not as sentences, the latest FEVER dataset (FEVEROUS; Aly et al. 2021) contains claims verifiable by tables and text. FEVEROUS also contains (partial) evidence for NEI claims. Training the system on FEVEROUS would be an interesting addition to the current work.

- It's not clear how the sample size of WTR was computed. In section 4.1.2 it is mentioned that "385 references represents a 95% confidence interval and a 5% margin of error" but, to me, this way of calculating sample size doesn't make sense for datasets of natural language. What is statistic being measured with 95% confidence interval? Whatever it is, should be indicative of the variance of language, structure and size of the population (total number of unique references) but I don't see any such calculation being made.

- It's not clear why in Figure 6 the passages are plotted against individual annotations and not the majority vote. The whole point of using multiple annotators is that the individual annotations cannot be trusted (especially given the relatively low kappa score).

- The scores in Table 3 for the 1.D support type are interesting. In theory there shouldn't be anything in the text (explicitly accessible by the model) to help with the classification. Does this mean that these numbers are the class imbalance bias? In that case, what does it say for the model that the AUC is higher for 1.D compared to 1.B?

- While I agree with the arguments in Section 6.4, the example given as evidence for the type of real refuting evidence can be classified as a type of entity substitution. Admittedly, the entities in FEVER are not likely to include dates but that wasn't explicitly prohibited. As such it's not clear whether FEVER isn't sufficient for finetuning a system like ProVe.

### Minor comments
- Page 1, line 40: "real and abstract entities" implies that abstract entities aren't real. While it's more of a philosophical discussion, I would argue that KGs contain mostly real entities unless we look at fictional characters. A better contrast would be "concrete and abstract entities".

- The term "non-ontological" predicate is used early in the paper (page 4, line 43) without a formal definition.

- State the number of triples contained in the Wikidata dump used for WTR (instead of saying "vast amounts"). How many triples (what %) are covered by the 20M unique references extracted?

- What are the % of references belonging to groups 2 and 3 (page 14)? How is the definition of the last group operationalised?

- Why are there 6 subtasks in T1 (section instead of 5 (one for each retrieved evidence)?

- Page 18, line 29/32: there is no - the correct site is Also, it's not accurate to say that there are no sentence breaks on that website - the semicolons (;) act as sentence/phrase boundaries.

- The term TRE, used in Section 5.4 and elsewhere is not the term of art for textual entailment. I would suggest using either RTE (the original name of the task) or the more modern term NLI (natural language inference).

- Missing closing bracket on page 21, line 22.

- For the results in Section 5.4.2 replace the labels 1.A - 2.B with interpretable ones. It's hard to remember the meaning of each label without going back to their definitions much earlier in the paper.

- Make sure the references have correct capitalisation (e.g. BERT instead of Bert for [6])

### Resources
I cannot access the resources using the link provide. Following the link to download the zip file named (, I get the error: {"message": "Entity not found: file", "code": "EntityNotFound"}

Review #2
By Michael Röder submitted on 20/Dec/2022
Major Revision
Review Comment:

# Publication Summary

The paper presents ProVe, an automatic approach to verify a given statement from a knowledge graph based on the references that are listed in the statement's provenance. The approach generates search phrases based on the statement and generates overlapping text passages from the reference documents. It is worth pointing out that ProVe is able to process references that contain plain text as well as HTML web pages. From this large set of passages, the Sentence Selection step assigns a relevance score to each passage and selects the top 5 passages for further processing. To this end, a BERT model is used for scoring. For each of these top passages, the Textual Entailment Recognition step assigns a score to each of the three FEVER classes, namely SUPPORTS, REFUTES, and NOT ENOUGH INFO. This step also relies on a BERT model. Finally, the Stance Aggregation step uses the class and relevance scores as well as further features from the passages as input and returns a final classification result whether the reference mentioned in the statement's provenance supports the statement or not. Three different aggregation methods are proposed: a weighted sum, a rule-based approach and a classifier (Random Forests are used in the evaluation).

The authors also present the Wikidata Textual References (WTR) dataset. The dataset is based on Wikidata. First, 7M references are selected. The authors sample a subset and process the statements and their references with ProVe's Sentence Selection. Crowdsourcing workers annotate each of the system's selected passages with one of the three FEVER classes with respect to the statement. The same is done for a set of evidence for a given statement. Finally, the authors annotate the references themselves with one of the three classes. WTR contains 416 statement-reference pairs with 76 distinct Wikidata properties.

Section 5 of the paper contains a detailed evaluation. Where possible, the single steps of the approach are evaluated on their own. I won't go into the details of the evaluation. However, the main result of evaluating the complete pipeline is that ProVe performs best if the Stance Aggregation uses a) a Random Forests classifier and b) the classification is carried out as binary classification (supports yes/no) instead of a multi-class classification with all three FEVER classes.

Section 6 discusses results and limitations. For example, the presented version of ProVe does neither take qualifiers within the provenance nor negations within the extracted passages into account.

# Review Summary

## Originality

The main difference between ProVe and the related work is the usage of language models. It could be argued that their application is obvious and has already been used in the related area of fake news detection. However, to the best of my knowledge, there is no work that uses language models for this particular application area. I also think that the problem of training the models is solved in a reasonable way by using the FEVER dataset.

The dataset is a well-described resource that can be useful for the community in the future.

## Significance of the Results

The WTR dataset can have a significant impact in the future. However, it is not possible to state something about the significance of the evaluation results for ProVe. The performance of the system is not compared to any other system within Section 5.4. While the Sections 5.1–5.3 look at intermediate results and it is clear that these might not be compared to the related work (since other systems may not even produce a comparable intermediate result), it is not clear why the overall system's performance is not compared to the performance of the related work. This is a major issue which is further detailed in the next section.

## Quality of Writing

Overall, the writing is good. However, some parts of the paper are lengthy in comparison to their content and should be further improved. I think that this is a minor issue of the current state of the paper and can easily be solved. I listed several suggestions further below.

## Open Science Data

The repeatability of the experiments seems to be already good but can be further improved.

- The code of ProVe is linked in the paper. It seems to come with the necessary data (e.g., the BERT models) and intermediate results to repeat the experiments. Unfortunately, I didn't have the time to rerun something, so it might be missing something that I am simply not aware of. However, the paper should contain some information about the parameters. For example, the paper contains the statement "Population Based Training was used to tune learning rate, batch sizes, and warmup ratio" (p11 l46–47). However, the parameters that produce the final evaluation results are not listed in the paper and I also couldn't find them in the linked source code.
- The dataset is available on Figshare and linked in the paper. The dataset is documented in Annex B of the paper. However, I found it surprising that the dataset itself does not contain any README file. It might be better to have such a README as part of the dataset which describes the data and links to other sources, e.g., the source code of the paper.

## Conclusion

The submission is already in good shape. It has some minor issues in the following areas:
- References and Citations
- Inconsistencies
- Writing

The paper has a major issue with respect to its evaluation. The numbers rely on a newly created dataset and ProVe is not compared to any other system (details follow below). This is a major issue, which leads to my conclusion that a major revision is required for this submission.

# Major: Evaluation

At a first glance, the evaluation is very detailed. The single steps of the pipeline are evaluated one after the other. However, I see two gaps that make the evaluation fail its main purpose.
1. It focusses on a single use case while the introduction names three use cases. As one of the papers inconsistencies, this is a minor issue and is explained further below in the respective section.
2. The evaluation is mainly based on the WTR dataset created by the authors. The different variants of ProVe are compared in the final experiment. However, the main problem is that ProVe is not compared to any other system. This leads to the situation that although we can see accuracy values, it is not possible to argue whether these values are good or not. Maybe the WTR dataset is very simple and the values are low. Maybe the dataset is very hard and the values are good. This major problem of missing interpretability has to be fixed before the paper can be published. I see several ways how this could be done and I am confident that the authors are able to choose and implement one (or even find a better way).

1. Compare to system(s) of the related work
Several times within the paper, DeFacto and FactCheck are named as systems that are closely related to ProVe. Hence, a comparison with these systems seems reasonable, especially since the third use case for ProVe (p2 l47–48) fits very well to the systems of DeFacto and FactCheck. The setup of the experiment would of course be up to the authors. However, as a co-author of FactCheck, I would like to point out that a comparison on the WTR dataset might be slightly unfair, since the dataset has a very large number of different properties compared to the number of triples (76 properties vs. 416 triple-reference pairs). This ratio would be fatal for DeFacto and FactCheck, which are only trained on this small number of examples, while parts of ProVe can make use of the much larger FEVER dataset. A comparison on an updated version of the FactBench dataset might be more reasonable.

2. Add a baseline
If ProVe is only made for the particular use case of classifying the stance of triple references and it does not seem to be possible to have a comparison to systems like DeFacto and FactCheck, the introduction of a reasonable baseline could help. A "standard" approach which shows the difficulty of the task could already outline the difficulty of the dataset and help to be able to interpret the accuracy value. However, the paper would have to include a) a reasonable baseline and b) an argumentation why the baseline can be used for comparison while other systems are not used.

# Minor: References and Citations

- The following references in the bibliography refer to preprints on although the referenced papers have been published at research conferences. I think it would be better to acknowledge the achievement of the paper's authors to have been able to publish the papers on major conferences by citing them accordingly:
[3] COLING 2018
[7] EMNLP 2018
[10] ACL 2020
[11] ACL 2020
[12] ACL 2019
[13] EMNLP 2018
[14] NAACL 2018
[41] NAACL 2021
[42] EMNLP 2020
[47] NLP4ConvAI 2021
[55] EACL 2017

- The introduction of the paper lacks references to back up claims. For example, the first paragraph of the publication starts with an explanation of KGs. First, it refers to the term "knowledge base" which is not further defined. Second, the whole paragraph has one single citation which does not seem to be connected to the explanation of KGs nor to any of the examples in the paragraph. This is especially important since knowledge graphs have a central role within the paper but do not have any further definition apart from this paragraph. Similarly, the first two sentences of the second paragraph do not have any reference ([2] mentioned in the third sentence does not seem to back up the claims in these two sentences).
- Citing the AdamW creators would be fair (see

# Minor: Inconsistencies

- The introduction defines three different use cases for ProVe (p2 l45–l48): "Firstly, by assisting the detection of verifiability issues in existing references, bringing them to the attention of humans. Secondly, given a triple and its reference, it can promote re-usability of the reference by verifying it against neighbouring triples. Finally, given a new KG triple entered by editors or suggested by KG completion processes, it can analyse and suggest references." The remainder of the paper focuses solely on the first use case and the second and third are never mentioned again. It would be fair, if the authors either clearly state that they focus on the first use case and that the others are outside of the scope of the paper or that they move the second and third use case into the future work section.
- Definition of "AFC" in the text vs. related work. The term "is commonly defined in the Natural Language Processing (NLP) domain as a broader category of tasks and sub-tasks [3–5] whose goal is to, given a textual claim and searchable document corpora as inputs, verify said claim’s veracity or support by collecting and reasoning over evidence." (p2 l38–40) Within the following paragraphs, the authors try to widen the definition of AFC to cover works that do neither rely on textual claims nor documents but only on triples and reference knowledge graphs, e.g., [35, 36]. However, this does not work since the definition that has been given before clearly excludes them. Table 1 points out this dilemma when it contains the task "triple prediction" with triples as input and paths as evidence. However, this task is not further defined throughout the paper. In general, I appreciate that the authors try to point out that there is related work that does not work with natural language sources and representations. However, I think it would be better if such approaches would be presented as part of a closely related field of research ("fact validation" or "fact checking" or "triple classification"; maybe not triple prediction as it can be misunderstood as link prediction which again leads to a lot of other works). In this context, I would also like to point out that the subtask "SS" in Table 1 for [31, 32, 35, 36] does not make sense to me. To the best of my knowledge, they do not use any sentence selection since they solely work with triples.
- Definition of "AFC on KGs". In Section 1, it is defined as "AFC on KGs takes a single KG triple and its documented provenance in the form of an external reference." Section 2 changes this definition to "Given a KG triple and either its documented external provenance or searchable external document corpora whose items could be used as provenance, AFC on KGs can be defined as the automated verification of said triple’s veracity or support by collecting and reasoning over evidence extracted from such actual or potential provenance." The difference in the two versions is that the first one excludes related work like DeFacto and FactCheck while the second definition is formulated in a way that includes them. However, there should be exactly 1 definition for this term that is consistent across the publication. In addition, I would like to encourage the authors to use the active voice in this situation, since (to the best of my knowledge) their work seems to be the only one so far that defines the term "AFC on KGs". Hence, a formulation like "We define AFC on KGs as..." would clearly communicate this.
- Figure 3: The arrows in the figure show that E is a result of the sentence selection. E is then used as input for the Stance Aggregation. However, in the textual description, E is already input for the TER step. Hence, the two descriptions differ in the sense that either the TER step is executed for all (v,p_i) pairs or only for the top 5 p_i.
- p6 l42: The sum of probabilities is not well-defined. I assume that the three probabilities for the i-th evidence together should be 1.0. However, this can be easily misunderstood since the sum does not exactly define whether it sums up over all k or i or both.

# Minor: Writing

I would like to point out that the general writing is good. The paper only contains a small number of writing errors (compared to its length). The errors I found are listed further below. However, the paper is written in a style that is slightly exhausting for the reader.

## Writing Style

The style in which this paper is written comes with two main drawbacks: long sentences with verbose formulations and repetitions.

- The paper comprises very long verbose sentences with unnecessary complicated formulations. While a reader of this review may have already noted that I tend to have very long sentences as well, this is typically discouraged in scientific publications. The sentences should be short and to the point. I do not ask the authors to rewrite the whole paper. However, it would be good if they could at least cut the longest sentences into several shorter sentences with a clearer structure. Some examples of long sentences or verbose formulations (this list is not complete; there are many more of them):
-- p10 l31–33: "Sentence selection consists of, given a claim, rank a set of sentences based on how relevant each is to the claim, where relevance is defined as contextual proximity e.g. similar entities or overlapping information." --> "Our sentence selection ranks the generated passages according to their relevance with respect to the given claim. We define relevance as contextual proximity, e.g., similar entities or overlapping information."
-- p10 l40-43: "Fine-tuning is achieved by feeding the model pairs of inputs, where the first element is a concatenation of a claim and a relevant sentence, while the second element is the same but with an irrelevant sentence instead, and training it to assign higher scores to the first element, such that the difference in scores between the pair is 1 (the margin)." --> "We fine tune the model by feeding pairs of inputs. The first element of a pair is a concatenation of a claim and a relevant sentence. The second element is the same claim but with an irrelevant sentence. We train the model to assign higher scores to the first element, such that the score difference is 1."
-- p18 l24: "ranging from as low as 1 to as high as 804" --> "ranging from 1 to 804"

- While the overall structure of the paper is very good, parts of the content are repeated several times. Especially Section 5 repeats a lot of the content of Sections 3 and 4. Some examples (this list is not complete; there are more of them):
-- Terms are introduced repeatedly, e.g., the three types of stances are introduced two times within Section 3.1.
-- Section 3.1. gives the overview over the algorithm two times. First from p6 l37 to l46 and after that again from l47-p7 l6. I understand that in the first part, ProVe is explained as a single algorithm while the second part describes Figure 3. However, I would suggest combining these descriptions (if possible).
-- Section 5.1.1. is a repetition of results presented in [46]. From my point of view, stating that the approach for this part of ProVe has already been evaluated with a short summary of the result and a reference to [46] would be better.
-- In Section 5.3, it is three times explained that the top 5 passages from the evidence set are used. This is something that a) has already been clearly defined in Section 3.4. (e.g., equation 5) and b) could maybe be repeated once but not three times in a section that focuses on the evaluation.
-- The first paragraph of Section repeats content from Section 4.2.
-- p21 l37: "Crowdsourced annotations are collected multiple times and aggregated through majority voting, with authors serving as tie-breakers." While this has already been described in Section 4 and a reader might be already used to this kind of repetition, this sentence may confuse readers since it might not be clear why crowdsourcing is applied _again_ (which is not the case).

## Typos and smaller Errors

- p1 l44: search engines results --> search engine results
- p2 l17: "well explored" --> "well-explored"
- p2 l27: state-of-the-art --> state of the art
- p2 l40: as well a --> as well as a
- Figure 1: It seems like the figure cuts off a part of line 4.
- p5 l28: but use --> but uses (or the previous "does" has to be "do")
- p6 l45: "as well as to calculate" something seems to be missing there.
- p7 l31: the meaning of v is not clear in that line. Is it the sentence or does it represent "the same information" that is expressed? A reader has to read further to understand that it refers to the sentence. I think that the v can be placed better beforehand to avoid such a misunderstanding.
- p9 l25: "KG triple" --> "in a KG triple" (or something similar)
- p9 l37: "text text"
- p10 l4: s_{j} has a strange whitespace in front of it.
- p10 l6: "a n-sized" --> "an n-sized"
- p11 l22: "a pre-trained BERT" --> "a pre-trained BERT model"
- p11 l42: "‘NOT ENOUGH INFO’" --> "‘NEI’"
- p12 l10: "weighted sum \sigma" --> naming the weighted sum sigma at this point doesn't seem to be correct to me.
- p12 l13: "probability" --> "probabilities"
- p12 l29: "y is 1 is" --> "y is 1 if"
- p12 l29: "the triple-reference is" --> "the triple-reference pair is"
- p13 l47-48: the double quotes around the URL seem to be two single quotes. That should be fixed.
- p14 l10: "according according" --> "according"
- p14 l13: "is carried" --> "is carried out"
- p15 l28: "were carried through" --> "were carried out through"
- p15 l28: "Its structured" --> "Its structure"
- p16 l48: "properly carried" --> "properly carried out"
- p18 l37: "a sanity check of by" --> "a sanity check by"
- p18 l47: "current state-of-the-art, on the" --> "current state of the art on the"
- p19 l8: "(-1 to 1)" --> "($-$1 to 1)"
- p20 l47: "is carried" --> "is carried out"
- p21 l44: "domain such as" --> "domains such as"
- p21 l49: "Using this argmax approach" --> referring to something with "this" at the beginning of a new subsection is not good since it is unclear what this word refers to. This should be formulated in a different way.
- p21 l51: "once can measure" --> "one can measure"
- p25 l20: "As the same time" --> "At the same time"
- p25 l35: The last sentence of 6.2. is incomplete.

# Further Suggestions and Comments

The following comments are "neutral" in the sense that they do not influence my final rating of the paper. Instead, they should be seen as suggestions to further improve the quality of the paper.

- Why is the maximum size of the sliding window only n=2. Are there any results with respect to larger window sizes? According to the argumentation in, it would be possible to reach 0% irrelevant passages if we simply always select the complete document (assuming that the reference contains anything of relevance and that the LM is able to identify that). I assume that there is some limitation and it would be good if the authors could point out why larger passages would not work.
- It is not necessary to introduce abbreviations several times (e.g., AFC, NLP, KG).
- If an abbreviation is introduced, it should be used consistently instead of the term that it abbreviates (e.g., SUPP, REF, NEI).
- I would suggest avoiding unnecessary "rating" adjectives like "simple" as long as the authors do not have a particular reason to use them (e.g., p12 l10/l43).
- In some parts, the text switches between tenses (e.g. This should be avoided.
- Vague formulations like "can be defined" should be avoided. Instead, it should always be clearly stated how the authors define the terms. (e.g., p19 l32: "one can define the value zero as the threshold")
- Most of the footnotes seem to be misplaced (e.g., 1–4). To the best of my knowledge, a footnote should either follow directly the name of the tool without any white space or it should be placed at the end of the sentence (as it has been done for footnote 5).
- Figure 1: I like that there is a clear separation of data and processes and that data is the input to a process which outputs new data. However, between "Document Retrieval" and "Evidence Selection", there is no data object. Maybe "Relevant Documents" could make sense between them.
- Figure 4: some text is very small, especially the text on the arrows.
- p10 l3: {S^{i}}_{j} --> S^{i}_{j} to ensure that i and j are beneath each other, or (better) S_{i,j} to avoid the confusion with exponents; if the authors prefer the first solution, I would like to point out, that the subscript is typically the start and the superscript is the end (e.g., integrals).
- p21 l27--31: It is typically good to choose a single format for numbers and to stick to this format. If it is necessary to report 4 digits after the comma, it should be done for all numbers (i.e., "0.617"-->"0.6170"; "0.76110"-->"0.7611"). Presenting the results in a table might be better in this particular example.
- Figure 5: a dashed vertical line for the relevance score 0 might be helpful within the diagram.
- Figure 7: if the sentence selection assigned relevance scores in the range [-1,1], why is it possible that the line in the diagram has points with a value >0 outside of this range? I would suggest that the results are represented in a way that avoids this kind of misunderstandings.
- Figure 8: The names "Supports Model" and "Supports Crowd" should be replaced by only "Supports" since the axis labels clearly define that the rows show the crowd sourcing results while the columns show the TER model's class predictions.